Re: Corpora: efficient string matching

Karine Wagner (karine@ilsp.gr)
Mon, 13 Apr 1998 11:54:17 +0300

Tu savais que t'avais un nom d'arbre toi?.....;-))))

At 12:04 PM 4/11/98 +0000, Mark Johnson wrote:
>Tom Vanallemeersch writes:
>> What interests
>> me ... is to know whether there a programs which some way store
>> information on the substrings of a text (e.g. all substrings up to a
>> length of 10), and possibly give the context of those strings.
>
>This sounds like an application for a Patricia tree (I think this is the
>name). A good book on algorithms should describe this data structure.
>
>Another very simple method that I have found useful is the following.
>Think of the corpus as being one long vector, so that each character position
>described by an integer. Run through your corpus, constructing another
>vector which records the starting positions of all of the substrings you
>want to index on (e.g., all of the starting positions of words). Regard
these
>as pointers into your original corpus. Sort this vector of indices by the
>suffix string beginning at the position in the corpus that each index points
>to.
>
>This sorted index now gives an alphabetic listing of suffix strings in
>the corpus. It is a trivial matter to search for any substring of any
>length with this sorted index using a binary search.
>
>Hope this helps,
>
>Mark
>
>