Re[2]: Corpora: query

Max Schulze (bschulze@xis.xerox.com)
Fri, 29 May 1998 11:44:05 PDT

Sounds like a good idea. However, the biggest problem is a
morphological one: Foreign words occur very often not in their
original spelling or they carry certain affixes (either derivational
or inflectional ones). Thus, the intersection of such word-lists
should take this into account. I.e. use morphological analyzers before
you perform the intersection proper.

Max
---
Bruno Maximilian Schulze
Pagis Indexing Sr. SW Engineer
ScanSoft, Inc. -- A Xerox Company
Peabody MA, USA

______________________________ Reply Separator _________________________________
Subject: Re: Corpora: query
Author: "Ted E. Dunning" <ted@aptex.com> at intergate
Date: 5/29/98 9:52 AM



a simpler approach would be to intersect an english word-list with a
word-list from another language (such as french, spanish, german,
latin). these word lists are readily available. deaccenting the
non-english word-list would probably be a good idea.

this approach will, of course, miss out on words like skosh and honcho
which are taken from japanese. there are approaches which can
approximate the matching between english words and japanese words, but
these approaches are relatively difficult to implement well.