Re[2]: Corpora: query

Max Schulze (bschulze@xis.xerox.com)
Fri, 29 May 1998 08:11:11 PDT

I completely disagree. You would throw out so many foreign words
consisting of only ASCII characters, such as 'kindergarten',
'festschrift', 'verboten', 'faux pas' etc. Additionally, many words
with diacritical characters occur very often 'de-accented', e.g.
resume.

A two-step approach could be better: Use on-line dictionaries to
generate a list of possible candidates (should give a list of foreign
words -- without frequency), and then use corpora to determine the
frequencies. Biggest problem will be, of course, the availability of
suitable on-line dictionaries ...

Max
---
Bruno Maximilian Schulze
Pagis Indexing Sr. SW Engineer
ScanSoft, Inc. -- A Xerox Company
Peabody MA, USA


______________________________ Reply Separator _________________________________
Subject: RE: Corpora: query
Author: keith@mitre.org (Keith J. Miller) at intergate
Date: 5/28/98 3:47 PM

I'm not aware of any such list, but I'm sure that it would be corpus/domain
specific in any case. A quick way to generate candidates for a list for
your own corpus would be to throw together a perl script that kept
track/count of any words containing any character over (decimal) 128
(assuming your text is in ISO-1 [ISO-8859-1] or some similar encoding), and
then to weed junk out of that list based on your idea of what high-frequency
means. Of course, there are other things besides the accented characters
above 128, but that should give you a pretty good start without much effort.

----- Keith J. Miller
millerk@gusun.georgetown.edu
keith@mitre.org

>-----Original Message-----
>From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no]On
>Behalf Of afc@nwnexus.com
>Sent: Thursday, May 28, 1998 5:14 PM
>To: corpora@hd.uib.no
>Subject: Corpora: query
>
>
>I'm looking for a list of high-frequency foreign words found in English
>text, e.g. words like "cafe" (with an acute accent over the final e),
>"resume" (acute accent over both e's) and facade (where c ==
>c-cedilla), etc.
>
>Does anyone know of such a list? Or pointers to a listing from which
>the list I'm looking for could be extracted?
>
>Many thanks,
>
>Alexander
>
>
>