Re: word frequency lists?

Centre for Lexical Information (celex@mpi.nl)
Thu, 23 Nov 1995 21:31:07 +0100 (MET)

On Thu, 23 Nov 1995, Ted Dunning wrote:

> moreover, this coin of domain specificity has another side. it also
> invalidates counts taken from the so-called "balanced" corpora such as
> the brown corpus or the british national corpus. by conjoining data
> from diverse sources, an average count is obtained which might be
> supposed to be better in some sense than the counts obtained from any
> domain specific source.
>
> this is not true, however. the act of balancing has created a corpus
> which is utterly unlike any real bit of text.

Yes, caution should be used even with 'balanced' or 'representative'
corpora, but such corpora do have their use for applications based on the
degree of familiarity of language users in general with certain words,
which for educated speakers would be an amalgam of words from the spoken
and written medium. I am thinking of studies of the mental lexicon, the
compilation of learner's dictionaries and general-purpose dictionaries
(including spelling checkers). Also, acknowledging there is a need for
special-purpose R&D restricted to a particular domain, I believe that
applications based on general counts, which provide a rough and ready
initial analysis with less than complete coverage and success rate, are
useful for reducing the drudgery of lots of tasks (apart from the fact they
are more marketable!).

Richard Piepenbrock
CELEX