Corpora: Summary: corpus frequencies for psycholinguistics experiments

From: Philip Resnik (resnik@umiacs.umd.edu)
Date: Sat Jul 15 2000 - 21:07:20 MET DST

  • Next message: Roberta Catizone: "Corpora: Re: DRH2000"

    This is a summary of replies to my question about alternatives to
    Francis and Kucera for getting word frequency data to use in
    psycholinguistics experiments. The rationale behind the query is
    nicely summed up by one of the respondents, who wrote: "[F&K] has been
    shown (by Burgess I think) to be inaccurate for many words that it
    calls low frequency and of course it's out of date by now, some words
    that are now considered politically incorrect for example, and
    therefore, not used very often, are relatively frequent in there."

    Thanks to the following people for replies:

      Chris Brew <cbrew@ling.ohio-state.edu>
      Adam Kilgarriff <Adam.Kilgarriff@itri.brighton.ac.uk>
      Jim Magnuson <magnuson@ling.ling.rochester.edu>
      Paul Rayson <paul@comp.lancs.ac.uk>
      Nina Silverberg <nsilverb@astro.ocis.temple.edu>

    Here's the summary.

    1a. British National Corpus (http://info.ox.ac.uk/bnc/)

       The corpus itself is available only to Europeans, but Adam
       Kilgarriff has produced word frequency lists and put them on the
       Web at http://www.itri.brighton.ac.uk/~Adam.Kilgarriff/bnc-readme.html.
       He writes, "the lists from the BNC on my web page - particularly
       the lemmatised ones - were produced with English teaching and
       dictionaries in mind, and have been quite widely used for
       experiment-type purposes. The BNC is clearly appropriate, as it
       was designed with 'general English' in mind. (though it is
       British, but I suspect the differences there are quite marginal.)
       It's been getting 200 files downloaded per month for 4 years now,
       and I think it is quite widely used."

       Adam's paper

        @article{ak-ijl,
            author = "Adam Kilgarriff", title = "Putting Frequencies into
            the Dictionary", journal = "International Journal of
            Lexicography", year = 1997, volume = 10, number = 2, pages =
            {135--155}
        }

       argues for the list and explains how it was done, and there's an
       on-line copy available from his Web page.

       Paul Rayson has been working on BNC and writes:

         I have been working on frequency lists for the second version of
         the BNC (POS tagging and file headers updated) and short versions
         of those lists will appear in

           Leech, G., Wilson, A., Rayson, P. (forthcoming). Word Frequencies
           in Spoken and Written English: based on the British National
           Corpus. Longman, London.

         Due to the size of the lists, we plan to make the longer versions
         available on the UCREL website later this year when the book is
         published.

         http://www.comp.lancs.ac.uk/ucrel/

    1b. BNC Online (http://sara.natcorp.ox.ac.uk/)

       Although the corpus itself is not available outside Europe, there
       is worldwide access to search capabilities, so you can search
       for instances of particular words, phrases, or patterns. I've tried
       it and it's quite nice.

    2. For the future: the American National Corpus, presumably modeled
       on the BNC. See http://www.cs.vassar.edu/~ide/anc/.

    3. Curt Burgess at UC Riverside has been building corpora from Usenet
       postings, some say. I couldn't find a Web page on this project
       but his home page is http://locutus.ucr.edu/~curt/.

    4. The CELEX database. "They used a databse of about 17 million words
       as opposed to the 1 million from FK. However, that is a British
       English count. It would be nice if there were something available
       on the web that allowed a person to enter a word (or preferably a
       list of words) and to get a count of its frequency per million out
       of a very large corpus. Seems doable, but I don't think it's been
       done." CELEX is on the Web at http://www.kun.nl/celex/.

    Thanks again to those who replied!

      Philip



    This archive was generated by hypermail 2b29 : Sat Jul 15 2000 - 21:05:40 MET DST