Corpora: Collaborative effort

From: Jem Clear (jem@cobuild.collins.co.uk)
Date: Fri Jun 09 2000 - 16:54:39 MET DST

  • Next message: Bill Fisher: "Re: Corpora: Collaborative effort"

    Dear Corpora-people

    I've just had an idea for a collaborative venture that might benefit
    the whole language-research community.

    Suppose I were to post on the CORPORA list a word plus its definition
    as taken from some good dictionary. Then I invite all and any
    CORPORA list members to select from whatever corpora they may have
    accessible to them one or more instances of an authentic context
    in which the word is used with the given sense. I invite anyone to
    email their selected examples to an email address which simply
    files the examples away under the heading of the word/sense combination.
    Easy, eh? How many examples might I expect to get? One hundred? Or
    just ten? Or maybe 1,000? I think I might get a hundred or more.

    Now of course I could continue to post different word/sense pairs
    and invite anyone anywhere to contribute some examples taken from
    real text. This might grow into a database of corpus data sorted into
    sense categories (as delimited by the chosen dictionary).

    We could then share that growing resource freely, gratis, in the
    public domain -- since the cost of building it would be spread so
    widely that no-one would have incurred any significant costs.
    Moreover the range and variety of corpora which would in some
    small measure contribute to the database could be very extensive
    indeed and offer thereby a comprehensiveness not achievable by
    the use of any single corpus (such as the British National Corpus,
    or the Bank of English). There would be no copyright problem involved
    in disseminating this database freely, since no text source is being
    reproduced beyond the 20 or 30 words of context necessary to
    illustrate the context of the word in its particular sense.

    My preliminary investigations of the Cobuild English Dictionary show
    that the number of lexemes (lemmas, headwords, whatever term you like
    to use) having more than one dictionary sense within the same
    part-of-speech class is measured in the low thousands.
    (e.g. instances of "exhaust" can be divided into verb and noun uses
    and each will have a different sense.) So it may take only a few years
    to compile a database giving hundreds of corpus instances of each
    sense of most polysemous words of English. The potential for
    exploiting such a database in information retrieval, machine
    translation, etc etc is clearly significant.

    This scheme would be a truly collaborative effort, driven entirely
    through the co-operation of the corpus linguistics community. This
    might be more effective and more comprehensive than an EU-funded
    project with participants from numerous EU member states, where
    restrictions over proprietary software, or data, or copyright
    concerns, or the multiple layers of bureaucratic administration tend
    to hold back the process of compiling and disseminating useful
    information.

    Just for starters, here's a definition for the word "fierce" followed
    by some illustrative examples:

        Fierce feelings or actions are very intense or enthusiastic, or
        involve great activity.

        Ex: A fierce battle has been raging all day in the Croatian town
            of Pakrac
        Ex: He inspires fierce loyalty in his friends.

    Send any examples you can find (from any corpora to which you have
    access) to "jem@cobuild.collins.co.uk".

    Jem Clear

    Electronic Development Director phone: +44 (0)121-414-3926
    Collins Dictionaries fax: +44 (0)121-414-6203
    Westmere, 50 Edgbaston Park Road email: jem@cobuild.collins.co.uk
    Birmingham, B15 2RX, UK WWW: www.cobuild.collins.co.uk



    This archive was generated by hypermail 2b29 : Fri Jun 09 2000 - 17:01:01 MET DST