Corpora: Subsets and "partially-tagged" corpora

From: Mark Davies (mdavies@ilstu.edu)
Date: Wed May 10 2000 - 15:43:32 MET DST

  • Next message: Kees Koster: "Re: Corpora: Cooperation needed to delevelop Dutch IR test collection"

    I am in the process of creating a 100,000,000 word corpus of historical
    Spanish texts (1200s-1900s) and have a question regarding possible
    alternatives to POS tagging for the entire corpus.

    It would be extremely difficult (if not impossible) to tag the entire
    corpus, because of the large amount of variation in forms (e.g. hubiese,
    hubiesse, ouiese, ouisse, ouyese, ouyesse, + + + for the past subjective of
    haber "to have") as well as because of the sheer size and lexical
    complexity of the corpus.

    I am considering an alternative scheme in which I tag just the most common
    words/forms for a given syntactic or verbal category, such as the 100 most
    common nouns and infinitives, the 25 most common adjectives, the 35 most
    common preterites, etc. The "tagged" elements would be identified by a
    prefix, such as:

            VI-estar (= verb/infinitive-"to be")
            N-hombre (= noun-"man")
            VPT-supo (= verb/preterite-"knew")

    Users could then search for a construction like

            parece* INF-* [parecer "to seem" + infinitive]
            deb* CL-* INF-* [deber "should" + clitic + infinitive]

    which would give cases of "parecer" followed by (just) one of the 100 most
    common infinitives, or "deber" followed by a clitic and and one of the 100
    most common infinitives (these are just two of many possible examples).

    (I'm aware of problems of polysemy, such as ser = "to be / a (human) being
    (N)", habla = "speak-3SG / speech (N)", and these will have to be dealt
    with as best as possible. But a POS tagger will have similar (if not
    worse) problems identifying the correct POS for each form, considering the
    incredible range in forms in a corpus this size, covering a period of 800
    years).

    So my question deals with what percentage of all of the occurrences of a
    particular category would be included in this subset of most frequent
    forms. For example, if there are 100,000 occurrences of infinitives in a
    particular block of text (representing 2000 different forms) and I tag just
    the 100 most common forms, what percentage of all of the occurrences will
    get marked -- 25%, 50%, etc.? I'm going to be carrying out some test
    myself, but would like to be able to compare the results to other studies
    that might have already been done.

    The main question, then, is whether anyone might be aware of statistical
    studies that have been done along these lines, especially for one of the
    Western European languages. I realize that grammatical categories are
    divided differently in different languages (e.g. infinitives in German
    might not compare directly to infinitives in Spanish, and the same for
    clitics in French and Spanish), but what I'm looking for here are just very
    approximate figures.

    Again, I realize that a "partially-tagged" corpus such as this has very
    real shortcomings, both in terms of theory [representativeness] and
    practice [only having access to a limited number of forms for a given
    category, and missing interesting occurrences with less common forms]. But
    if the alternative is a corpus that is not tagged at all, it is probably
    still worth doing.

    Thanks in advance for any comments that you might have.

    Mark D.

    =======================================
    Mark Davies, Associate Professor, Spanish Linguistics
    Dept. of Foreign Languages, Illinois State University
    Normal, IL 61790-4300

    Voice:309/438-7975 email:mdavies@ilstu.edu
    Fax:309/438-8038 http://mdavies.for.ilstu.edu/personal/
    =======================================



    This archive was generated by hypermail 2b29 : Wed May 10 2000 - 15:41:38 MET DST