RE: Corpora: Question about a Brown Corpus tag

From: E S Atwell (eric@comp.leeds.ac.uk)
Date: Thu Sep 14 2000 - 16:32:21 MET DST

  • Next message: Atro Voutilainen: "Re: Corpora: Question about a Brown Corpus tag"

    Mark Lewellen asked:

    >An alternative to underspecification of POS information is to develop a
    >POS tagger that records multiple POS in ambiguous contexts (ideally with
    >probabilities attached to each POS choice). An advantage to this
    >approach is that POS-ambiguity information is not 'hard-coded' in advance
    >by the tag set, but is rather determined by sentence context, and may be
    >extended to other ambiguities (such as N vs. V).
    >
    >Could anyone point out projects that have developed such POS taggers, or
    >submit opinions as to their viability? One difficulty I notice is that a
    >typical tagger using an HMM with the Viterbi algorithm determines a most
    >likely _sequence_ , which would make it difficult to establish
    >proabilities of multiple POS tags for a given word.

    The CLAWS tagger originally developed to PoS-tag the LOB Corpus did this
    (as presumably did later versions of CLAWS used on SEC, BNC etc) - the
    tagger included the option of outputting all tags allocated by
    lexicon+suffixlist, along with context-dependent weights. The proofreader
    had to mark all cases where the correct tag wasn't the highest-weighted;
    then a "cleanup" program "rubbed out" all but first tag (unless
    proofreading marked another tag, in which case this was left as singel
    correct tag). Using a Markov sequence-based model is not a problem - the
    relative weight attached to a tag can be the weight from the best sequence
    using it, or the sum of all sequences passing through the tag, or some
    other function of all sequences including the tag.

    The ENGCG English Constraint Grammar tagger/parser would probably appeal
    to you even more. This applies all tags from lexicon, then applies
    constraint-rules to rule out candidates incompatible with context. Usually
    this leaves only one candidate PoS-tag per word, but where there is an
    ambiguous context it leaves more than one tag.

    For refs to these and more tagsets, see:
    Atwell E, Demetriou G, Hughes J, Schiffrin A, Souter C, and Wilcock S. 2000.
    A comparative evaluation of modern English corpus grammatical annotation schemes
    ICAME Journal, volume 24, pages 7-23, International Computer Archive of
    Modern and medieval English, HIT Centre, Bergen University. ISSN:0801-5775

    -- 
    Eric Atwell, Distributed Multimedia Systems MSc Tutor & SOCRATES Tutor
    School of Computing, University of Leeds, LEEDS LS2 9JT
    TEL: (44)113-2335430  FAX: (44)113-2335468
    WWW: http://www.comp.leeds.ac.uk/eric  EMAIL: eric@comp.leeds.ac.uk
    



    This archive was generated by hypermail 2b29 : Thu Sep 14 2000 - 16:33:02 MET DST