Re: Corpora: Question about a Brown Corpus tag

From: Atro Voutilainen (atro.voutilainen@conexor.fi)
Date: Thu Sep 14 2000 - 16:57:12 MET DST

  • Next message: j.m.b.johannessen@ilf.uio.no: "Corpora: Prepositions, subjunctions and conjunctions"

    Eric,

    Thanks for mentioning ENGCG. A recent version, called EngCG-2, can be tested at
    http://www.conexor.fi; also an online evaluation paper can be found there.

    Atro Voutilainen

    E S Atwell wrote:
    >
    > Mark Lewellen asked:
    >
    > >An alternative to underspecification of POS information is to develop a
    > >POS tagger that records multiple POS in ambiguous contexts (ideally with
    > >probabilities attached to each POS choice). An advantage to this
    > >approach is that POS-ambiguity information is not 'hard-coded' in advance
    > >by the tag set, but is rather determined by sentence context, and may be
    > >extended to other ambiguities (such as N vs. V).
    > >
    > >Could anyone point out projects that have developed such POS taggers, or
    > >submit opinions as to their viability? One difficulty I notice is that a
    > >typical tagger using an HMM with the Viterbi algorithm determines a most
    > >likely _sequence_ , which would make it difficult to establish
    > >proabilities of multiple POS tags for a given word.
    >
    > The CLAWS tagger originally developed to PoS-tag the LOB Corpus did this
    > (as presumably did later versions of CLAWS used on SEC, BNC etc) - the
    > tagger included the option of outputting all tags allocated by
    > lexicon+suffixlist, along with context-dependent weights. The proofreader
    > had to mark all cases where the correct tag wasn't the highest-weighted;
    > then a "cleanup" program "rubbed out" all but first tag (unless
    > proofreading marked another tag, in which case this was left as singel
    > correct tag). Using a Markov sequence-based model is not a problem - the
    > relative weight attached to a tag can be the weight from the best sequence
    > using it, or the sum of all sequences passing through the tag, or some
    > other function of all sequences including the tag.
    >
    > The ENGCG English Constraint Grammar tagger/parser would probably appeal
    > to you even more. This applies all tags from lexicon, then applies
    > constraint-rules to rule out candidates incompatible with context. Usually
    > this leaves only one candidate PoS-tag per word, but where there is an
    > ambiguous context it leaves more than one tag.
    >
    > For refs to these and more tagsets, see:
    > Atwell E, Demetriou G, Hughes J, Schiffrin A, Souter C, and Wilcock S. 2000.
    > A comparative evaluation of modern English corpus grammatical annotation schemes
    > ICAME Journal, volume 24, pages 7-23, International Computer Archive of
    > Modern and medieval English, HIT Centre, Bergen University. ISSN:0801-5775
    >
    > --
    > Eric Atwell, Distributed Multimedia Systems MSc Tutor & SOCRATES Tutor
    > School of Computing, University of Leeds, LEEDS LS2 9JT
    > TEL: (44)113-2335430 FAX: (44)113-2335468
    > WWW: http://www.comp.leeds.ac.uk/eric EMAIL: eric@comp.leeds.ac.uk

    -- 
    Atro Voutilainen                              mobile: +358 50 5437452
    Conexor oy                                       fax: +358 9 37468502
    Helsinki Science Park                     atro.voutilainen@conexor.fi
    Koetilantie 3, 00710 Helsinki, Finland          http://www.conexor.fi
    



    This archive was generated by hypermail 2b29 : Thu Sep 14 2000 - 16:51:21 MET DST