Re: [Corpora-List] Part-of-speech tagger

From: Chris Brew (cbrew@ling.ohio-state.edu)
Date: Tue Nov 12 2002 - 13:33:04 MET

  • Next message: Afsaneh Fazly: "[Corpora-List] part-of-speech tagger"

    On Mon, Nov 11, 2002 at 08:52:20PM -0500, Afsaneh Fazly wrote:
    >
    > Greetings,
    >
    > I need to build a part-of-speech tagger for a new language
    > (for which there is no PoS-tagger available). For this, I need
    > to hand-annotate a minimum amount of text. I would like to know
    > how much text (minimum of course) I need to hand-tag. Also,
    > for this much text, what is the reasonable size of the tagset
    > used for annotation?
    >
    > Regards,
    >
    > Afsaneh

    The minimal amount of annotated text, strictly speaking, is probably none.
    There is a Computer Speech and Language paper by Julian Kupiec explaining
    how and why it is possible to train an HMM-based POS tagger without annotated
    text. You do need to make decisions about the tagset, and to create a lexicon
    relating words to their possible tags.

    But in practise, most people still use annotated text. A Computational
    Linguistics paper by Bernard Merialdo includes a careful measurement of
    when using annotated text is helpful (and there is similar work, from
    about the same time by David Elworthy). How much you need depends on
    the complexity of the tagset and the text that you use. Once again there
    is good work by Elworthy (from a 1995 EACL workshop) that explains the
    tradeoffs. In practice, many people use tagsets which are close to the
    Brown and/or CLAWS tagsets developed in the early years. But languages
    differ a lot, so it is probably worth thinking carefully about what you
    are doing and why. For languages with richer morphology than English,
    part-of-speech tagging might turn out to be trivial if (a big if) you have
    a good morphological analyser, impossible otherwise. And so on...

    Chris



    This archive was generated by hypermail 2b29 : Tue Nov 12 2002 - 13:37:04 MET