Corpora: Style guides for hand tagging POS

From: David A. Campbell (campbed@flux.cpmc.columbia.edu)
Date: Wed Apr 12 2000 - 09:10:57 MET DST

  • Next message: Monica MERINO: "Corpora: Corpora and language testing"

    Hi,
        I've been using the Penn Treebank Project Guidelines for POS tagging
    of English text (Beatrice Santorini). I'm tagging raw (unedited and
    uncorrected) text and I've had some problems assigning tags in some
    cases:
        1. Misspellings. Especially when they are mispelled into other
    English word: "He was (d)one eating."
        2. Compound nouns that should be hyphenated, but aren't. "I had a
    follow up yesterday" vs. "I had a follow-up yesterday"
        3. Tokenization of dates. Should 3/5/00 be tokenized into 3 / 5 /
    00 and each marked up individually or should it be kept as is?

    Can someone point me to a guide for tagging when there are errors in the
    text?

    Thank you,

    David Campbell
    Department of Medical Informatics
    Columbia University



    This archive was generated by hypermail 2b29 : Wed Apr 12 2000 - 09:09:37 MET DST