Re: Corpora: Style guides for hand tagging POS

From: E S Atwell (eric@scs.leeds.ac.uk)
Date: Thu Apr 13 2000 - 11:20:11 MET DST

Next message: Geoffrey Sampson: "Corpora: lemmatizer"

Previous message: Monica MERINO: "Corpora: Corpora and language testing"
In reply to: David A. Campbell: "Corpora: Style guides for hand tagging POS"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

David,

I think the short answer is "there is no standard answer, different
tagging projects and tagged corpora adopt different policy on corrections
and tokenisation". For example, see

Atwell E, Demetriou G, Hughes J, Schiffrin A, Souter C, & Wilcock S. 2000.
A comparative evaluation of modern English corpus grammatical annotation
schemes. to appear in ICAME Journal, vol.24.

Even if you are using existing guidelines (you mention Penn Treebank) you
need not be bound by them if you have good reason to modify them for your
specific text genre (as long as you document your decision in your Corpus
Handbook); after all, this is what other corpus-tagging projects have
done.

-- 
Eric Atwell, Distributed Multimedia Systems MSc Tutor & SOCRATES Tutor
Centre for Computer Analysis of Language And Speech (CCALAS)
School of Computer Studies, University of Leeds, LEEDS LS2 9JT
TEL: (44)113-2335430  FAX: (44)113-2335468
WWW: http://www.scs.leeds.ac.uk/eric  EMAIL: eric@scs.leeds.ac.uk
On Wed, 12 Apr 2000, David A. Campbell wrote:
> Hi,
>     I've been using the Penn Treebank Project Guidelines for POS tagging
> of English text (Beatrice Santorini).  I'm tagging raw (unedited and
> uncorrected) text and I've had some problems assigning tags in some
> cases:
>     1.  Misspellings. Especially when they are mispelled into other
> English word:  "He was (d)one eating."
>     2.  Compound nouns that should be hyphenated, but aren't.  "I had a
> follow up yesterday" vs. "I had a follow-up yesterday"
>     3.  Tokenization of dates.  Should 3/5/00 be tokenized into 3 / 5 /
> 00 and each marked up individually or should it be kept as is?
> 
> Can someone point me to a guide for tagging when there are errors in the
> text?
> 
> Thank you,
> 
> David Campbell
> Department of Medical Informatics
> Columbia University
> 
> 
>

Next message: Geoffrey Sampson: "Corpora: lemmatizer"
Previous message: Monica MERINO: "Corpora: Corpora and language testing"
In reply to: David A. Campbell: "Corpora: Style guides for hand tagging POS"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Apr 13 2000 - 11:21:13 MET DST