> > Does anybody know of an existing tool to translate between the BNC C5
> > tag-set and the Penn Tree Bank tag-set?
>
> [...]
> You could alternatively just retag the BNC using a Penn-style tagger, of
> course, given that the BNC data was for the most part automatically tagged.
I'd be very careful there. The 2 million word BNC core corpus is
hand-corrected, which according to Leech (1997) reduced the error
rate to less than 0.3%. And for the 100 million word BNC that paper
mentions an error rate of 1.7% (of all words, excluding punctuation
marks). For the BNC2, the "BNC2 POS-tagging Manual" that comes with
the corpus estimates the overall error rate at 1.15% (cf. also the
BNC Tagging Enhancement Project). So "simple automatic retagging
with a Penn-style tagger" is likely to double or triple your error
rate.
Lieben Gruss,
Detmar
@Manual{leech:97,
title = {A Brief Users' Guide to the Grammatical Tagging of the British
National Corpus},
author = {Geoffrey Leech},
organization = {UCREL, Lancaster University},
year = 1997,
note = {\url{http://www.hcu.ox.ac.uk/BNC/what/gramtag.html}}}
-- Detmar Meurers Fax: Int + 614 292-8833 The Ohio State University Tel: Int + 614 292-0461 Department of Linguistics E-Mail: dm@ling.osu.edu 1712 Neil Avenue, Oxley Hall Homepage: http://ling.osu.edu/~dm/ Columbus OH 43210-1298, USA PGP key on web page (use encouraged)"It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts." Sherlock Holmes in "A Scandal in Bohemia" (A. C. Doyle)
This archive was generated by hypermail 2b29 : Fri Jan 31 2003 - 06:55:50 MET