Dear Colleagues ,
Those intersted in a new generation part-of-speech tagger will be
welcome
when addressing their reflexions to gojol@sunu.rnc.ro , especially if
having
purchasing or collaboration intents ( or hints about ) . Thank you .
Best regards ,
Vladimir V. Gojol
Senior Software Engineer
Institutul National de Informatica
Bucuresti , Romania
............................................................................
I created a part-of-speech tagger with an unusual capacity of dealing
with large contexts , especially for German . I used Negra ( seemingly
the
best known German corpus , with free obtainable licence ) . The tagger
currently reputed as being the most accurate for German is perhaps TnT .
It
reports upon this corpus an error rate of 3.4% . But I have found a
syste-
matic error in Negra : all the occurences of the auxilliary verbs are
tagged
as auxilliary ( VAFIN ) , though in 50% of the cases they function as
finite
verbs ( VVFIN ) . I corrected a part of the corpus ( cca 40,000 tokens )
.
In this more correct environment ( where the performance of TnT should
be
probably around 4.5% ) , my tagger gets 1.7% .
On another German corpus ( I call it X ) , with comparable contents (
news-
paper articles ) and tagset , but with attached exterior lexicon ( i.e.
not
extracted from the corpus ) , the result is 2.4% .
I also used Susanne ( the only English corpus I could get free ) .
The re-
ported result for TnT is 3.8% . Mine is 2.8% . On the "A" texts , best
paral-
lelable with those in Negra , as journalistic , it's 2.3% .
Initially I had used a Romanian corpus , with a result of 0.9% (
compared
to 1.7% , 2.5% and 4.2% respectively got by the Xerox , Birmingham and
Brill
taggers ) .
The speed is comparable to that of TnT and modifiable by parameter
setting ,
in reverse proportion to the accuracy ( but without affecting it much )
.
The incremental operating mode and the data structures segmentation
allow
running on very small memory computers .
There is the advantage of an intuitive output ( no hostile binary
matrix ) ,
in a form analogue to the input of some expert systems .
Special facilities exist , such as virtual tags , or context
essentialisa-
tion ( permitting to get the minimal contexts set characteristic to a
certain
linguisic style , useful not only for maximum accuracy and speed ) etc.
All is built on two essentially new concepts : organicity and context
pro-
pagation . I didn't publish anything about them , to keep up their
commercial
appeal . The accuracy comparable to that of manual tagging made me find
many
errors in the used corpora : 98 in Negra , 36 in Susanne ; Prof. G.
Sampson
replied gratefully , saying that it's the first time somebody reports
more
than 2 errors , and that my findings make necessary a new version of
Susanne .
The handling of very large contexts could even modify the current
tagsets de-
sign , by cancelling some unnatural decisions ( motivated only by the
incapa-
city of the existing taggers to see beyond a 3-tokens neighborhood ) ,
such as
those concerning the auxilliary verbs , participles etc. - so removing
some
burden from the subsequent stages of text processing .
It is written in C ( Linux ) . Demos for German ( Negra or X ) and
English
( Susanne ) are available .
This archive was generated by hypermail 2b29 : Wed Feb 16 2000 - 17:30:05 MET