RE: Corpora: Question about a Brown Corpus tag

From: Mark Lewellen (lewellen@erols.com)
Date: Thu Sep 14 2000 - 15:42:25 MET DST

  • Next message: Miles Osborne: "RE: Corpora: Question about a Brown Corpus tag"

    Frank Mueller pointed out that it is reasonable to leave POS-information
    underspecified (i.e. group together POS categories that are difficult to
    tag), since POS tagging typically takes place in the context of a larger
    task, such as parsing. The parser can then decide what is the most
    appropriate category (e.g. preposition or conjunction).

    An alternative to underspecification of POS information is to develop a
    POS tagger that records multiple POS in ambiguous contexts (ideally with
    probabilities attached to each POS choice). An advantage to this approach
    is that POS-ambiguity information is not 'hard-coded' in advance by the
    tag set, but is rather determined by sentence context, and may be extended
    to other ambiguities (such as N vs. V).

    Could anyone point out projects that have developed such POS taggers, or
    submit opinions as to their viability? One difficulty I notice is that a
    typical tagger using an HMM with the Viterbi algorithm determines a most
    likely _sequence_ , which would make it difficult to establish proabilities
    of multiple POS tags for a given word.

    Mark Lewellen

    > > on 17 Aug 2000 Eric S Atwell wrote:
    > >
    > > > Some tag definitions in Brown were clearly
    > > > decided by what TAGGIT found computable;
    > > > I *guess* linguistic inconsistencies in tagging
    > > > some words may be down to drawing boundaries on
    > > > grounds of computational tractability rather than
    > > > purely linguistic reasons
    > >
    > > on 17 Aug 2000 Andrew Harley wrote:
    > >
    > > > This explains how so many taggers can claim 95% or higher
    > success rates!
    > >
    > > > I also know taggers that tagged IN as "preposition
    > > > or conjunction" on the same grounds.
    > > ------------------------
    >
    > This is a reasonable decision, because you cannot resolve this ambiguity
    > on the grounds of the immediate context (which most taggers use). It is,
    > thus, better to keep the POS-information underspecified and resolve the
    > ambiguity, when you are doing the parse. Otherwise, your parser has to
    > work with unreliable information.
    >
    > > So what could be the linguistic reasons that Eric was mentioning? For me
    > > (with a rather limited linguistic background) the "traditional" criteria
    > > for POS determination look quite arbitrary or let's say heuristic.
    > >
    > > I cannot, for instance, see any advantage of separating "until" in:
    > > * until tomorrow (preposition)
    > > * until the morning comes (subordinating conjunction)
    >
    > I agree that you can (or even should) also leave this underspecified
    > until you do a full parse. However, at some point you have to make a
    > decision, because you have to annotate clauses and you have to annotate
    > prepositional phrases. Now, the 'until' (when it is a connector) gives
    > you a good cue where the clause starts.
    >
    > > while not separating "and" in:
    > > * you and me (coordinating conjunction)
    > > * I go and see (coordinating conjunction)
    >
    > As 'and' coordinates constituents of the same kind, you can analyse
    > sentences like:
    >
    > 'I came and see.' as: [CL [NP [N I]] [VP [V came] [CO and] [V see]]
    > (my ad-hoc annotation ;-))
    >
    > The use of 'and' does not affect the 'global' structure of the clause.
    > However, this is clearly different for 'until' as it introduces a
    > prepositional phrase in the one case and a clause in the other.
    >



    This archive was generated by hypermail 2b29 : Thu Sep 14 2000 - 15:40:46 MET DST