Re: Corpora: Question about a Brown Corpus tag

From: Dirk Ludtke (i60x0378@ip.media.kyoto-u.ac.jp)
Date: Thu Sep 14 2000 - 09:22:41 MET DST

  • Next message: Asialexlist: "(no subject)"

    This thread is already a month old, but for me there are still some
    questions left.

    -----------------------
     
    on 16 Aug 2000 David Campbell wrote:

    > 'Who' and 'That' are tagged by Brown as 'Wh'
    > pronouns (WPS) when introducing relative
    > clauses, but 'which' retains it's determiner
    > tag WDT. I am at a loss as to why.

    on 17 Aug 2000 Eric S Atwell wrote:

    > Some tag definitions in Brown were clearly
    > decided by what TAGGIT found computable;
    > I *guess* linguistic inconsistencies in tagging
    > some words may be down to drawing boundaries on
    > grounds of computational tractability rather than
    > purely linguistic reasons

    on 17 Aug 2000 Andrew Harley wrote:

    > This explains how so many taggers can claim 95% or higher success rates!

    > I also know taggers that tagged IN as "preposition
    > or conjunction" on the same grounds.

    ------------------------

    So what could be the linguistic reasons that Eric was mentioning? For me
    (with a rather limited linguistic background) the "traditional" criteria
    for POS determination look quite arbitrary or let's say heuristic.

    I cannot, for instance, see any advantage of separating "until" in:
    * until tomorrow (preposition)
    * until the morning comes (subordinating conjunction)

    while not separating "and" in:
    * you and me (coordinating conjunction)
    * I go and see (coordinating conjunction)

    or "with" in:
    * to see with a telescope (preposition)
    * the man with the telescope (preposition).

    Or why should I call the German "entlang" (along) a PREposition,
    even if it is behind the noun phrase:
    * den Fluss entlang (along the river)

    --------------------------

    But, I am sure that there is theoretic linguistic work about POS
    categorization without these kinds of inconsistencies. And I am almost
    sure that people who tag corpora not only think about the accuracy of
    their results, but also about the needs of future users or at least
    about linguistic credibility.

    And therefore I don't understand why connective Parts of Speech (like
    relative pronouns, conjunctions, conjunctive adverbs... ) are modelled
    in such a neglectful way in all the corpora I have seen so far.

    Or are there maybe approaches I am not aware of?
    Or is it maybe too difficult or even impossible to make it "good"?

    --------------------------

    Dirk Ludtke

    Language Media Lab
    Kyoto University



    This archive was generated by hypermail 2b29 : Thu Sep 14 2000 - 12:29:30 MET DST