Corpora: Collaborative venture

From: Jem Clear (jem@cobuild.collins.co.uk)
Date: Tue Jun 13 2000 - 13:23:20 MET DST

  • Next message: Yuliya Katsnelson: "Re: Corpora: Collaborative effort"

    Re: the points raised by Eric Atwell (et al.) (see snippet below).

    > >I agreed if the sense tags have completely different meaning. However,
    > >the differences in meaning between tags may be in shades of meaning
    > >rather than the crisp decision that they are or not same....

    > ... I don't believe there is a clear, "self-evident" set of semantic
    > tags. Semantic tagging could instead aim to annotate each word with
    > a SET of semantic features, and "disambiguation" could aim to
    > eliminate sematic features incompatible with context; this would
    > allow for overlap and indeterminate sense-tagging. The set of
    > semantic features for a word could be a bundle of semantic
    > information, for example the lemma/root, subject-category code,
    > selection restrictions, and meaning definition from LDOCE; instead
    > of sense-tagging, if the aim was to eliminate features which were
    > incompatible with context, you should get more inter-annotator
    > agreement.

    Oh dear! No, no, no. OK. Maybe I was being a little naive in
    thinking that a large group of corpus linguists could even begin
    to agree on a simple, but potentially useful, collaborative
    scheme. A project in "semantic tagging" seems to my way of
    thinking precisely what we do *not* need -- or rather we have
    plenty of such projects going on at the moment anyway so there's
    no widespread benefit to the linguistic community in having
    a few more people sitting round discussing what exactly *are*
    the set of primitive semantic components or how a semantic "entry"
    should be structured or whatever.

    I was feeling reckless last Friday afternoon so thought I'd float
    an extremely simple idea based on the assumption that speakers
    of English (native or non-native) have some ability to pick from
    a number of offered citations those which in their opinion match
    a given dictionary definition. I am not so foolish as to believe

    a) that all respondents would select the same citations if offered the
    same source set (this is the Consensus Issue)

    b) that the dictionary definition is "true" or "correct" or clearly defines
    the boundaries of a word sense (this is the Which Tagset? Issue)

    c) that all citations selected by respondents would be "correct" (this
    is the Quality Control Issue: aka the Noise Problem)

    Suppose in primitive times, when the only routes connecting towns and
    villages were rough, muddy tracks, that someone proposes that the
    community build a road by bringing bucketloads of rubble, stones, ash,
    whatever and pack it down to make a hard flat surface. As soon as this
    idea is proposed, one group of villagers get very excited because
    no-one has told them how wide the proposed road should be (just wide
    enough for one cart -- or wide enough for two carts to pass?). A wise
    man from another town questions whether straw should be added to the
    stones being thrown down -- straw may disintegrate and not last
    through winter rains. Others get into fierce arguments about whether
    the road should go straight from one village to another or should wind
    around avoiding hills, deep valleys, marshland, etc.

    You get the idea! Just a few people bring along a few bucketloads of
    stones and rubble and the road extends for no more than 5 metres,
    despite the fact that almost everyone agrees that a road of some sort
    would be much better than the rutted, filthy, muddy track along which
    they have to walk, ride, or drive their livestock.

    Linguistics is such fun, isn't it

    Jem Clear

    Electronic Development Director phone: +44 (0)121-414-3926
    Collins Dictionaries fax: +44 (0)121-414-6203
    Westmere, 50 Edgbaston Park Road email: jem@cobuild.collins.co.uk
    Birmingham, B15 2RX, UK WWW: www.cobuild.collins.co.uk



    This archive was generated by hypermail 2b29 : Tue Jun 13 2000 - 13:29:43 MET DST