Corpora: XML programmes and tagging

From: Tom Emerson (tree@cymru.basistech.com)
Date: Fri May 19 2000 - 16:34:57 MET DST

  • Next message: Jean Veronis: "Corpora: Searching the Web"

    Gabriella Rundblad writes:
    > 1) As far as I understand, it is today recommended to use
    > XML for tagging purposes. For this I'll need user-friendly
    > programme(s), the question is which. I know there are both
    > free ware, share ware and commercial products out there,
    > though I've never tried (yet) either of them and don't
    > know how user-friendly they are. I know HTML and use
    > Hotmetal Pro for this (great!) and there is obviously an
    > XML equivalent (XMetal). Could you advice what programme(s)
    > to use?! Is XMetal good for a never-before-tagger?!

    I've been building large monolingual and parallel corpora for
    Simplified and Traditional Chinese texts (SC<>TC parallel) and have
    ended up using GNU Emacs for all of my editing. I have not been able
    to find another tool that allows me to create and edit documents using
    Unicode (the only way to handle SC and TC within a single document).

    There are SGML and XML modes for Emacs that are useful, though for my
    purposes (and with my DTD, see below) I just insert the markup
    manually or with the help of various Python scripts I put together to
    massage the various source texts.

    I ruled out XMetaL when it was first released because of their refusal
    to fully support Unicode, which is essential for my purposes. For your
    needs it probably is not a problem: eth and thorn (upper- and
    lowercase) are both in ISO 8859-1 (Latin-1). If you need Yogh then
    you're out of luck.

    > 2) The tagging I would like to do (I'm reading up on TEI
    > etc) is a tagging of phrases and clauses, not parts of
    > speech. What's been done on this earlier? Any lists of tags
    > etc?

    Take a look at the Corpus Encoding Standard,

        http://www.cs.vassar.edu/CES/

    and its XML counterpart, XCES,

        http://www.cs.vassar.edu/XCES

    For my purposes I couldn't use these because they lack support for
    Eastern languages, and right now I don't need the complexity for my
    internal work. So I rolled my own DTD which works fine for me. In the
    long term I would like to move to XCES. Unfortunately attempts to
    become involved in that have gone unanswered.

           -tree

    -- 
    Tom Emerson                                          Basis Technology Corp.
    Language Hacker                                    http://www.basistech.com
      "Beware the lollipop of mediocrity: lick it once and you suck forever"
    



    This archive was generated by hypermail 2b29 : Fri May 19 2000 - 16:43:42 MET DST