RE: Corpora: non-alphabetic language databases

From: Mcenery, Tony (eiaamme@exchange.lancs.ac.uk)
Date: Thu Nov 30 2000 - 13:19:06 MET

  • Next message: Mike Maxwell: "Re: Corpora: non-alphabetic language databases"

    Hi,

    I agree with Thomas that Unicode is promising, at least in terms of encoding
    the characters. However, rendering Unicode in a readable form can be a major
    task. This is especially true for non-alphabetic writing systems, where glyphs
    of the writing system may not actually be represented in Unicode, but are
    instead generated by a rendering engine. The available font rendering engines
    which can actually display Unicode text accurately are few and far between. In
    terms of corpus processing I am aware of no system which actually renders
    Unicode accurately for all languages. Help is on the horizon:

    1.) I believe Mike Scott is working on a version of Wordsmith which may both
    read Unicode text and render it appropriately.
    2.) I am currently working with the GATE team at Sheffield towards making a
    version of GATE which renders a wide range of writing systems encoded in
    Unicode, but it is laborious work.
    3.) SIL international are developing a font rendering engine called Graphite
    which should be able to be embedded in corpus processing systems.

    So while I think Unicode is the way for corpus work to go in the future,
    treading that path with non-alphabetic writing systems at this moment in time
    is somewhat difficult.

    T

    > -----Original Message-----
    > From: Thomas Schmidt [SMTP:thomas.schmidt@uni-hamburg.de]
    > Sent: 30 November 2000 12:00
    > To: corpora@hd.uib.no
    > Subject: AW: Corpora: non-alphabetic language databases
    >
    > The unicode standard is indeed a promising solution for representing
    > non-alphabetic characters of any kind. Concerning the original question: I
    > don't know much about sign languages, but I wouldn't be surprised if the
    > unicode consortium has taken or will take these into account.If they don't,
    > the design of the unicode standard leaves room for user-defined symbols, so
    > it should be possible, for instance, to code alphabetic and sign language
    > symbols within one document.
    > The unicode homepage is on
    >
    > http://www.unicode.org/
    >
    > -----Ursprungliche Nachricht-----
    > Von: Simon G. J. Smith [SMTP:smithsgj@eee.bham.ac.uk]
    > Gesendet am: Donnerstag, 30. November 2000 12:34
    > An: corpora@hd.uib.no
    > Betreff: Re: Corpora: non-alphabetic language databases
    >
    >
    > Paula
    >
    > Have a look at www.chinesecomputing.com
    >
    > Are you a student of one of these languages? Take a look at a website from
    > one of the countries, without character-reading software running, and you
    > will see that each character is represented by two ASCII characters -
    > usually obscure things like ^ or ` and others that are not on the qwerty
    > keyboard at all.
    >
    > My understanding is this: order of database entry is not based on any
    > phonetic system, nor on any arrangement of radicals or character
    > components, but on a standard (for Chinese, usually one of Big-5 or GB
    > (Guo-Biao)) which maps each character on to an arbitrary pair of ASCII
    > characters. With the advent of the Unicode standard, a one-to-one mapping
    > is also now possible, but implementations are rare.
    >
    > I'm not an expert: perhaps there's one around who would care to add their
    > comments?
    >



    This archive was generated by hypermail 2b29 : Thu Nov 30 2000 - 13:16:02 MET