Corpora: corpus of Information Tech.

From: Kim Tan (kimmy1003@hotmail.com)
Date: Thu Nov 23 2000 - 03:35:45 MET

  • Next message: eneko agirre: "Corpora: grammatical relations"

    Hi all,

    I'm currently involved in a small project that involves terminology building
    ( Information Technology (IT)) from two content parallel corpora ie English
    and Malay . The texts are not translated texts but are weekly newspaper
    pullouts on IT , both dealing in the same area.

    At this stage, we're still handling the English articles & trying to
    identify IT specific words. With a corpus of nearly 400 000 words, a
    wordlist has been generated based on frequency count. This list is compared
    with the wordlist of a general corpus of Malaysian English (ME) of 300,000
    words, the freq. of ME are then adjusted, after which the freq.index is
    calculated. By looking at the index, words that are over represented in the
    sp. corpus as compared to the general corpus are then said to be IT specific
    words.

    My question is whether this would be a valid claim & also whether there are
    other ways of identifying words ( statistically or otherwise )that are
    specific to a specialized area . As I'm rather new to this area, I'd
    appreciate any form of input ..

    Seeking your expertise

    KIM
    National Univ. of Malaysia
    _____________________________________________________________________________________
    Get more from the Web. FREE MSN Explorer download : http://explorer.msn.com



    This archive was generated by hypermail 2b29 : Thu Nov 23 2000 - 03:33:18 MET