Re: Corpora: Relatve text length

From: spela vintar (vintar@dfki.de)
Date: Wed Apr 24 2002 - 15:49:25 MET DST

  • Next message: Nolan, Patrick A: "Corpora: AAACL 2002 Conference"

    Hi Andrew,

    for Eastern-European languages you can compare the lengths of Orwell's 1984
    and its translations that were collected within the Multext-East project.
    The original Multext project (http://www.lpl.univ-aix.fr/projects/multext/)
    should provide the same for English, German, French, Spanish etc., however I
    wasn't able to find it on their homepage at first glance...

    Best,
    Spela

    http://nl.ijs.si/ME/CD/docs/mte-d21f/node8.html
    //////////////
    ...
    Below we give an estimate for the number of words, by language. The
    wordcounts were produced by removing the SGML tags from the texts and then
    using a 'wc'-like procedure.

      English
                104.302
      Romanian
                101.460
      Slovene
                 91.619
      Bulgarian
                 87.235
      Czech
                 80.366
      Hungarian
                 81.147
      Estonian
                 79.334

    Andrew Bredenkamp wrote:

    > Hello everyone,
    >
    > Does anyone know where I can find a list of relative text length?
    >
    > Taking one language as an index (100), I would like a list of the (other)
    > main European languages - e.g. (made up):
    >
    > Spanish: 100
    > English: 105
    > French: 110
    > German: 85
    >
    > ... etc.
    >
    > Thanks a lot in advance for any help you can give me.
    >
    > Cheers,
    > Andrew
    > =========================================
    > Andrew Bredenkamp
    > acrolinx GmbH
    > URL: www.acrolinx.com
    >
    > =========================================



    This archive was generated by hypermail 2b29 : Wed Apr 24 2002 - 19:20:34 MET DST