Corpora: ELRA News

From: Valerie Mapelli (mapelli@elda.fr)
Date: Thu Dec 28 2000 - 11:54:20 MET

  • Next message: Priscilla Rasmussen: "Corpora: Prelim. Call for Participation: Text Summarization/Document Understanding Conference"

    [ We apologise for the duplicate posting of this announcement ]
    ___________________________________________________________
                                    ELRA
                    European Language Resources Association
                                   ELRA News
    ___________________________________________________________

                         *** ELRA NEW RESOURCES ***

    We are happy to announce new resources available via ELRA:

    - Telephone Speech Resources
          ELRA-S0090 Polish SpeechDat(E) Database
          ELRA-S0092 Portuguese SpeechDat(II) FDB-4000
    - Desktop Microphone Speech Resources
          ELRA-S0087 BABEL Hungarian Database
          ELRA-S0088 Twin database - TWINDB1
          ELRA-S0089 Albayzin corpus
          ELRA-S0093 IBNC - An Italian Broadcast News Corpus
    - Speech Related Resources
          ELRA-S0091 Pronunciation lexicon of British place names,
          surnames and first names
    - Written Corpus
          ELRA-W0025 A "scientific" corpus of modern French
          (La Recherche magazine)
    - Multilingual Lexicons
          ELRA-M0025 Bilingual English-Russian Russian-English Dictionaries

    A short description of each database is given below.
    _______________________________________
    TELEPHONE SPEECH RESOURCES
    _______________________________________
    - ELRA-S0090 Polish SpeechDat(E) Database
    This database comprises 1000 Polish speakers (488 males,
    512 females) recorded over the Polish fixed telephone network.
    - ELRA-S0092 Portuguese SpeechDat(II) FDB-4000
    This database comprises 4027 Portuguese speakers (1861 males,
    2166 females) recorded over the Portuguese fixed telephone network.
    _______________________________________
    DESKTOP/MICROPHONE SPEECH RESOURCES
    _______________________________________
    - ELRA-S0087 BABEL Hungarian Database
    The BABEL Database is a speech database that was produced by
    a research consortium funded by the European Union under the
    COPERNICUS programme (COPERNICUS Project 1304).
    The Hungarian database consists of:
    - the basic "common" set which contains the Many Talker Set (30 males,
    30 females), Few Talker Set (4 males, 4 females), Very Few Talker Set
    (1 male, 1 female);
    -- and the extension part: a short description of Hungarian sound system
    - ELRA-S0088 Twin database - TWINDB1
    The Twin database named TWINDB1 includes recordings of 45 French
    speakers, consisting of 9 pairs of identical twins (8 males and 10 females)
    with similar voices, and 27 other speakers (13 males and 14 females)
    including 4 none-twin siblings.
    - ELRA-S0089 Albayzin corpus
    This corpus consists of 3 sub-corpora of 16 kHz 16 bits signals,
    recorded by 304 Castillian speakers: Phonetic corpus, Geographic corpus,
    "Lombard" corpus
    - ELRA-S0093 IBNC - An Italian Broadcast News Corpus
    Produced within the European Commission funded project LRsP&P
    (Language Resources Production & Packaging - LE4-8335), the collection
    consists of 150 broadcast programs from the RAI, for a total time of about
    30 hours, issued in 36 different days, between 1992 and 1999.
    down-sampled to 16kHz 16 bit, and encoded into the NIST Sphere PCM
    format.
    _______________________________________
    SPEECH RELATED RESOURCES
    _______________________________________
    - ELRA-S0091 Pronunciation lexicon of British place names, surnames and
    first names
    This pronunciation lexicon produced within the European Commission funded
    project LRsP&P (Language Resources Production & Packaging - LE4-8335)
    is an SGML-encoded database. It contains 160,000 entries of British
    place-names, surnames and first names All phonemic transcriptions in the
    database are based on the SAMPA phonetic alphabet.
    _______________________________________
    WRITTEN CORPUS
    _______________________________________
    - ELRA-W0025 A "scientific" corpus of modern French (La Recherche magazine)
    Produced within the European Commission funded project LRsP&P (Language
    Resources Production & Packaging - LE4-8335), the corpus contains all articles
    published in La Recherche magazine in 1998, including issues 305 (January) to
    315 (December), which amounts to 447,244 tokens and 30,238 types. Two
    versions are available: the raw data (XML format) and the complete version
    (XML
    and SGML formats)
    _______________________________________
    MULTILINGUAL LEXICONS
    _______________________________________
    - ELRA-M0025 Bilingual English-Russian Russian-English Dictionaries
    Produced within the European Commission funded project LRsP&P (Language
    Resources Production & Packaging - LE4-8335), these bilingual dictionaries
    contain more than 350,000 pairs of words (in tabular form) in XML format:
         1) Russian-English dictionary - more than 130,000 entries
         2) English-Russian dictionary - more than 95,000 entries
    Each entry contains: source word (lemma); part of speech of source word;
    target word(s) (lemma(s)), grouped by same meaning; part of speech of target
    word(s); domain(s).

    =====================================
    For further information, please contact:

          ELRA/ELDA Tel +33 01 43 13 33 33
          55-57 rue Brillat-Savarin Fax +33 01 43 13 33 30
          F-75013 Paris, France E-mail mapelli@elda.fr

    or visit our Web site:

          http//www.icp.grenet.fr/ELRA/home.html
          or http//www.elda.fr
    =====================================



    This archive was generated by hypermail 2b29 : Thu Dec 28 2000 - 11:53:12 MET