Corpora: ELRA News

From: Valerie Mapelli (mapelli@elda.fr)
Date: Mon Jul 31 2000 - 15:22:11 MET DST

  • Next message: MICHELE A SIGLER: "Corpora: how many words?"

    [ We apologise for the duplicate posting of this announcement ]
    ___________________________________________________________
                                    ELRA
                    European Language Resources Association
                                   ELRA News
    ___________________________________________________________

                         *** ELRA NEW RESOURCES ***

    We are happy to announce new resources available via ELRA:

    ELRA-S0034 New Verbmobil databases
    ELRA-S0084 SALA Spanish Colombian Database
    ELRA-L0042 PAROLE Spanish lexicon

    A description of each database is given below.

    _______________________________________
    ELRA-S0034 Verbmobil
    _______________________________________

    This resource consists of spontaneous speech recorded in
    a dialog task (appointment scheduling). The BAS edition of
    the German part is fully labelled and segmented into
    phonemic/phonetic SAM-PA by the MAUS system and partly
    segmented manually.
    New corpora available via ELRA (for the complete list, please
    contact ELRA or visit ELRA or BAS Web sites):
    VM CD 30.1 - VM30.1 (BAS edition)
    Verbmobil II - German, 58 spontaneous dialogues (33 close mic,
    0 room mic, 25 phone line (GSM) recordings), 3024 turns,
    transliteration (Verbmobil II Format)

    VM CD 31.1 - VM31.1 (BAS edition)
    Verbmobil II - American English, 32 spontaneous dialogues
    (32 close mic, 0 room mic, 0 phone line (GSM) recordings),
    2512 turns, transliteration (Verbmobil II Format)

    VM CD 32.1 - VM32.1 (BAS edition)
    Verbmobil II - Multilingual, 17 spontaneous dialogues (17
    close mic, 0 room mic, 0 phone line (GSM) recordings),
    992 turns, transliteration (Verbmobil II Format)

    _______________________________________
    ELRA-S0084 SALA Spanish Colombian Database
    _______________________________________

    The SALA Spanish Colombian Database comprises 1000
    Colombian speakers (475 males, 525 females) recorded
    over the Colombian fixed telephone network. Corpus design,
    recruiting of speakers, annotation and formatting was done by
    the Universitat Politècnica de Catalunya (UPC). Collection was
    performed at Siemens Colombia.. Six speakers repeated the
    same prompt sheet in different calls. This database is
    partitioned into 4 CDs, each of which comprises 300 speakers
    sessions (except for CD 4, with 100 speakers sessions). The
    speech databases made within the SALA project were
    validated by SPEX, the Netherlands, to assess their
    compliance with the SALA format and content specifications.

    The speech files are stored as sequences of 8-bit, 8kHz A-law
    speech files and are not compressed, according to the
    specifications of SALA. Each prompt utterance is stored within
    a separate file and has an accompanying ASCII SAM label file.

    Corpus contents:
    · 6 application words;
    · 1 sequence of 10 isolated digits;
    · 4 connected digits: 1 sheet number (6 digits), 1 telephone
    number (9-11 digits), 1 credit card number (14-16 digits), 1 PIN
    code (6 digits);
    · 3 dates: 1 spontaneous date (e.g. birthday), 1 prompted date
    (word style), 1 relative and general date expression;
    · 1 spotting phrase using an application word (embedded);
    · 1 isolated digit;
    · 3 spelled-out words (letter sequences): 1 spelling of surname;
    1 spelling of directory assistance city name; 1 real/artificial
    name for coverage;
    · 1 currency money amount;
    · 1 natural number;
    · 5 directory assistance names: 1 surname (out of 500); 1 city
    of birth / growing up (spontaneous); 1 most frequent city (out of
    500); 1 most frequent company/agency (out of 500); 1 "forename
    surname" (set of 150 )
    · 2 questions, including "fuzzy" yes/no: 1 predominantly "yes"
    question, 1 predominantly "no" question;
    · 9 phonetically rich sentences;
    · 2 time phrases: 1 time of day (spontaneous), 1 time phrase
    (word style);
    · 4 phonetically rich words.

    The following age distribution has been obtained: 11 speakers
    are below 16 years old, 486 speakers are between 16 and 30,
    305 speakers are between 31 and 45, 163 speakers are between
    46 and 60, and 35 speakers are over 60.

    A pronunciation lexicon with a phonemic transcription in SAMPA
    is also included.

    _______________________________________
    ELRA-L0042 PAROLE Spanish lexicon
    _______________________________________

    The PAROLE Spanish lexicon follows standard PAROLE
    architecture which includes morphological and syntactic layers.
    It includes the most frequent words found in a 1 million word
    corpus, coded according to the PAROLE specifications.

    The lexicon contains about 22,000 morphological units, of which
    12,209 are common nouns, 3,367 verbs, 4,996 adjectives. Closed
    classed categories are fully covered.

    The information associated with each morphological unit concerns
    part-of-speech and subtype, inflection paradigm (with
    morphosyntactic information for the endings organised in about
    132 models), possible stems in relation with the relevant endings,
    linking with syntactic layer. In the syntactic layer, information
    regarding subcategorisation for verbs and insertion context for
    nouns is encoded following the PAROLE model.

    =====================================
    For further information, please contact:

         ELRA/ELDA Tel +33 01 43 13 33 33
         55-57 rue Brillat-Savarin Fax +33 01 43 13 33 30
         F-75013 Paris, France E-mail mapelli@elda.fr

    or visit the online catalogue on our Web site:

         http://www.icp.grenet.fr/ELRA/home.html
         or http://www.elda.fr
    =====================================



    This archive was generated by hypermail 2b29 : Mon Jul 31 2000 - 15:23:30 MET DST