Corpora: New Corpora

From: LDC Office (ldc@unagi.cis.upenn.edu)
Date: Wed Oct 11 2000 - 23:12:56 MET DST

  • Next message: Norbert Schlueter: "Corpora: An Empirical Grammar of the English Verb System"

    The Linguistic Data Consortium is pleased to announce 3 new
    corpora.

    Voice of America (VOA) Czech Broadcast News Audio
    http://morph.ldc.upenn.edu/Catalog/LDC2000S89.html
    $900 for nonmembers

    Between February 9 and May 28, 1999, the Linguistic Data
    Consortium collected approximately 30 hours of broadcast audio
    from the Voice of America news service in Czech. The 62 data
    files presented in this corpus represent the audio of the daily
    broadcasts of 30-minute news programs.

    Voice of America (VOA) Czech Broadcast News Transcript Corpus
    http://morph.ldc.upenn.edu/Catalog/LDC2000T53.html
    $200 for nonmembers

    The transcriptions were created by native Czech speakers,
    working at the Department of Cybernetics, University of West
    Bohemia (UWB) in Pilsen, under the direction of Josef Psutka and
    Pavel Ircing. They used transcription software provided by the
    LDC (the "transcriber" package, developed by Eduoard Geoffrois
    and Claude Barras at DGA, France, with assistance from Zhibiao
    Wu at the LDC; the package is currently available from the LDC
    web site: www.ldc.upenn.edu. The transcript files are presented
    here in a format that was defined by the speech group at NIST,
    who refer to it as the "Universal Transcription Format" (UTF --
    not to be confused with the "Unicode Transformation Formats").
    The transcription text is rendered using the ISO 8859-2
    character set.

    TREC Spanish
    http://morph.ldc.upenn.edu/Catalog/LDC2000T51.html
    $200 for nonmembers

    This is the set of documents used for the Spanish task in TRECs
    3-5. It consists of approximately 250 megabytes of the Mexican
    newspaper El Norte and 300 megabytes of Agence France Presse
    1994 newswire text, formatted to include TREC document IDs. The
    El Norte documents were used for TRECs 3-4, and the Agence
    France Presse documents for TREC 5. The topics (questions) and
    relevance judgments (right answers) that complete the test
    collections can be downloaded from the TREC web site
    (http://trec.nist.gov) in the Data/Non-English section. Users
    who wish to receive this corpus must sign the user license which
    can be obtained from
    http://morph.ldc.upenn.edu/Catalog/mem_agree/trec-spa nish.html.

    If you would like to order a copy of these corpora, please email
    your request to <ldc@unagi.cis.upenn.edu>. If you need
    additional information before placing your order, or would like
    to inquire about membership in the LDC, please send email or
    call (215) 573-1275.

    Further information about the LDC and its available corpora can
    be accessed on the Linguistic Data Consortium WWW Home Page at
    URL: http://www.ldc.upenn.edu/



    This archive was generated by hypermail 2b29 : Wed Oct 11 2000 - 23:10:15 MET DST