Corpora: New Corpora

From: LDC Office (ldc@unagi.cis.upenn.edu)
Date: Wed Nov 15 2000 - 23:17:35 MET

  • Next message: Charles Meyer: "Corpora: Final Call for Papers: 3rd N. American Symposium on Corpus Ling. and Language Teaching"

    The Linguistic Data Consortium is pleased to announce 3 new
    corpora.

    Topic Detection and Tracking(TDT) 2 Careful Transcription Audio
    http://www.ldc.upenn.edu/Catalog/LDC2000S92.html
    $300 for nonmembers

    Topic Detection and Tracking (TDT) 2 Careful Transcription Text
    http://www.ldc.upenn.edu/Catalog/LDC2000T44.html
    $200 for nonmembers

    This realease contains broadcast news speech and transcripts from
    the following sources:

    ABC January-June 1998
    CNN January-June 1998
    PRI January-June 1998
    VOA March-June 1998

    The audio files are single channel, 16 KHz, 16 bit linear SPHERE
    files. Topic Detection and Tracking (TDT) refers to automatic
    techniques for finding topically related material in streams of
    data such as newswire and broadcast news. The TDT2 corpus was
    created to support three TDT2 tasks: find topically homogeneous
    sections (segmentation), detect the occurrence of new events
    (detection), and track the reoccurrence of old or new events
    (tracking).

    TREC Mandarin
    http://www.ldc.upenn.edu/Catalog/LDC2000T52.html
    AGREEMENT: http://www.ldc.upenn.edu/Catalog/mem_agree/trec_mandarin.html
    $200 for nonmembers

    This publication contains the TREC (Text REtreival Conference)
    Mandarin Corpus used for the Chinese task in TRECs 5-6 and
    consists of approximately 170 megabytes of articles drawn from
    the People's Daily newspaper (1991-1993) and the Xinhua newswire
    (1994-1995) formatted to include TREC document ids. The text is
    Mandarin Chinese and is encoded using the GB encoding scheme. The
    topics (questions) and relevance judgments (right answers) are
    not included in this publication but can be downloaded from the
    Data/Non-English section of the TREC web site.

    This collection of text was originally gathered by the Linguistic
    Data Consortium (LDC), and then adapted by the National Institute
    of Standards and Technology (NIST) for use in the TREC Mandarin
    evaluation program.

    If you would like to order a copy of these corpora, please email
    your request to <ldc@unagi.cis.upenn.edu>. If you need
    additional information before placing your order, or would like
    to inquire about membership in the LDC, please send email or
    call (215) 573-1275.

    Further information about the LDC and its available corpora can
    be accessed on the Linguistic Data Consortium WWW Home Page at
    URL: http://www.ldc.upenn.edu/



    This archive was generated by hypermail 2b29 : Wed Nov 15 2000 - 23:15:19 MET