Corpora: Re: English dialogue corpora

From: Matthew Purver (matthew.purver@kcl.ac.uk)
Date: Wed Oct 18 2000 - 14:20:02 MET DST

  • Next message: Paul Rayson: "Re: Corpora: BNC word Frequency List"

    As promised, here's a summary of the information sent to me by helpfule
    people in response to me query about English dialogue corpora. Thanks to
    all who helped

    Matt

    -- 
    Matthew Purver  <matthew.purver@kcl.ac.uk>
    

    Computational Linguistics and Natural Language Processing Group Department of Computer Science King's College London, Strand, London WC2R 2LS

    ---------- Forwarded message ----------

    > Spoken Professional American-English (CSPA): > Size: > 1M words academic committee meetings > 1M words White House press conferences > Cost: > 79 dollars US (or 49 without PoS tags) (from Athelstan) > Features: > SGML but little detail (no prosody, overlaps). PoS tags. > > Santa Barbara Corpus of Spoken American English (CSAE) > (forms the US part of the ICE) > Size: > 14 texts of 15-30 mins each / 3 CD-ROMs > Cost: > 75 dollars US (from LDC) > Features: > Overlaps, timing, prosody. No PoS tags. > > CALLHOME: > Size: > 230K words / 120 texts of 5 or 10 mins each (telephone conversations) > Cost: > 500 dollars US (from LDC) > Features: > Not SGML - transcripts only. > > SWITCHBOARD: > Size: > 3M words / 2400 texts (telephone conversations) / 1 CD-ROM > Cost: > 100 dollars US (from LDC) > Features: > Not SGML, but includes overlaps, pauses, non-speech events, timings. > > Verbmobil: > Size: > about 500 dialogues / 3 CD-ROMs > Cost: > 255 euro (150 pounds) from ELDA > Features: > Most of Verbmobil is German - these 3 CDs are the English part - some > German words & "Denglish" though. > Not sure of format - probably straight transliteration. > > British National Corpus (BNC): > Size: > natural dialogue (volunteer wearing microphone, others unaware): > 4M words / 153 texts / 85 Mb > context-governed (meetings etc.): > 6M words / 762 texts / 100 Mb > Cost: > 220 pounds from OU, or 245 euro (= 150 pounds) from ELDA > Features: > SGML (DTD available), PoS tags (CLAWS), speakers, timing, overlaps, > prosody. > > International Corpus of English, GB section (ICE-GB): > Size: > about 0.6M words in 180 dialogue texts, of which 100 private > conversations, 80 context-governed > Cost: > 300 pounds from UCL > Features: > SGML, PoS tags, speakers, timing, overlaps, parse tree, > no prosody. > > London-Lund Corpus (LLC): > Size: > 0.5M words / 100 texts, about 75% spontaneous dialogue (some > surreptitious) > Cost: > 3500 Norwegian kroner (= 260 pounds) as part of ICAME CD > Features: > NOT standard SGML, includes prosody but no PoS tags. > > Bergen Corpus of London Teenage Language (COLT): > Size: > 0.5M words, all spontaneous dialogue (volunteer wearing microphone, > others unaware) > Cost: > (part of ICAME CD) > Features: > SGML, includes PoS tags (CLAWS), speakers, prosody. > > Wellington Corpus of Spoken English (WSC): > Size: > 0.5M words conversation (non-surreptitious), small amounts of > telephone, interviews etc. > Cost: > (part of ICAME CD) > Features: > SGML including prosody, no PoS tags. Some Maori words. > > Edinburgh HCRC Map Task Corpus (MTC): > Size: > 128 dialogue texts > Cost: > 165 pounds from Edinburgh, or 200 dollars US (136 pounds) from LDC > Features: > SGML, includes actual recordings > > TRAINS spoken dialog corpus: > Size: > 55K words / 98 texts / 1 CD-ROM of task-oriented (goods shipment in > railway system) dialogues > Cost: > 150 dollars US (= 103 pounds) from LDC > Features: > Plain text transcription, includes actual recordings



    This archive was generated by hypermail 2b29 : Wed Oct 18 2000 - 14:17:55 MET DST