ENGLISH COMPUTER CORPORA

MATERIAL AVAILABLE THROUGH ICAME

 

The Norwegian Computing Centre for the Humanities (NCCH) functions as secretariat and clearing house for machine readable English texts. The secretariat publishes ICAME JournalThe following material is currently available through the International Computer Archive of Modern English (ICAME):
 - Brown Corpus, untagged text format I (available on tape or diskette): A revised version of the Brown Corpus with upper- and lower-case letters and other features which reduce the need for special codes and make the material more easily readable. It contains approximately a million words of printed text (500 text samples of about 2,000 words). A number of errors found during the tagging of the corpus have been corrected. Typographical information is preserved; the same line division is used as in the original version from Brown University except that words at the end of the line are never divided.
 - Brown Corpus, untagged text format II (tape or diskette): This version is identical to text format I, but typographical information is reduced and the line division is new.
 - Brown Corpus, KWIC concordance (tape or microfiche): A complete concordance for all the words in the corpus, including word statistics showing the distribution in text samples and genre categories. The microfiche set includes the complete text of the corpus.
 - Brown Corpus, WordCruncher version (diskette): This is an indexed version of the Brown Corpus. It can only be used with WordCruncher (for MS-DOS). See the article by Randall Jones, ICAME Journal 11, pp. 44-47.
 - LOB Corpus, untagged version, text (tape or diskette): The LOB Corpus is a British English counterpart of the Brown Corpus. It contains approximately a million words of printed text (500 text samples of about 2,000 words). The text of the LOB Corpus is not available on microfiche.
 - LOB Corpus, untagged version, KWIC concordance (tape or microfiche): A complete concordance for all the words in the corpus. It includes word statistics for both the LOB Corpus and the Brown Corpus, showing the distribution in text samples and genre categories for both corpora.
 - LOB Corpus, tagged version, horizontal format (tape or diskette): A running text where each word is followed immediately by a word-class tag (number of different tags: 134).
 - LOB Corpus, tagged version, vertical format (available on tape only): Each word is on a separate line, together with its tag, a reference number, and some additional information (indicating whether the word is part of a heading, a naming expression, a quotation, etc).
 - LOB Corpus, tagged version, KWIC concordance (tape or microfiche): A complete concordance for all the words in the corpus, sorted by key word and tag. At the beginning of each graphic word there is a frequency survey giving the following information: (1) total frequency of each tag found with the word, (2) relative frequency of each tag, and (3) absolute and relative frequencies of each tag in the individual text categories.
 - LOB Corpus, WordCruncher version (diskette): This is an indexed version of the tagged LOB Corpus (horizontal format). It can only be used with WordCruncher (for MS-DOS). London-Lund Corpus, text, original version (computer tape or diskette): The London-Lund Corpus contains samples of educated spoken British English, in orthographic transcription with detailed prosodic marking. It consists of 87 `texts', each of some 5,000 running words. The text categories represented are spontaneous conversation, spontaneous commentary, spontaneous and prepared oration, etc.
 - London-Lund Corpus, KWIC concordance I (computer tape): A complete concordance for the 34 texts representing spontaneous, surreptitiously recorded conversation (text categories 1-3), made available both in computerized and printed form (J. Svartvik and R. Quirk (eds.) A Corpus of English Conversation, Lund Studies in English 56, Lund: C.W.K. Gleerup, 1980).
 - London-Lund Corpus, KWIC concordance II (computer tape): A complete concordance for the remaining 53 original texts of the London-Lund Corpus (text categories 4-12).
 - London-Lund Corpus, supplement (diskette): The remaining 13 texts of the 100 spoken texts collected and transcribed at the Survey of English Usage, University College London. See the presentation by S. Greenbaum, ICAME Journal 14 (1990), pp. 108-110.
 - Melbourne-Surrey Corpus (tape or diskette): 100,000 words of Australian newspaper texts (see the article by Ahmad and Corbett, ICAME Journal 11, pp. 39-43).
 - Kolhapur Corpus (tape or diskette): A million-word corpus of printed Indian English texts. See the article by S.V. Shastri, ICAME Journal 12, pp. 15-26.
 - Kolhapur Corpus, WordCruncher version (diskette): This is an indexed version of the Kolhapur Corpus. It can only be used with WordCruncher (for MS-DOS).
 - Lancaster/IBM Spoken English Corpus (tape or diskette): A corpus of approximately 52,000 words of contemporary spoken British English. The material is available in orthographic and prosodic transcription and in two versions with grammatical tagging (like those for the LOB Corpus). There is an accompanying manual. See further ICAME Journal 12, pp. 76-77.
 - Polytechnic of Wales Corpus (tape or diskette): Orthographic transcriptions of some 61,000 words of child language data. The corpus is parsed according to Hallidayan systemic-functional grammar. There is no prosodic information. See further ICAME Journal 13 (1989), p. 20ff, and the presentation of the edited version of the corpus in this issue.

Most of the material has been described in greater detail in previous issues of our journal. Prices and technical specifications are given on the order forms which accompany the journal. Note that tagged versions of the Brown Corpus cannot be obtained through ICAME. The same applies to audio tapes for the London-Lund Corpus, the Lancaster/IBM Spoken English Corpus, and the Polytechnic of Wales Corpus.

There are available printed manuals for the LOB Corpus (the original manual and a supplementary manual for the tagged version). Printed manuals for the Brown Corpus cannot be obtained from Bergen. Some information on the London-Lund Corpus is distributed together with copies of the text and the KWIC concordance for the corpus. Users of the London-Lund material are also recommended to consult J. Svartvik (ed.). The London-Lund Corpus: Description and Research, Lund University Press, 1990.

A manual for the Kolhapur Corpus can be ordered from: S.V. Shastri, Department of English, Shivaji University, Vidyanagar, Kolhapur-416006, India. The price of this manual is US $15 (including airmail charges). Payment should be sent along with the order by cheque or international postal order drawn in favour of The Registrar, Shivaji University, Kolhapur.

 


Innholdslisten for dette nummeret  Hovedside, Humanistiske Data Hjemmeside, Humanistisk Datasenter