ICAME Collection of English Language Corpora

 

Coordinators: Knut Hofland (project leader, conversion and indexing of texts),
Anne Lindebjerg (manuals and CD label/booklet),
Jørn Tunestvedt (manuals),
HIT-Centre,
University of Bergen
Version 2 Bergen, June 1999
ISBN 82-7283-091-4
Publisher/
Distributor:
The HIT Centre
Allégt. 27
N-5007 Bergen
Norway
  Telephone: +47 5558 2954
Telefax: +47 5558 9470
Electronic mail: icame@hit.uib.no

Each corpus was produced by a different research team, as explained below.

The Brown Corpus

The Brown Corpus was compiled in the early 1960s at Brown University, USA, under the direction of W. Nelson Francis and Henry Kucera.

The WordCruncher version of the Brown Corpus was made by Randall Jones, Brigham Young University

 

The LOB Corpus

The Lancaster-Oslo/Bergen (LOB) Corpus was compiled in the 1970s under the direction of Geoffrey Leech, University of Lancaster, and Stig Johansson, University of Oslo. The tagging was done by researchers at Lancaster, Oslo, and Bergen. The principal members of the research teams were:

Lancaster: Geoffrey Leech, Roger Garside, Eric Atwell, Ian Marshall

Oslo/Bergen: Stig Johansson, Knut Hofland, Mette-Cathrine Jahr

 

The Kolhapur Corpus

The Kolhapur Corpus is an Indian English counterpart of the Brown and LOB corpora, compiled under the direction of S. V. Shastri, Shivaji University, Kolhapur. It contains 500 text samples selected from English texts printed in India in 1978.

The WordCruncher version of the Kolhapur Corpus was made by Gerhard Leitner, Free University of Berlin and Knut Hofland, HIT-Centre, University of Bergen.

 

The London-Lund Corpus

The London-Lund Corpus contains 100 spoken English texts of some 5,000 words collected and transcribed at the Survey of English Usage, University College London, under the direction of Randolph Quirk, and computerized at the University of Lund, under the direction of Jan Svartvik (13 of the texts were computerized at University College London, under the direction of Sidney Greenbaum). The principal members of the research teams were:

London: Sidney Greenbaum, Andrew Rosta, Akiva Quinn

Lund: Bengt Altenberg, Mats Eeg-Olofsson, Lennart Månsby, Bengt Oreström, Jan Svartvik, Cecilia Thavenius

The WordCruncher version has been made by Knut Hofland, University of Bergen.

 

The Helsinki Corpus of English Texts: Diachronic Part

This corpus was compiled at the University of Helsinki, under the direction of Matti Rissanen. Other members of the research team were:

Old English: Leena Kahlas-Tarkka, Matti Kilpiö, Ilkka Mönkkönen, Aune Österman

Middle English: Inkeri Blomstedt, Juha Hannula, Mailis Järviö, Leena Koskinen, Saara Nevanlinna, Tesma Outakoski, Päivi Pahta, Kirsti Peitsara, Irma Taavitsainen

Early Modern English: Merja Kytö, Anneli Meurman-Solin, Terttu Nevalainen, Helena Raumolin- Brunberg, Ritva Tiusanen

The project secretary was Merja Kytö and the research assistants, who keyed in and proofread texts were:

Kirsi Heikkonen, Jussi Klemola, Asta Kuusinen, Tuula Lehtonen, Tom Löfström, Arja Nurmi, Minna Palander, Tiina Selki, Päivi Öhman.

The WordCruncher version of the Helsinki Corpus is made by Merja Kytö, University of Helsinki.

 

Freiburg-LOB Corpus of British English (FLOB)

In 1991, Christian Mair, at Englisches Seminar at Albert-Ludwigs-Universität Freiburg, took the initiative to compile a set of corpora that would match the well-known and widely used Brown and LOB corpora with the only difference that they should represent the language of the early 1990s. The project started in April 1991. To speed up the process of compilation, Christian Mair was granted additional funding by the DFG (German Research Foundation) for the years 1994-1996.

The following have all been involved in the often tedious process of typing the text-extracts and/or the proofreading: Birgit Felleisen, Heike Fiedler, Elke Frings, Elke Gebhard, Dorothee Graf, Ulrike Günther, Matthias Kaufmann, Manfred Krug, Christoph Lindner, Isolde Mattmüller-Ofori, Nadja Nesselhauf, Christine Oesterlee, and Heike Schnitzler. Heike Fiedler helped in the final stages of proof-reading and the editing of the manual.

Special thanks go to Christoph Lindner who wrote the programs that were used in assigning the category references and line-numbers to the ASCII-texts, and to Heide Peper-Ludwig, the main troubleshooter in computer related emergencies.

The WordCruncher version has been made by Knut Hofland, University of Bergen.

 

Freiburg-Brown Corpus of American English (Frown)

In 1991, Christian Mair, at Englisches Seminar at Albert-Ludwigs-Universität Freiburg, took the initiative to compile a set of corpora that would match Brown and LOB corpora with the only difference that they should represent the language of the early 1990s. 1992 saw the beginning of the new Freiburg Brown Corpus, Frown. To speed up the process of compilation, Christian Mair was granted additional funding by the DFG (German Research Foundation) for the years 1994-1996.

The following have all been involved in the often tedious process of typing the text-extracts and/or the proofreading for the Frown corpus: Jost Burger, Birgit Felleisen, Elke Gebhard, Dorothee Graf, Ulrike Günther, Matthias Kaufmann, Manfred Krug, Christoph Lindner, Tobias Maier, Nadja Nesselhauf, Christine Oesterlee, Stefanie Rapp, Heike Schnitzler, Anne Schröder. Heike Fiedler and Nicole Knäble helped in the final stages of proofreading and the editing of the manual.

Special thanks go to Christoph Lindner who wrote the programs that were used in assigning the category references and line-numbers to the ASCII-texts, and to Heide Peper-Ludwig, the main troubleshooter in computer related emergencies.

The WordCruncher version has been made by Knut Hofland, University of Bergen.

 

The Australian Corpus of English (ACE)

The Australian Corpus of English (ACE) was compiled in the department of Linguistics at Macquarie University NSW Australia, from 1986 on. It was supported by a small grant 1988-1989 from the Australian Research Grants Council, and by a series of grants from Macquarie University. Other support came from the National Languages and Literacy Institute of Australia and the University of New South Wales. The project was conceived by Pam Peters, Peter Collins and David Blair, and was carried through with the help of a number of research assistants, notably Alison Moore, Elizabeth Green, Robert Jenkins, Catherine Martin, Diana Grace, Heather Middleton, Wendy Young and Adam Smith. Computational help and advice was provided by Harry Purvis and Steve Cassidy, and the project enjoyed continuous infrastructure support from Macquarie's Speech, Hearing and Language Research Centre.

The WordCruncher version has been made by Knut Hofland, University of Bergen.

 

The Wellington Corpus of Written New Zealand English

The corpus Wellington Corpus was developed in the Department of Linguistics at Victoria University of Wellington in the years 1986-1992.

The idea of a New Zealand corpus had been around since the first half of the 1980s, was canvassed at a Linguistic Society of New Zealand Conference in Wellington in 1985 by Derek Davy, and was warmly supported by the Linguistic Society. In 1986 planning for such a project was begun by a group of people interested in the idea of a corpus from the Department of Linguistics and the English Language Institute. In 1987 a tentative start was made on collecting the material for the Press section.

Laurie Bauer took on the task of directing the collection of the written material.

The project has been generously supported by the Internal Grants Committee of Victoria University of Wellington, and by the (now defunct) University Grants Committee.

We have also been helped considerably by the staff of Victoria University's Computer Services Centre, under the directorship of Frank March, and we should like to express our appreciation of the effort made by them in aid of this project.

We were fortunate to be able to employ a number of current and former Linguistics students as research assistants, and it is their work and care which have brought the project to a successful conclusion so quickly. I should like to thank for their hard work on this corpus Anna Adams, Debra Beckett, Rachel Dickinson, Katrina Foster, Lisa Matthewson, Ruth Pemberton, Mary Roberts, Shelley Robertson, Jane Sayers, Robert Sigley, Rowena Simpson.

The WordCruncher version has been made by Knut Hofland, University of Bergen.

 

The Spoken English Corpus (SEC)

The SEC project was supported in 1984-1985 by the University of Lancaster Humanities Research Fund and by IBM UK Ltd., and subsequently by IBM UK Ltd. IBM have not only given financial support, but have actively participated in the project.

A large number of people have contributed to the project:

The project team comprised Dr G Knowles (University of Lancaster), Dr P Alderson (IBM), Dr B Williams (IBM) and L Taylor (University of Lancaster). Prof G Leech (University of Lancaster) and Prof G Kaye (IBM) initiated the project and maintained an acttive collaborative role in it. Additional help was provided by A Seil and N Campbell (IBM), and S Elliot, C Grover, and Dr E Briscoe (University of Lancaster).

The majority of texts in the corpus were obtained from the BBC, and thanks must go to Norma Jones in the BBC Sound Archives for her help in organising contracts, contacting speakers, and providing information for the three years of the project.

The WordCruncher version has been made by Knut Hofland, University of Bergen.

 

The Wellington Corpus of Spoken New Zealand English (WSC)

Project Director
Janet Holmes

Corpus Research Advisory Group

Laurie Bauer, Allan Bell, David Britain, Graeme Kennedy, Chris Lane, Miriam Meyerhoff and Maria Stubbe.

Corpus Managers
Miriam Meyerhoff 1989-1991
Maria Stubbe 1991-1992
Raewyn Whyte 1992-1993
Sue Petris 1993-1994
Jane Pilkington 1994
Jennifer O'Brien 1994
Gary Johnson 1994-1997
Bernadette Vine 1997-

Transcribers
Alexander Tripp, Gary Johnson, Martin Paviour-Smith
Angela Lavender, Jane Pilkington, Meg Sloane
Anissa Bain, Jen Hay, Michaela Stirling
Anita Easton, Jennifer O'Brien, Nina Flinkenberg
Ben Taylor, Jenny Allan, Penny Wilson
Bernadette Vine, Kate Kilkenny, Rachel Lum
Camille Plimmer, Kate Wadsworth, Rowena Samaraweera
Claire Solon, Kerry McCarty, Sarah Dreyer
Elizabeth Smith, Lynnette Sollitt-Morris, Shelley Robertson
Esther Griffiths, Margaret Cain, Sue Petris

Research Assistants
Alexandra Manolis, Keri Shepherd, Meredith Marra
Anna Adams, Louise Burns, Ruth Katene
Anthony Singleton, Maria Aptekar, Robert Sigley
Clare Taylor, Maria Tuinman, Shannon Marra
Inga Fillary, Maryann Nesbit, Sue Jones

The WordCruncher version has been made by Knut Hofland, University of Bergen.

 

The Bergen Corpus of London Teenage Language (COLT)

The project was initiated by Anna-Brita Stenström in collaboration with Leiv Egil Breivik and was carried through with the help of postgraduate students employed as research assistants, notably Gisle Andersen, Vibecke Haslerud, Kristine Hasund, Migle Miliauskaite, Kristine Monstad, Ingrida Strazdaite, Nina Sørli, Ingrid Thompson and Hanne Aas. In addition, Lars Johannessen was engaged for the preparation of the material for text-to-sound conversion, which was completed by Tony Robinson at SoftSound, St Albans.

We are extremely grateful to the Department of Education in London for suggesting suitable London schools for collecting the material; to the Longman Group, London, not only for letting us use the method of corpus collection that was used for the collection of the British National Corpus but also for carrying out the orthographic transcription; and finally to the researchers at Lancaster University, in particular Elizabeth Eyes, for doing the word class tagging.

The project could hardly have been carried through without the assistance of Knut Hofland at The Norwegian Computing Centre for the Humanities and, at a later stage, Manfred Thaller at the Centre for Huminaties Information Technologies Research, both at the University of Bergen.

 

The Helsinki Corpus of Older Scots

The compiler of the corpus is Anneli Meurman-Solin, University of Helsinki. Research assistants were Kirsi Heikkonen and Arja Nurmi. The compiler would like to thank Matti Rissanen, Merja Kytö and A.J. Aiken for support in the compilation.

 

The Corpus of Early English Correspondence

The Corpus of Early English Correspondence (CEEC) and the Corpus of Early English Correspondence Sampler (CEECS) have been compiled by the Sociolinguistics and Language History project team at the Department of English, University of Helsinki. The project has been funded by the Academy of Finland (1993-95) and the University of Helsinki (1996-98). The team is lead by Professor Terttu Nevalainen and includes senior researcher Helena Raumolin-Brunberg, and researchers Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin. We have been helped in the compilation of the corpus by Kirsi Heikkonen, and in the proofreading by Alistair Melville-Smith, Taru Nurmi, Arja-Liisa Rossi, Reza Sanatnama, Heli Tissari and Anne Virolainen.

 

The Newdigate Newsletters

Compiled by: Philip Hines, Jr., Norfolk, VA USA

Advice and help have come to me from many friends, colleagues, and former students, all of which I gratefully acknowledge. I wish especially to thank Laetitia Yeandle, Manuscript Curator at the Folger Library; Garland F. White III, former Director of the Computer-Based Laboratory for Instruction and Analysis at Old Dominion University; and Henry L. Snyder of the University of California, Riverside, Director of "The Eighteenth Century Short Title Catalogue--North America," for much very fundamental aid. I thank the Research Foundation and the Research and Publication Committees of the College of Arts and Letters and of the Department of English (all of Old Dominion University) for grants-in-aid in support of this project. And for their faithful and effective help in transcribing the letters I thank Eric Bing, Wayne E. Bowman, Kevin Farley, Frances Johnson, Daniel Martin, Gwen McAlpine, Alison

Rand, Nancy Rector, and Mark Thorsen.

The WordCruncher version has been made by Knut Hofland, University of Bergen.

 

The Lampeter Corpus of Early Modern English Tracts

The Lampeter project was initiated in 1991 by Prof. Dr. Josef Schmied and Eva Hertel at Bayreuth University and moved with them to Chemnitz in 1993. It has been funded by the Deutsche Forschungsgemeinschaft (DFG), the German Research Association, since 1994. Travel grants made available by the Deutscher Akademischer Austauschdienst (DAAD), the German Academic Exchange Service, have made possible research collaboration with the English Department at Helsinki University and the Department of Linguistics & Modern English Language at Lancaster University (Gerald Knowles, Tony McEnery and Andrew Wilson) on questions of corpus compilation and annotation. The current compilers are Claudia Claridge and Rainer Siemund, both of them linguists with an accompanying major in history. Eva Hertel was responsible for the early stages of compilation. Student assistants: Jeannine Stöhrer, Angelika Giesecke, Astrid Lohse, Anja Ficker, Daniela Zierold, Mario Nyeki and Manuela Sachs. The corpus passages in Greek script were transliterated by Daniela Schindling and the items in Semitic by Gerry Knowles. Hildegard Schäffler provided the ESTC-information for the headers.

Markup according to the guidelines of the Text Encoding Initiative (TEI) and use of the Standard Generalized Markup Language (SGML), in collaboration with Lou Burnard and the Oxford Text Archive.

The WordCruncher version has been made by Knut Hofland, University of Bergen.

 

Lancaster Parsed Corpus

Roger Garside, Geoffrey Leech and Tamás Váradi, Lancaster University.

We are grateful for help received from the following sources:

(a) The development of the Parsed Corpus was originally undertaken in 1983-6, with the support of Research Grant GR/C/47700 funded by the Science and Engineering Research Council (SERC).

(b) The automatic probabilistic parser which produced the parses (prior to post-editing) derived its frequency data from another parsed database of sentences from the LOB Corpus, known as the Lancaster-Leeds Treebank, manually parsed by Geoffrey Sampson (see R. Garside, G. Leech and G. Sampson, The Computational Analysis of English: a Corpus-based Approach, London: Longman, 1987, Chapter 7). The Lancaster Parsed Corpus implements a simplified version of the parsing scheme more fully instantiated in Sampson's treebank.

(c) The post-editing of the corpus was undertaken by a number of research students at Lancaster. We particularly acknowledge the major post-editing work undertaken by Heather Kempson and by Srikant Sarangi. Finally, the whole corpus was thoroughly checked and corrected by Tamás Váradi.

(d) Steve Fligelstone and Andrew Wilson gave invaluable help in the final stages of checking and producing the corpus.

(e) A number of errors were reported by Qiao Hong Liang of the University of Queensland. Corrections were made to the corpus in April 1995.

 

The International Corpus Of English - East African component

The East African component of The International Corpus of English (ICE-EA) is a computerized collection of spoken and written texts from Kenya and Tanzania. It is the result of a project started in 1989 at the University of Bayreuth and continued from 1995 at the Chemnitz University of Technology within the framework of the Special Research Programme on Africa, which was financially supported by the German Research Foundation (DFG).

The following team of researchers are responsible for the compilation of ICE-EA:

Diana Hudson-Ettle (Co-ordinator), Barbara Krohne (Assistant Co-ordinator), Josef Schmied (Project Director). Our work would not have been possible without the help of many friends and colleagues during fieldwork but we would like to mention and thank especially Casmir Rubagumya (University of Dar es Salaam), who gained access to and provided the main part of the Tanzanian spoken data, Eunice Nyamasyo (Kenyatta University) and Kembo Sure (Moi University), who were of assistance in acquiring some of the Kenyan texts.

We should also like to express our gratitude to Eva Hertel, colleague and PhD scholar, for her support, Paul Skandera, PhD scholar, for the material he provided from his stay in Kenya, and Jemimah Mwakisha, the Kenyan journalist, for her part in helping us to obtain written texts and sociolinguistic information about the authors.

A number of undergraduate student assistants helped us by typing, scanning and proofreading the corpus texts. We thank them and also those who undertook the first stages in the particularly time-consuming and often quite demanding task of transcribing the spoken texts. Special mention must be made here of Gabriele Engelhardt, Astrid Lohse, Dirk Schmerschneider and Katrin Voigt.

 

Innsbruck Computer-Archive of Machine-Readable English Texts (ICAMET)

Compiler: Manfred Markus, Institut fuer Anglistik, Universitaet Innsbruck.

 

Polytechnic of Wales Corpus

Compiled by Robin Fawcett and Michael R. Perkins, Polytechnic of Wales, Pontypridd. Handbook by Clive Souter, University of Leeds.

 

WordCruncher ViewLtd 4.5 DOS

Copyright 1985-92 Brigham Young University. Licenced from Electronic Text Corporation/CD Danmark A/S, Copenhagen.

 

LEXA and Linguafont

The programs are written by Raymond Hickey, Essen University.

 

Qwick

Qwick is a JAVA application which uses the CUE system, which was originally developed at Birmingham University by Oliver Mason and John Sinclair.

 

WordSmith

The program is written by Mike Scott at Liverpool University.

 

Textual Analysis Computing Tools (TACT)

TACT is owned and managed by the University of Toronto and the following principals of the TACT Group, all members of the University:

John Bradley: Computing and Communications
Lidio Presutti: Humanities Programmer, Computing and Communications (1983-1991)
Michael Stairs: Centre for Computing in the Humanities, Faculty of Arts and Science
Ian Lancashire: Department of English

Full credits may be found in each program by pressing F6, or in the case of UseBase, Ctrl-F1. The credits for the Install program are as follows:

 

PRINCIPAL FOR INSTALL 2.1

Michael Stairs - Designer-Programmer, Vers 2.1

Copyright for this program is held by the University of Toronto and the above named principal of the TACT Group, all members of the University. The system is managed by the TACT Group, which includes all principals and the following University members: