Corpora at Victoria University of Wellington

Linguists at Victoria University of Wellington have been involved in collection of New Zealand English for three different corpora, one spoken, one written, and a third which includes both spoken and written data.

1. The Wellington Corpus of Written New Zealand English (WWC)

One million words of written New Zealand English collected from writings published in the years 1986 to 1990.

The WWC has the same basic categories as the Brown Corpus of written American English (1961) and the Lancaster-Oslo-Bergen corpus (LOB) of written British English (1961). The corpus also parallels the structure of the Macquarie Corpus of written Australian English (1986). The WWC consists of 2,000 word excerpts on a variety of topics. Text categories include press material, religious texts, skills, trades and hobbies, popular lore, biography, scholarly writing and fiction. (For further information see Bauer 1993.)

2. The Wellington Corpus of Spoken New Zealand English (WSC)

One million words of spoken New Zealand English collected in the years 1988 to 1994. Ninety nine percent of the data (545 out of 551 extracts) was collected in the years 1990 to 1994. Of the eight remaining files, four were collected in 1988 (4 oral history interviews) and four in 1989 (4 social dialect interviews).

The WSC was formerly known as A Computerised Corpus of English in New Zealand (ACCENZ). The corpus consists of 2,000 word extracts (where possible) and comprises different proportions of formal, semi-formal and informal speech. Both monologue and dialogue categories are included and there is broadcast as well as private material collected in a range of settings.

3. The New Zealand component of the International Corpus of English (ICE-NZ)

One million words of spoken and written New Zealand English collected in the years 1990 to 1996. ICE-NZ consists of 600,000 words of speech and 400,000 words of written text.

The WSC and the spoken component of ICE-NZ share 9 categories. Because informal conversational data in particular was so difficult to collect, there is an overlap of 339,248 words (173 files) between the two corpora to achieve economy in data collection.

The categories which are shared are identified in section 8.3, WSC and ICE-NZ Overlap.