Genesis and Development of the VUW Corpora


1. Background

New Zealand linguists had discussed collecting a corpus of New Zealand English since the mid-1980s. They had used corpora to research vocabulary (Kennedy 1991, Bauer and Nation 1993), and the expression of speech functions such as quantity (Kennedy 1987), causation (Fang and Kennedy 1992) and certainty (Holmes 1982, 1983). They were aware of the value of resources such as the 1961 Brown Corpus of American English, the 1987 LOB Corpus of British written English, and the 1980 LUND Corpus of British spoken English.

At the Seventh New Zealand Linguistic Society Conference in 1987, Derek Davy proposed that New Zealand linguists should cooperate in collecting a corpus of New Zealand English, comprising half written data and half spoken data. The proposal was supported, but with little agreement on its composition (Davy 1988).

In 1987, after much debate about design and methodology, linguists at Victoria University began collecting data. In 1989, Victoria University of Wellington accepted the task of compiling two corpora of New Zealand English, eventually to be named the Wellington Corpus of Written New Zealand English (WWC) and the Wellington Corpus of Spoken New Zealand English (WSC). Laurie Bauer took responsibility for assembling the written corpus and Janet Holmes for assembling the spoken corpus. It was agreed each component should comprise one million words. By the end of 1989, the basic structure of the spoken corpus had been agreed by the Corpus Research Advisory Group (see section 3, Project Team).

In 1988, the late Sidney Greenbaum proposed an international corpus of English be gathered (Greenbaum 1988). The New Zealand component of ICE was completed in conjunction with the Wellington corpora.

As initially designed the WSC was to consist of formal speech/monologue (10%), semi-formal speech/elicited monologue (10%) and informal speech/dialogue (80%) collected between 1987 and 1992. The Corpus Research Advisory Group decided that informal styles rather than formal, and dialogue rather than monologue should form the bulk of the data, and that as large a proportion as possible of the spoken corpus should consist of casual conversation in private informal contexts. Informal conversational interaction is the most pervasive, unmarked, daily expression of New Zealand English, and should therefore be well represented.

This structure was further refined at the Eighth New Zealand Linguistic Society Conference in 1991.

By the end of 1992, all the broadcast data and almost all the formal spoken material had been collected, yet the target of 80% dialogue had not been reached. The proportion of conversation was therefore reduced to 75% and the completion date for data collection extended to 1993.


2. Broad Composition of the WSC

The collection dates for the WSC were finalised as 1 January 1988 to December 31 1994. As noted earlier, 99% of the data was collected in 1990 to 1994, the exception being 8 private interviews. The closing date was extended due to difficulties encountered in the collection of non-broadcast data, particularly transactions in the workplace (see Holmes 1994, 1996).

The proportions of speech styles were finalised as:

Formal Speech/Monologue                       12%
Semi-formal Speech/Elicited Monologue         13%
Informal Speech/Dialogue                      75%