Sources and Sampling

1. Speakers

1.1. Who counts as a New Zealander?

One of the most fundamental issues addressed by the Corpus Advisory Research Group was the problem of defining who counted as a New Zealander. Who should be allowed to contribute to the corpus? This problem has presumably faced all those involved in corpus collection, but it has received little explicit attention. It is a particularly vexatious problem for colonial societies where large sections of the community are immigrants. At what point does an immigrant become a New Zealander? (For a fuller exploration of this issue see Bauer 1991)

We rejected the notion of selecting people who sounded as if they were New Zealanders, since this would have self-evidently pre-judged an issue which the corpus data was intended to illuminate - namely what constitutes New Zealand English. Similarly non-linguistic criteria such as citizenship or residency are fraught with problems, since those who hold such qualifications may be very recent arrivals from elsewhere. Even longer-term residents cannot be expected to have acquired features which distinguish New Zealand speech from other varieties if they have arrived in the country after puberty. Consequently, we adopted a criterion which has been regarded by others as very stringent, but which we felt confident would ensure the integrity of the New Zealand samples included in the corpus.

A speaker of New Zealand English is defined as someone who has lived in New Zealand since before the age of 10 years.

A certain amount of overseas experience was regarded as normal within New Zealand, but, again for reasons relating to the need to establish the distinctive features of a New Zealand variety of English, people who had spent extensive periods of time overseas were excluded. More than ten years or over half their lifetime (whichever was the greater) was considered an extensive period of time, and this rendered people ineligible for inclusion in the spoken corpus. Also excluded were people who had returned from an overseas trip within the last year.


To summarise:



Lived in NZ since before age of 10 years

Arrived in NZ after the age of 10 years

10 years or less spent overseas, or

Less than lifetime (whichever greater)

More than 10 years spent overseas, or

More than lifetime

Last overseas trip over 1 year ago

Last overseas trip less than 1 year ago

We also had an age restriction in that anyone under 16 years of age was not included.

There was also a restriction on the number of words transcribed from each person. The most words included from any one person is 5642 words. Very few participants contribute more than 5,000 words.

The talkback and news extracts frequently involved one or more individuals from whom we were unable to obtain background information and permission forms. Specifically, the callers to talkback programmes and the reporters and public in the field providing an on-the-site news perspective were not generally traceable. The speech of these individuals was included if the individuals sounded like New Zealanders. They contribute less than 5% of the total words transcribed.

1.2 Ethnic and gender representation

People of any ethnicity (e.g. Dutch, Samoan, Greek, Tongan) were considered eligible for inclusion in the spoken corpus provided they satisfied the criterion for eligibility as a New Zealander. No attempt was made to include representative samples from particular ethnic groups other than Maori. It was considered important to include an appropriate proportion of the speech of the indigenous Maori people, and while this was not possible within each sub-category, it was recognised as a reasonable aim for the corpus as a whole. As seen in section 8.2, WSC Gender, Ethnicity and Age Breakdowns, Maori contribute 18% of the total words in our transcribed corpus and Pakeha 76%.

Some degree of gender balance was also considered desirable, with an ideal overall goal of 50% female speech and 50% male speech within the 1,000,000 word sample. Women contribute 52% and men 48% of the final transcribed words, reflecting the New Zealand population balance (see section 8.2, WSC Gender, Ethnicity and Age Breakdowns).

1.3. Other social factors

Recognising that it was unrealistic to attempt to collect a representative sample which took account of additional social variables such as social class, regional origin, level of education, occupation and age, no attempt was made to pre-determine the number of contributers in such categories. However, every speech sample collected is described as fully as possible in these respects for each speaker contributing to the corpus (see background information sheet, Appendix 2). No attempt was made at iwi representation and information on iwi affiliation was not collected.


1.4. Whose speech was included?

Given our stringent criteria for classification as a New Zealander (see above and Holmes 1995), there were obviously many people who did not qualify for inclusion in the corpus. Yet such people were often recorded in discussion or conversations with New Zealanders counted as eligible, especially in broadcast recordings. Indeed it was almost impossible to find television discussions involving four or more people where all participants qualified as eligible speakers of New Zealand English. In such cases, the contributions of all speakers were transcribed in order to respect the integrity of the discussion, but contributions from non-New Zealanders were clearly indicated in the transcript, and were not included in word counts.

2. Background information sheets

It was essential to collect a certain amount of personal information from every contributor to the corpus for two reasons:

  1. to ensure that they were eligible for inclusion in the corpus according to the criteria set out in section 11.1.1, Who counts as a New Zealander?
  2. to provide information for researchers regarding social characteristics of speakers.

The background information sheet is provided in Appendix 2. The first page refers to the contributor; the second page gathers information on the context of data collection. In the light of experience a number of weaknesses were identified. The points made in this section owe a great deal to a valuable critique of the background information sheets written by Jenny O'Brien and Shelley Robertson.

(i) The background information sheet asks people to state whether they have spent time out of New Zealand, and if so for how long. It does not ask them where they have spent that time. In retrospect, it would be useful to know whether people had spent a little time in many countries, or a larger amount of time in one place. It would also be useful to know whether those places had been English-speaking or not.

(ii) Questions intended to elicit information on regional origins could usefully be more specific. The background information sheet asks only for place of birth. Questions asking where people had grown up or where they had lived for more than a certain time period (say 3 years) would have been more informative.

(iii) For Maori informants it would have been useful to ask about iwi affiliation.

(iv) The questions on language background provided only minimal information. Information on any language regularly used, not just on first language, would have been useful.

(v) The question on ethnicity should have indicated that respondents could circle more than one ethnic group if appropriate.

(vi) The question asking for highest educational qualifications caused embarrassment to some contributors. Rewording could avoid this.

(vii) Questions on employment need to be worded to distinguish between students and non-students employed in similar part-time jobs: e.g. in pubs, restaurants, or unskilled manual labour.

Any request for detailed information has to be weighed against the inconvenience it causes contributors. If too much information is requested from contributors it may jeopardise their willingness to be involved. In private/informal situations where the contributors are known to the collector a detailed background information sheet may be appropriate. When background information is requested by mail (e.g. broadcast material) or from relative strangers, participants in work environments (e.g. transactions), or from large groups of contributors (e.g. meetings), long questionnaires requesting seemingly irrelevant or personal information may be filled out incorrectly, partially or not at all.

The task of obtaining accurate background information from all of those whose speech was collected turned out to be one of the most problematic aspects of the whole project. One very obvious rule was to ensure participants completed background information sheets at the time of the recording, and that they filled them in with as much detail as possible. A check at the time by the data collector saved hours of inconvenience later attempting to collect information which had been inadvertently omitted. This worked well for most of the non-broadcast data, but obtaining background information from those who had been recorded from radio and television was an on-going problem throughout the project.

In some cases interviewees on recorded radio programmes proved impossible to identify. Though their interviewers were often remarkably helpful, there were always some who proved untraceable. So, for instance, some wonderful examples of New Zealand speech were recorded at a motorbike rally and a country fair, but it was simply not possible to locate the contributors. Moreover, in some cases the excerpts from different contributors included in the broadcast programme were too short to justify the huge amount of effort which would have been involved in identifying them and obtaining their consent to use their speech.

The best advice in the light of experience in this area is that background information sheets be sent out with stamped addressed envelopes as soon as possible after a programme has been recorded. Intensive follow-up by telephone and fax can be reasonably effective, especially if sympathetic and helpful administrators within the broadcasting network can be located. Indeed, good relations with broadcasters proved essential for a number of reasons, since their assistance was so important in tracing contributors and obtaining copyright permission.

People working in the private sector were more likely to respond to a brief, focussed letter which was followed up by a phonecall to the individual or their PA. Accompanying information on the project also needed to be concise and to focus on areas of interest to the general public rather than methodological concerns.

It was also important to conserve resources by not transcribing any material until background information sheets establishing the eligibility of contributors for inclusion had been obtained for all contributors involved.