Data Collection

1. Ethical Issues - no surreptitious recording

A firm decision was taken at the start of the project that all contributors would know that they were being recorded - in other words, no surreptitious taping (see Holmes 1996). Apart from respecting the contributor's privacy, it is essential in a large-scale project such as the WSC, which relies heavily on the social networks of the collectors, to maintain the trust and confidence of the community in which the recordings are made. There is no quicker way for linguists to lose that trust than to bug people.

While the issue of surreptitious recording is not relevant to broadcast data, the problem of how to obtain genuine or unmonitored speech arises for all other speech categories, i.e. the "observer's paradox": how to record the way people speak when they are not being observed (Labov 1972:181).

Our general practice was to ask people to collect at least thirty minutes of conversation. In fact, a number of contributors recorded an hour or more. This allowed us to take a sample from a point at which it seemed likely the speakers had forgotten the tape recorder. In general this proved a very successful technique, and, because we were able to select samples which began well into the recorded material, the majority of recorded conversations are as natural as possible in the circumstances.

Another way in which we attempted to circumvent this problem was to collect recordings without the speaker knowing that they were being recorded on that particular occasion. This involved the data collector asking the person in advance if they would agree to be recorded at some future date without necessarily being informed at the specific time of the recording. They were told immediately afterwards and then had the right to veto the use of the tape. This strategy was used for a small number of telephone conversations and some face-to-face conversations. But, it is worth noting that in the case of face-to-face interactions, it did not always yield usable data: the quality of surreptitious recordings is often dubious since the microphone is rarely in the best position for collecting the data (cf. Labov 1984).


2. Collecting Spoken Data

As mentioned in the Preface, the WSC data collection was a collaborative effort drawing on the goodwill and generosity of many volunteers as well as a team of paid research assistants. The first step in actually collecting the data was to translate the list of text categories into small manageable data collection tasks. Each of the categories of speech that we decided to include in the WSC presented its own particular problems for data collection. Broadcast speech was easy to record, for example, but, as mentioned above, raised issues of copyright, as well as enormous problems in collecting relevant social information about the speakers. Non-broadcast speech was difficult to record, but collecting information about the speakers was relatively straightforward. In this section information on how we solved some of the methodological problems which arose in just four areas is provided: collecting lectures, transactions, telephone conversations and face-to-face relaxed conversation.


2.1. Collecting lectures

Our goal for formal lectures was a minimum of 28,000 words. Most of this data was collected at Victoria University of Wellington, for obvious practical reasons. Using departmental secretaries as sources of information, our first step was to establish which members of the university staff were eligible contributors. A large proportion of New Zealand university staff are recruited from overseas so the number of eligible contributors was relatively restricted.

The second step involved selecting a sample to represent a range of disciplines, and to provide appropriate Maori and gender representation. Data was recorded from lecturers in arts, sciences and the professional areas (law, commerce, architecture), and the sample included both female and male teachers, as well as four Maori lecturers. The final sample constituted 32,000 words, from which 28,000 would be selected for inclusion in the corpus. The 32,000 words were allocated as follows, in terms of gender and ethnicity. The actual number of transcribed words included in the corpus is also provided.


Word target

Words transcribed

Pakeha women



Pakeha men



Maori women



Maori men







The third step, the actual recording, proved relatively unproblematic for lectures, since all the staff involved were very cooperative. The range of lecturing styles, however, proved to be a further variable which we did not attempt to control, but which was noted. Some staff used a relatively formal style, staying close to their notes and to the lectern; others moved around much more and invited participation from the class. While the former were easier to record, it seemed important to include a representative range of styles rather than to select those who provided the fewest methodological problems. Consequently the excerpts in this category include a range of lecturing styles.



2.2. Collecting transactions

The goal for transactions and meetings was 100,000 words. Our original conception of a "transaction" was a canonical business transaction in which goods or services were exchanged. The Corpus Research Advisory Group generated many ideas for collecting business transactions but the reality regularly defeated us. It was impossible to obtain consent from many potential venues (e.g. travel agents, estate agents, information desks) because the management feared clients would be inhibited by the tape recorder and this would adversely affect their business. Many people felt it would be an intrusion on clients' privacy (e.g. student loans desk, banks). A number of shops were investigated, but they frequently proved too noisy and the management often had reservations about recording interactions at the complaints or order desk. At the other extreme the library provided many interactions which were totally non-verbal and thus inappropriate for a speech corpus.

We did finally collect a reasonable range of business transactions, but many involved a huge amount of work and planning for very small returns. Transactions in shops, for instance, required a great deal of setting up, including notices to customers that they were being recorded, and the end result was a very short exchange, often of very unclear quality. Transactions where our aims had been discussed with the customer in advance were much more successful. In some cases friends agreed to allow us to record a transaction in which they were involved: e.g. planning a holiday or visiting an estate agent. And some student research assistants with initiative managed to collect transactions in venues such as a hairdresser's shop and a vet's surgery. Longer transactions, such as administrators advising students at enrolment, were also more worthwhile in terms of quality and return for effort.

Contact with the ICE project personnel in regards to the ICE-NZ component about these problems indicated they were having similar problems, and they had decided to include formal meetings as examples of transactions. We therefore relaxed our criteria and collected data from a range of different types of meetings, from school staff meetings through university committee meetings, to the meetings of recreational clubs. With this modification, the goal of 100,000 words for this category became feasible.

Thus, our final definition of a transaction identified two crucial criteria: "a transaction consists of an interaction between two or more people (i) where the participants are acting predominantly in role (e.g. customer-shop-keeper, client-lawyer, student-adviser) or (ii) where the structure of the interaction is mainly determined by an agreed agenda".


2.3. Telephone conversations

The goal for telephone transactions was a demanding 70,000 words, which we justified by pointing to the extensive use most people make of the phone in their everyday interactions. Collecting this amount of telephone talk proved very difficult. Firstly, special equipment was needed and the first few telephone pick-up microphones used proved unsatisfactory. The microphone finally used (an Olympus Pearlcorder TP3) was very small and required one speaker to place it in their ear; it then picked up both ends of the conversation well. Secondly, the requirement that people inform their addressees that they were being recorded severely inhibited the data collection. Some collectors found this so difficult that they gained permission to record in advance, as described in section 12.1, Ethical Issues – no surreptitious recording. They would then inform their addressees that they had been recorded at the end of the call.

We explored the possibility of using a variety of established help services (e.g. the student helpline at enrolment, police enquiries), but none were willing to assist because they feared that recording would discourage users. Another strategy which proved to be not worth the cost or effort involved was a free phone service (with a toll bar!) provided during the university enrolment period. This was set up in a room where students could phone their friends free of charge provided they were willing to record the conversation. Follow-up was then necessary to collect background information from the people phoned in order to eliminate any who were ineligible. This ultimately proved an expensive way of collecting telephone conversations when account was taken of the cost of a research assistant to monitor the equipment, the cost of the calls, and the cost of the follow-up in terms of time and stamped addressed envelopes.


2 4. Collecting "natural" conversations

Our ideal in collecting informal conversations was to remove ourselves from the process as far as was consistent with obtaining good quality recordings. So where possible we supplied our contributors with information about the aims of the project, with good quality equipment, and practice in operating it, and then left them to tape-record themselves. We set up the recording equipment if requested to do so, but by far the majority of recordings were successfully made by people who chose their own time with their selected co-participants and made the recording themselves. Most recordings were made in people's homes, though some were made in workplaces at tea breaks, during lunchtime or after work. The result was that contributors recorded material at times that were convenient for them and the conversations were much more relaxed and natural than if they had involved researchers as observers.

Our corpus group decided that 50% of the WSC should consist of private informal conversations, the most pervasive form of talk in any speech community (see Holmes 1995). Collecting 500,000 words of conversation proved quite a challenge. It was accomplished only as a result of the generous efforts of many friends, colleagues and acquaintances, together with contributions from students over several years of Linguistics courses. As a result, the sample is inevitably biased in favour of New Zealanders from Wellington, and young female Pakeha students constitute a larger proportion of the total in this category than would be the case in a representative sample.