Re: Corpora: e-mail corpus

William C. Spruiell (3lfyuji@cmich.edu)
Sat, 18 Apr 1998 12:47:05 -0400

I recently finished a pilot project that involved analyzing netnews
postings; I found a shareware program, Gravity, that allowed the user to
store all new messages from marked groups on a local hard drive, where they
could be subjected to standard string searches. Unfortunately, the program
does not allow the message database to be dumped as a plain ascii file, so
using a full concordancer with it is impossible. Messages containing
searched-for strings, of course, can be cut-and-pasted into ascii files (and
if you have lots of time or a phalanx of assistants, I suppose you could do
that with *all* messages). I looked at netnews because I was interested in
argumentation, but it has the added advantage of sidestepping the privacy
issue, since netnews postings are fully public.

The material itself raises a number of interesting analytical issues. For
example, what is the implication for, say, type/token ratios of a medium in
which users commonly copy the entirity of a preceding message into a current
one?