Re: comparisons in text corpora: keywords / CHI square

Mr A.P. Berber Sardinha (tony1@liverpool.ac.uk)
Thu, 29 Aug 1996 17:28:14 +0100 (BST)

Hi Marc,
The KeyWords program has helped me carry out several comparisons
between word lists in various situations. More recently we have
used it to identify words which are characteristic of the discourse
of British schoolchildren who have taken part in the APU archive.
This paper was presented at the recent TALC conference in Lancaster.
Kwords have also been used to compare the styles of lawyers and
witnesses in the OJ Simpson trial (!) and to help find internal
boundaries of business texts, among other things.
You may want to have a look in a forthcoming paper
in LWPAL (Liverpool Working Papers in Applied
Linguistics) where I review some of the studies I know have
been carried out using KWords. LWPAL's page is at
http://www.liv.ac.uk/~tony1/lwpal.html but I'm afraid the
article in question isn't there yet. If you want it please let
me know. The article also reviews some other uses of KWords eg
key word databases and briefly mentions Clumps and Associates.
Apparently the author (Mike Scott) has been working on
alternatives to chi-square which Ted Dunning and Adam Kilgariff
discuss.
Please let me know more about your current and previous investigations
using key words.

Cheers

Tony

In the last mail Marc Weeber said:
>
> Hello corpora people,
>
> At the moment, I'm trying to isolate certain areas in a corpus to
> extract area-specific keywords. The corpus consists of abstracts of
> medical articles concerning one drug. I'm interested in extracting
> the side effects of this drug. I have located the areas concerning
> side effects, and I want to compare these areas with the rest of the
> corpus. The method I'm using is the keyword program of the WordSmith
> Tools package. This program compares the frequencies of words between the
> subset and the complete corpus. Words that are more frequent in the
> subset compared to the complete set (test with CHI square) are called
> `keywords' of the subset.
>
> Now I have two questions:
>
> 1 what exactly should I use as reference corpus: the complete corpus
> of abstracts or the complete corpus minus the subset. In the former
> case, words that occur in the subset are counted twice (in subset and
> in reference corpus). The results will be more conservative compared
> to the latter case. However, I don't know which method to use, which
> leads to the second question:
> 2 can someone give me more background on the use of keywords as
> means of comparison between two sets (*actually, list of words).
> Commments, references to books, articles, URL's, etc, would be much
> appreciated.
>
> thanks in advance,
>
> Marc Weeber
> marc@farm.rug.nl
>

-- 
---------------------------------------------------
Tony Berber Sardinha     | tony1@liverpool.ac.uk
AELSU                    | Fax 44-51-794-2739
University of Liverpool  |
PO Box 147               | http://www.liv.ac.uk/
Liverpool L69 3BX        | ~tony1/homepage.html
UK                       |
---------------------------------------------------
My karma ran over my dogma ...... `' -o-o-
Everything should be as simple as possible but no 
simpler. (A Einstein)