Corpora: Statistics in genre differences

Kristen Precht (kprecht@ruby.iupui.edu)
Fri, 19 Mar 1999 16:11:19 -0500

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Ted Pedersen: "Re: Corpora: Statistical significance of tagging differences"
Previous message: James L. Fidelholtz: "Re: Corpora: Statistical significance of tagging differences"

I have had a similar quandry on interpreting with statistics when comparing
differences in PoS tags in genres. I use Doug Biber's tags, and am
interested in comparing features (PoS or other identified textual features,
such as tagged metalanguage markers) across genres or within a genre as
written by L1 and L2 speakers of English. I have been running ANOVAs or
t-tests on normed and standardized tag frequencies, and can find an array of
features which show significant differences, and have run principle
components analysis to compare what features seem most "salient" across
genres or L1 groups.

But this begs questions on the relationship between statistical significance
and language data. For example, if a particular text is significantly higher
in hedges or emphatics, does that mean that the difference would be
noticeable to a reader? Or conversely, it seems that a non-significant
feature could still be quite noticeable to the reader. This is especially
problematic, it seems, with features that have very low frequencies ... it
is not difficult to find significant differences, yet with such low overall
frequencies, it's hard to assume that the reader would notice the difference
between 2 per thousand words and 5 per thousand words.

Of course I could field run an experiment on texts with different degrees of
hedges, emphatics and such to rate reader sensitivity ... *sigh* ... but
that would have to be done feature by feature, and may not adequately take
into account the role of co-occurrence of features.

Has anyone else come across literature, or had thoughts on the role of
statistics in making comparisons between genres, or any other corpus
comparisons? I have often seen assumptions that significant difference can
be used to 'categorize' genres or corpora, but I'm just not comfortable with
that yet. I've been struggling with this question for a while and am not
happy with the options I've come up.

Kristen Precht
Northern Arizona University
kprecht@iupui.edu

Next message: Ted Pedersen: "Re: Corpora: Statistical significance of tagging differences"
Previous message: James L. Fidelholtz: "Re: Corpora: Statistical significance of tagging differences"