But this begs questions on the relationship between statistical significance
and language data. For example, if a particular text is significantly higher
in hedges or emphatics, does that mean that the difference would be
noticeable to a reader? Or conversely, it seems that a non-significant
feature could still be quite noticeable to the reader. This is especially
problematic, it seems, with features that have very low frequencies ... it
is not difficult to find significant differences, yet with such low overall
frequencies, it's hard to assume that the reader would notice the difference
between 2 per thousand words and 5 per thousand words.
Of course I could field run an experiment on texts with different degrees of
hedges, emphatics and such to rate reader sensitivity ... *sigh* ... but
that would have to be done feature by feature, and may not adequately take
into account the role of co-occurrence of features.
Has anyone else come across literature, or had thoughts on the role of
statistics in making comparisons between genres, or any other corpus
comparisons? I have often seen assumptions that significant difference can
be used to 'categorize' genres or corpora, but I'm just not comfortable with
that yet. I've been struggling with this question for a while and am not
happy with the options I've come up.
Kristen Precht
Northern Arizona University
kprecht@iupui.edu