Kristen Precht wrote:
[snip]
> But this begs questions on the relationship between statistical significance
> and language data. For example, if a particular text is significantly higher
> in hedges or emphatics, does that mean that the difference would be
> noticeable to a reader? Or conversely, it seems that a non-significant
> feature could still be quite noticeable to the reader. This is especially
>
> Of course I could field run an experiment on texts with different degrees of
> hedges, emphatics and such to rate reader sensitivity ... *sigh* ... but
> that would have to be done feature by feature, and may not adequately take
> into account the role of co-occurrence of features.
>
Our techniques would pick up the differences in emphasis and do take
into account co-occurrences. The difficulty is that one needs to
examine the results (graphs) and then use the other statistics that are
generated to go back and find out precisely what gave rise to the
differences.
> Has anyone else come across literature, or had thoughts on the role of
> statistics in making comparisons between genres, or any other corpus
> comparisons? I have often seen assumptions that significant difference can
> be used to 'categorize' genres or corpora, but I'm just not comfortable with
> that yet. I've been struggling with this question for a while and am not
> happy with the options I've come up.
>
I ran MCCA against Adam Kilgarriff's `gold standard' Known-Similarity
Corpora to good effect and even was able to question the presumed
similarity of some of his textual materials (e.g., even though a set of
texts might have been drawn from the Guardian, they could have been
drawn from different genre's such as movie revues, gardening tips, and
straight news).
-- Ken Litkowski TEL.: 301-482-0237 CL Research EMAIL: ken@clres.com 9208 Gue Road Damascus, MD 20872-1025 USA Home Page: http://www.clres.com