I am looking for some training material for automatic categorization
of HTML-documents. Therefore, I am especially interested in the fol-
lowing subject matter fields:
- Soccer
- Tennis
- Formula 1
- Heart Diseases / Cardiology
- Allergies
- Dentistry
The size of each of the corpora should be at least 1,000,000 characters
to obtain reasonable results. The categorizer should work for English,
French and German documents, so I am looking for material in all
three languages, not necessarily HTML-documents!
Does anybody know about available corpora (or WWW-sites ... )? A sum-
mary will be posted.
Noemi Preissner
noemi@coli.uni-sb.de