OPUS is an open source parallel corpus which is available from
http://logos.uio.no/opus/
Version 0.2 of the corpus contains roughly 30 million tokens
in 60 languages. OPUS is sentence aligned (1830 language pairs),
tokenized, and partly tagged.
The following subcorpora are included:
OpenOffice.org ca 2,5 million words 6 languages
PHP manuals ca 3,2 million words 21 languages
KDE messages ca 20,5 million words 60 languages
KDE manuals ca 3,8 million words 24 languages
More information can be found on the OPUS home page.
---------------------------
Jörg Tiedemann (http://stp.ling.uu.se/~joerg/)
Lars Nygaard (http://folk.uio.no/larsnyg/)
=======================================================================
The following tools have been used (not including standard GNU-tools):
* align - sentence aligner (based on Gale&Church, 1993)
* OpenNLP & Grok, Jason Baldridge and Gann Bierner
http://grok.sourceforge.net/
* TnT - Statistical Part-of-Speech Tagging, Thorsten Brants
http://www.coli.uni-sb.de/~thorsten/tnt/
* TreeTagger - Decision Tree Tagger, Helmut Schmid
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
* ChaSen - japanese tokenizer + tagger
http://chasen.aist-nara.ac.jp/
* recode - convert between various character encodings
(http://www.iro.umontreal.ca/contrib/recode/HTML/)
* tidy - validate, correct, and pretty-print XML-files
(http://www.w3.org/People/Raggett/tidy/)
* Uplug - tokenizer, sentence-splitter, XML-tools
http://stp.ling.uu.se/plug/
This archive was generated by hypermail 2b29 : Sat Jul 12 2003 - 12:10:19 MET DST