[Corpora-List] OPUS v0.2 is available

From: Jörg Tiedemann (joerg@stp.ling.uu.se)
Date: Sat Jul 12 2003 - 12:06:13 MET DST

Next message: Yuri Tambovtsev: "[Corpora-List] corpora and new language classifications"

Previous message: Maria E: "[Corpora-List] Tools for indexation (english)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

OPUS is an open source parallel corpus which is available from
http://logos.uio.no/opus/

Version 0.2 of the corpus contains roughly 30 million tokens
in 60 languages. OPUS is sentence aligned (1830 language pairs),
tokenized, and partly tagged.

The following subcorpora are included:
   OpenOffice.org ca 2,5 million words 6 languages
   PHP manuals ca 3,2 million words 21 languages
   KDE messages ca 20,5 million words 60 languages
   KDE manuals ca 3,8 million words 24 languages

More information can be found on the OPUS home page.

                      ---------------------------
                      Jörg Tiedemann (http://stp.ling.uu.se/~joerg/)
                      Lars Nygaard (http://folk.uio.no/larsnyg/)

=======================================================================

The following tools have been used (not including standard GNU-tools):

* align - sentence aligner (based on Gale&Church, 1993)
* OpenNLP & Grok, Jason Baldridge and Gann Bierner
  http://grok.sourceforge.net/
* TnT - Statistical Part-of-Speech Tagging, Thorsten Brants
  http://www.coli.uni-sb.de/~thorsten/tnt/
* TreeTagger - Decision Tree Tagger, Helmut Schmid
  http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
* ChaSen - japanese tokenizer + tagger
  http://chasen.aist-nara.ac.jp/
* recode - convert between various character encodings
  (http://www.iro.umontreal.ca/contrib/recode/HTML/)
* tidy - validate, correct, and pretty-print XML-files
  (http://www.w3.org/People/Raggett/tidy/)
* Uplug - tokenizer, sentence-splitter, XML-tools
  http://stp.ling.uu.se/plug/

Next message: Yuri Tambovtsev: "[Corpora-List] corpora and new language classifications"
Previous message: Maria E: "[Corpora-List] Tools for indexation (english)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sat Jul 12 2003 - 12:10:19 MET DST