Corpora: Correction:Morphologically Analyzed and Disambiguated Turkish News Text Available

From: Kemal Oflazer (ko@cs.bilkent.edu.tr)
Date: Thu Apr 27 2000 - 10:38:12 MET DST

  • Next message: Thorsten Brants: "Corpora: LINC-2000 Extended Deadline"

    The previous message had a typo in the URL. Apologies.
    --------

    Dear All,

    We have made available for download, morphologically analyzed and
    disambiguated Turkish news text. The disambiguation has been performed
    with a statistical disambiguator but no manual corrections have been
    attempted.

     A morphological parse is represented as a sequence of features with
    derivations being marked by the symbol ^DB.   Morphological analysis has
    been performed by the Turkish analyzer developed using XRCE Finite State
    Tools.  Unknown words have been analyzed with an unknown word processor and
    the resulting candidate parses for those have also been disambiguated.

    A typical sentence is tagged as follows with the first token on the line
    being the word and the subsequent portion is the disambiguated morphological
    analysis.
     
    <S> <S>+BSTag
    E?itim e?itim+Noun+A3sg+Pnon+Nom
    hizmetlerinin hizmet+Noun+A3pl+P3sg+Gen
    ülkenin ülke+Noun+A3sg+Pnon+Gen
    her her+Det
    ki?isine ki?i+Noun+A3sg+P3sg+Dat
    ve ve+Conj
    kö?esine kö?e+Noun+A3sg+P3sg+Dat
    ula?tırılmı? ula?+Verb^DB+Verb+Caus^DB+Verb+Pass+Pos+Narr+A3sg
    olması ol+Verb+Pos^DB+Noun+Inf+A3sg+P3sg+Nom
    bunlardan bu+Pron+DemonsP+A3pl+Pnon+Abl
    birisidir biri+Pron+A3sg+P3sg+Nom^DB+Verb+Zero+Pres+Cop+A3sg
    </S> </S>+ESTag

     

     

    CAVEAT: On small test sets we have seen an accuracy of 94% (over 95% if one
    ignores some semantic markers).  We expect a similar accuracy on this
    corpus, but we have no idea how it fares.  Originally the text had about 2
    morphological parses per token.  When you notice any errors, please let us
    know and we will update the copies on the server.

    Turkish has been coded using ISO-LATIN 5 encoding.   The text of about 1M
    words can be retrieved either as a single file, or as a batch of shorter
    files. For more details on the explanation of morphological symbols used,
    and downloading see

    http://www.nlp.cs.bilkent.edu.tr/Center/Corpus/

    Please let us know of any problems.

    -- 
    Kemal Oflazer                   e-mail: ko@cs.bilkent.edu.tr
                                    http://www.cs.bilkent.edu.tr/~ko/ko.html
    Bilkent University              tel: (90-312) 266-4133 (Sec)
    Dept. of Computer Engineering                 290-1258 (Office)
    Bilkent, ANKARA, 06533 TURKEY        (90-532) 447-8978 (Mobile)
                                    fax: (90-312) 266-4126        
    



    This archive was generated by hypermail 2b29 : Thu Apr 27 2000 - 09:35:49 MET DST