Re: [Corpora-List] On tools for indexing and searching large corpora

From: Sylvain Loiseau (sylvain@toucheraveclesyeux.com)
Date: Wed Nov 20 2002 - 18:41:32 MET

  • Next message: Stamatia Spiliopoulou: "Re: [Corpora-List] Greek characters in Unix?"

    Dear Serge,

    If you have a valid XML-encoded corpus (and, basically, if you want to
    check if it is valid XML), regexes are not the best tool: you could
    consider using a parser, and for efficiency a C parser. This allow you
    to keep Perl as your main language since wrappers for C libraries (and
    the best of them) exist in Perl, like LibXML (wrapper for the libxml
    librarie, which now seem to be a real SAX parser, i.e. didn't buffer
    the whole string), and XML::SAX::Expat [2] (moving the James Clark
    Expat library into SAX2 idiom). (Both available on CPAN).

    If you use XSLT/XPath, which is the best way to use powerful (and
    standard) query language without reinventing the wheel, you could
    consider to use a Splitter and a Merger SAX handler to split your
    document on middle-sized units (like <text> in TEI), bufferise the
    chunk, and process them with a XSLT proc (which is easy with
    XML::LibXML and XML::LibXSLT, see XML::Filter::XSLT on CPAN as an
    example of a XSLT filter in a SAX handler).

    Another solution to consider is to store your TEI-XML document into a
    native XML DB. Sleepy cat (Berkeley DB XML) is no doubt helpful (a new
    alpha is just release), allowing to process XPath query on very large
    corpora. But I'm wondering (without more test) if the size of the
    index needed by deeply-anotated corpus didn't simply replace the
    problem of memory consumption in bufferisation (XPath, XSLT) approach.

    Berkeley DB XML: http://www.sleepycat.com/xml/index.html

    Please let me know your choice.
    Regards,

    Sylvain Loiseau

    ----- Original Message -----
    From: "Serge Sharoff" <sharoff@aha.ru>
    To: <corpora@lists.uib.no>
    Sent: Tuesday, November 19, 2002 12:03 PM
    Subject: [Corpora-List] On tools for indexing and searching large
    corpora

    > Dear all,
    >
    > I'm in the process of compiling a corpus of modern Russian
    comparable to the
    > BNC in its size and coverage. The format of the corpus is based on
    TEI, for
    > instance,
    > <s id="nashi.535">
    > ...
    > <w>глава
    > <ana lemma="глава" pos="noun" feats="мр,од,ед,им"/>
    > <ana lemma="глава" pos="noun" feats="жр,но,ед,им"/>
    > </w>
    > <w>Владивостока
    > <ana lemma="Владивосток" pos="noun" feats="мр,но,ед,рд,геог"/>
    > </w>
    > ...
    > </s>
    > in the first case, the POS tagger detects and cannot resolve an
    ambiguity
    > between two possible readings (masc, animate, i.e. the head of, and
    fem.,
    > inanimate, i.e. the chapter of), so both analyses are left.
    >
    > Currently for searching the corpus I use custom tools written in
    Perl and
    > based on regular expressions. As the corpus gets larger (currently
    40
    > million words), the indexing scheme gets totally inefficient and I'm
    > reluctant to reinvent the wheel by improving it.
    >
    > What is the technology used in the BNC and other annotated corpora
    of
    > similar size? Can it be applied in this case (given the need to cope
    with
    > possible ambiguity)? The corpus uses Win-1251 encoding, but
    eventually I
    > plan to convert it to Unicode. Any suggestions?
    >
    > Best,
    > Serge
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2b29 : Wed Nov 20 2002 - 18:49:32 MET