Corpora: SQL Server as an option for large, fast web-based corpora

From: Mark Davies (mdavies@ilstu.edu)
Date: Mon Dec 04 2000 - 20:30:20 MET

  • Next message: Mark Davies: "Corpora: SQL Server... (clarification)"

    For the last three or four years I've been looking for a program to allow
    me to provide access to large (100+ million word) corpora via the Web, and
    provide users with fast (<10 sec) queries and KWIC-formatted
    output. Virtually all of the commercially available "web-indexing"
    packages are designed to return a list of web pages with the desired
    content, but they provide no way to extract and list just the relevant
    sections of each web page in KWIC format. I am also aware of several
    PC-based solutions that are designed to provide KWIC-format output, but
    these are for use on a local workstation, and have not yet been
    (completely) modified to provide Web access.

    Within the last three or four months I've developed a schema that does
    allow access to corpora via the Web, and this approach involves:
       -- SQL Server (including the new "full-text searching option in 7.0)
       -- ADO (Active Data Objects), and
       -- ASP scripts (Active Server Pages, using VBScript).
    This schema allows fairly fast access (<10 seconds for most queries) to
    large corpora (~180-200 million words). The output is displayed in KWIC
    format, and the results can further be sorted by left or right context words.

    Examples of these corpora can be found at:

    http://mdavies.for.ilstu.edu/corpus 3 million word corpus of historical
    Spanish
    http://mdavies.for.ilstu.edu/corpus/publico 180+ million word corpus of
    Modern Portuguese

    The one major shortcoming of this approach is that it is limited by the
    (overly-restricted) native search syntax of the "full-text" search engine
    in SQL Server, which serves as the backbone for the corpora. While it is
    possible to do wildcard OR proximity searches (e.g. 1-3 intervening words),
    it is not possible to combine these two types of queries. In addition, it
    is not possible to do left-branching wildcard queries (*ing, *ization,
    etc). Using script-based serial queries (which would be opaque to the end
    user), however, it should be possible to replicate most of these more
    advanced queries.

    In addition to a more robust search syntax, there are other improvements
    such as more options for output (# words and sorting) that I could/should
    integrate into the corpora. But for right now I think they still provide
    some indication of what can be done.

    At any rate, I'm sending this to CORPORA simply to get feedback from those
    who are working on similar approaches for PC/NT-based web-accessible
    corpora. I'd appreciate any comments that you might have.

    Mark Davies
    Illinois State University

    =======================================
    Mark Davies, Associate Professor, Spanish Linguistics
    Dept. of Foreign Languages, Illinois State University
    Normal, IL 61790-4300

    Voice:309/438-7975 email:mdavies@ilstu.edu
    Fax:309/438-8038 http://mdavies.for.ilstu.edu/
    =======================================



    This archive was generated by hypermail 2b29 : Mon Dec 04 2000 - 20:21:39 MET