Corpora: CFP: Web-Based Language Documentation and Description

From: Steven Bird (sb@unagi.cis.upenn.edu)
Date: Thu Jul 27 2000 - 19:27:13 MET DST

  • Next message: Christopher Cieri: "Re: Corpora: Locating sources of corpora"

                            CALL FOR PARTICIPATION

               Web-Based Language Documentation and Description

                    Philadelphia USA, 12-15 December 2000

                    http://www.ldc.upenn.edu/exploration/

                 Institute for Research in Cognitive Science
                          University of Pennsylvania

     Organizers: Steven Bird (U Penn) and Gary Simons (SIL International)

    [The full version of this abridged CFP is available from the above page.]

    This workshop will lay the foundation of an open, web-based
    infrastructure for collecting, storing and disseminating the primary
    materials which document and describe human languages, including
    wordlists, lexicons, annotated signals, interlinear texts, paradigms,
    field notes, and linguistic descriptions, as well as the metadata
    which indexes and classifies these materials. The infrastructure will
    support the modeling, creation, archiving and access of these
    materials, using centralized respositories of metadata, data, best
    practice guidelines, and open software tools.

    BACKGROUND

    Recent years have witnessed dramatic advances in the mass storage and
    web delivery technologies, making it possible to house virtually
    unlimited quantities of speech data online, and to disseminate this
    data over the web. The development of XML and Unicode greatly
    facilitate the interchange and reuse of structured multimodal and
    multilingual data and the development of interoperating software
    tools. These developments are having a pervasive influence on the way
    primary linguistic data are gathered, stored, analyzed and
    disseminated, as demonstrated by the initiatives surveyed on the
    linguistic exploration page (http://www.ldc.upenn.edu/exploration/),
    and the papers presented at the Linguistic Exploration Workshop at
    the Chicago LSA Meeting (http://www.ldc.upenn.edu/exploration/LSA/).

    CHALLENGES

    With these new technological opportunities are concomitant needs
    and challenges for modeling, creating, archiving and accessing data:

    I Data Models. A diverse range of data types are required in language
        documentation and linguistic fieldwork, including word lists,
        lexicons, annotated signals, writing system documentation,
        interlinear texts, paradigms, field notes, and linguistic
        descriptions. We need flexible and general models for these data
        types (including links between them), and good ways to represent
        information which is either partial, uncertain, evolving, or
        disputed. We need to develop a consensus in the community
        regarding best practice for modeling these kinds of data, to
        ensure maximal reusability of data and software.

    II Data Archives. Whether just the private collection of a single
        researcher or a large and centralized repository, language data
        needs to be stored and reused. To support this, we need durable
        and open storage and interchange formats that embody the best
        practice consensus. We need to convert (parochial) 8-bit
        character codings to Unicode, using a general tool for character
        conversion along with a host of conversion tables for specific
        character sets. We also need to convert markup into the best
        practice formats we have defined. We need a mechanism to support
        durable citation of data, so that document authors do not need to
        duplicate all the data they reference just to be sure that the
        links will not break. More generally, we need a metadata standard
        for indexing the resources, regardless of format and availability,
        and a wide-coverage index conforming to the standard, so that
        someone interested in a particular language or region can find all
        the electronic resources that are pertinent to it, without having
        to determine how each of several different archives have named and
        classified their holdings.

    III Data Creation. Now that mass storage is so inexpensive,
        researchers are creating large amounts of digital data covering
        the types listed above. Both the number and scale of these
        collection efforts are growing rapidly. We need software tools
        supporting data creation, conforming with best practice, and
        covering primary collection of textual data (wordlists, texts) and
        recordings (audio, video, physiological), along with transcription
        and annotation of the primary materials conforming to a broad
        range of descriptive and analytical practices.

    IV Data Access. Once data has been created and archived, there exist
        a variety of access modes. A region of data is identified by
        browsing, by launching a query, or by following a reference. The
        selection is displayed according to appropriate conventions and
        styles, or converted into some other form (e.g. for statistical
        analysis and visualization). The selection may be corrected,
        imported into a document, analyzed, and annotated, leading to the
        creation of secondary data and/or the elicitation of new primary
        data. We need to develop suitable delivery mechanisms including
        stylesheets, conversion tools, indexing methods, and query
        languages, which encompass the needs for security and privacy. We
        need standard application programming interfaces and a library of
        reusable components, to support the development of software for
        new modes of access.

    Many of the activities listed above are already underway; the lure of
    the technology is great despite the lack of infrastructure. However,
    it is beyond the capacity of any single individual or institution to
    develop this infrastructure of standards and tools on their own. There
    is a pressing need for close cooperation between these initiatives, so
    that scarce human, software and data resources are used optimally.

    WORKSHOP OBJECTIVES

    This workshop will lay the foundation of an open, web-based
    infrastructure for collecting, storing and disseminating the primary
    materials which document and describe human languages. The
    infrastructure will support the modeling, creation, archiving and
    access of these materials, using centralized respositories of
    metadata, data, best practice guidelines, and open software tools.

    To meet this goal, we have identified three main objectives which can
    be substantially achieved at the present time:

    Objective 1: to develop a comprehensive framework which identifies all
        the infrastructural needs, designates appropriate roles for
        existing results as pieces of an overall solution, and sets out a
        coordinated response to the remaining challenges.

    Objective 2: to found centralized repositories (and nominate existing
        ones) for housing components of the infrastructure, so that data,
        tools, formats and standards can be collected, indexed, and made
        available to the community.

    Objective 3: to begin construction of the repositories, by identifying
        the contribution of past and present activities by the
        participants and by other individuals and institutions, and
        by gathering the results and their documentation.

    CALL FOR PARTICIPATION

    The workshop will include paper presentations and working sessions to
    develop the infrastructure. Interested members of the community are
    invited to participate in the workshop. There is a limit on available
    places, and participants will be identified on the basis of submitted
    abstracts. Funding is available for authors of accepted papers.

    Abstracts. One page abstracts are invited which describe substantive
    contributions to the repositories, or which discuss concrete problems
    for web-based language documentation and description, and describe
    possible solutions.

    Papers. Authors of accepted abstracts will be asked to prepare a
    2-3,000 word paper plus associated materials.

    Address submissions to: Steven.Bird@ldc.upenn.edu, Gary_Simons@sil.org

    Timetable.

    Friday 1 September Abstract deadline
    Friday 29 September Acceptance notification
    Friday 24 November Paper deadline
    12-15 December Workshop

    IMPORTANT: FOR FURTHER INFORMATION

    Intending authors should consult the EXTENDED CFP, available from the
    linguistic exploration page (http://www.ldc.upenn.edu/exploration/).
    To be sure of receiving future announcements, please subscribe to the
    LINGUISTIC-EXPLORATION mailing list, referenced from that page.

    --
    Steven Bird                    Gary Simons
    University of Pennsylvania     SIL International
    Steven.Bird@ldc.upenn.edu      Gary_Simons@sil.org
    http://www.ldc.upenn.edu/sb    http://www.sil.org/SIL/roster/simons.htm
    



    This archive was generated by hypermail 2b29 : Thu Jul 27 2000 - 19:25:20 MET DST