khwapp3

Appendix 3.

Lexa - Corpus Processing Software

Raymond Hickey
Anglistik
Universität GH Essen

Introduction

The present set of programmes is intended to offer a wide range of software which will carry out such tasks as lexical analysis and information retrieval required by linguists involved in the examination of text corpora. The suite has been particularly adapted to be used with the Helsinki Corpus of English Texts. The general nature of the software, however, permits its application to any set of texts, particularly those which use headers in the so-called Cocoa format, as it can access the information contained there during analysis. For any linguist using the Old and Middle English parts of the Helsinki Corpus the font module of the Lexa suite will be essential as with it one can see the special characters necessary for these stages of English in their actual form on the computer screen and not as difficult-to-read sequences of two codes.

The package consists of some 5.5 MB of programmes, sample data and demonstration files (4 microfloppy disks) and approx. 800 pages of documentation which is organized into three volumes as follows: Vol.1: Lexical Analysis and Information Retrieval, Vol.2: Database and Corpus Management, Vol.3: Utility Library. The software and documentation are distributed as a package from Bergen at the following address:

The HIT Centre
Allégt. 27
N-5007 Bergen
Norway.

Interested linguists are advised to write for an order form or contact Bergen via email for information on the cost of the package and the relevant postage rate.

In the following a brief description of the main parts of the Lexa suite offers a first orientation for those interested in corpus data processing.

Lexical analysis

The main programme, Lexa, puts at one's disposal the options required in order to process lexical data with a high degree of automation on a personal computer. Lexa allows one, via tagging, to lemmatise any text or series of texts with a minimum of effort. All that is required is that the user specify what (possible) words are to be assigned to what lemmas. The rest is taken care of by the programme. In addition, one can create frequency lists of the types and tokens occurring in any loaded text, make lexical density tables, transfer textual data in a user-defined manner to a database environment or generate one of two types of concordance. The results of all operations are stored as files and can be examined later, for instance with the text editor shipped with the package. Each item of information used by Lexa when manipulating texts is specifiable by means of a setup file which is loaded after calling Lexa and used to initialise the programme in the manner desired by the user. Sample files can be used to begin with to gain an impression of how the programme works. There are furthermore two basic modes, an interactive one in which the user specifies manually the steps to be taken, and a batch mode in which no keyboard input is required, thus obviating the presence of the user. The latter mode is useful when carrying out large tasks which can then run without taking up any of the linguist's time.

Information retrieval

The second main goal of the first part of the Lexa set is to offer flexible and efficient means of retrieving information from text corpora. The programme Lexa Pat allows one to specify parameters for combing through text files. By determining these precisely the user can achieve a high level of correct returns which are of value when evaluating texts quantitatively. A further programme, Lexa DbPat, permits similar retrieval operations to be applied to databases, for instance those generated by Lexa from the text files of a corpus.

Ascertaining the occurrence of syntactic contexts is catered for by the programme Lexa Context with which users can specify search strings, their position in a sentence, the number of intervening items and then comb through any set of texts in search of them.

By means of the utility Cocoa it is possible to group the text files of a corpus on the grounds of shared parameters from the Cocoa-format header at the beginning of each file in the Helsinki corpus. All information retrieval operations can then have as their scope those files grouped by the Cocoa utility on the basis of their contents.

Database management

The database management software serves several purposes. The most obvious one is of course the processing of databases generated with lexical processing programmes (see above). A second and important goal of this software is to offer linguists options for carrying out the type of statistical operations on a computer which are frequently required in language studies; these are realised with the database manager which has a whole range of inbuilt statistical options. The type of statistics possible with DbStat and CalcStat is what is known as inferential statistics. Here one is concerned with computations on numerical data which allow one to say if something is true or not (for example that two sets of data are correlated) with a certain degree of confidence.

The numerical data to be processed must be available in the form of lists contained in ASCII text files. These can be used immediately with the calculator CalcStat for more direct computation or can be imported into a (dBASE compatible) database which is subsequently used for statistical purposes. The database manager DbStat is capable of statistical computations which involve two series of data thus allowing tests like the Chi-square test, or estimations of correlation like the Pearson product-moment correlation coefficient or the Spearman rank correlation coefficient. The results of computations are either stored directly in a database or can be deposited in an ASCII text file which can be examined later and even function as renewed input into a database for later statistical purposes.

Most of the types of statistical computations which linguists will be interested in are non-parametric (so-called distribution-free tests of significance). As the data is not usually of the discrete type, but rather nominal, non-parametric tests are of increased importance. Such tests are based on the notion of ranking the data which serves as input to a computation. Several tests of this type have been included in DbStat, e.g. the Spearman or the Mann-Whitney U-test. In addition, linguists are frequently interested in determining whether any given set of data match an expected distribution; here the appropriate test is the Chi-square test which is a measure of the goodness of fit between an observed set of data and an expected set.

Corpus Management

The driving idea behind Corpus Manager has been to provide a means of structuring existing texts in the simplest possible way which would then allow users to retrieve information selectively from within an easy-to-grasp user interface. The texts to be used can be determined by the user or prepared in advance by some supplier of texts, e.g. a university department engaged in compiling a corpus. A set of texts can be arranged with a kind of table of contents which allows one to move through layers and of course search for user-specified contents in a variety of ways, e.g. by direct searching or by using a list of keywords to be located.

The text database used with Corpus Manager can be of any size and any contents. You can chain a whole series of existing files to a text database using the Lexa utility Lexa Chain; you can use the automatically generated log file to check on what source files were chained together, these may also be "unlinked" later by reversing this process with Lexa Unchain. To illustrate the scope of this programme a sample text database has been prepared consisting of a section of the Helsinki Corpus and is supplied with some additional files on the distribution diskettes.

Utility Library

The programmes in the third volume of the Lexa group are those which are not primarily involved in the processing of corpus data but in the management of the latter. The set of utilities has been gathered together to form a library which is intended to fulfill common functions which users will require on the personal computer. The main programme is the Lexa file manager which provides a convenient environment for the DOS operations of copying, deleting and renaming files. Global operations through the use of directory windows are easily and swiftly executed. Directories can be created, erased and renamed. Two drives can be shown at once and sorting can be according to name, size or date and a date filter can be set. On-line help and a diary is available and DOS can be loaded. The programme can manage any type of storage medium, not just floppy and hard disks but also CD-ROMs such as the ICAME Collection of English Language Corpora.

Apart from file management the Utility Library offers two programmes which can be used as launchers for any of the programmes of the Lexa group, i.e. Lexa Control and Lexa Desk. Users should start here as in each case an easy-to-grasp interface to the other programmes is put at one's disposal thus minimizing the effort involved in getting acquainted with the group.

A third and major section of the Utility Library is represented by the programmes of the Font Module. This contains utilities for the display, entry and printing of special Old and Middle English characters which will be of value to those linguists using the Helsinki Corpus, thus rendering the task of reading the corresponding texts of the corpus much easier.

In the design of the current suite of programmes, flexibility has been given highest priority. This is to be seen in the number of items, in nearly all programmes, which can be determined by the user. Furthermore, techniques have been employed which render the structure of each programme as user-friendly as possible (pull-down menus, window technology, mouse support, similarity of command structure between the more than sixty programmes of the set), permitting the linguist to concentrate on essentially linguistic matters.