Version 6 | Version 7 | Version 7 update doc (165 pages)
Fetch version 7 from here (7 MB, unzip with WinZip or PKZIP -d)

LEXA: Corpus Processing Software

Foreword

The intention behind the present set of programmes is to put at the disposal of the interested linguist the tools he or she would require in order to process linguistically relevant data, most probably from an available corpus, with a high degree of automation on a personal computer. The package is divided into several groups which perform typical functions. Of these the first, lexical analysis, will be of immediate concern. The main programme, Lexa, allows one to tag and lemmatise any text or series of texts with a minimum of effort. All that is required is that the user specify what (possible) words are to be assigned to what lemmas. The rest is taken care of by the programme.

In the design of the current set flexibility has been given highest priority. This is to be seen in the number of items, in nearly all programmes, which are user-determinable. Furthermore, techniques have been employed which render the structure of each programme as user-friendly as possible, permitting the linguist to concentrate on essentially linguistic matters.

To the user who has little or no previous experience of computing a word of warning is called for: all technical explanations given below assume that one is acquainted with the basics of computer hardware and software and that one has at least some experience with word processing if not with database management. Those users for whom this is not the case are strongly advised to acquire the necessary background knowledge in these relevant areas before embarking on linguistic data processing. Most of these are covered by the text database on personal computing, PC_KNOW.TDB, which can be processed with Corpus Manager and by the various utilities on the basics of personal computing supplied with the Lexa set.

The present project spans a gestation period of a number of years. As it was drawing to a close I had the rewarding experience of cooperation with many colleagues, especially from Scandinavia. Particular mention is deserved by Merja Kyt| and Matti Rissanen of the University of Helsinki who have given me many a sound suggestion for improvement from the point of view of the corpus linguist. From Norway I received welcome support from Stig Johansson of the University of Oslo and above all from Knut Hofland of the Norwegian Computing Centre for the Humanities in Bergen. It is he who was involved in all the practical work connected with printing the documentation and producing the software and it is to him that my debt of gratitude is greatest.

Raymond Hickey Münich December 1992


Programme summary

I. Lexical analysis

1. Lexa

This is the main programme of the current set. It allows one to automatically lemmatise any input ASCII texts, to create frequency lists of the types and tokens occurring in any loaded text, to generate lexical density tables, to transfer textual data in a user-defined manner to a database environment, to mention just the main procedures which are built into Lexa. The results of all operations are stored as files and can be examined later. Each item of information used by Lexa when manipulating texts is specifiable by means of a setup file which is loaded after calling Lexa and used to initialize the programme in the manner desired by the user.

II. Data processing

1. Lexa Text

A compact and flexible text editor intended primarily for editing both text files from a corpus and output files generated by such programmes as Lexa. On-line help, block processing, find and replace, undo delete and a system of text macros are included. Up to 4 texts can be edited at one time. Note that it is only intended for editing ASCII files. Available in a compact version as Lexa Text Small.

2. Lexa Bat

A combined editor and dispatcher for DOS batch files (or other files if you like). From a desktop you can survey the batch files on your disk, edit them with ease and execute them at will.

3. Lexa Form

A versatile form editor which allows users to fill in data of their own into pre-determined forms which are kept on disk. An internal module allows interfacing with databases so that field contents can be imported into fields of a form as can external text. Output of form data can be to printer or to file as desired.

4. Lexa Browse

The purpose of this utility is to allow the user to leaf through the files of an entire hard disk and view and/or edit them at will. You page through a directory listing and the beginning of each file is shown in a window; should you wish you may then view the entire file or call your own editor to alter it or hexedit it or whatever. During Lexa Browse you are still in DOS and can execute any command you like.

5. Lexa View

Permits the viewing of any input file (but not of course editing). Scrolling is allowed in both directions and you can search for and count strings.

6. Lexa Hex

Dumps a specified file in hex format. Offers the information in byte and text form as well and permits the user to scroll in both directions. At any stage you can edit a stretch of text as either hex or ASCII and save the changes to disk.

7. Lexa Info

Takes an ASCII file and carries out various counts, such as the number of characters, words, sentences, etc. and calculates a series of further statistics, all of which you can save to file, if you wish.

8. Lexa Comp

Takes two ASCII files, compares them and reports on differences if any are found. The lines which differ are listed and the results written to file.

III. Information retrieval

1. Lexa Pat

A pattern matching facility to allow searches for user- specified strings in files. Searches can be made to cover groups of files by using the DOS wildcards * and ?. A whole variety of switches offer the programme flexibility, e.g. a search can include all directories on a disk, can merely look for whole words, can report non-matches, etc. Various statistics pertaining to the search are offered and the results can be written to a file. Will work with binary files as well. Includes a desktop, from where all operations can be easily controlled, and an integrated text editor.

2. Lexa DbPat

A version of the previous programme which works with databases. You may comb through any number of databases in any number of directories for user- determined contents, specifying such features as whether all fields of a database should be searched or only the field chosen by the user.

3. Lexa Context

To determine whether strings or words occur in a certain context the present programme can be used. It allows the specification of many parameters of relevance such as one or two string search, the number of ambient or separating characters, the consideration of letter case and of sentence boundaries as well as the drive to be used, directory scope, etc. The results are written to an ASCII text file.

4. Lexa Sweep

A menu driven utility for global "find and exchange" operations over sets of files in which you can confirm replacements as they are made, edit files during replace procedures, increase scope to deal with an entire disk, just delete strings, write string finds to file, etc. Allows customization via initialization files.

5. Lexa Search

For quick searches of simple strings (without loading the files combed through) the programme Lexa Search should be useful. On finding the search string it automatically loads Lexa View and locates the string within the file and displays the section of the file where this is to be found.

6. Lexa Replace

Permits the substitution or deletion of a string or character (or several of these) in file(s) specified by the user. A switch determines the mode of operation.

7. Lexa Filter

With the help of an input table file the present programme allows you to change several bytes in a set of input files to a series of different bytes in one move.

IV. Database and corpus management

1. DbStat

This is one of the major programmes which allows the editing of dBASE compatible databases directly. It is primarily intended for statistical calculations on data gleaned from texts processed with Lexa. To this end there are many standard statistical tests incorporated into the programme. DbStat is, however, a powerful database manager which will deal with any type of database in the dBASE format and allows the user to filter databases, append data from existing databases, copy databases to text files, import text files into databases, locate contents, selectively delete records during processing, etc. Available in compact form as DbSmall.

2. DbTrans

For such purposes as pre-translation or normalization of texts the present programme has been included in the current set. It takes a terminology database (generated with DbStat for instance) and replaces any occurrences of input terms by corresponding output terms in a user-defined set of files. Various parameters for the operation of DbTrans can be set and the programme will also run in a batch mode.

3. DbPage

A desktop with a series of options concerning the interfacing of texts and databases. The programme allows you to generate a text file which can act as an input template for database records and then to import the textual information entered into such a template into an actual database, thus permitting you to collect information for a database while actually editing texts.

4. DbLook

A reduced form of the database manager DbStat. It can be used to view databases quickly with. You may scroll at will in the database. The display is on a one-record-per-screen basis; a browse mode and a field search facility are also included.

5. CompDb

In order to fulfill the frequent need of checking up on doublets in different databases this programme has been added to the set. On two halves of the screen the user can compare records, transfer from one window to the other, mark for deletion and search for contents among other options.

6. MergeDb

A means of merging any two databases with each other. The programme examines the second database and only imports those records into the first which are not already present in the latter. The checking can cover entire records or single fields.

7. ReportDb

A report form generator enabling users to determine the layout of the text which is created on exporting records from DbStat. From a comfortable interface, ReportDb allows users to place fields and extraneous text on a report form, store this to file and export any database using this form. Configurations of ReportDb can be stored as setup files on disk and be used repeatedly.

8. DbList

An interface with databases which creates user-determined lists from data contained in record fields. Lists can be generated in tabular form with several columns and output can be to file or to printer. Similar in command structure to ReportDb. Configurations of DbList can be stored as setup files on disk in an identical manner to ReportDb.

9. ClassifyField

A database utility with which you can classify the contents of a user-specified field of any database. The programme works by registering each unique word in the chosen field and writing the resulting list to disk.

10. PackMemo

A simple utility which compresses the text file created in association with a database which has a memo field, i.e. it packs a .DBT file which is structured according to the dBASE format for such files.

11. CalcStat

A desktop calculator with an array of functions. It allows complex expressions to be calculated and the results to be written to a file. Data entered can be collected internally and then used for statistical purposes, with the results displayed as a bar chart and stored to file for further processing. Numerical data can be loaded into the calculator to allow repeated processing of data from session to session.

12. Corpus Manager

Provides a means of structuring existing texts in the simplest way which then allows users to retrieve information selectively from within an easy-to-grasp user interface. The texts to be used can be determined by the user or prepared in advance by a supplier of texts, e.g. a university department compiling a corpus. The pro- gramme indexes text files for quick searching later.

13. CharacterIndex

Generates a word list from the file specified on the DOS command line, storing it as a new file. The word list consists of a list of all the words (but only registered once) which are preceded by the character which is passed to the programme.

14. TinyText

A simple text editor which allows the user to place (or remove) markers easily in a text which are then recognized by Corpus Manager as level identifiers when displaying a text in a structured and hierarchical form.

15. DosDemo

An online demonstration of the normal interface of the operating system MS-DOS. Typical mistakes are simulated and explained with tips as to how these can be avoided. An explanation of DOS error messages and how memory is used along with a dictionary of computer terms is included.

16. DosDict

A text-oriented dictionary database manager which comes with a file containing comprehensive definitions of the most common computer terms. The programme offers all the basic functions such as adding, deleting and searching for records. The user can specify the file to be loaded and so maintain several files at once.

17. DosHelp

Something for the beginner to help him/her on the way to proficiency in DOS. It offers on-line descriptions of all the more common operating system commands with examples on a series of 38 screens.

V. Utilities

1. Lexa Control

The intention of the present programme is to provide users of the Lexa set of programmes with a convenient platform from which to launch any of the items of these packages along with a brief explanation of what they do. Note that each time a programme is started the entire system memory used by Lexa Control is released. Thus there are no size restrictions on the programmes which can be launched. General information concerning Lexa is available via <F10> and assumes that the file lexainfo.txt is accessible.

2. Lexa Desk

Here the user is offered a single interface as entrance point for the entire set of programmes in the Lexa package. The desktop functions as a launching pad and is arranged in groups which correspond to typical functions realised by the data processing utilities and which offer quick orientation for the user at the outset.

3. Lexa Shell

A command shell which provides swift and easy access to the Lexa set. The shell is arranged such that you can use it for the internal options or for normal DOS commands and programmes. There are two main levels within the shell. You may move from level to level without effort. Most importantly, the user can configure the shell to suit his/her own needs and store the setup to disk.

4. Lexa File

The file manager for the Lexa package. The intention is to provide a convenient environment for typical management tasks. From a series of picklists the user can carry out various operations necessary when managing disks such as formatting and copying, directory control, disk scanning, etc. as well as copying, deleting and renaming files. On-line help and a diary are available and DOS commands can be processed; DOS can be loaded temporarily as well.

5. Lexa FileCat

A database generator with which one can transfer the information on the files of a hard disk to a dBASE database (created internally by Lexa FileCat). The user can specify the drive to be used, the directories to be searched, the file template to be matched, etc.

6. Lexa Backup

A utility which allows you to specify parameters from a desktop (and save these to disk) which are used when examining any number of drives and directories to copy or delete files or just create a disk log file of the latter. Size and date filters can be set and inclusive or exclusive file templates can be used during backup operations.

7. Lexa Move

A file copying utility which shows you what is happening during its operation; it can be run interactively or from the command line.

8. Lexa Dirs

Offers a list of the directories on a disk in tree form. By using the arrow keys you can move a highlight bar and, by pressing , change to a new directory. The tree structure can also be written to a disk file. The second function of Lexa Dirs is as a directory switcher. By typing the fragment of a path on the DOS command line, the programme switches to the next directory it finds which matches this.

9. Lexa Find

A means of retrieving file information in a comprehensive and visually effective manner. The contents of an entire disk are read by the programme, the files then being listed in a column on the left of the screen, while the directory in which each file is to be found is highlighted in a tree graph on the right. As with Lexa Browse you can edit/view/hexedit any file you like, call it with an external programme or issue any DOS command via an input line at the bottom of the screen.

10. Lexa Scan

A disk search facility which enables you to comb through your disk for a file you specify. The legal DOS wildcards * and ? are permissible to allow searches for more than one file. The output of a search can be viewed on the screen or written to a file (determined by a command line switch).

11. Lexa Print

A stand-alone pretty print programme. There are a whole series of switches which allow one to determine page layout and printer pitch, header, page numbering, etc. for the file to be printed. Several files can be specified at once as can a number of copies of the same file. Works with all 24 needle dot matrix and laser beam printers. Includes a desktop from where all operations can be easily controlled.

12. Lexa Sort

Sorts an input ASCII file into ascending alphabetical order. The file to be sorted should not be above 160K for safety. The sort can be made to start at a specified column or tab number. In addition, the length of time the sort takes is recorded so that the programme can be used as a benchmark for computer performance (as sorting is entirely in RAM eliminating the factor of disk access).

13. Lexa Chain

A utility which links any series of input text files together to produce a single composite output file. This file can be processed by Lexa afterwards, e.g. to force the latter programme to analyse a very large text file. Lexa Chain also generates a log file to allow you to keep a record of what files have been linked together.

14. Lexa UnChain

The reverse of Lexa Chain so to speak. It unravels the text files contained in a composite file and allows you to edit these normally before possibly re-linking them with Lexa Chain.

15. Lexa List

A directory lister with a difference, namely that it allows you to scroll forwards and backwards in any directory which, if not the default one, is supplied as a command line parameter (DOS wild cards are permissible). You can change both directory and mask within the programme and specify the sorting order of listings as well.

16. Lexa CrLf

A small programme to extract the carriage return/line feed characters from an ASCII text file and so allow it to be imported and automatically word wrapped by any word processor.

17. Lexa LineNo

Takes any set of input files and adds line numbering to them. This programme recognizes Helsinki Corpus texts and can use separate file numbering for composite files, include them as comments and delete line end markers to achieve natural line wrapping.

18. Lexa Adjust

The idea behind the present programme is to allow users to extract a section of a large file, process it (e.g. with Lexa for grammatical analysis) and re-insert it into the original file. By these means it is possible to analyse parts of very large files without worrying about whether these will fit into system memory when another programme is loaded.

19. Cocoa

With this utility users can extract the information contained in the Cocoa-style header for each file of the Helsinki and other text corpora, depositing this in a database which can be edited with DbStat or DbSmall afterwards.

20. Lexa Byte

A much expanded version of Lexa Hex which allows you to hexedit two files at once, compare them for differences, search for strings, etc. In addition it includes file management facilities similar to Lexa File and a hex print option.

21. Lexa Kill

A global delete programme. Permits the deleting of files in all directories of a disk (the current one or another) which match a file specification. User confirmation can be disabled for those who know what they are doing.

22. Lexa MkAsc

A utility to convert any input file to an ASCII output file. You can specify how long each line in the resulting text is to be and if the upper ASCII symbols should be filtered out or not.

23. Lexa Strip

Removes formatting information from an input file. By this is meant that all characters below $32 and above $126 are filtered out (the upper ASCII area can be left untouched if the user so desires). This can be useful when transporting texts from one environment to another.

24. Lexa Mem

A simple utility to show how system memory is currently occupied. Useful when checking on memory-resident software.

25. Lexa ExtMem

Shows whether the current machine has EMS support and how much expanded memory is available.

26. Install

In order to facilitate the quick installation of the Lexa package, an install programme has been included on the first distribution diskette. This will make the necessary directories, alter your DOS path and adapt your autoexec.bat file for operation with the Lexa set.

VI. Font module files

1. Lexa Keyboard

A special keyboard driver for which users can determine the settings (with Lexa SetKey). A pre-adapted driver, key_hels.exe, is supplied with Lexa in which the special symbols of the Helsinki Corpus (for Old and Middle English) have already been assigned to combinations of the right Alt key and a letter.

2. Lexa SetKey

A keyboard manager with which you can customize the supplied keyboard driver lkeyb.exe to suit your needs. In all, five keyboard layouts per driver are permitted. This allows you to access the entire extended ASCII character set without having to resort to entering the values of symbols on the numeric keypad.

3. Lexa Perm

A utility with which users can load a video font file into the adapter of the computer permanently (until the computer is reset). Works with EGA and VGA adapters and allows special fonts to survive the video reset which so many programmes make when loaded.

4. Lexa LoadVid

This utility is similar to Lexa Perm only that fonts loaded may well be discarded by programmes started later (this may be desirable). It can also be used to load a special font for use with a Hercules Plus video adapter.

5. Make Symbols

Takes any set of input files and combs through them for occurrences of special character codes of the Helsinki Corpus type and then converts them to the actual symbols as defined in the Lexa font module. This means that a sequence such as "+t" is converted to the actual thorn character. Otherwise input files are left unaltered.

CURRENT ADDRESS:

Raymond Hickey, Universität GH Essen, FB 3 Literatur- und Sprachwissenschaften, FB 3 Anglistik / Linguistik, D - 45177 ESSEN, Germany.

Tel. +49 201 183 3441 Fax. +49 201 183 3437

E-mail: lan300@vm.hrz.uni-essen.de