Users' Manual


Stig Johansson

in collaboration with

Eric Atwell

Roger Garside

Geoffrey Leech

Norwegian Computing Centre

for the Humanities Bergen, 1986


The tagged LOB Corpus is the result of cooperation among researchers at the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for the Humanities, Bergen. The principal members of the research team have been:

Geoffrey Leech and Roger Garside (project leaders)
Erie Atwell
lan Marshall

Stig Johansson (project leader)
Mette-Cathrine Jahr

Knut Hofland

The project was supported by the Social Science Research Council and the Norwegian Research Council for Science and the Humanities.
The section on computational aspects (4) is a revised version of: Geoffrey Leech Roger Garside, and Erie Atwell, 'The Automatic Grammatical Tagging of the LOB Corpus,' ICAME News 7 (1983), pp 13-33. The rest of the manual is the work of Stig Johansson, who could also draw on the information in Erie Atwell's Manual Pre-Edit Handbook (November 1981) and Manual Post-Edit Handbook (June 1982).

Stig Johansson

Erie Atwell

Roger Garside

Geoffrey Leech






1 The LOB Corpus
2 Tagged versions

2.1 Description of tape and files - vertical version
2.2 Description of records - vertical version
2.3 Description of tape and files - horizontal version
2.4 Description of records - horizontal version
2.5 Reference code
2.6 Special information in the vertical version
2.7 Number of words
2.8 Sample text extract - vertical version
2.9 Sample text extract - horizontal version

3 The LOB tag set

3.1 An overview of the LOB tag set
3.2 Some differences between the LOB and Brown tag sets
3.3 Ditto tags

4 The LOB tagging suite

4.1 Pre-editing
4.2 Tag assignment
4.3 Tag selection
4.4 Idiom tagging
4.5 Post-editing

5 Differences between the original corpus and the tagged versions

5.1 Capitalisation
5.2 Punctuation marks and sentence/paragraph division
5.3 Contractions
5.4 Codes for abbreviations and 'non-English' words
5.5 Other differences

6 Principles in post-editing
7 Problem areas

7.1 Word division
7.2 Idioms
7.3 -ed forms
7.4 -ing forms
7.5 Auxiliaries
7.6 Nouns: number and case
7.7 Proper nouns
7.8 Adjectives
7.9 Adjective vs noun
7.10 Adverbs
7.11 Adverb vs adjective
7.12 Determiners/pronouns
7.13 Prepositions
7.14 Conjunctions
7.15 Conjunction vs preposition
7.16 WH-words
7.17 Numerals
7.18 Interjections
7.19 Abbreviations
7.20 Non-standard forms
7.21 Foreign words and expressions
7.22 Formulas and scientific symbols
7.23 Cited forms
7.24 Punctuation marks
7.25 Letters

8 KWIC concordance

8.1 Tapes and files
8.2 Records
8.3 Sorting
8.4 Example
8.5 Frequencies
8.6 Index to the KWIC concordance

9 Developments
Appendix 1: General flowchart of Tag Assignment Program
Appendix 2: Tagging decisions of APPLYHYPHEN
Appendix 3: Tagging decisions of APPLYWIC
Appendix 4: List of tags

Coding key

The following codes have been taken over from the original (untagged) LOB Corpus:


degree symbol








begin quote


end quote


begin comment tag


end comment tag


begin subscript


end subscript


begin superscript


end superscript


macron on preceding character


acute accent on preceding character


grave accent on preceding character


tilde on preceding character


circumflex accent on preceding character


cedilla under preceding, character


umlaut or diaeresis on preceding character



For a full list of *? codes (=uncoded character), see Johansson et al (1978). The word-class tags are surveyed in Section 3 and Appendix 4. As regards other coding conventions, see Sections 2.6 (special information in the vertical version) and 5.2 (sentence and paragraph division).