2 Tagged versions

In the tagged corpus each word is accompanied by a word-class tag, assigned through a combination of automatic tagging programs and manual pre- and post-editing. There is no syntactic bracketing. The LOB tagging suite will be described in some detail in Section 4 below. There are two versions of the tagged corpus:

I: a horizontal format, with a running text where each word is immediately followed by its associated tag;

II: a vertical format, where each word is on a separate line together with its associated tag, some 'special information' (see 2.6), and a reference number.

All versions of the LOB Corpus text (tagged and untagged) are available only on tape and are available only for use by academic researchers. Concordances based on the corpus are available on tape and microfiche. See Section 8.

Users of the material are asked to notify the compilers of errors and inconsistencies in tagging.

2.1 Description of tape and files - vertical version

Code:

ASCII

Tracks:

9

Density:

1600 or 6250

Label:

none

Parity:

odd

Files:

54

EOF marks:

1 after each file, 2 at end of tape

Record size:

60

Blocking factor:

100

2.2 Description of records - vertical version

Column

Contents

1-11

Reference number

13-17

Tag

19-43

Word

45-49

These columns may contain the following details:
\0 abbreviated word or expression
\1-15 'non English' codes
< or > contraction

50

H heading

51

N descriptive name

52

T title

53

C cited word

54

F foreign word

55

@PREEDIT1 query (see below)

57

P paragraph marker

58

I included-sentence marker

Columns which have not been specified are blank. For a description of the codes, see 2.6.

2.3 Description of tape and files - horizontal version

Code:

ASCII

Tracks:

9

Density:

1600 or 6250

Label:

none

Parity:

odd

Files:

54

EOF marks:

1 after each file, 2 at end of tape

Record size:

80

Blocking factor:

100

2.4 Description of records horizontal version

Column

Contents

1-7

Reference

9-80

Tagged text (sequences of WORD-TAG)

One original text line may be divided into two lines in the tagged corpus.

2.5 Reference code

The reference consists of the LOB Corpus line identification, i.e. text sample code (letter A to R and two digits) and line number (1-3 digits). The line identification is the same as in the original, untagged corpus. See also 2.8 and 2.9.

2.6 Special information in the vertical version

The vertical version of the corpus may contain 'special information' in the columns after the word (cf 2.2):

H, N, and T are useful where there has been a change in capitalisation; see 5.1. C (like NC, see 7.23) occurs rather sparingly and only with short cited words and phrases. The 'special information' was put in at the pre-editing stage and may contain inconsistencies, due to changes at later stages of the project. The 'special information' in the vertical version also includes a paragraph marker (P) and a marker for included sentences (I), both put in at a late stage. The markers are given after the first word of the sentence/paragraph.

Codes for abbreviations and 'non-English' words:

\

foreign word

\0

abbreviation

\1

non-current English

\2

non-standard English

\3

foreigner English

\4

science fiction

\5

miscellaneous

\6

foreign word or expression widely used

\11

Cyrillic alphabet

\15

Greek alphabet

@ flag

The query marker @ was inserted by the automatic pre-edit program PREEDIT1 in a column of the verticalised text when certain changes or additions had been made to the horizontal text and in certain other circumstances when verticalisation was problematic:

1.

Full-stop was inserted because one of the following characters had been encountered without preceding terminal punctuation:
a. Open heading (*<)
b. Close heading (*>)
c. Sentence-initial mark (^)
d. Sentence-initial mark preceded by end quote mark (**' or **").
(Full-stop was inserted before end quote.)

2.

One of the following characters or strings had been changed:
a. Begin-list mark (_) to sentence- initial mark (----);
b. [ to (;
c. ] to );
d. **[BEGIN QUOTE**] to *";
e. **[END QUOTE**] to **";
f. **[MIDDLE OF QUOTE**] to *" or **".

3.

One of the following characters had been treated as a single word though this might be problematic:
a. Prime mark (*?7-9) preceded by digit;
b. Slash (/) followed by space.

2.7 Number of words

Because of the differences in word division, the number of words is somewhat higher than in the original, untagged corpus:

A

89,139

J

161,907

B

54,447

K

59,205

C

34,321

L

49,145

D

34,388

M

12,120

E

76,916

N

59,390

F

89,094

P

59,382

G

155,342

R

18,203

H

60,769

Tot.

1,013,768

These figures do not include the punctuation marks.

2.8 Sample text extract - vertical version

A012001		----- 	------------------------------
A012002		*'	*'		H
A012010		VB	stop		H
A0 12020	VBG	electing	H
A0 12030	NN	life		H
A0 12040	NNS	peers		H
A0 12041	**'	**'		H
A0 12042	.	.		H	@
A0 13001	----- 	-------------------------------
A013010		IN	by		H
A013020		NP	Trevor		H
A0 13030	NP	Williams	H
A013031		.	.		H	@
A014001		----- 	-------------------------------
A014010		AT	a			P
A014020		NN 	move
A014030		TO	to
A014040		VB	stop
A014050		NPT	\0Mr			\0
A014060		NP	Gaitskell
A014070		IN	from
A014080		VBG	nominating
A014090		DTI	any
A014100		AP	more
A014110		NN	labour		N
A015010		NN	life
A015020		NNS	peers		N
A015030		BEZ	is
A015040		TO	to
A015050		BE	be
A015060		VBN	made
A015070		IN	at
A015080		AT	a
A015090		NN	meeting
A015100		IN	of
A015110		NN	labour		N
A015120		NPTS \0MPs \0
A015140 	NR 	tomorrow
A015141
| |||||
| ||||Separator (1 digit)
| |||Word no. (2 digits)
| ||Line no. (1-3 digits)
| Sample no. (2 digits)
Text category (letter)

2.9 Sample text extract - horizontal version

A012	^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**' ._.
A013	^ by_IN Trevor_NP Williams_NP ._.
A014	^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN
A014	nominating_VBG any_DTI more_AP labour_NN
A015	life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT
		meeting_NN
A015	of_IN labour_NN \0MPs_NPTS tomorrow_NR
A016	 ^ \0Mr_NPT Michael_NP Foot_NP has_HVZ put_VBN down_RP a_AT
A016	resolution_NN on_IN the_ATI subject_NN and_CC
A017	he_PP3A is_BEZ to_TO be_BE backed_VBN by_IN \0Mr_NPT Will_NP
A017	Griffiths_NP ,_, \0MP_NPT for_IN Manchester_NP
A018	Exchange_NP ._.
A019	 ^ though_CS they_PP3AS may_MD gather_VB some_DTI left-winj_JJB
A019	support_NN ,_, a_AT large_JJ majority_NN
A010	of_IN labour_NN \0MPs_NPTS are_BER likely_JJ to_TO turn_VB down_RP
A010	the_ATI Foot-Griffiths_NP
A011	resolution_NN ._.
A012	^ *' *' abolish_VB Lords_NPTS **' **' ._.
A013	^ \0Mr_NPT Foot's_NP$ line_NN will_MD be_BE that_CS as_CS
		labour_NN
A013	\0MPs_NPTS opposed_VBD the_ATI
A014	government_NN bill_NN which_WDT brought_VBD life_NN peers_NNS
		into_IN
A014	existence_NN , they_PP3AS should_MD
A015	not_XNOT now_RN put_VB forward_RB nominees_NNS ._.
A016	 ^ he_PP3A believes_VBZ that_CS the_ATI House_NPL of_IN Lords_NPT 
A016	should_MD be_BE abolished_VBN and_CC that_CS
A017	labour_NN should_MD not_XNOT take_VB any_DTI steps_NNS
		which_WDT
A017	would_MD appear_VB to_TO *'_*'	prop_VB up_RP **'_**' an_AT
A018	out-dated_JJ institution_NN
| ||Line no. (1-3-digits)
| |Sample no. (2 digits)
Text category (letter)