RE: Corpora: lemma vs lexeme

Przemyslaw Kaszubski (
Wed, 10 Nov 1999 12:05:59 CET

Many thanks to all those who have replied. From what I have read
flows one immediate conclusion regarding the status of a 'lemma':
it is very much a CONVENTIONAL unit, devised and used by
lexicographers, corpus linguists and computational lexicologists to
somehow order and arrange the world of wordforms. It seems a
matter of the task at hand what sort of lemma definition you decide
upon. This is an approach very much to my liking: it means I, too,
can define lemmatisation criteria for my purposes. Consider the
following, tentative tone, of the makers of the CELEX database:

"So when you're interested in a word like walking, you
know that you can find lots of information about it under the
bold-type entry for the verb walk. These bold-type words in
dictionaries are called headwords or canonical forms, since
they represent what can be called the full canon or paradigm
of inflections: walk is the headword which stands for the
wordforms walk, walks, walking and walked.
The dictionary headword, as described above is one form a
lemma can take to represent a `word' in all its inflected forms.
It is possible---but probably not very helpful for humans---to
signify the `word' by some other word, or even a number;
anything will do, so long as it is understood to represent the
whole inflectional paradigm. A lemma is that `underlying'
form; it doesn't really exist, except for use in databases and
dictionaries. It looks like a real word, but in fact, it's just a
convenient way of expressing something bigger.
In an English lemma lexicon, the lemma is given in the form
of the traditional lexicographic headword. This is in contrast
to Dutch lemma lexicons, where the lemma can take the form
either of the traditional headword or of a stem, which is a
form more suitable for most linguistic research. No such
complications apply to English, however: the `underlying'
lemma always becomes the traditional headword when it
comes to the `surface'."

"There is one major difference between dictionary entries and
celex English lemmas, however: celex lemmas are never
distinguished solely on the basis of meaning. In a dictionary,
there might be two entries for the noun bank, one explaining
that it means the land at the side of a river, the other that it
means a financial institution. In the celex database, there
is only one lemma for the noun bank, and thus it gets only
one row in the database (which corresponds to an entry or
sub-paragraph in a dictionary). On what basis, then, does
celex differentiate between lemmas? There are five possible
criteria. If two potential lemmas are the same on all five
points, then they are considered as belonging to one lemma.
This remains true even if the two words differ in meaning.
If, however, they differ on any one criterion, and differ in
meaning, then they are treated as two separate lemmas. The
five distinguishing criteria are as follows:
1. Orthography of the wordforms.[...]
2. Syntactic class. [...]
3. Inflectional paradigm. The noun antenna (meaning radio
aerial) and the noun antenna (an anatomical feature of some
insects) are two different lemmas [...]
4. Morphological structure. The noun rubber (someone or
something that rubs -- rub + er) and the noun rubber (the
elastic substance [...]
5. Pronunciation of the wordforms.[...]"

According to the CELEX guides, meaning WOULD have been
fancied as one of the criteria, but apparently there was (and still is,
I believe) no convincing and accurate method of sense

A broader conclusion regarding the lexeme has yet to dawn on me,
but it will. Thanks a lot to everyone.

Przemyslaw Kaszubski, M.A.


School of English
Adam Mickiewicz University
Al. Niepodleglosci 4
61-874 Poznan, POLAND
tel: +48 61 8528820
fax: +48 61 8523103