Corpora: grammar of English letter combinations

From: Geoffrey Sampson (geoffs@cogs.susx.ac.uk)
Date: Mon May 08 2000 - 13:01:13 MET DST

  • Next message: Paul Clough: "Corpora: Plagiarism detection"

    This posting is primarily for entertainment, but may have some intellectual
    purpose also.

    In the first place, many thanks to all who responded to my query about a
    grammar of English letter-sequences which would generate "possible words" of
    English. From the replies it sounded as though nothing exactly like this
    is out there, though (as I thought) several linguists have done similar
    things with phoneme sequences. The solidest-looking reference here was
    from Graeme Hirst, Toronto, to pp. 220-32 of B.L. Whorf, _Language, Thought,
    and Reality_, a passage I think I remember vaguely though I have not checked
    it now (I needed written letter-sequences rather than phoneme sequences).

    Since exactly what I want didn't seem to be available, I tried a DIY solution;
    I wrote a prog which counted 3-gram frequencies in the words of an
    electronic dictionary including inflected forms, and omitting any entry not
    composed entirely of lower-case alphabetic letters, but with the twist
    that 48 common consonant pairs and vowel pairs (e.g. "tt", "ie") were
    treated as if they were single letters; and then the prog uses the 3-gram
    frequencies to construct words probabilistically starting from word-boundary
    and ending when it hits another word-boundary. It works like a dream.
    Appended for readers' delectation is a short excerpt from its output.
    Some of the forms are real words, of course; others are BETTER than real
    words!

    Prof. Geoffrey Sampson

    School of Cognitive & Computing Sciences
    University of Sussex
    Falmer, Brighton BN1 9QH, GB

    e-mail geoffs@cogs.susx.ac.uk
    tel. +44 1273 678525
    fax +44 1273 671320
    Web site http://www.grs.u-net.com

    fard
    mered
    unleadcrating
    mal
    phist
    indfarkiness
    ider
    eling
    ay
    undry
    booked
    tals
    ephons
    am
    splametion
    debults
    cas
    skirdate
    nalixtrobitches
    ditch
    nuses
    red
    sitatiseregrequest
    lagarreled
    scolinted
    dent
    prisombosed
    mater
    ned
    palurprestratents
    jampass
    roofed
    hidaw
    unded
    gnees
    hindfulaturaterround
    coulikenas
    gled
    beared
    cons
    issing
    compored
    hoodoorbigging
    peskier
    ame
    cong
    aniers
    idepulated
    caring
    briguns
    de
    inters
    sorcal
    deaf
    amely
    throntolaggled
    debucklevibrant



    This archive was generated by hypermail 2b29 : Mon May 08 2000 - 13:00:13 MET DST