Corpora: Penn-Helsinki Parsed Corpus of Middle English

From: Anthony Kroch (kroch@linc.cis.upenn.edu)
Date: Fri May 19 2000 - 16:42:08 MET DST

  • Next message: Nicola Guarino: "Corpora: Open research positions in ontological engineering"

    This might be of interest to others besides the questioner.

    ------- Forwarded Message

    Date: Fri, 19 May 2000 10:26:45 -0400
    From: kroch@change.ling.upenn.edu
    To: G.Rundblad@uea.ac.uk
    Subject: Corpora: XML programmes and tagging

    Hello Gabriella Rundblad,

    My name is Anthony Kroch and I am on the faculty of the Linguistics Department
    at the University of Pennsylvania. As it happens, I and my colleagues have
    already created a corpus of the sort that you are talking about. It is a
    parsed version of the prose text samples in the Helsinki Corpus of Historical
    English and is called the "Penn-Helsinki Parsed Corpus of Middle English." The
    first edition, which is five years old or so, has total of 500,000 words of
    running text and marks clause and phrase structure without indicating part of
    speech. A second edition will be released at the end of the month. This new
    edition contains 1.3 million words and was created by increasing the size of
    the Helsinki samples (to a maximum 50,000 words when the text was long
    enough). The second edition also has a richer annotation system and includes
    part-of-speech tagging. The first edition comes with Perl scripts to
    facilitate searching and the second edition comes with a specially written
    Java program for this purpose.

    You can get more information about the corpora from the PPCME web site:
    http://www.ling.upenn.edu/mideng

    The corpora are not currently in XML format but part of our plan for the
    future is to perform that conversion, which can be done automatically for the
    most part. We are currently creating a corpus of early Modern English, using
    the same annotation guidelines as those of the PPCME2. At the University of
    York in England, Prof. Anthony Warner is directing a project to create a
    corpus of Old English along the same lines.

    Please feel free to contact me if you have any questions about the corpora.

    Yours,

    Anthony Kroch
    Professor and Chair
    Department of Linguistics
    University of Pennsylvania
    Philadelphia, PA 19104-6305
    USA

    >From: Gabriella Rundblad <G.Rundblad@uea.ac.uk>
    >To: CORPORA@hd.uib.no
    >Subject: Corpora: XML programmes and tagging
    >
    >
    >Dear all,
    >
    >Despite having used language corpora for some years, I've
    >never put together my own corpus. Until now.
    >
    >I'm considering putting together a corpus of Middle English
    >using already electronically available text, but tagging it
    >to enable searches. I shall be attending the Oxford summer
    >seminars on digital resources etc. to learn more, but would
    >like to address some of the issues already now and perhaps
    >do some tests to see if my idea is plausible at all.
    >
    >
    >1) As far as I understand, it is today recommended to use
    >XML for tagging purposes. For this I'll need user-friendly
    >programme(s), the question is which. I know there are both
    >free ware, share ware and commercial products out there,
    >though I've never tried (yet) either of them and don't
    >know how user-friendly they are. I know HTML and use
    >Hotmetal Pro for this (great!) and there is obviously an
    >XML equivalent (XMetal). Could you advice what programme(s)
    >to use?! Is XMetal good for a never-before-tagger?!
    >
    >2) The tagging I would like to do (I'm reading up on TEI
    >etc) is a tagging of phrases and clauses, not parts of
    >speech. What's been done on this earlier? Any lists of tags
    >etc?
    >
    >
    >Grateful for all the advice you can offer.
    >
    >
    >Gabriella Rundblad
    >
    >
    >University of East Anglia
    >School of Language, Linguistics and Translation Studies
    >Norwich NR4 7TJ
    >UK
    >

    ------- End of Forwarded Message



    This archive was generated by hypermail 2b29 : Sun May 21 2000 - 23:16:57 MET DST