The script below will work for most tags, but may fail in the following
more complicated cases:
1. A tag is spread out over more than 1 line (usual cases: comment tags,
tags with attribute/value pairs).
2. A tag has an attribute value that has a ">" in it.
3. A comment tag has a ">" embedded in it.
I have encountered these in html files of journal articles gotten off
the web. Thanks.
-Alex Yeh
Danko Sipka wrote:
> Hi:This Perl script should do the job: print "What is your input file
> name:\n";
> chomp($infile=<STDIN>);
> open IN, $infile or die "No file, no fun!";
> open OUT, ">$infile.out" or die "No file, no fun!";
> while (<IN>) {
> $_=~s/\<.+?\>//g;
> print OUT "$_";
> }
> close (IN) or die "D'oh!";
> close (OUT) or die "D'oh!";Best, Danko Sipkasipkadan@main.amu.edu.pl |
> Danko.Sipka@asu.eduhttp://main.amu.edu.pl/~sipkadan |
> http://www.public.asu.edu/~dsipka
>
> ----- Original Message -----
> From: Tine & Colleen
> To: CORPORA@HD.UIB.NO
> Sent: Tuesday, April 16, 2002 8:13 PM
> Subject: Corpora: sgml detagger
> HiI am compiling a corpus for research reasons and some of
> the texts are sgml-tagged.Does anybody know an easy way to
> remove the tags and save the texts as 'raw' .txt files?Maybe
> a PERL script? Thanks in advance Tine LassenCopenhagen
>
This archive was generated by hypermail 2b29 : Tue Apr 16 2002 - 20:41:20 MET DST