It will probably be more easy to use an existing sgml parser than to
write a script that can really identify _all_ possible tags and
remove them.
The (freely available) parser onsgmls has in its output format all
data content on lines of their own, which are prefixed by a "-". So
you can simply run onsgmls on your sgml-files and retain only those
lines that start with "-". (using 'grep -e "^-"'); then you can
easily remove the leading "-" with perl or something similar. This
assumes that all data is good and not e.g. a javascript, which you
will probably not want to include in your corpus.
--_______________________________________________________________________ Dr. Michael Betsch privat: SFB 441, Projekt B1 Nauklerstraße 35 Rappenberghalde 27 72074 Tübingen 72070 Tübingen Tel. 07071/29-77161 Tel. 07071/51917 email: Michael.Betsch@uni-tuebingen.de _______________________________________________________________________
This archive was generated by hypermail 2b29 : Wed Apr 17 2002 - 09:50:06 MET DST