On Thu, 23 Mar 2000, Geoff Wilkins wrote:
> I'm looking for software - preferably freeware or shareware - to
> use to download text from Web sites, for use in a corpus.
I have used w3mir
http://www.math.uio.no/~janl/w3mir/
and
SiteSnagger
http://hotfiles.zdnet.com/cgi-bin/texis/swlib/hotfiles/info.html?fcode=000P7Z
Both have shortcomings, but I have downloaded gigabytes of HTML-files
with the programs.
With w3mir (and some home made programs) I have built a fully automatic
system for downloading all the new articles each day in 10 Norwegian
newspapers in the Web, stripping HTML-codes, indexing the text (with IMS
CWB) and making the total text searchable through a Web-browser (with a
passwd due to copyright reasons). I will present this project at LREC in
Athens later this year.
Knut Hofland | Knut.Hofland@hit.uib.no
HIT-Centre (former NCCH) | http://www.hit.uib.no/knut/
University of Bergen, | Phone: +47 5558 9463
Allegt. 27, N-5007 Bergen, Norway | Fax: +47 5558 9470
This archive was generated by hypermail 2b29 : Mon Mar 27 2000 - 00:47:08 MET DST