I attach a minimalist perl prog that does the job. Or you can find
lists already generated on my website,
Adam
Kai Noponen wrote
> I need a tool that can make a frequency list out of the BNC. It must
> utilize the part-of-speech tags in order to separate the different cases.
> It also should read SGML.
-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Adam Kilgarriff Senior Research Fellow tel: (44) 1273 642919 Information Technology Research Institute (44) 1273 642900 University of Brighton fax: (44) 1273 642908 Lewes Road Brighton BN2 4GJ email: Adam.Kilgarriff@itri.bton.ac.uk UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%==============cut here===================
$/="<w "; while (<>){ /^([^>]+)>([^<]+)/; $word=lc $2; # all words normalised to lower case --delete 'lc' if you want to retain capitalisation $pos = $1; $word =~ s/\n/ /; $word =~ s/ +$//; $word =~ s/ /_/; # multiword 'words' will have _ between items ("in_order_to") in stead of spaces $count{$word." ".$pos}++; } for (keys %count){print "$_ $count{$_}\n"}
# words which, for some reason, weren't marked up with SGML w tag will be missed
This archive was generated by hypermail 2b29 : Thu Sep 14 2000 - 12:29:21 MET DST