On Mon, 10 Mar 2003, Joerg Schuster wrote:
> I think one of the disandvantages of your program is that it stores
> all data in main memory. You have to say something like
>
> my $sentences=get_sentences($in);
>
> Though this is very comfortable when dealing with small files, I would
> like to rather say something like
>
> while(<>) {
> print_sentences;
> }
>
> Then huge files could easily be sentencized, too.
The thing is that some of the decisions are made globally.
Of course the program does not need more than a reasonable
window of text to make good decisions, but the size of that
windos is something the user should worry about (according
to the data available).
Given a huge file, you can first chop it into smaller chunks
(and you have the freedom to decide how to do that) and then
feed to the Lingua::EN::Sentence module each chunk at a time.
Taking input one line at a time will in most cases fail the
effort of determining the proper locations of sentence boundaries.
-- Shlomo Yona shlomo@cs.haifa.ac.il http://cs.haifa.ac.il/~shlomo/
This archive was generated by hypermail 2b29 : Mon Mar 10 2003 - 10:28:00 MET