Queries involving "stop words" (Was Re: Stop Word Lists)

R Chandrasekar (mickeyc@linc.cis.upenn.edu)
Mon, 16 Sep 1996 19:11:17 EDT

Hi all,

In his recent posting, Robert Amsler says (in the context of queries
which are made substantially of what would 'normally' be stop words):

> .... consider the
> following queries...
>
> "Find all quotations to: "To be or not to be" in a corpus."
>
> "Find news stories about a former US TV show called "The A Team"
>
> "How much "down" is used in the fashion industry per year?"
>
> If one has the option to allow ALL words to be accessible if the user asks
> for them within quotes or via some other alternative means, then that would
> probably be preferable to not indexing them at all. ....

What is interesting is that some IR tools available to search across
the Web seem to have tackled this quite well. I tried the queries
above, and one additional query (Tell me all about the band "The Who")
which is similar in nature, and got some interesting results. I used my
favorite search engine, Alta Vista, which allows you to group words
within quotes to indicate phrases, and allows a + to indicate that a
word/phrase must necessarily be present to be a match.

1. When I searched for the words

To be or not to be

I got no matches, and the system gave the following statistics
(words with their frequencies):

Ignored: To: 6705583; not: 23058338; be: 41779013;
or: 41826213; to: 184278669

But when I searched for the phrase

"To be or not to be"

I got 164 matching items. Clearly, many of these are matches to the
phrase, but not in the Shakespearian context; however, I did not
try to refine my query further.

2. For the query:
Find news stories about a former US TV show called "The A Team
I looked just for

The A Team

and got the message:

Word count: Team:789223
Ignored : A: 52996696; The: 59193911

I got no matches out of this. But when I looked for the phrase:

"The A Team"

I got about 700 matches. With the query:

+"The A Team" +US +TV

this came down to 73 matches.

3. When I tried the query (the words from the query):

How much down is used in the fashion industry per year

I got:
Word count: fashion:329074; industry:2393072; How:2781093;
down:3134349; much:3258984; per:5065604
Ignored : year: 6478168; used: 6568163; is: 78496895;
in: 129623029; the: 387572745

About 200000 documents matched the query!
When I changed the query to:

+"fashion industry" +down

I got the following statistics:
Word count: fashion industry: about 3000;
down:3134349

About 300 documents matched the query. When I further refined (?)
the query to:

"use of down" "fashion industry"

I got one match, which looked good except, it was in part about the
"use of down-time" (of computers), which was far from the context we
had in mind!

4. When I looked just for

The Who

I got the following numbers:
Word count: Who: 843185
Ignored : The: 59193911

And Alta Vista was all set to show me documents 1-10 of
about 900000 matching the query. This included items
with the segments "Who's who?" and "Who is ....".
When I looked for the phrase:

"The Who"

I got:
Word count: The Who: about 4000

The approx. 4000 items matching the query included items such as:
Applied Mathematics Center, The Who's Who
which are really not relevant. Finally, when I tried:

+"The Who" concert

I got:
Word count: The Who: about 6000; concert:333835

About 800 documents matched this query, which
is still a large number.

-----------------------------------------------------

What can we say about all this?

A. Alta Vista seems to have some useful mechanisms to handle
phrases which include stop words. Presumably other search engines
have something similar. These search engines seem to be indexing
*all* words. This changes the notion of stop words somewhat.

B. There still seems to be a substantial need for search intermediaries,
who can refine queries to work well with a particular information source.

C. There are some fundamental problems, such as tokenization
(should "use of down-time" be treated differently from "use of down"?)
and spelling variations (if we use "met*wave", will it cover
"metrewave", "meterwave" and "meter-wave"?)
which need to be understood better.

Regards,

-- Chandrasekar
mickeyc@linc.cis.upenn.edu

______________________________________________________________________
R Chandrasekar Voice Mail: +1-215-898-0332
CASI/Instt for Research in Cognitive Science Fax : +1-215-573-9247
University of Pennsylvania Home : +1-610-352-5512
3401 Walnut St, Suite 400 C mickeyc@linc.cis.upenn.edu
Philadelphia,PA 19104-6228, USA http://www.cis.upenn.edu/~mickeyc
______________________________________________________________________