Re: Posted via E-mail (w/t vs. t/w)

Henry S. Thompson (ht@cogsci.ed.ac.uk)
Fri, 19 Apr 96 14:08:59 BST

Another way of putting what the others have said is that it's because
we approach the tagging problem as an instance of the noisy channel
model: The original data was a sequence of tags, which has been
corrupted by the noisy channel to give a sequence of words, from which
we have the job of recovering the original tag sequence.

So we want to recover that tag sequence T which is most likely given the
observed sequence W, i.e. which maximises p(T|W). There's no obvious
way to do this directly, but Bayes rule tells us p(T|W) is equal to

p(T) * p(W|T)
-------------
p(W)

So we want to find the sequence T which maximises THAT value, and
that's not too hard, since p(W) doesn't change as we change T, so we
just need to maximise the numerator, p(T) * p(W|T). A simple tagger
might do this by using observed tag bigram frequencies to estimate
p(T), and similarly [finally he gets to the answer to your question!]
the product of the frequency-based estimates of the individual
p(w[i]|t[i]) to estimate p(W|T).

Needless to say, any table of word/tag pair frequencies will allow you
to derive estimates of either p(w|t) or p(t|w)---it just depends on
what you sum over what, as it were.

Hope this helps,

ht