Okay. Let me now talk about how we evaluate text categorization. And in doing
this, I'm gonna come back to the concepts of precision recall that we introduced
informally before. But define them formally, and show how they get put
together into combine measure, the F measure gets applied to text
classification. But it isn't only applied to text classification. You'll see these
concepts coming back again and again as ways of evaluation for tasks in natural
language processing. The starting point for understanding these measures is, the
following two by two Contingency table. And so for any particular piece of data
that we're evaluating, there are essentially four states. On one axis we're
choosing whether This piece of data correctly belongs to a class. Or whether
it doesn't correctly belong to a class. So, for example, if we're wanting to
decide whether a piece of email is spam, we're. It either is spam, which is the
correct class, or it's not, which is the not correct class. And so, on this axis,
we're. Describing the truth, now what we've done is built a system that tries to
detect the truth and so secondly we're going to look at the status of our system,
so our system could be saying that this piece of data is spam, or it could be
saying that it's not spam. And so we then go to take these four things and look at
the four possibilities that occur, and here they are. So if we're looking for
spam one possibility is the truth of it that it is that it is spam, and that we
said it was spam. And so that's then referred to as a true positive. Another
possibility is that. Our system thought it wasn't spam even though actually it was.
And so that's then a false negative. It was treated as negative falsely. On the
other side of the dial it's possible that the piece of email wasn't spam. In that
case there are two possibilities. Our system mistakenly thought it was spam, so
let's say a false positive, we're classifying something positively wrongly,
but the other possibility is that wasn't a piece of spam and our system said it
wasn't a piece of spam and then we're dealing with the true naked [inaudible].
[sound]. Okay. So, if we're doing this kind of classification with two classes
that are, perhaps, about equally common in your email, spam and non spam. It seems
the reasonable thing to do is just to look at accuracy. So for accuracy, the ones
that count as you're getting the answer correct are these ones. So if you wanna
work out the accuracy, the accuracy equals the true positives, plus the true
negatives over. All four classes. The true positives plus the false positives plus
the false negatives plus the true negatives. In many applications, accuracy
is a useful measure for systems. But there's a particular scenario in which
it's not a useful measure. And that scenario is when you're dealing with
things that are uncommon. So, for example, suppose that we're wanting to detect
mentions of shoe brands in web pages. So the correct class is that something is a
shoe brand. And the, the not class is anything else. Well, if you're just
looking through random web pages or tweets or something, looking for mentions of shoe
brands, probably. I don't know. 99.9 percent of the words you encounter are not
gonna be the name of shoe brands. And so what does that mean? What that is likely
to mean is that, perhaps, in total, let's say there are. Ten mentions of a shoe
brand in a 100,000 word sample and 99. Thousand nine hundred ninety words that
are not shoe brands. Well. If we build a classifier, presumably even if it's a
mediocre classifier, we're going to say most words are not a shoe brand. And if we
go straight to the limit case, one possibility of our classifier is it could
say that no word is a shoe brand. So, it doesn't select any words as shoe brands
and it says that all 100.000 words are not shoe brands. And so that means the number
of true negatives is 99. Thousand nine hundred ninety, but there are few. False
negatives was, were shoe brands. But the number of them is ten. So if we then
actually work out the accuracy of this system, you'll see that the accuracy of
the system [sound] Is 99.99%. So, we have a 99.99 percent accurate system. But the
system has done precisely nothing. This is very easy system to write. You just write
one line of code that says return false for any token, and you're done, and yet
its accuracy is amazing. So what's clearly missing here is that we just not deal
Playing with what we wanted to do. So what we wanted to do was detect shoe brands. So
what's all important to us is the ten tokens which are instances of shoe brands.
And so what we want to do is come up with an evaluation metric that is much more
focused on are we detecting these very few words that are the names of shoe brands.
And so that's what the matrix of precision and recall do. Tells you of the things
that you are selecting, are they, they correct things. So precision is saying, of
the things that you are selecting, which is that row, what percentage of them are
correct things, this column. So precision is the number of true positives out of all
the things that you selected, the true positives plus the false positives. And
then recall is the opposite measure, so. For recall. We'll say of the things. That
are correct, what percentage of them did you find? So, for recall, the numerator
is, again, the true positives. But this time, the denominator is the true
positives plus the not selected false negatives. And so note how looking at
these measures solves the problem that we had last time with the shoe brands.
Because the fact. That they. Is 99,990 token that are true negatives, is now
having no effect on these measures at all, And so what would happen in our previous
case where there were ten tokens here, is, and zero tokens here, since our classifier
said was every word wasn't a, a shoe brand, that what we'd find is that. We'd
say that the recall of that system is zero. It's finding none of what we wanted
to find. And basically, we can say, a system with zero recall isn't interesting.
And so this suggests We've actually, what to do better by returning some stuff. So,
what we'd like to do is label some things as shoe brands. So we could revise the
system, and when we revise the system, maybe what we'll do is, we'll have it
return 40 things as the name of shoes. And so it will, perhaps, find eight of the
shoe brands. But it'll make some mistakes and claim some things as shoe brands that
aren't, so there'll be 32 tokens over here. And then there'll be two tokens down
here. And then the remainder of the tokens. 960 will be down here. So then, at
the point, we can say, well, the recall of our system is pretty good. So it's now
finding eight out of the ten instances of shoe brands. So its recall is 80%. But the
problem is that recall came at a cost because it's now also claiming lots of
other things the shoe brands that aren't. So the precision of the system is now
eight. Over eight plus 32 equals eight out of 40, which equals twenty percent. One in
five things it returns is correct. And so for a lot of practical applications this
kind of precision is going to be too low. It depends on whether you're really
interested in finding references to shoe brands, and are prepared to have a human
being go through and check them all individually. But if you want to do
something more automatic, Twenty percent is too low of a precision. But in this, we
see the secret of the tradeoff that balances between recall and precision. Cuz
almost inevitably, if you?re going to increase your recall, and find instances
of something, you're going to make some mistakes. And so your precision is going
to go down. And the more you try and boost recall, the more your precision is going
to be starting to drop. And so people are trading off precision and recall. So
that's a good thing about having these two measures. Because. Can discuss the
tradeoff and say how important to someone is the precision of what's returned versus
how important is it to find all of the stuff. To have high recall. And that's a
trade off that is played out differently in different applications. So in various
applications, such as for things like legal applications, where you want to find
all of the appropriate evidence, such as in discovery procedures. But what you
really, really want to be doing is having a system that's high recall, that finds as
much of the relevant stuff as possible. Where in other contexts, where what you're
going to do is just show a sampling of stuff to the user. You might be more
interested in the stuff that's shown to the user being stuff that is correct, that
looks good. That your precision is high. And it doesn't really matter that you're
only showing to the user one-tenth or 1/20 of the things that do satisfy their query.
So, sometimes that explicit tradeoff is really useful, and it's a really good
concept, to remember when you're building [inaudible] systems. Cuz in practice,
whenever you're building one of these systems, you're choosing some tradeoff
point as to how much you're emphasizing precision or recall. And different
tradeoff points are appropriate in different applications. But sometimes,
that's a slightly annoying thing, having these two measures. Because if people are
wanting to compare systems and say, which one is better, then you need some way to
compare. Find these measures, and so that's been thought about as well. [sound]
And so the standard way that's been proposed to combine these measures, is
something that's called the F measure. In disciplines like information retrieval and
name density recognition. So what is the F measure? What the F measure is, is this
neither more nor less than a weighted harmonic mean. So at this point we have to
kind of revise a little bit of math on means. So you doubtless remember the
arithmetic mean, which is just the average of two things. And you perhaps also
remember. A geometric mean, but in addition to both of those, there's a
harmonic mean. And in a harmonic mean, what you do is you take the reciprocal of
two quantities, and then add them. And then take the reciprocal of that. And if
you work through in more detail, what that does, the harmonic mean ends up as a very
conservative average. So if you put in two numbers into a harmonic mean, it's fairly
close to the minimum of the two numbers. Not completely at the minimum, but it's
nearer to the minimum than either the arithmetic or geometric means. Then in
particular. For this F measure we're allowed to put weights on these two terms
and so this is this alpha factor, so the weighting is how much to pay more
attention to the precision or how much to pay more attention to the recall. So, if
you are in an application and you know the precision is much important to you, you
can actually express that utility tradeoff by putting a particular value of alpha in
that expresses that. So this formulation here very clearly shows that the F measure
is the weighted harmonic mean. But actually that's not the formulation that
you normally see in books. The formulation that you normally see in books is actually
this one, which is I guess just a little bit tidier to write. And these two
formulations are related, and they do work through that in one of the exercises. You
can work out the relationship between alpha and beta. But effectively, this
formula also expresses a weighted harmonic mean, and it also. This control parameter,
beta, as to how much. Emphasis is being put on precision versus recall. So
[inaudible] control parameter is useful in the absence of other evidence, the mostly
commonly used thing is a balanced F measure. And so the balanced F measure
gets referred to as the F1 measure. And what that's meaning is that you're
setting. Theta equal to one. That then gives you equal balance between precision
and recall, which corresponds in turn to alpha being a half. And if you do the
balance def measure, this formula here simplifies to just this formula. And so
this is really the thing that you'll see most common. So if you don't remember any
of the rest of this stuff, what you should do is remember this little formula for the
F measure, two times the precision times the recall divided the sum of the
precision and the recall. So that should give you a good sense of what these
measures of precision record are, how they are combined into the F measure, why
they're useful, and how we use them to evaluate text classification.
No comments:
Post a Comment