Precision, recall, sensitivity and specificity

Nowadays I work for a medical device company where in a medical test the big indicators of success are specificity and sensitivity. Every medical test strives to reach 100% in both criteria. Imagine my surprise today when I found out that other fields use different metrics for the exact same problem. To analyze this I present to you the confusion matrix:

Confusion Matrix

Confusion Matrix

E.g. we have a pregnancy test that classifies people as pregnant (positive) or not pregnant (negative).

  • True positive – a person we told is pregnant that really was.
  • True negative – a person we told is not pregnant, and really wasn’t.
  • False negative – a person we told is not pregnant, though they really were. Ooops.
  • False positive – a person we told is pregnant, though they weren’t. Oh snap.

And now some equations…

Sensitivity and specificity are statistical measures of the performance of a binary classification test:

Sensitivity

Specificity

sensitivity and specificity

Sensitivity in yellow, specificity in red

 

In pattern recognition and information retrieval:

Precision

Recall

Let’s translate:

  • Relevant documents are the positives
  • Retrieved documents are the classified as positives
  • Relevant and retrieved are the true positives.
Precision, recall

Precision in red, recall in yellow

Standardized equations

  • sensitivity = recall = tp / t = tp / (tp + fn)
  • specificity = tn / n = tn / (tn + fp)
  • precision = tp / p = tp / (tp + fp)

Equations explained

  • Sensitivity/recall – how good a test is at detecting the positives. A test can cheat and maximize this by always returning “positive”.
  • Specificity – how good a test is at avoiding false alarms. A test can cheat and maximize this by always returning “negative”.
  • Precision - how many of the positively classified were relevant. A test can cheat and maximize this by only returning positive on one result it’s most confident in.
  • The cheating is resolved by looking at both relevant metrics instead of just one. E.g. the cheating 100% sensitivity that always says “positive” has 0% specificity.

More ways to cheat

A Specificity buff – let’s continue with our pregnancy test where our experiments resulted in the following confusion matrix:

8 2
10 80

Our specificity is only 88% and we need 97% for our FDA approval. We can tell our patients to run the test twice and only double positives count (eg two red lines) so we suddenly have 98.7% specificity. Magic. This would only be kosher if the test results are proven as independent. Most tests are probably not as such (eg blood parasite tests that are triggered by antibodies may repeatedly give false positives from the same patient).

A  less ethical (though IANAL) approach would be to add 300 men to our pregnancy test experiment. Of course, part of our test is to ask “are you male?” and mark these patients as “not pregnant”. Thus we get a lot of easy true negatives and this is the resulting confusion matrix:

8 2
10 380

Voila! 97.4% specificity with a single test. Have fun trying to get that FDA approval though, I doubt they’ll overlook the 300 red herrings.

What does it mean, who won?

Finally the punchline:

  • A search engine only cares about the results it shows you. Are they relevant (tp) or are they spam (fp)? Did it miss any relevant results (fn)? The ocean of ignored (tn) results shouldn’t affect how good or bad a search algorithm is. That’s why true negatives can be ignored.
  • doctor can tell a patient if they’re pregnant or not or if they have cancer. Each decision may have grave consequences and thus true negatives are crucial. That’s why all the cells in the confusion matrix must be taken into account.

References

http://en.wikipedia.org/wiki/Confusion_matrix

http://en.wikipedia.org/wiki/Sensitivity_and_specificity

http://en.wikipedia.org/wiki/Precision_and_recall

http://en.wikipedia.org/wiki/Accuracy_and_precision

About these ads

8 thoughts on “Precision, recall, sensitivity and specificity

  1. I disagree true false negatives can be ignored by search engines, and perhaps that’s why they suck. Unless you compare your listed results against the true negatives I don’t think you have a fair idea of how well the search is actually performing. You don’t know whether your false positives are reasonable or excessive.

    • Sorry for the delay. Here are 2 thoughts:

      1. When comparing search algorithms you’ll most likely compare them on a static database so you can assume: FP = Const – TN. So TN is in there though not directly.

      2. The users of a search engine will never see the true negatives. They are not affected by their abundance or scarcity. A search engine designer should only be worried about what he shows the users. Another example would be a criminal finger print matching search algorithm where every result returned means more police work. Every positive classified costs actual tax money to address. So now you have a few search algorithms and you want to know which one gives you the best bang/$ and that’s precision and recall. Specificity is a red herring in that case, not to mention it may be incomprehensibly over 99.99% in a giant database.

  2. It’s a very interesting subject, one I’ve come to appreciate more over the years. I remember the first time I saw this (back in highschool) and had to work with sensitivity and specificity, what a discovery that was for me!

    Nowadays I actually work with even more interesting factors, you can some really unexpected stuff with classifiers it seems. Let’s meet for lunch and I’ll tell you about it :)

  3. Hi there
    I am not an statistician but doing some modeling on invasive plants, I am wondering if in a test sensitivity be more important that specificity, which method we should should use instead of F-score to give more weight to sensitivity that specificity. Thanks

    • Hi Hank, I just now noticed this comment, so sorry for taking so long to answer. Could you describe the statistic problem you’re trying to solve?

  4. Nice explanation! I just have a question, do you think eventually we can depend only on sensitivity and specificity to asses a classifier’s performance then? The aim would be to maximize both these values. This is very helpful in cases where we judge classifiers performance by how good it is at identifying correctly both classes. Because by definition, precision, recall and even f-score which depends on them tend to focus only on one class and sort of ignores the other.

    • I think you’re going to need to understand the problem your classifier is trying to solve and then decide what the right benchmark is. E.g. for a search engine, precision and recall are better benchmarks than sensitivity and specificity.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s