Testing binary classifiers

24 Oct 2018

Introduction to binary classifiers

Classification is a classic problem in machine learning and statistics, where we have some data and wish to choose a single category for each data point. The simplest form of this is binary classification, where each data point can represent one of two states. For example, is an email spam or not spam? Does a medical test mean a patient does or doesn’t have a particular disease? or does a picture contain a cat or a dog? Consider a series of emails that have been hand-labeled to be either spam or not spam. A simple (and probably pretty poor) binary classifier would be whether the email contained the word “spam” itself. In this way we could automatically classify each email and then compare to the hand-labelled category.

No classifier is perfect. Sometimes spam is missed or a medical test gives a wrong result. In order to determine how well a classifier can perform a number of different performance statistics have been developed. This article briefly goes through some of the main ones with interactive graphs to demonstrate how they’re applied and some of their consequences.

The statistics

For a population of data-points (e.g. emails, patients, or images) and a binary classifier (e.g. checking the email contains the word spam), each data-point can be divided into one of four categories. These are given by whether the data-point is positive or negative (e.g. actually is spam or has a disease) against whether the point tested positive or negative (e.g. the email contains the word “spam”). The four groupings are then

  • True Positive (TP). A point that tested positive and is actually positive.
  • True Negative (TN). A point that tested negative and is actually negative.
  • False Positive (FP). A point that tested positive, but is actually negative.
  • False Negative (FN). A point that tested negative, but is actually positive.

The diagram below splits the population up into positive (in orange) and negative (in blue), with those that tested positive darker than those that tested negative. Each area represents the total number for each category.



The above diagram shows some of the main classifier statistics used:

  • Sensitivity (or recall). Out of how many points that are positive actually tested positive.
  • Specificity. Out of how many points that are negative actually tested negative.
  • Precision (or positive predictive value). The probability that a point that tested positive is actually positive.

Each button above shows how to calculate these in practice.

Putting it into practice

Thinking about these statistics is important especially when the prevalence of a positive case is low. If only one in a thousand emails is spam and our test is not that sensitive, then most points that test positive actually won’t be spam.

The interactive diagram below simulates the consequences of varying prevalence, sensitivity, and specificity for a population of points (shown as circles). Use the sliders first to determine how prevalent a positive case is along with the properties of the test. Then simulate the positives using the set positives button. Next apply the test with the specified sensitivity and specificity. Finally sort the data points to calculate the statistics.



The data points are sorted into a confusion matrix, with testing negative and positive arranged in columns and the points actually being negative and positive arranged into rows. Once sorted the true positive, false positive, true negative and false negatives can be calculated as entries in each part of the matrix. A number of statistics can then be calculated including sensitivity, specificity (coming from the sample as opposed to the underlying test), and the precision. Others include the negative predictive value, the accuracy (proportion of points correctly classified) and the F1 score.

The receiver operator characteristic

For some purposes a higher number of false positives or false negatives can be tolerated. Often a binary classifier will report a continuous value instead of true positive or negative (such as a probability or a concentration of a species in a blood test). In order to classify each point as being negative or positive, we are free to set a threshold on these values wherever we like. For example, if there is a high cost with missing a positive point then we might set this threshold low, so anything above it is classified as positive. In general, we want to be able to determine how well a classifier is performing over a range of these thresholds. This can be done using the Receiver Operating Characteristic (ROC) curve.

The diagram below demonstrates the ROC for varying classifiers. We can imagine that a classifier assigns a value to each point. If the point is positive this value is determined by a distribution (we’re using a normal distribution here, but the exact shape of the distribution doesn’t matter) and if the point is negative, the classifier value is determined by a different distribution. The sliders below can change the mean and variance of the positive distribution.



The diagram on the left shows how the classifier translates each point into their corresponding values depending on whether they’re positive or negative. The diagram on the right maps out for each possible threshold, the corresponding false positive rate (1 - specificity) and true positive rate (sensitivity). By experimenting, you can see that a good classifier can maximize the true positive rate, whilst minimizing the false positive rate. The area underneath the whole curve (AUC) then summarizes how well the classifier performs over the whole range of possible thresholds. It turns out that, if you pick at random a positive point and a negative point, then the probability the value of the positive point is higher than the negative point is equal to the AUC (see below for more of an explanation).

You can also use the diagram above to explore how the shape of the ROC relates to different parts of where the classifier may be failing. For example, if the ROC curve dips below the diagonal at any point this is an indication that there are more negative cases greater than positive cases at that threshold.

Wrapping up

There’s a huge plethora of statistics for binary classifiers. These statistics can all have their subtleties and can be used in conjunction to understand issues and assess how well the classifier is performing.

Understanding the math behind the AUC

I mentioned briefly above that the probability that the value of a positive case being greater than a negative case is equal to the AUC. In order to understand this, we can simply integrate over the ROC and use the definition of the true positive rate $tpr$ and the false positive rate $fpr$ in terms of the probability density for a negative $f_0$ and positive $f_1$ point in terms of the value $x$,

\(\begin{align} AUC &= \int_0^1 fpr(x)d(tpr), \\ &= \int_{-\infty}^\infty fpr(x)tpr'(x)dx, \\ &= \int_{-\infty}^\infty tpr(x) f_0(x) dx, \\ &= \int_{-\infty}^\infty \int_{x}^\infty f_1(y) dy f_0(x) dx, \\ &= \int_{-\infty}^\infty \int_{-\infty}^\infty I(y > x) f_1(y) f_0(x) dx dy, \\ &= P(Y > X). \end{align}\)

Acknowledgements

All the interactive examples were coded in d3, which has many fantastic examples for data visualization.