In many binary classification problems, especially in domains with highly unbalanced problems (such as the medical domain and rare event detection), we need to make sure our model does not become too biased for the more predominant class. 

Thus, you may have heard that accuracy is not a good metric to validate classifiers on unbalanced settings. Instead, people tend to use other performance metrics that are robust to imbalance, such as the ROC AUC and the F1-score. So, why don’t we train models that optimize these metrics directly? Well, in some cases it is not possible/efficient since they are not differentiable. This is a series of blog posts in which we’ll explain how to optimize your model for such metrics (or soft versions of them), starting with the ROC AUC. 

What is the ROC AUC?

A ROC Curve is a plot which illustrates the diagnostic ability of a binary classifier for varying discriminative thresholds in its probabilistic output.  It can be built by thresholding the model’s predicted probabilities at several values between 0 to 1 and calculating the True Positive Rate (Proportion of positive samples correctly predicted as positive) and the False Positive Rate (proportion of negative samples incorrectly predicted as positive). The ROC curve is the plot of all these determined points, as shown below.

In ML, we typically want to achieve the curve with the highest possible area. Why? Maybe you didn’t know this, but the area under the ROC curve equals the probability of the classifier ranking a randomly chosen positive instance ahead of a randomly chosen negative one (the proof of such a theorem is available here). Namely, what’s the probability of assigning a higher priority to a sick patient than to a healthy patient.

So, we can think of the ROC AUC as the accuracy of a ranking model when exposed to a pair of samples with opposite classes (e.g, one sick and one healthy). Being the cross-entropy loss the de facto soft approach for learning binary classifiers when we have accuracy in mind, the cross-entropy of a pairwise ranking model (e.g., a siamese neural network) would be a soft way of learning that tends to maximize the ROC AUC.

How to optimize the ROC AUC?

Let’s say we have a model (e.g., a deep neural network) such as the following one that, given the input features, predicts a continuous score.

As we discussed before, maximizing the ROC AUC is equivalent to maximizing the accuracy of the score difference sign for a positive-negative pair:

Thus, we can simply use a Siamese architecture, where each stream will contain our target model, trained on positive-negative pairs. In our architecture, the scores will be subtracted and passed through a sigmoid activation in order to approximate the probability of the positive sample having a higher score than the negative sample.

For generating our training batches, each pair will have a sample from each class, one on the negative stream and one on the positive stream, meaning that the ground truth will always be 1, as the probability from the positive stream should always be higher than the probability from the negative stream. The model is trained by minimizing the cross-entropy loss of this pairwise target. We won’t converge to a naive solution here given that the weights on each stream of a siamese network are shared.

This means the model is penalized whenever the negative side has a greater score than the positive side. As such, we are optimizing the model so that it always gives a higher score to the positive class (input_pos) when compared to a negative class (input_neg) – which is essentially the definition of optimizing ROC AUC! 

So, how do we turn this network into an actionable model, which returns the binary classes? We need to reduce our network to a single stream, with the pre-trained weights, and determine a threshold value for the predicted score above which the model classifies the class as positive.

Validation

This architecture was tested on the CIFAR10 dataset in Keras, by creating an artificially unbalanced problem. The positive class was considered to be “airplanes”, and the negative class was all the other classes in the dataset. The positive class was then subsampled to 5% so we created an artificially unbalanced problem. 

Afterward, we used a feedforward network with internal dropout layers and compared the performance of the simple cross-entropy based strategy to train classifiers with the siamese-based models that we discussed in this post. The experiment was repeated 5 times with different random seeds to get a mean value more independent of the image selection process.

The mean ROC AUC value for the Siamese Network was (86 ± 1.3)%, while for the one-stream network the value was reduced to (72.2 ±  7.2)%, showing that optimizing the model with the siamese architecture was beneficial for the ROC AUC. 

Conclusion

This blogpost explained how to optimize your model for a different metric, based on the probabilistic interpretation of ROC AUC.

At NILG.AI, we have worked on a lot of medical and marketing applications, where targets tend to be extremely unbalanced. We have used this strategy in several projects, achieving on each case higher performance with this learning strategy than with traditional learning approaches. If you are facing a similar problem, let’s discuss how we can collaborate with these types of learning approaches!