Difficult Targets to Optimize: the ROC AUC

A practical trick to learn ML models with imbalance data

In many binary classification problems, especially in domains with highly unbalanced problems (such as the medical domain and rare event detection), we need to make sure our model does not become too biased for the more predominant class. 

Thus, you may have heard that accuracy is not a good metric to validate classifiers in unbalanced settings. Instead, people tend to use other performance metrics that are robust to imbalance, such as the ROC AUC and the F1-score. So, why don’t we train models that optimize these metrics directly? Well, in some cases it is not possible/efficient since they are not differentiable. This is a series of blog posts in which we’ll explain how to optimize your model for such metrics (or soft versions of them), starting with the ROC AUC. 

What is the ROC AUC?

A ROC curve is a plot that illustrates the diagnostic ability of a binary classifier for varying discriminative thresholds in its probabilistic output.  It can be built by thresholding the model’s predicted probabilities at several values between 0 to 1 and calculating the True Positive Rate (Proportion of positive samples correctly predicted as positive) and the False Positive Rate (proportion of negative samples incorrectly predicted as positive). The ROC curve is the plot of all these determined points, as shown below.

In ML, we typically want to achieve the curve with the highest possible area. Why? Maybe you didn’t know this, but the area under the ROC curve equals the probability of the classifier ranking a randomly chosen positive instance ahead of a randomly chosen negative one (the proof of such a theorem is available here). Namely, what’s the probability of assigning a higher priority to a sick patient than to a healthy patient.

So, we can think of the ROC AUC as the accuracy of a ranking model when exposed to a pair of samples with opposite classes (e.g, one sick and one healthy). Being the cross-entropy loss the de facto soft approach for learning binary classifiers when we have accuracy in mind, the cross-entropy of a pairwise ranking model (e.g., a siamese neural network) would be a soft way of learning that tends to maximize the ROC AUC.

How to optimize the ROC AUC?

Let’s say we have a model (e.g., a deep neural network) such as the following one that, given the input features, predicts a continuous score.

As we discussed before, maximizing the ROC AUC is equivalent to maximizing the accuracy of the score difference sign for a positive-negative pair:

Thus, we can simply use a Siamese architecture, where each stream will contain our target model, trained on positive-negative pairs. In our architecture, the scores will be subtracted and passed through a sigmoid activation in order to approximate the probability of the positive sample having a higher score than the negative sample.

For generating our training batches, each pair will have a sample from each class, one on the negative stream and one on the positive stream, meaning that the ground truth will always be 1, as the probability from the positive stream should always be higher than the probability from the negative stream. The model is trained by minimizing the cross-entropy loss of this pairwise target. We won’t converge to a naive solution here given that the weights on each stream of a siamese network are shared.

This means the model is penalized whenever the negative side has a greater score than the positive side. As such, we are optimizing the model so that it always gives a higher score to the positive class (input_pos) when compared to a negative class (input_neg) – which is essentially the definition of optimizing ROC AUC! 

So, how do we turn this network into an actionable model, which returns the binary classes? We need to reduce our network to a single stream, with the pre-trained weights, and determine a threshold value for the predicted score above which the model classifies the class as positive.


This architecture was tested on the CIFAR10 dataset in Keras, by creating an artificially unbalanced problem. The positive class was considered to be “airplanes”, and the negative class was all the other classes in the dataset. The positive class was then subsampled to 5% so we created an artificially unbalanced problem. 

Afterward, we used a feedforward network with internal dropout layers and compared the performance of the simple cross-entropy-based strategy to train classifiers with the siamese-based models that we discussed in this post. The experiment was repeated 5 times with different random seeds to get a mean value more independent of the image selection process.

The mean ROC AUC value for the Siamese Network was (86 ± 1.3)%, while for the one-stream network the value was reduced to (72.2 ±  7.2)%, showing that optimizing the model with the siamese architecture was beneficial for the ROC AUC. 


This blogpost explained how to optimize your model for a different metric, based on the probabilistic interpretation of ROC AUC.

At NILG.AI, we have worked on a lot of medical and marketing applications, where targets tend to be extremely unbalanced. We have used this strategy in several projects, achieving on each case higher performance with this learning strategy than with traditional learning approaches. If you are facing a similar problem, let’s discuss how we can collaborate with these types of learning approaches!

Like this story?

Subscribe to Our Newsletter

Special offers, latest news and quality content in your inbox once per month.

Signup single post

This field is for validation purposes and should be left unchanged.

Recommended Articles

Business-centric AI: A New Perspective for Your Company

Coping with the challenge of integrating AI into your business? You’re not alone. Many companies struggle to find the right approach to AI, often getting lost in technical details or data management issues. However, there’s a solution that transcends these common pitfalls: Business-centric AI. This transformative strategy is the perfect way to align your core […]

Read More
Long-term vs. Short-term Predictions in Machine Learning

When building a machine learning model, one of the most common questions is whether to opt for long-term or short-term predictions. In other words, should you build a model that forecasts an event tomorrow or a month from now? Our article will demystify this critical decision-making process. We’ll walk you through a strategic approach that […]

Read More
Ditch the Crystal Ball: Reverse-Engineering with Machine Learning

  Machine Learning models are estimators – which means they can be used not only to predict unknowns in your business but also to reverse-engineer complex business processes. As part of this blog post, you will learn how to identify these potential points of improvement, prioritize them, and create models to estimate them. Identification How […]

Read More