Machine Learning and Artificial Intelligence algorithms are currently applied in almost every industry, integrating numerous Value Chains that depend on their decisions. However, despite the continuous advances in the state-of-the-art, these algorithms are still not perfect and make several mistakes in critical situations. The cause of each mistake might rely on several factors, for example:
Data points that are too close to a decision boundary – In real-world datasets, decision boundaries might be hard to define. In those cases, in the predictions close to the boundary, the model returns predictions with low levels of confidence, which might lead to misclassified data.
Outliers – If the data point doesn’t belong to any population seen on the training set, it’s hard for the model to make an inference about that data point.
- Missing data – There are several ways to deal with missing data, either by imputation or by adding “missing data” flags. In both cases, we are making assumptions about the data that might not be true, therefore, the inferences about those data points should have a lower confidence level.
Low confidence levels might lead to misclassification cases but is it really the model’s fault? When we ask an algorithm for its predictions about a data point, we force it to return an answer, even if it doesn’t know it. For several use cases (see some examples in “Applications”), it’s beneficial to give the model the option to remain silent, i.e., if the algorithm is not confident enough, it has the option to reject the data point, avoiding making mistakes – Classification with Reject Option.
This approach is only applicable when it is possible to pass on the decision to another available decision system (e.g. another algorithm, a specialist, exams, or tests) or when there’s no need to return predictions for the entire dataset. In other words, apply Reject Option when the cost of rejecting an instance by the model is lower than the error cost. Here are a couple of examples of applications that can benefit from an approach of Classification with Reject Option:
- Decision Support Systems for Medical Diagnose – there are few things riskier than a misdiagnosis, especially for lethal diseases. Therefore, delegating that task to an algorithm and making it responsible for the decision is hardly accepted by the medical community, since it brings a lot of reliability issues to the table. For that reason, it is difficult to integrate AI algorithms in the screening workflow of diseases. However, the intention of including AI in healthcare is to help the specialist and not to replace them. Using a Classification with Reject Option approach, the algorithm returns its predictions for the cases where it is highly confident and passes on the decision to a specialist when the confidence levels are lower. This way, the algorithm will be helping the specialist, relieving him/her from a significant workload.
- Image-based Classification in Videos – In some applications, such as object detection, action recognition, video summary, or face recognition, not every frame is relevant since the normal frame rate of videos is 30 FPS and most of the time, one good frame is enough to trigger a decision. In these cases, instead of returning frame-wise predictions with a lot of noise and uncertainty, the models could use the Reject Option trick and return the predictions with high levels of confidence, only.
Now that we have seen how Classification with Reject Option can help us in critical use cases, let explore how we can integrate it in our model implementations.
Method 0 – Threshold Optimization
The easiest and simplest way to integrate Reject Option in a Decision Support System is applying post-processing on the model results, considering the confidence level and the performance goal. For example, if the acceptable performance is an average accuracy above 95%, you can optimize the confidence threshold for each class. To do so, follow these steps:
- Compute the predictions to your validation dataset
- For each class, iterate through the prediction corresponding to the class from the lowest to the highest
- Consider that prediction as to the value of your threshold
- After applying the threshold, compute the metric of interest (in this case, the average accuracy)
- Once you achieve the average accuracy of 95%, you have found the optimized threshold
To avoid overfitting over the validation set, apply cross-validation and compute the optimized threshold considering one of the sample statistics: average, median, or mode.
Despite being easy to implement, this method has some limitations. First of all, it’s hard to regularize the amount of data that is being ignored by post-processing. In the limit, this method is able to find perfect metrics by ignoring all the data, so you will need extra mechanisms to avoid that to occur in your optimization. Second of all, since this method is applied after getting the predictions, the model doesn’t learn how the feature space is related to data rejection. To overcome these limitations, we present to you the next three methods found in the literature.
Method 1 – Adding a Rejection Class
This method was explored by Sousa, Ricardo Gamelas, et al. in  for a binary problem. The solution implemented by them included the following steps:
- Define a value (random or not) as the initial threshold.
- Compute the ratio of rejected instances (R = number of rejected instances / total number of instances in the dataset) and the ratio of misclassified data points (E = number of misclassified instances / total number of instances in the dataset).
- Compute the Ȓ, using the equation Ȓ = ⍵R + E, where ⍵ is the rejection cost, R is the ratio of rejected instances, and E is the ratio of misclassified instances.
- Repeat steps 1, 2, and 3 for a set of thresholds.
- Select the threshold that minimizes Ȓ.
- Create the Rejection Class, and re-label the dataset with that class when the predictions are under the threshold value.
- Train a new model for the 3 classes problem.
A weakness of this model is that it needs two different training sets, one for the first model and a second to be re-labeled and to train the second model. If you’re dealing with small amounts of data, you might compromise the model performance by using only half of it.
Method 2 – Class-Specialized Models
This method was also presented by Sousa, Ricardo Gamelas, et al. in  for a binary problem. However, as well as the previous method, it can be adapted for the multi-class problem.
The implementation of this solution integrated the following steps:
- Define the Rejection Cost for each model, considering the context of your use case and the real-life costs. For example, if rejecting a sample implies that a specialist has to analyze it later, consider the duration of the task and the man-hour value.
- Train the first model to become specialized on class 0, i.e. maximizing the precision for class 0.
- Train the second model to become specialized in class 1, i.e. maximizing the precision for class 1.
- Compute the predictions for the test dataset, for model 1 and for model 2.
- For each data point, if the predictions match, classify the instance with the corresponding class otherwise, classify it as “rejected”.
To extend this method for a multi-class problem, you must train a different model for each class and then combine the predictions of all the models to check if there is unanimity, otherwise, the data point is rejected. This means the computation scales with the number of classes in the problem, which makes it impracticable when working with datasets with a high amount of classes, as the Imagenet (1000 classes), for example.
Method 3 – Regularization Through Loss Function
The fourth and last method was proposed by Geifman, Yonatan, and Ran El-Yaniv in  with the novel Dense Neural Network (DNN) architecture “Selectivenet”. The Selectivenet can be adapted to any DNN, by adding an extra task to the model for data selection. The selection task is self-supervised, which means there’s no ground truth related to this task but its output is supervised by the loss function.
The loss function has then two terms: one to punish misclassifications on the data points that were not rejected by the model and a second term to punish the rejection itself to avoid a massive rejection.
Additionally, the authors suggest joining an auxiliary task that can be the same as the classification task or a different one, as long as it doesn’t ignore any data point. The purpose of this task is to force the model to learn the entire feature space represented by the available data and to learn the relation between the feature space and the rejection. Adding the auxiliary task implies the addition of a third term to the loss function, whose impact is regularized by a parameter.
From all the methods this seems the most functional since it is easy to implement, it doesn’t require an extra data partition to optimize the thresholds, and it doesn’t cause a significant increase in the computation cost.
Reject Option methods are useful to increase the trustability of Machine Learning methods and to avoid mismanagement in critical situations. However, it is not applicable to every use case. When a data point is rejected by the model and it can’t be ignored, someone or something has to handle it, and that option might not be available. Once again, the key to a successful AI system is in understanding the problem, finding the strengths and the limitations associated with each possible method, and designing a solution that fits the problem and its context.
If you’re looking for more ideas or if you’re willing to discuss cutting-edge solutions in AI, contact us at firstname.lastname@example.org
 – Sousa, Ricardo Gamelas, et al. “Robust classification with reject option using the self-organizing map.” Neural Computing and Applications 26.7 (2015): 1603-1619.
 – Geifman, Yonatan, and Ran El-Yaniv. “Selectivenet: A deep neural network with an integrated reject option.” International Conference on Machine Learning. PMLR, 2019.