Introduction and Motivation

Insurance codes are used by people’s health plan to make decisions about how much your doctor and other healthcare providers should be paid.  There is some variety of coding systems currently used [1]:

  • Current Procedural Terminology (CPT) codes, used by physicians to describe the services they provide. 
  • Healthcare Common Procedure Coding System (HCPCS),  used by Medicare. It is subdivided into level I codes (equal to CPT codes) and level II codes. The latter are used for identifying products, supplies, and services not included in the CPT codes (e.g. prosthetics and ambulance services)
  • International Classification of Diseases (ICD), developed by the World Health Organization (WHO), with the goal of identifying the patient’s health condition/diagnosis. These codes are typically combined with CPT Codes, to make sure that the patient’s health condition and the received services match (i.e., for matching billing and documenting diagnosis).

After a given procedure, healthcare professionals list the procedure’s code in an insurance claim form so that the hospital is partially or fully refunded by that procedure. 

However, it is natural that not all claims sent correspond to the actual procedures that were performed – due to fraud or submission errors, for instance. 

Here’s an example (adapted from [1]): if you fall and sprain your ankle, and go to the emergency services as a consequence of it, you might end up performing an X-Ray of the ankle. If by mistake, the healthcare professionals mislabel ankle X-Ray as elbow X-Ray but still end up giving you the diagnosis of sprained ankle, the procedure and the diagnosis is not consistent, and the insurance claim might end up being rejected. 

What are some of the issues resulting from this process which can have monetary consequences, from several points of view of the negatively affected stakeholders?

So, how can we use AI to assist in this area?

Modeling Anomaly Detection in Insurance Claims

We will show you how to solve it using multiple techniques, including: Supervised, Unsupervised and Weakly supervised versions. To know more about it, enroll into our online course where we discuss in-depth all of these concepts.

The Machine Learning Spectrum course

What to do?

 We can identify problems in insurance claims’ submissions at a certain granularity level: 

  • A group of claim codes does not make sense
  • A group of claim codes does not make sense because the healthcare professional clicked the adjacent code in the interface or used the code in the wrong category (since the codes have a certain hierarchy - e.g. local and global anesthesia). 

This use case of insurance claim error detection can be applied to both hospitals and insurance companies, as there is interest in understanding which claims are not correct. Basically, the possible options are to: identify a claim as wrong, correct an error in claim codes and/or try to explain why or where it occurred. 

We will now start using letters to refer to the claim codes, as a simplification. The following figure represents possible inputs and outputs of an insurance claim model.


Regarding the third case, we can have extra things in the claim sent by the hospital - an insurance company will want to have the minimum things possible, so the model should delete the codes which are unnecessary.

For the fourth case, a claim sent by a hospital can have missing codes for some procedures (i.e. by mistake or by lack of knowledge regarding a specific code). A model applied in a hospital should be able to add extra claims when they are missing. 

What are the restrictions? 

A model used to detect errors in insurance claims should be invariant to the order in which the healthcare professional places the codes (e.g. A-B-C or C-A-B). We can add this invariance in different ways:

  • Augmenting the input data with random shuffling
  • Ordering both input and outputs in alphabetical/numerical order. 
  • Using a position-invariant representation. For instance, since claim sequences can be considered as text, we could use Bag-of-Words for counting the presence of claims regardless of their order. 

How can we do this? 

There are several approaches we can take in this problem, depending on the amount of labels and data available. We will give some examples of how we can approach this in a supervised and unsupervised way, with a more detailed focus in an unsupervised approach. 

Available data

  • Claim Code
  • Date of claim
  • Possibly: Result (used as a label)

Assumed labels

  • Positive: Wrong claim. Insurance company reported issues with a certain hospital’s claim, and the hospital backed down and agreed with the error.
  • Negative: Cases in which the insurance company detected that the claim had no errors, did not want to spend time and money in legal processes for that given claim or failed to detect an error which existed. As such, negative labels are a mix between positive and negative.
  • None: Remaining claims which were not evaluated yet

Two major learning mechanisms can be used: 

Supervised Approaches

If we have both positive and negative labels, this is a classical supervised learning problem framed as binary classification. We can then manually extract features from the codes, such as the co-occurence of code pairs or use some Deep Model (e.g. RNNs) to try to infer the relationship between codes from the input.  

However, as negative labels can also contain the positive target, we can instead think of this problem as weakly supervised and use Positive Unlabeled learning (PU Learning), in which the class that is not positive is considered to have both negative and positive examples (mixed set). Inside PU Learning, there are several algorithms that can be used, some of which are described/referred in the literature [2].

Unsupervised approaches

If there are no labels available at all, we then need to follow an unsupervised approach. We will describe a few examples next:

     (i) Code embeddings

We can train a word2vec model that, given two codes, estimates the most likely adjacent code. Note that we are adding position-invariance.

This way, we can train code embeddings (similar to word embeddings) which learn the relationships between different codes. Then, we check if a given code has the embedding with the smallest distance to its neighbours. If not, we replace it by the code who does.

This is more error-prone as we can have codes for common operations with similar embedding distance. 

     (ii) Generative model

Using a generative model, we can fit our claims to a model, learning a density function. The most common claims will be close in a given probability space. Examples of models who do so are variational autoencoders or a gaussian mixture models. We will then be able to know the probability of each claim being an outlier. 

     (iii) Seq2Seq inspired: reconstructing correct sequences

With a denoising autoencoder, we are trying to reconstruct a certain claim sequence - we add noise and the model tries to know what is wrong and try to correct it. We can then calculate a reconstruction error, which tells us we should have more elements of a certain claim and less elements of another claim. We then have an explanation informative to knowing what is wrong in the claim. 

     (iv) Seq2Seq inspired: probability of a sequence being wrong

Alternatively, we can have a single model which tells us the probability of each element being wrong (and therefore, we know the probability of the whole claim being wrong).

To do this, we randomly add label noise by adding, removing and swapping claims. We then have an autoencoder which has a sigmoid layer that reconstructs the probability of each claim being wrong.

We have a higher degree of confidence in the model (and can measure its uncertainty) and can decide better on which claims we should manually analyze, since we have probabilities. On the other hand, we know a certain sequence has a high probability of being wrong, but we don’t know if it should be added or deleted. 

To solve this issue, we could add a network with three extra tasks: probability of the claim code being wrong because it needs to be edited, deleted or added. 

    (v) Mixed generative and reconstructive model

We can also have a generative model which shares weights with a denoising autoencoder (or other reconstructive model). This way, a generative model tells us which claim is wrong, and the reconstructive model tells us why it’s wrong (i.e., what part of the sequence is wrong), returning also the corrected sequence.

What to do with model results?

So, how can we be actionable with our model? Let us assume we are an insurance company with these two tools:

  • Probability of the claim being wrong
  • Suggestions of what is wrong

If we want to select N cases to manually evaluate, how could we optimize this to determine which are the most cost-effective claims?

The insurance company has certain costs associated to this procedure, and an example claim with the codes AAGKM, which should be AAGKD. Each code is a procedure/item with a certain cost. 

Code Cost
A 5
B 50
K 1000
M 200
G 300
D 100

AAGKM = 5*2 + 300 + 1000 + 200 = 1510€

AAGKD = 5*2 + 300 + 1000 + 100 = 1500 €

Positive cases are fraud cases, which we want to manually evaluate. 

If we are applying this model in an insurance company, we want to maximize both True Positives (TP) and True Negatives (TN). By maximizing True Negatives, we save analysis time, and by maximizing True Positives, we are reducing the number of cases which the insurance company should not be paying, but actually is. 

On the other hand, if we apply this in a hospital, we want to minimize FP - cases which are flagged as a fraud but are not, costing man hours to evaluate manually - and FN - cases which are flagged as negative but are actually fraud, costing money due to errors.

How can we optimize this for an insurance company?

There is a certain cost associated with correcting something in a claim, and a price difference between the reconstructed claim and the original claim. 

For each claim, X, we can calculate a score, and choose the samples with the highest N scores as the claims to manually evaluate. 

This score needs to be composed of two terms. In the first term, containing the expected value in case fraud is detected, we multiply the probability of fraud by the money saved by the insurance company in case fraud is detected. Here, the money inflow is dependent on the cost of the corrected claim subtracted to the original claim price and the man-hour rate required for correcting that claim manually. 

In the second term, we multiply the probability of non fraud by the man hour rate required for analyzing that sample, because even if there’s no fraud, there’s a cost associated with analyzing that claim manually. 

     \begin{eqnarray*} Score(X) = Prob(Fraud) * ( PriceCorrectedClaim(X) - Price(X) - ManHourRate(Corrected(X) - X)) \\ -(1-Prob(Fraud)) * ManHourRate (Corrected(X) - X) \end{eqnarray*}

which is equal to:

(1)    $ $\begin{equation*} Score(X) = Prob(Fraud) * (PriceCorrectedClaim(X) - Price(X)) - ManHours $\end{equation*} $

So, for the above example, if the model corrects the sequence AAGKM to AAGKD, we have a 90% confidence that it is anomalous, and assuming a fixed price of 5€ per claim analysis:

Score(AAGKM) = 0.9 x (1510 - 1500 - 5)  = 4.5 


In this blogpost, we presented the issue of automatically detecting errors/anomalies in insurance claims, an use case which can affect several stakeholders: patients, hospitals and insurance companies. 

This approach can be done in a supervised or unsupervised way, depending on the available data. Even with no labels available, it is possible to create an interpretable and actionable model for optimizing the process of manually reviewing claims.

Let us know if you have any more ideas for solving this issue!

Mailing List

If you enjoyed the content of this post, subscribe to our mailing list. There, you will find content such as:

  • Our blog posts
  • References to papers we publish with other clients or research institutions
  • Reference to events in which we will participate/sponsor
  • An aggregate of content we recommend (e.g. papers, libraries, books, opinion articles, softwares, online courses, ...)


* indicates required



[2] Sansone, E., De Natale, F. G., & Zhou, Z. H. (2018). Efficient training for positive unlabeled learning. IEEE transactions on pattern analysis and machine intelligence, 41(11), 2584-2598.

[3] Photo by Andrea Piacquadio from Pexels

2 thoughts on “Detecting Errors in Insurance Claims

Comments are closed.