An Introduction to Multiple Instance Learning

Multiple Instance Learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag, opposedly to the instances themselves. This allows to leverage weakly labeled data, which is present in many business problems as labeling data is often costly:

  • Medical Imaging: Computer-aided diagnosis can be trained with medical images for which only patient diagnosis for diseased regions are available, instead of local annotations.
  • Video/Audio: Video or audio tags are only often available for the whole video, and it’s relevant to know when that happens (e.g. this video contains a cat and a human)
  • Text: Document Classification, where you want to know, for instance, if a certain website (composed of several web pages) is about one specific topic. You will have multiple pages with irrelevant information where that topic is not present.
  • Marketing: Often, marketing campaigns are sent to a group of people and it’s not clear which person was impacted by it.
  • Time Series: In some industry cases, where you have gas/water meters, and know the total amount per month, you might want to estimate the amount at a more granular level (e.g. days).


Would you like to know more about it?

The literature mostly focuses on applications of MIL for classification. However, there are some applications of MIL for regression, ranking, or cluster, which will not be focused here. For resources on it, please refer to this review paper.

Also, besides this blog post, we have an online course where we discuss in-depth Multiple Instance Learning, how to implement it, common errors and how to avoid them, and some practical examples from our consulting practice. You will also learn about other techniques such as Semi-Supervised Learning, Self-Supervised Learning, among others.


The Machine Learning Spectrum

Master Multiple Instance Learning and several other techniques in our course.

Learn More


In the standard MIL assumption, negative bags are said to contain only negative instances, while positive bags contain at least one positive instance. Positive instances are labeled in the literature as witnesses.

An intuitive example for MIL is a situation where several people have a specific key chain that contains keys. Some of these people are able to enter a certain room, and some aren’t. The task is then to predict whether a certain key or a certain key chain can get you into that room.

For solving this, we need to find the exact key that is common for all the “positive” keychains – the green key. We can then correctly classify an entire keychain – positive if it contains the required key, or negative if it doesn’t.

This standard assumption can be slightly modified to accommodate problems where positive bags cannot be identified by a single instance, but by its accumulation. For example, in the classification of desert, sea and beach images, images of beaches contain both sand and water segments. Several positive instances are required to distinguish a “beach” from “desert”/”sea”.

Characteristics of MIL Problems

There are some common characteristics of MIL problems, as defined in the literature, which will be discussed next.

Task/Prediction: Instance level vs Bag Level

In some applications, like object localization in images (in content retrieval, for instance), the objective is not to classify bags, but to classify individual instances. The bag label is the presence of that entity in the image.

Note that the bag classification performance of a method often is not representative of its instance classification performance. For example, when considering negative bags, a single False Positive causes a bag to be misclassified. On the other hand, in positive bags, it does not change the label, which shouldn’t affect the loss at bag-level.

Bag Composition

Most existing MIL methods assume that positive and negative instances are sampled independently from a positive and a negative distribution. This is often not the case, due to the co-occurrence of several relations:

Intra Bag Similarities

The instances belonging to the same bag share similarities that instances from other bags do not. In Computer Vision applications, it is likely that all segments share some similarities related to the capture condition (e.g. illumination). Another option is overlapping patches in an extraction process, as represented below.



Adapted from here

Instance Co Occurrence

Instances co-occur in bags when they share a semantic relation. This type of correlation happens when the subject of a picture is more likely to be seen in some environment than in another, or when some objects are often found together.

Adapted from here

Instance and Bag Structure

In some problems, there is an underlying structure (spatial, temporal, relational, causal) between instances in bags or even between bags. For example, when a bag represents a video sequence – for instance, identifying the frames of a video where a cat appears knowing only there’s a cat in that video – all frames or patches are temporally and spatially ordered.

Label Ambiguity

Label Noise

Some MIL algorithms, especially those working under the standard MIL assumption, rely heavily on the correctness of bag labels. In practice, there are many situations where positive instances may be found in negative bags – due to labeling errors or inherent noise. For example, in computer vision applications, it is difficult to guarantee that negative images contain no positive patches: An image showing a house may contain flowers, but is unlikely to be annotated as a flower image.

Label noise occurs as well when you have different bags with different densities of positive events. For instance, we have an audio recording (R1) of 10 seconds containing only a total of 1 second of the tagged event in it and another audio recording (R2) of the same duration in which the tagged event is present for a total of 5 seconds. R1 is a weaker representation of the event compared to R2.

Different Label Spaces

It is possible to extract patches from negative images that fall into this positive region. In the example shown below, some patches extracted from the image of a white tiger fall into another concept region due to being visually similar to it.


There are multiple models that can be used for MIL – either at instance or bag-level classification. A few examples are shown next:

Bag-Level Classification

Bag of Words approach

A bag can be represented by its instances, using methods such as an image embedding, and determining the frequency of each instance in a bag. A classifier is then trained on this histogram, to determine whether a bag is positive or not.

Earth Mover Distance Support Vector Machine (EMD-SVM)

The EMD-SVM is a measure of the dissimilarity between two distributions (e.g. via an image embedding as well). Each bag is a distribution of instances and the EMD is used to create a kernel used in an SVM.

Instance-Space Methods

(image reference here)

Alternative applications of SVMs (mi-SVM and MI-SVM) were developed for multiple instance learning applications. Classically, SVMs try to determine the maximum margin between instances. For MIL, since the goal is to have at least one instance in a positive bag as positive, the margin is changed so that condition occurs: at least one instance in a positive bag should have a large positive margin.

After determining the decision function, the instances’ class can be recovered.


Neural Network with pooling

With a bag-level label, we can have a latent space containing the probability of each segment (using a sequence-based input). By applying a pooling operator (max/average pooling), there’s just a single score associated with a bag. After training, if you want to do an instance-level prediction, the last pooling layer can be removed.

Usually, max pooling is used for classification problems, while average pooling is applied to regression problems.

Neural Networks with Attention Mechanisms

Attention Mechanisms can also be applied to these kinds of problems. Consider the image below, for audio-level event detection, which uses both a detector and a classifier (symmetric) with just the video-level label to create two separate models. The output of the classifier indicates how likely a certain block has tag k. The output of the detector indicates how informative the block is when classifying the k-th tag. This way, the model determines how informative a block is for classifying a certain tag.

(image reference here)

Do you want to further discuss this idea?

Book a meeting with Paulo Maia

Meet Paulo Learn More


This blog post has described the concept of Multiple Instance Learning, its major challenges, and some examples of algorithms that can be used. Although applying MIL is not ideal, and very often, it seems impossible to train models with sparse annotations, there are tools designed specifically to tackle this barrier and obtain satisfactory results.

These are just some of the tools which can be used for this purpose. Hopefully, it has given you some new ideas for applying this to your projects – enroll in our online course for information about Multiple Instance Learning and other learning paradigms.


The Machine Learning Spectrum

Master Multiple Instance Learning and several other techniques in our course.

Learn More

Like this story?

Subscribe to Our Newsletter

Special offers, latest news and quality content in your inbox once per month.

Signup single post

This field is for validation purposes and should be left unchanged.

Recommended Articles

Revolutionizing Industry: The Impact of Large Language Models

Large Language Models (LLMs) are THE hot topic of the year. If the name LLM sounds unfamiliar to you, I’m pretty sure you’ve heard of ChatGPT, OpenAI, and Bard. People who don’t know how to code have gained access to a tool that allows them to build Proof of Concepts for ideas they’ve been meaning […]

Read More
unlock poor models
In medio stat virtus? Not always!

The Problem What do you do when the model is underperforming? When the models’ performance does not meet our expectations, we usually spend time searching for the flaws, selecting and analyzing the cases where it failed to understand why it happened. Then, we try to apply more robust solutions, train, test, and repeat. In some […]

Read More
Increasing Efficiency with Active Learning

The problem So there you are. You have collected your data, analyzed it, processed it, and built your sophisticated model architecture. After many hours of training and evaluating, you have come to a very unpleasant conclusion: you need more data. Before you readjust your budget to fit the extra data acquisition and labeling, let me […]

Read More