Privacy Preserving Machine Learning

This article reports my work at NILG.AI during a curricular internship on privacy-preserving Machine Learning. Trip data is any type of data that connects the origin and destination of a person’s travel and is generated in countless ways as we move about our day and interact with systems connected to the internet. But why is trip data sensitive? The trips we take are unique to us. Researchers have found that it takes 12 points on a fingerprint to identify an individual, while it takes only 4 location points to identify 95% of the population (link) uniquely.

Of course, we can generalize this to any type of sensitive data. Businesses and organizations hold more personal data now than ever to serve their customers better and more effectively run business operations. Besides, since privacy goes hand in hand with security, preserving users’ data privacy benefits the users and companies processing potentially sensitive information. However, with plenty of malicious parties eager to access personal data and track sensitive information back to its source, finding a way to maintain the data’s utility while adequately reducing the risk of leaked sensitive information has become the main concern.

This raises an important question – how do we perform machine learning on data to enable technical advances without creating excessive risks to users’ privacy? In fact, data privacy is a central issue in training and testing artificial intelligence models, especially ones that train and infer sensitive data.

The four pillars of perfectly privacy-preserving Machine Learning

What it means to have perfectly privacy-preserving Machine Learning has not yet been clarified.

However, it is possible to define the four crucial pillars to achieve that: training data privacy – it should not be possible for a malicious actor to reverse-engineer the training data; input privacy – one’s input data should not be observable by other parties, including the model creator; output privacy – model’s output data should not be visible by anyone except for the user whose data is being inferred upon; and model privacy – the model should not be susceptible for any malicious party to steal it (link).

Privacy-preserving Machine Learning Techniques

Although there are many privacy-preserving techniques for Artificial Intelligence and Machine Learning, only four were selected to be explored and implemented in this curricular internship – K-anonymity, Differential Privacy with a focus on the Laplace Mechanism, Homomorphic Encryption, and Extreme Learning Machines.


K-anonymization is often referred to as the power of “hiding in the crowd” and is essentially used to generalize some identifying attributes and remove others entirely from the data set, without compromising the utility and effectiveness of the data (link).

The k in k-anonymity refers to the number of times each combination of values appears in a data set. If k = 2, for any record in a data set, there must be at least k other records that are indistinguishable. Various k-anonymization techniques keep data safe and anonymous, from generalization to suppression.

Unfortunately, k-anonymity isn’t sufficient for anything but very large data sets with only small numbers of simple fields for each record. Intuitively, the more fields and the more possible entries there are in those fields, the more unique a record can be and the harder it is to ensure that there are k equivalent records.

Differential Privacy

Differential Privacy is a group of algorithms based on describing the patterns of groups in a data set while withholding information about individuals in the data set (link).

Informally, Differential Privacy guarantees the following for each individual who contributes data for analysis: the output of a differentially private analysis will be roughly the same, whether or not one contributes his data. That is, the risk to one’s privacy should not substantially increase as a result of participating in a statistical database. Thus, an attacker should not be able to learn any information about any participant that he could not learn if the participant had opted out of the database.

Absolute privacy is inherently impossible, but what Differential Privacy provides is a small chance of a privacy violation.

It builds conceptually on a prior method known as a randomized response, where the key idea is to introduce a randomization mechanism that provides plausible deniability. Let’s suppose the responses recorded in a survey were randomized, meaning that with probability p, the true response would be recorded, and with probability, 1-p would be recorded a random one. Despite what happened, every individual could argue that the recorded response was false and, therefore, keep their privacy.

The Laplace Mechanism is one of the most comprehensible methods for Differential Privacy. So far, it is the only approach that has both theoretical bounds on the privacy of users in the data set and still enables scientists to mine useful insights from it. The Laplace mechanism, however, is unsuitable for categorical data (the Exponential Mechanism would be more suitable in that case).

Homomorphic Encryption

Homomorphic encryption is a public key cryptographic scheme that includes three main steps:

  • The owner generates a private and public key pair
  • The owner encrypts the data: the public key, the plaintext is first converted into ciphertext, which is unreadable by humans until the proper private key is used to decrypt it.
  • Computation on encrypted data is performed: computations are performed directly on the ciphertext, meaning that the data and results remain encrypted during the process.
  • Owner decrypts data and results: the encrypted results are then sent back to the data owner and because of the homomorphic properties of the encryption and decryption, data and results are decrypted by the data owner without having compromised security at any point during the procedure.

While it is standard for data to be encrypted in transit, it becomes vulnerable once decrypted for processing. As a result, the perfect scenario is to be able to process the encrypted data directly. This is when Homomorphic Encryption becomes particularly interesting, such as when one wishes to provide a prediction service based on a user’s data but without the user needing to trust another party with their data, thus only supplying encrypted data.

Despite being relatively simple and having little dependence compared to the other methods, Homomorphic Encryption cannot efficiently perform arbitrary multiplications and can only perform addition and multiplication, which can make some computations more difficult to perform. Besides, it is still slow and computationally expensive, so it might not be currently practical, mainly when compared to other traditional mechanisms.

Extreme Learning Machines

Extreme Learning Machines are neural networks with a single layer or multiple layers of hidden nodes where the parameters of the hidden nodes don’t require training.

The idea is that, by keeping the model’s weights randomized (but saving them), we are applying multiple additions and multiplications on the data that have no information about the data yet.

The “anonymized” data is the compressed version – which can have more, less, or the same number of features (depending on the number of neurons in the hidden layers).

Internship – in practice

For the purpose of this internship, the New York City Taxi Trip Duration data set was used to predict a trip’s duration given the location and time of the day.

The importance of comprehending and analyzing the data cannot be neglected. For this reason, a thorough exploratory analysis was performed before pre-processing the data.

After getting to know the data set properly, the feature engineering stage began, which, very briefly, involved handling missing values through imputation, handling non-numerical features, for example, by binarizing them; extracting new relevant features from existing ones.

Having defined what would be performed in terms of feature engineering, the data set was sampled to increase efficiency and speed up the data processing, ordered temporally, and split into training and test samples. The machine learning models chosen were the Linear Regression, XGBoost Regressor, and the Multi-layer Perceptron (MLP) Regressor. In terms of metrics, mean squared error, root mean squared error, and mean absolute error, as well as the model train and test scores and time required for its execution, were evaluated.

A more comprehensive and detailed explanation of the exploratory analysis, the feature engineering, and the aforementioned anonymization mechanisms, both in practical and theoretical terms, can be found in this internship’s final report.

All the results obtained in machine learning with anonymized and non-anonymized data have been compiled into an interactive dashboard, developed using the Streamlit library.

Moreover, if you are interested in watching the 5-minute video, I submitted as my final presentation, where I go through a quick dashboard exploration, follow this link.

Conclusions from my Privacy-preserving Machine Learning internship

Regarding the internship’s specific goals, I believe that a longer internship would have allowed me to explore topics of increased complexity in greater detail. Regarding Homomorphic Encryption, even though it is one of the strongest methods for anonymizing data, the outcomes were far below expectations – but there was a lack of time for further exploration. As for the Extreme Learning Machines mechanism, whose anonymization power is not far behind the previous one, it could have been explored in greater detail, although the results obtained were good when compared with the other mechanisms.

I acquired a wide collection of new skills and experiences for the whole internship. Despite all the challenges faced, not only have I expanded my knowledge in machine learning, which turned out to be very useful in one of my curricular units, but  I also had the unique opportunity to be part of a project of immeasurable importance nowadays.

Like this story?

Subscribe to Our Newsletter

Special offers, latest news and quality content in your inbox once per month.

Signup single post

This field is for validation purposes and should be left unchanged.

Recommended Articles

Can Your Business Optimize AI Predictive Models?

Predictive models are transforming the AI landscape. They can forecast future events, identify past occurrences, and even predict present situations. However, building a successful predictive model is not as simple as it seems. To achieve an effective predictive model, you need to consider three crucial moments: the prediction time, the prediction window, and the data […]

Read More
Is Your Business Ready for Generative AI Risks?

Generative AI is a powerful tool that many companies are rushing to incorporate into their operations. However, it’s crucial to understand the possible risks associated with this technology. In this article, we’ll discuss the top nine risks that could impact your business’s readiness for AI integration. Stay ahead of the curve, and make sure you’re […]

Read More
Can the STAR Framework Streamline Your AI Projects?

As a manager dealing with AI projects, you may often find yourself overwhelmed. The constant addition of promising projects to the backlog can lead to a mounting technical debt within your team, forcing you to neglect the core aspects of your business. Here at NILG.AI, we have a solution for this challenge: the STAR framework. […]

Read More