Embedding Domain Knowledge for Estimating Customer Lifetime Value

How we designed an interpretable neural network to predict Customer Lifetime Value

As part of the rise of Deep Neural Networks in the ML community, we have observed an increasing fit-predict approach, where AI practitioners don’t take the time to think about the domain knowledge that is already available and how to embed that knowledge in the models. In this blogpost, we will cover how we created custom-made deep neural networks that combine domain knowledge for estimating customer lifetime value in multiple timesteps, in a project developed together with a major Telco in Portugal. We will start by explaining the problem and the different ways it has been approached in the literature, followed by our solution and the way we incrementally built it.

While this post reflects the ideas applied in the Telecommunications industry, it can be easily extrapolated to any other subscription-based industry.

What is Customer Lifetime Value and how to estimate it?

The customer lifetime value is the net present value of customers calculated profit over a certain number of months [1]. Specifically, in the telecommunications industry, where price transitions are limited, customer monthly margin and customer survival curve are the two major components of this term.

There are several approaches in the literature for estimating Customer Lifetime Value (CLTV). The most classical approaches are based on statistical models (e.g. Buy ’Til You Die Models, Pareto/NBD model, Recency-Frequency-Monetary value [2] ) and consider features such as purchase frequency and most recent purchases, fitting them to a certain statistical distribution. Most recently, machine-learning-based approaches are reported in the literature, which consider hand-crafted features [3, 4].

Based on a previously presented idea of a model for recommending the best package for a given customer, we set out to build a model that could estimate customer lifetime value in a time window of N months, given such recommendation. We intended to build a tool for understanding if CLTV could be used to help our stakeholders make decisions in the best package to offer in an outbound call. By maximizing CLTV for a set of possible package transitions, it is expected that client satisfaction and loyalty will also increase.

Business Rules and important dataset features

After accepting an offer, clients agree to a fidelization/contract period of 24 months. If the offer is rejected, the fidelization period remains the same, and the risk of the client leaving the company (churn) increases.

Our dataset contains two groups of features:

  • Behavioral features: consumption patterns, interactions with the company’s channels, etc.
  • Proposal features: Features such as internet speed, package type, number of TV Channels, …

During this article, we will use the following notation: the offer month will be month 0, and the subsequent months will be named month 1, … N.  If the target variables are presented in curly brackets, that means the input is a dictionary of arrays. If not, it is a single array.

From the available data, we can calculate three different possible targets, which are all helpful for business decision making, in different use cases:

$yAlive_N$: Client has not left the company in month N (is alive/did not churn)

$yPrice_N$: Monthly Revenue (Price), simplified as subscription value

$yTaker$: Client accepted the offered package

After considering several approaches for estimating CLTV based on optimization functions or machine learning models which used pre-trained models’ scores as features (For more, check the appendix) we opted to develop an approach based on Deep Neural Networks.

Deep Neural Networks Approach

Deep learning has promoted a black-box approach where people do not think about what they’re trying to do, and just plug their input features into a model and get an output value. At NILG.AI, we always think about business impact and how to create explainable models to help communicate with people in charge of making business decisions. 

How can we create a block-based model which we could quickly manipulate in case we wanted to test different things? How can we learn holistic representations of the client that cover more than one signal of interest (e.g., CLTV, churn, upselling, etc.). At a very high level, we planned on building a model that could be single or multi-output, and single or multi-task, with embedded domain knowledge.

The next sections will explain our building blocks of this architecture.

We improved the Mean Absolute Error by 50% using our custom architecture when compared with an off-the-shelf regression model

DNN User

DNN User is a loop of N Dense layers followed by Dropout layers, with a final Dense layer which creates a common latent space for further tasks. Basically, this is a feature extraction step that extracts relevant user-specific features from the input features.

For now, let’s consider that we are outputting only the Price for a given month after the proposal month ($yPrice_N$). 

Churn + Regression Task – Single Output

If we were building a simple regression model, all we needed to do to finish this architecture would be to add a final non-negative activation (e.g., ReLU/ELU), predicting $yPrice_N$. We decided to add an extra task: estimating client survival probability, by having a sigmoid layer to predict $yAlive_N$.

As a way of enforcing that a decrease in client survival probability leads to a decrease in CLTV, we added the following business rule to our neural network:

     \begin{eqnarray*} PriceDest_N & = & ProbAlive_N(X) \times\\ & & PriceAlive_N + \\ & & (1-ProbAlive_N(X)) \times PriceChurned_N\\ \end{eqnarray*}

where the second term is equal to 0, as the Price when the client has churned ($PriceChurned_N$) is zero. Therefore, this equation is reduced to a multiplication between survival probability and $PriceAlive_N$, a latent space which can be interpreted as the potential value the client is willing to pay for the service, without considering customer satisfaction and competition.

The network we have built until now is summarized below:

Where the red-box outputs are learned in a multitask supervised fashion.

The loss function is calculated according to the following equation:

     \begin{eqnarray*} Loss_N & = & XEntropy(yAlive_N, p(Alive)_N) + \\ &    & MSE(yPrice_N, Price_N)\\ \end{eqnarray*}

Churn + Regression Task – Multiple Output

The above network predicts $yPrice_N$. However, CLTV is estimated by summing the Price over several months.

To create a multi-output model, all that is required is to repeat the above blocks, and shifting the target by one month for each block.

The loss function is then calculated as:

     \begin{eqnarray*} Loss & = & \sum_{i=0}^{N} Loss_N \end{eqnarray*}

Churn + Regression Task +  Taker Task

We can also add an extra task for further explainability: the probability of the client accepting the offer (yTaker), which can be modeled as a business rule in the network by the following equation:

     \begin{eqnarray*} ExpectedValueTaker_N(User) & = & ProbTaker_N(User) \times LatPrice_N + \\ & &(1 - ProbTaker_N(User)) \times \\ & & PriceOrigin\\ \end{eqnarray*}

Which is the sum of the subscription value if the client accepts the offer and if he does not. This expected value is then multiplied by the survival probability in that timestamp, as in the previous architecture.

The final equation (which can be followed by a non-negative activation, such as ReLu) is then written as:

     \begin{eqnarray*} PriceDest_N(X) & = & ProbAlive_N(User) \times \\ & & (ProbTaker_N(User) \times LatPrice_N + \\ & & (1-ProbTaker_N(User)) \times PriceOrigin)\\ \end{eqnarray*}



This architecture was compared to a base regression model. The mean absolute error (MAE) between the predicted and estimated Price was calculated. We improved the Mean Absolute Error by 50% using our custom architecture when compared with an off-the-shelf regression model.

While for some cases, this may not have a direct impact in model results (and may even lead to reduced performance, if some of the labels are noisy and contradictory), adding these priors can lead to an increased model regularization and reduce overfitting to noisy values/outliers or sporadic events in time.


We have managed to innovate in this area of estimating Customer Lifetime Value by creating a multitask model that can predict customer churn, lifetime value and propensity for accepting an offer. In spite of DNNs being typically considered as black boxes, we show here that, by thinking of DNN as Differentiable Programmes (as encouraged by Yann LeCunn), we can add different tasks to increase interpretability and decision making.

With this, we can understand which offer is best for improving customer satisfaction and retention. For instance, an offer that leads to a low propensity and high probability of churn is probably not the best offer for that customer.

There are several more ways we can add domain knowledge to this model, such as a loss function that penalizes more customers in a certain risk group, ensuring monotonic features have monotonic impact, among others.



  1. Lu, J., & Park, O. (2003). Modeling customer lifetime value using survival analysis—an application in the telecommunications industry. Data Mining Techniques, 120-128.
  2. Peter S. Fader, Bruce G. S. Hardie, and Ka Lok Lee. 2005. RFM and CLV: Using Iso-Value Curves for Customer Base Analysis. Journal of Marketing Research XLII, November (2005), 415–430
  3. Chamberlain, B. P., Cardoso, A., Liu, C. H., Pagliari, R., & Deisenroth, M. P. (2017, August). Customer lifetime value prediction using embeddings. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1753-1762). ACM.
  4. Vanderveld, A., Pandey, A., Han, A., & Parekh, R. (2016, August). An engagement-based customer lifetime value system for e-commerce. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 293-302). ACM.


Like this story?

Subscribe to Our Newsletter

Special offers, latest news and quality content in your inbox once per month.

Signup single post

This field is for validation purposes and should be left unchanged.

Recommended Articles

Turning classes into inputs

Let’s face it, we all have worked on an ML project where we had to predict a ridiculously high number of classes. Large enough to make the number of observations per class into an embarrassingly small subset. Most people model these tasks as a multiclass classification problem where, for each input observation, we must predict […]

Read More
NILG.AI in the AI community

Connecting and being connected greatly impact how we positively interact with others. At NILG.AI, we are not only focused on helping businesses unlock their capabilities, but we also make our mission sharing with the world how to leverage the power of Artificial Intelligence (AI). This knowledge-centered culture is one of our greatest pride and a […]

Read More
A new era has arrived for NILG.AI

Happy birthday! Today is NILG.AI’s fourth anniversary. Happy birthday to us! For most humans, birthdays are a synonym for getting older and leaving the good days of the youth behind. For companies, they are a moment to reflect on everything we achieved, recognize how far we have come, and envision how far we will go. […]

Read More