As part of the rise of Deep Neural Networks in the ML community, we have observed an increasing fit-predict approach, where AI practitioners don’t take the time to think about the domain knowledge that is already available and how to embed that knowledge in the models. In this blogpost, we will cover how we created custom-made deep neural networks that combine domain knowledge for estimating customer lifetime value in multiple timesteps, in a project developed together with NOS, a major Telco in Portugal. We will start by explaining the problem and the different ways it has been approached in the literature, followed by our solution and the way we incrementally built it.
What is Customer Lifetime Value and how to estimate it?
The customer lifetime value is the net present value of customers calculated profit over a certain number of months . Specifically, in the telecommunications industry, where price transitions are limited, customer monthly margin and customer survival curve are the two major components of this term.
There are several approaches in the literature for estimating Customer Lifetime Value (CLTV). The most classical approaches are based on statistical models (e.g. Buy ’Til You Die Models, Pareto/NBD model, Recency-Frequency-Monetary value  ) and consider features such as purchase frequency and most recent purchases, fitting them to a certain statistical distribution. Most recently, machine-learning-based approaches are reported in the literature, which consider hand-crafted features [3, 4].
At NILG.AI, we have developed several projects with the internal Data Science team at NOS. NOS is rich in data regarding customer interactions and sales, containing years of history in data. While this post reflects the ideas we applied at NOS, it can be easily extrapolated to any other subscription-based industry.
Based on a previously presented idea of a model for recommending the best package for a given customer, we set out to build a model that could estimate customer lifetime value in a time window of N months, given such recommendation. We intended to build a tool for understanding if CLTV could be used to help our stakeholders make decisions in the best package to offer in an outbound call. By maximizing CLTV for a set of possible package transitions, it is expected that client satisfaction and loyalty will also increase.
Business Rules and important dataset features
After accepting an offer, clients agree to a fidelization/contract period of 24 months. If the offer is rejected, the fidelization period remains the same, and the risk of the client leaving the company (churn) increases.
Our dataset contains two groups of features:
- Behavioral features: consumption patterns, interactions with the company’s channels, etc.
- Proposal features: Features such as internet speed, package type, number of TV Channels, …
During this article, we will use the following notation: the offer month will be month 0, and the subsequent months will be named month 1, … N. If the target variables are presented in curly brackets, that means the input is a dictionary of arrays. If not, it is a single array.
From the available data, we can calculate three different possible targets, which are all helpful for business decision making, in different use cases:
y_alive_N: Client has not left the company in month N (is alive/did not churn)
y_PVP_N: Monthly Revenue (PVP), simplified as subscription value
y_taker: Client accepted the offered package
After considering several approaches for estimating CLTV based on optimization functions or machine learning models which used pre-trained models’ scores as features (For more, see: APPENDIX: Other ways of estimating CLTV we considered) we opted to develop an approach based on Deep Neural Networks.
Deep Neural Networks Approach
Deep learning has promoted a black-box approach where people do not think about what they’re trying to do, and just plug their input features into a model and get an output value. At NILG.AI, we always think about business impact and how to create explainable models to help communicate with people in charge of making business decisions.
How can we create a block-based model which we could quickly manipulate in case we wanted to test different things? How can we learn holistic representations of the client that cover more than one signal of interest (e.g., CLTV, churn, upselling, etc.). At a very high level, we planned on building a model that could be single or multi-output, and single or multi-task, with embedded domain knowledge.
The next sections will explain our building blocks of this architecture.
(i) DNN User
DNN User is a loop of N Dense layers followed by Dropout layers, with a final Dense layer which creates a common latent space for further tasks. Basically, this is a feature extraction step that extracts relevant user-specific features from the input features.
For now, let’s consider that we are outputting only the PVP for a given month after the proposal month (y_PVP_N).
(ii) Churn + Regression Task – Single Output
If we were building a simple regression model, all we needed to do to finish this architecture would be to add a final non-negative activation (e.g., ReLU/ELU), predicting y_PVP_N. We decided to add an extra task: estimating client survival probability, by having a sigmoid layer to predict y_alive_N.
As a way of enforcing that a decrease in client survival probability leads to a decrease in CLTV, we added the following business rule to our neural network:
where the second term is equal to 0, as the PVP when the client has churned (PVPChurned_N) is zero. Therefore, this equation is reduced to a multiplication between survival probability and PVPAlive_N, a latent space which can be interpreted as the potential value the client is willing to pay for the service, without considering customer satisfaction and competition.
The network we have built until now is summarized below:
Where the red-box outputs are learned in a multitask supervised fashion.
The loss function is calculated according to the following equation:
(iii) Churn + Regression Task – Multiple Output
The above network predicts y_PVP_N. However, CLTV is estimated by summing the PVP over several months.
To create a multi-output model, all that is required is to repeat the above blocks, and shifting the target by one month for each block.
The loss function is then calculated as:
(iv) Churn + Regression Task + Taker Task
We can also add an extra task for further explainability: the probability of the client accepting the offer (y_taker), which can be modeled as a business rule in the network by the following equation:
Which is the sum of the subscription value if the client accepts the offer and if he does not. This expected value is then multiplied by the survival probability in that timestamp, as in the previous architecture.
The final equation (which can be followed by a non-negative activation, such as ReLu) is then written as:
This architecture was compared to a base regression model. The mean absolute error (MAE) between the predicted and estimated PVP was calculated. Compared to the base regression model, the MAE improved by 50% with the auxiliary tasks of Churn and Taker.
While for some cases, this may not have a direct impact in model results (and may even lead to reduced performance, if some of the labels are noisy and contradictory), adding these priors can lead to an increased model regularization and reduce overfitting to noisy values/outliers or sporadic events in time.
We have managed to innovate in this area of estimating Customer Lifetime Value by creating a multitask model that can predict customer churn, lifetime value and propensity for accepting an offer. In spite of DNNs being typically considered as black boxes, we show here that, by thinking of DNN as Differentiable Programmes (as encouraged by Yann LeCunn), we can add different tasks to increase interpretability and decision making.
With this, we can understand which offer is best for improving customer satisfaction and retention. For instance, an offer that leads to a low propensity and high probability of churn is probably not the best offer for that customer.
There are several more ways we can add domain knowledge to this model, such as a loss function that penalizes more customers in a certain risk group, ensuring monotonic features have monotonic impact, among others.
If you enjoyed the content of this post, subscribe to our mailing list. There, you will find content such as:
- Our blog posts
- References to papers we publish with other clients or research institutions
- Reference to events in which we will participate/sponsor
- An aggregate of content we recommend (e.g. papers, libraries, books, opinion articles, softwares, online courses, …)
 Lu, J., & Park, O. (2003). Modeling customer lifetime value using survival analysis—an application in the telecommunications industry. Data Mining Techniques, 120-128.
 Peter S. Fader, Bruce G. S. Hardie, and Ka Lok Lee. 2005. RFM and CLV: Using Iso-Value Curves for Customer Base Analysis. Journal of Marketing ResearchXLII,November (2005), 415–430
 Chamberlain, B. P., Cardoso, A., Liu, C. H., Pagliari, R., & Deisenroth, M. P. (2017, August). Customer lifetime value prediction using embeddings. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1753-1762). ACM.
 Vanderveld, A., Pandey, A., Han, A., & Parekh, R. (2016, August). An engagement-based customer lifetime value system for e-commerce. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 293-302). ACM.