With COVID-19, many were affected by the economic crisis and lost their jobs. In Portugal alone, between February and September, there was a 30% increase in unemployment! AI can be a powerful tool in allocating the scarce resources in a more efficient way. Inspired by DSSG Fellowship’s Project in Partnership with IEFP (Instituto de Emprego e Formação em Portugal) we started to think about how we would help reduce unemployment using AI.
Similar to DSSG’s project, the goals that will be discussed here are:
- Better identify individuals at high risk of long-term unemployment;
- Support more efficient allocation of the employment/training institute’s resources to respond to the needs of unemployed individuals.
This is a summary from an internal non-exhaustive discussion merely for learning purposes – there are multiple solutions, all of which depend on the development time and the data which you can access.
Data and Relevant Patterns
Let’s assume that we can gather data when the unemployed person registers on the website platform, and all the job-related interactions are stored until the candidate finds a job. For simplification purposes, we can build our dataset given a list of monthly snapshots of all platform users’ characteristics, which have an unique identifier.
What raw information would be interesting to have?
- Demographic (Age, Gender, Civil State, Address)
- List of previous courses and education level
- Curriculum Vitae
- Years of Experience
- Maximum job distance to home
- Schedule in which he can work
- Number of dependents
How can we encode in a similar way the information from a person with 10 previous courses vs. one with 2?
Encoding Options for List of Previous Courses
i) Label Encoding: We can represent each course as a number. But this means that the course number k is at a higher distance from course k – 5 than course k – 1, which might not be the best way of representing this.
ii) One Hot Encoding: We can represent each course as a new column. With too high dimensionality, this becomes very sparse and there’s no way to know the
iii) Content Based: This approach considers a domain-specific representation, where we can encode the courses by several areas of expertise (e.g. Maths, Chemistry, Programming, etc). As values, we can represent it either as binary (course covers that area or not) or represent it by the average grade of the candidate in those areas.
iv) Embedding Based: From a list of courses, train a vector representation (embedding) from scratch or based on content found in external sources so that similar courses have lower distances between them. We can then use a model that’s able to deal with varying input shapes (e.g. RNN, CNN).
Encoding Options for Curriculum Vitae (CV)
The candidate’s CV is also rich with data – we could use several NLP techniques to extract data from here, such as a Bag of Words or the average word embedding value (which has the advantage of having a semantic representation of words).
Note on fairness: We could have added a feature related to the candidate’s monthly expenses, as a way to estimate how much money he would need per month. But this could create a negative feedback loop – if he has low expenses due to not having the ability to live a better life, the model could be allocating him to offers with low salaries.
- Unemployment History (e.g. total number of months since last job, statistics of unemployment time in the past – max, mean, etc)
- Search Efforts (such as the total number of interviews, number of interviews above the minimum required threshold, percentage of recommended interviews attended)
- Search Consistency (e.g. standard deviation of days between interviews and of days between applications). Note that these features might be influenced by job offer scarcity, so it can end up giving unwanted biases to our model!
- Offer Name
- Offer Areas (e.g. Maths, Science, Programming, …)
- Offer Remuneration
- Offered shifts
Ideally, we would represent the offer list in a comparable domain to the candidate’s expertise, so our features could represent the similarity between the offer and the candidate skills. As such, offer areas would also be represented in a Content Based approach.
If we were to use a model that cannot learn this relationship directly (e.g. tree based models) we could calculate pairwise features, such as the difference between the offer remuneration and the candidate’s monthly expenses, or the intersection between the candidate and the offer’s areas.
After having all these features, we can create a model that, given a candidate snapshot in a month and a job offer, returns an employability score. There are several ways this employability score can be modeled, all of which can be tested and should be picked depending on how the model intends to be applied in production:
- Months until a job is found: allows us to determine how many months will the candidate still need to receive a monthly allowance from the government. Therefore, we can prioritize the cases according to this prediction.
- Number of interviews until the candidate is accepted: Doesn’t tell us anything about the timeframe, since the candidate can do several interviews in a month or no interviews for several months.
- Multiclass risk classification, from Low to High, depending on a threshold (this is what was done in DSSG Fellowship): is not as actionable as the previous ones. However, a model might learn these patterns more easily, as the problem is reduced to an ordinal classification.
Let’s assume, for now, the model gives us the months until the candidate finds a job.
Reliable estimation of the model performance
We can evaluate the model performance using regression metrics such as the Mean Absolute/Squared Error and the Spearman/Pearson correlation coefficients between the target and the predicted value.
Time is an important component in our learning system, and, as such, we must pay attention to the way we split our data for performance estimation. Random splits might “leak” information in training, giving overestimations of the model performance.
As we’re building a dataset with several rows for the same candidate in different months, if we had the same candidate in train and test, but in different months, we’d know whether he found a job or not.
An initial approach is grouping the train/test splits by candidate (A is in train, B is in test, in the above example).
However, this approach still has the issue of time-leakage. Imagine that you were training the model with data from October 2020. You know, since a big shift in employment happened started in March 2020, that the average value of “months until employment” has increased, so you could also have an overestimation of model performance in predictions from months before.
A possible solution is to do both temporal and grouped stratification: for instance, train with a list of applicants from the previous year, and test with a list of other applicants from the current year. In the above example, you could train with data from 2019, with a candidate list not considering A and B, and test it on 2020, in candidates A and B.
We can optimize the employment institute’s resources by calculating the model performance for a list of available job/formation offers, and having a cost function that tells us how good that offer is for that given candidate. We will not extensively discuss this in this blogpost.
Imagine, for instance, we want to create an email marketing campaign where we send a list of K offers for each candidate. We can decide (at least) two types of actions:
– Improving candidates skills (e.g., courses).
– Improve candidate exposure to jobs (e.g., interviews).
This is an assignment problem with a huge number of possible combinations and budget/time constraints.
Two possible cost functions for this problem are.
- argmax [f(Candidate) – f(Candidate + Offer)], which gives us the decay in months until the candidate finds a job, for a given offer. We then want to find the offer that maximizes this decay.
- argmax [ (f(Candidate) – f(Candidate+Offer)) x UnemploymentAllowance + Costs(Offer) ], which considers processing costs related with a given offer and incorporates business metrics, such as the money that would be required to maintain that candidate
This can then be optimized efficiently, for instance, with the use of metaheuristics, if the list of possible options is very large.
There are some ethical concerns with such cost functions:
- Are we minimizing the average time for employment at the expense of offering jobs with lower salaries?
- Will optimizing this leave some people in “starvation” of job offers just because it’s harder to find good offers for them?
To account for this, we can include some extra factors on the loss function to account for this, such as a multi objective search: considering time constraints for finding jobs for everyone and at the same time, reducing the average time people spend without a job.
This blogpost was written after an internal discussion where we discussed this topic. Obviously, it covers only a small fraction of what could be done with this topic. If you’re interested in having such discussions for a specific business problem you have, make sure to contact us!
If you enjoyed the content of this post, subscribe to our mailing list. There, you will find content such as:
- Our blog posts
- References to papers we publish with other clients or research institutions
- Reference to events in which we will participate/sponsor
- An aggregate of content we recommend (e.g. papers, libraries, books, opinion articles, softwares, online courses, …)