This blogpost describes the key points of my participation at the 2021 Edition of World Data League. The Tech Moguls Team, composed of me, Tiago Gonçalves, Tomé Albuquerque and Joana Morgado, from INESC TEC, finished second place in this edition.

World Data League (WDL) is a Data Science competition where groups of Data Scientists work to solve social problems using data. There were several main topics – Public Transportation, Traffic, Cycling, Environment. Each one was broken down into 4 smaller sub-topics, originating a two weeks Stage per sub-topic..

The finals were a 3-day event with the top-10 finalists, about Noise Pollution. All the code, data and challenge descriptions are available on the official WDL gitlab. For more detail than the provided here, please see the notebooks linked below.

The way we think about problem solving at NILG.AI allowed me to participate in this challenge. The whole Lean Data Science pipeline – creating a baseline solution, thinking about how the end user could use it and calculating business metrics besides technical metrics – is very relevant when developing solutions that generate value. In this blogpost, we show our way of reasoning.

Stage 1 – Public Transportation

In this stage, we covered churn models in public transportation. The dataset contained two periods of time (including the COVID-19 lockdown periods in Portugal) with the average bus users per day aggregated per different locations, gender and age group. The goal was to identify churn profiles and propose measures to reduce it.

For this, we used a Decision Tree to predict the probability of a certain segment increasing or decreasing the usage of public transportations, throughout the two periods given, with variables we considered to be relevant, in order to create groups which could be used to explain churn. The tree’s branches would give us information about the segments and its size.

(image in better resolution available at Built using DTreeViz)

We discovered two segments with high propensity of churning:

  • Our first segment refers to users from the South of Portugal, whose ages are not in 65+ nor 25-34
  • Our second segment refers to users from 25-34 a little bit all over the country

The most relevant variables that explained this decrease were:

  • Population Density in the County and District
  • Relative Change in Unemployment
  • Variability in demand, extracted from the Origin-Destination matrix, which can be a proxy for the easiness of flow going out of a county into the parishes
  • Age Group

Stage 2 – Traffic

For stage 2, we worked on predicting traffic flow in the city of Porto using induction loop sensors. Available data was traffic, weather and air quality sensors spread throughout the city, for three years.

The plot below shows a typical example of the average traffic flow in the city of Porto: it decreases during the night, and starts to increase around 04:00/05:00, which marks the time-points when people start their working routine. It then keeps increasing until 10:00/11:00 and has an approximately stable behaviour until the end of the working hours, 18:00/19:00, starting to decrease afterwards.

Our solution focused on forecasting traffic for 24h hours later for the city of Porto. We used an XGBoost Classifier with weather features (current and forecast), historical intensity features (e.g. average intensity in the past), date features (whether it’s a holiday/weekend at prediction time, hour, week day, …) and sensor position features (distance to sensor centroid).

This could be used, for instance, for dynamically changing traffic light frequency depending on the area and time of the day.

This stage had another outcome – the acceptance of a paper for SoGood 2021 – The 6th Workshop on Data Science for Social Good (Paulo Maia, Joana Morgado, Tiago Gonçalves and Tomé Albuquerque – Applying Machine Learning for Traffic Forecasting in Porto, Portugal)!

Stage 3 – Cycling

In the third stage, we worked on (Literally) paving the way towards safer cities. In this challenge, we had access to Google Street Maps images in Lisbon, in four different angles (0 – 360º) and the goal was to estimate a score of perceived road safety based on objects in an image.

We labeled a subset of images with the following classes:

  • Irrelevant view: whenever the street is fully visible or the image is just pointing to a wall;
  • Street width: a single car could fit the street vs more than one car could fit there;
  • Pavement Type: parallels (paralelo), tar (alcatrão) or dirt (terra batida);
  • Pavement quality (low, high, or mid);

A pre-trained model for car detection was also used to count the amount of cars in each image. This could allow us to obtain traffic intensity as a proxy for danger.

As an example, here’s a subset of images that were labeled as irrelevant:

Afterwards, we associated a risk score with the presence of each of these, and averaged the score for each angle – which could be used for creating a street-level risk map.

The groups of 4 images below show low risk and high risk scores, respectively, based on our established rules.

Images with the bottom risk scores are images with pavement type “alcatrao”/tar, where the pavement quality is high (no visible cracks), and there are no cars present.

Images with the top risk scores are images with pavement type “paralelo”, where cars appear.

Stage 4 – Environment

For the fourth stage, we worked on Optimisation of outdoor advertisements in cities. Cities are flooded by countless outdoor advertising panels, often with a poor distribution”. Visual aspects are crucial in the urban planning process since each plan choice can generate obstruction of urban elements, thus producing adverse effects on the city’s image.

The dataset contained the coordinates of several billboards, as well as the average number of visitors.

Our approach considered that we could only add or remove billboards from a location, but had to replace them in another existing location, as we do not know which coordinates are valid locations for billboards.

We developed a metaheuristics-based algorithm (local/neighbourhood search) that optimises the outdoor-billboard density (reducing it) and the total number of views (i.e., the number of outdoor billboards in a given radius – increasing it). We start by creating neighbour solutions through swap operations in which we change the coordinates of a given billboard and assess the impact on our fitness function, which takes this variable into account.

Finals – Noise Pollution

The WDL finals had a single challenge – Improving quality of life by reducing city noise levels. 

The provided data was about noise sensors in the city of Torino, Italy, as well as points of interest and police complaints.

We developed an explainable XGBoost Classifier capable of predicting the probability of noise levels exceeding the legal limit for the next day (at the same time) in the neighbourhood. A model was used to predict the volume of complaints. Finally, both models were combined into an expected annoyance value: the probability of noise exceeding the threshold level AND causing annoyance (according to tabulated values in the literature) AND causing a complaint. This makes the decision very actionable as it combines reasoning for negative consequences.

The users of this solution could be the local police forces who, knowing that in a given area the next day the probability of the noise level exceeding the limit is high, can optimise patrols and organise the teams. This model, by presenting the possible cause of the probability being high, allows the police to know in advance what to expect on the spot.


This blogpost described my participation in the 2021 WDL competition – there’s now an insights report available with a summary of all the project outcomes you can check out.

As always, you can contact us on if you have any similar use case to what we covered here and would like to know how you could generate value from it!