I recently had the humbling experience of working with the non-profit organization Bridges to Prosperity wherein I learned one of the most valuable lesson of my career thus far; Sometimes the most valuable deliverable you can create is not a machine learning model, but an explanation of why the client’s data does not contain enough signal to create one.
Bridges to Prosperity is an organization that builds bridges, primarily in East Africa, in order to ensure safe, reliable access to markets, hospitals, and educational facilities. The main problem with the lack of reliable river crossing is how it affects people’s decision-making. For example, it would be dangerous for a girl to cross the river to go to school if her parents couldn’t be certain she would be able to get back when the school day ends. For example, if the river flooded while she was at school, she would have to spend the night on the opposite side instead of going home, an unacceptable risk level for many families.
Bridges are also important to the farmers who make up a large portion of the local economy in these regions. When farmers can’t reliably cross the river to bring their crops to market, it hurts them financially, especially when their crops rot before they’re able to cross the river. Being able to reliably bring their crops to market gives them the peace of mind to plan ahead and plant their crops at the optimal time, knowing they will be able to go to market as soon as they’re harvested regardless of river levels.
But Bridges to Prosperity faces an efficiency problem when deciding on sites to build their bridges at. Once a site is identified, they send someone out to do a quick-and-dirty, preliminary technical site evaluation. In theory, this should eliminate sites that are technically infeasible so they don’t waste time sending their senior engineers out to those sites for a week of intensive data collection. In practice, they found the preliminary data collection created a lot of false negatives.
They wanted either a model that could predict which sites were incorrectly flagged for rejection or the knowledge that there is simply not enough signal contained in the preliminary data to make those site evaluations worth doing. We delivered the latter scenario.
My role in this project was to act as the machine learning engineer. I began by using our group’s Trello board to describe the problem and break it down into tasks, such as:
Explore the data.
Write a function for wrangling the data.
Explore options for machine learning models.
Create visualizations to explain the models.
The largest technical challenge was the lack of labeled data. Though we had over 1,400 rows of data, we had only 89 rows of labeled data to work with. I started out with many supervized machine learning approaches such as random forest, logistic regression, and passive-aggressive classifiers before settling on a semi-supervised machine learning approach using Scikit-learn’s label spreader.
I also tried several approaches to feature engineering. There were quite a few features that included a lot of text data or very high cardinality categorical variables so one approach I took in my wrangle function was to turn all of those into one large string, use a TfidfVectorizer to create a document term matrix and then merged that to the original data frame of numerical features I was using.
While it did improve model accuracy, it improved it in the wrong way. Like the supervised learning techniques I’d started out with, it created an overfit model that gave a predicted probability of ~73% to every single observation, turning it into nothing more than a simple majority classifier.
In the end, I scrapped that approach and went with a much simpler approach that included just 12 features. Though the model had a lower accuracy, its’ judgements were at least more sophisticated than a simple majority classifier.
One of the problems with any model I made was it was not using features that would indicate whether or not a site was technically feasible but rather if it was a high-priority site. This is because the labeled data we had was not a good representation of whether or not a bridge was technically feasible so much as an indication of whether or not a site had been chosen and that was highly influenced by factors besides technical feasibility.
For example, here are the top 20 feature importances for one of the random forest models I experimented with.
While latitude, longitude, and days per year the river is flooded might be good indicators of whether or not a bridge was likely to have already been built, they are not good indicators of technical feasibility. They show that Bridges to Prosperity started in a certain region and are gradually expanding outwards, so bridges further east are less likely to be completed already.
Additionally, top features indicate that Bridges to Prosperity is likely to prioritize bridges at sites that are comparatively wider and thus harder to cross by alternative means, are flooded more days of the year, and result in more injuries and deaths. The more technical features such as height differential between banks, bridge type, and 4WD accessibility are much less important to the models, showing models are not likely to predict technical feasibility so much as whether or not Bridges to Prosperity is likely to make a site a priority.
In the end, I decided to deliver results from the semi-supervised model in both the form of predictions and of predicted probabilities, though the model was objectively not very good.
Model Stats:
57% Test Accuracy (majority classifier would be ~73% accuracy).
Suitable bridge site:
recall: 70%,
precision: 70%
Unsuitable bridge site:
recall: 25%
precision: 25%
Though it is important for the client to know the model is in no way able to predict whether or not a bridge is technically feasible and produces many false positives, looking at the sites the model classifies with a high probability of being suitable (>95%) may help them with creating a shortlist of sites to prioritize that are within range of their normal operations and are the most dangerous to cross.
This was a humbling experience for me as I am not comfortable with and not used to not being able to deliver what the client originally wanted. I learned that sometimes there is simply not enough signal in the data and there’s nothing to be done about that.
I even tried my hand at creating synthetic data for this project using an LSTM model. Though I was able to blow up the data from less than 90 labeled instances to over 10,000 instances, the model performance remained exactly the same. When data has enough signal but there aren’t enough instances of that signal to get the model to pick up on it, synthetic data works very well. When there is simply not enough signal contained in the data, no amount of amplification is going to help.
I came into this project with a find-a-solution-or-bust mindset. I came out of this project with the knowledge that sometimes what a client wants is simply not possible and the most responsible thing to do is to be honest and upfront with them, reset their expectations, and find other ways you can use your skills to deliver them the maximum amount of value possible.