Imbalanced Data In Machine Learning

Semo Edam
3 min readNov 19, 2020

--

This last month I had the chance to work on a real-world problem with Bridges To Prosperity. A non-profit organization that helps build footbridges in east African communities, to connect people in isolation due to the lack of transportation infrastructure to local necessities like schools, hospitals, and markets. This initiative has already changed millions of people’s lives, and the organization is aiming to build an estimation of 100,000 bridges around the world to serve 1 billion people.

What is imbalanced data?
A dataset is imbalanced if at least one of the classes constitutes only a small minority. Imbalanced data is chronically prevalent in finance, banking, insurance, health sciences, engineering, and many other fields.

Why do we need to deal with imbalanced data?
Simply because we want to make our predictions more accurate, and if we have imbalanced data our model will be biased to the dominant target class.

How to deal with imbalanced data?
Most machine learning algorithms work best when the numbers of instances of each class are roughly equal. However, this is a perfect condition, and that is not always the case.
In my example, the stakeholder provided a dataset that was roughly imbalanced, and my job was to clean it and build a model to predict whether the bridge will be approved by the engineer or it will be rejected. Besides data cleansing, I had to split the data into three sets, two labeled sets to use for model training and one set missing the labels for which the model predicted the labels.

Link to the dataset

After the split, the data is clearly imbalanced as shown below.

The model that works best for this case is SMOTE (Synthetic Minority Oversampling Technique). SMOTE checks for the minority class and synthesizes new examples to balance the data, and to have some validation of how the model is performing I used cross-validation.

The precision score was 92%

Conclusion
Lastly, the challenge of working with imbalanced datasets is that most machine learning techniques will ignore it, and in turn, have poor performance on the minority class; although, typically its performance on the minority class is most important.

--

--

No responses yet