7 Ways to Handle Class Imbalance in Machine Learning
Class imbalance is the biggest challenge in the classification task of machine learning algorithms. This happens when one label of target variable data points is less as compared to another label.
If such imbalanced data is given to the learning model, most of the time it will generalize on majority labels but for minority labels, it will fail to recognize the pattern, and hence the model generated will be an overfit.
It will perform well on training data and if new data is given high probability, it will predict the minority class as the majority class. So in classification task accuracy will be misleading metrics. We should check the ROC_AUC score, Recall, Precision, Confusion matrix, etc.
The ratios that can be considered as imbalance ratios are
- 80:20
- 70:30
- 90:10
- 60:40
There are multiple ways in which this problem can be handled. Let us explore each of them
Collecting more data
This is the most overlooked method but this helps genuinely. Mostly if we can collect more data for minority class will solve the problem. Let’s see more ways to balance the data.
Changing the metrics
Data Science enthusiasts go happy when they achieve 99% of accuracy for the classification tasks. But if you take the definition for accuracy it correctly predicts data points/sum of actual and predicted points. Where are mispredicted points?? So in such a case, we should know which metrics can help me to get a generalized model. And this comes from domain analysis.
Consider a bank where they want to predict whether a customer will default on credit terms or not.
Now suppose the model created has an accuracy of 99% and for every customer, the prediction done is NO Default. But at a later stage, it was found the customer defaulted on credit. So basically the model created did the misprediction. So here our aim should be to create a model which should not predict a defaulter customer as non-defaulter and non-defaulter as a defaulter, so basically both Precision and Recall must be high here.
Consider another case where the bank wants to sell a new product to the customers and the task here is to predict whether the customer will buy or not. So we should create a model with high precision as if we predict a non-buyer is going to buy so marketing money will be spent on it. And hence loss can be suffered by the bank.
Taking another case where we want to predict whether a person will have heart disease or not. So at this task the model should not predict a person with heart disease will not have it (basically risking life) so recall should be high.
SO based on the domain you are working in, evaluation metrics should be selected.
Resampling the data
Resampling can help in preventing overfitting of the data, especially in unbalanced conditions. Stratified K fold cross-validation is a good choice in such cases.
Try to Generate Synthetic Samples
In the sklearn package, we have multiple synthetic samples generators like
Over sampler, undersampling, SMOTE available to create synthetic samples in the dataset leading to a balance of all classes. However, they have their advantages and disadvantages.
Use a machine learning algorithm that internally handles imbalance data.
Bagging and Boosting machine learning algorithms are good choices.
Penalize the model
Penalized classification puts an additional cost on the model for making classification mistakes on the minority class during training. By penalizing the model we can ask to give more attention to the minority class. Explore the regularization parameter of the model used for penalization.
Use anomaly detection algorithms.
Isolation Forest and Local Outlier Detection are the available methods to remove the anomaly from data making it a more generalized one.
All the real-time data will be imbalanced one and being Data Scientist we should treat the imbalanced data accordingly so that the model created should be the most generalized one.