Class Imbalance in Supervised Machine Learning: What is it and How can we Handle it?
Introduction
Classification is a type of supervised machine learning where observations with their associated attributes are assigned to a unique class. This is done by using an algorithm to study the relationship between the input data and the target class and then using that information to predict the class an observation belongs to. The number of unique classes could be two (binary) or more (multiclass classification).
During classification, we may encounter a problem where the frequency of samples in one class is overwhelmingly higher than in the other. We fit a classifier and then check its accuracy on the test set and excitedly you discover that the accuracy is 98%. But when you do a confusion matrix you find out that your model performs very poorly or cannot detect your class of interest. In this situation, you are faced with a class imbalance problem.
In this article, we will discuss class imbalance and how we can solve this problem. This topic will be discussed in two parts: theoretical and practical. Here, we will focus on just the theoretical part and in further articles in this series, we will delve into the practical part where we will apply each of the techniques of handling class imbalance discussed.
I apologise, this will be a bit lengthy but I promise to keep it short. So, grab a cup of coffee and some snacks!
Note:
It is assumed that the reader has an intermediate knowledge of classification in machine learning.
Class Imbalance
Class imbalance is a term used to define a scenario where the frequency of instances in one class is higher than in the other. The class with more samples is called the majority class while the other with a smaller number of samples is called the minority class. However, this may become a huge problem when the proportion of samples in the majority class is strikingly more than the minority class.
Class imbalance is usually encountered in cases where the occurrence of an event is rare. For example, in fraud cases, network breaches/failure, cyber-attacks, customer churn, email spam and some health diagnoses and diseases. This is a problem in machine learning because the classifier tends to overlook the minority class and focus more on the majority class. In an actual sense, it is biased towards the majority class. Typically, the minority class is usually the class of interest (what we care about). This problem results in classifiers with poor predictive performance for the class of interest.
The question you may ask is this: How then can we solve this challenge? Various techniques have been applied to solve this class imbalance problem. Some of these techniques include the
· Resampling method (downsampling and upsampling techniques)
· Applying weights/cost (penalty)
· Threshold adjustment
· The use of ensemble methods
· Using a different evaluation metric.
Techniques for Handling Class Imbalance
Resampling Methods
Here, the minority or majority class is resampled. They aim at balancing class distribution. Two major methods are downsampling the majority class by randomly selecting a fraction or upsampling the minority class by generating duplicate copies of it. Variants of these methods exist. One is generating synthetic copies of the minority class, called the Synthetic Minority Oversampling Technique (SMOTE). In SMOTE, samples of the minority class at a predefined number of nearest neighbours (k-nearest neighbours) are used to synthetically generate new samples of the minority class.
No on method is superior to the other. They have their strengths and weaknesses. For example, downsampling the majority data may lead to a loss of information. Duplicating copies of the minority class may also introduce some redundant information while generating synthetic data from the minority class may introduce some ambiguities. Therefore, depending on the type of resampling method chosen, the fraction of the majority or minority class may be iterated to find the fraction that gives the best performance.
Applying weights/cost
Here, a weight (penalty) is applied to both minority and majority classes. This is used to penalize the classifier for misclassifying a class. A higher cost is applied for every misclassification of the minority class while a lower cost is applied to the majority class. This technique is aimed to make the classifier sensitive to the minority class.
Ensemble Methods
In this technique, an ensemble of a weak learner (classifier) is utilized. It is a form of bagging where subsets of the training data are sampled and several models are fit on each sampled subset. After the models are fit, their predictions are aggregated either by averaging their probabilities or by selecting the highest occurring class. The weak learner could be a decision tree, a neural network or a linear model.
Threshold Adjustment
In binary classification, an instance (data point) is typically assigned to a particular class by thresholding the probability at a value of 0.5, where probability values greater than 0.5 are grouped as 1 and values below it as 0. For an imbalanced class problem, the threshold may need to be adjusted. Selecting the best threshold will need to be done by finding the optimal threshold that balances the classifier’s recall and precision scores. This can be investigated using the precision/recall curve or the receiver’s operating characteristic curve (ROC curve).
Choosing a different performance Metric
Sometimes evaluating the performance of a classifier using accuracy may not be appropriate. For imbalanced datasets, a phenomenon called the accuracy paradox exists where the classifier may have the best accuracy overall but fails woefully at accurately predicting the positive/minority class. Choosing a different metric, in this case, would be ideal. Metrics such as the Area under the ROC curve (AUC), recall, precision, F1 score or balanced accuracy can be adopted.
Recall or sensitivity gives information about the fraction of the class of interest that a classifier can detect. It is given by
Precision evaluates how accurate the classifier’s predictions are for a class of interest. It tries to find the fraction of the class of interest that is contained in a classifier’s predictions for the positive class.
Most practitioners use the F1 score for imbalanced datasets because it combines both the precision and recall of a classification model.
Balanced accuracy on the other hand looks at accuracy in both classes. That’s the average between sensitivity and specificity.
Conclusion
In this first article in the series, we looked at the class imbalance problem in machine learning and the different techniques used to solve it. Unfortunately, there’s no one-size-fits-all technique, however, you need to try a bunch of the techniques and find the one that best suits the imbalanced data you have. Also, depending on the technique chosen, optimal values (weight, sampling fraction, and threshold) that give optimal performance can be iterated.
In the next articles, we will delve into the practical ways of utilizing these techniques. Firstly, we will demonstrate the accuracy paradox where we will fit a classifier without applying any of the techniques for handling class imbalance.
Thanks for taking out the time to read this lengthy article and hope you enjoyed it. Your suggestions are highly welcome.
Next Article:
Handling Class Imbalance in Machine Learning: The Accuracy Paradox (Part 1)
Resources
8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset
5 Techniques to Handle Imbalanced Data for a Classification Problem