Handling Class Imbalance in Machine Learning: Resampling Methods (Part 2)
In part 1, we observed that the NHANES survey dataset has an unequal ratio of majority and minority classes. We also illustrated the accuracy paradox where we fit a logistic regression model to the data without applying any of the techniques for handling class imbalance.
In part 2, we will use the resampling methods and see how they perform on held-out test data. This will be limited to only downsampling (or undersampling) and upsampling (or oversampling) methods. We will also see how we can apply SMOTE.
Resampling Methods
1. Downsampling Resampling Method
To determine the fraction of the majority class to sample to equal the number of samples in the minority class, we divided the number of samples in the minority class with that of the majority class. 12% of the majority class in the training data was randomly selected and then merged with the samples of the minority class.
Note: Resampling should be done on the training data alone while the performance of the model should be evaluated on the test data.
# downsampling train data
set.seed(0) # to maintain reproducibility
train_samp <- train.data |>
filter(Diabetes == 'No') |>
sample_frac(size = 0.12) |>
# merge to minority class
bind_rows(
train.data |>
filter(Diabetes == 'Yes')
)
# fitting a logistic regression model
logit <- train(
Diabetes ~.,
data = train_samp,
method = 'glmnet',
preProcess = c('center', 'scale'),
trControl = trainControl(method='none'),
tuneGrid = expand.grid(alpha=0, lambda=0)
)
In the code above, we set a random seed to maintain the reproducibility of the results. In the next line, we filter the majority class from the training data and randomly select 12% of the samples, and then merge to the minority class using the bind_rows function. After that, we fit a logistic regression model on the training data and scale the input data using the preProcess argument in the train function from the caret library. Because we are not during hyperparameter tuning, we set the method in the trainControl function to none.
When we check the performance on the test data, we see that a considerable number of the minority class was correctly predicted, from 16 to 276 (Figure 1). However, there’s more misclassification of non-diabetic participants (False Positives) where the number of misclassifications jumped from 12 to over 750. Similarly, the recall and F1 scores jumped from less than 10% to 86% and 41% respectively. On the other hand, the overall accuracy dropped from 89.7% to 74%.
2. Upsampling Resampling Method
To determine the fraction of the minority class to sample to equal the number of samples in the majority class, we divided the number of samples in the majority class with that of the minority class. The number of samples randomly selected from the training data was 8.57 times that of the instances in the minority class. Sampling was done by replacement and then merged to the majority class.
# upsample minority class
set.seed(0)
train_samp <- train.data |>
filter(Diabetes == 'Yes') |>
# select with replacement 8.57 times of instances
sample_frac(8.57, replace = TRUE) |>
# merge to majority class
bind_rows(
train.data |>
filter(Diabetes == 'No')
)
# fitting a logistic regression model
logit <- train(
Diabetes ~.,
data = train_samp,
method = 'glmnet',
preProcess = c('center', 'scale'),
trControl = trainControl(method='none'),
tuneGrid = expand.grid(alpha=0, lambda=0)
)
When we check its performance on the test data, we see that this method is slightly worse than that of downsampling (figure 2). Only 274 out of 321 people in the minority class were correctly detected as having diabetes.
3. Synthetic Minority Oversampling Technique (SMOTE)
SMOTE is a form of oversampling (upsampling) where new instances within k-nearest neighbours of instances in the minority class are generated. To implement this method, the SMOTE function from smotefamily library was used.
Note: SMOTE throws an error if any of the variables is categorical in nature. Therefore, make sure that the variable data types are numeric in nature.
# SMOTE
set.seed(0)
train_samp <- SMOTE(
X=select(train.data, -Diabetes),
target=train.data$Diabetes
)
# selecting balanced data
train_samp <- train_samp$data
# rename
train_samp <- train_samp |>
# rename class (Diabetes) to original name
rename(Diabetes = class) |>
# convert the class (Diabetes) to factor
mutate(Diabetes = as.factor(Diabetes))
# fitting a logistic regression model
logit <- train(
Diabetes ~.,
data = train_samp,
method = 'glmnet',
preProcess = c('center', 'scale'),
trControl = trainControl(method='none'),
tuneGrid = expand.grid(alpha=0, lambda=0)
)
In the code, we instantiate the SMOTE function and feed into it the independent and target variables represented by the X and target arguments. Inside the returned result is a data method where the equalised data is stored, we select it and assign it to the result. In the next section, we rename the target class name and data type to match the test data. We then fit a logistic regression model and check its performance on the test data.
Note: By default SMOTE samples instances in the minority class within 5-nearest neighbours. The number of nearest neighbours and the number of instances to sample from the minority class can be changed. This is available in K and dup_size arguments in the SMOTE function
From the result below (figure 3), we see that SMOTE performs better than the other two methods above in overall accuracy and F1 score, however, it does worse in accurately recognising our class of interest (participants with diabetes)
Figure 4 is a table displaying the confusion matrix and metrics table for the resampling methods used here. Based on our aim of identifying diabetic patients (recall), the random downsampling method wins the battle since it was able to accurately identify 86% of diabetic participants.
Conclusion
From our results, we can observe that resampling methods are good techniques for solving class imbalance in classification tasks in machine learning. We looked at downsampling, oversampling and the SMOTE resampling methods. From the results, downsampling performed better than the other two methods in some of the metrics evaluated such as the recall score. However, other resampling methods exist in the smotefamily package that may be used to solve class imbalance. Similarly, the number/percentage of samples to select from either the majority or minority class and the number of nearest neighbours, for SMOTE, can be iterated for improved performance.
In the next article in this series, we will see how we can change the class threshold/cut-off point to solve class imbalance in machine learning classification tasks.
Thanks for reading!