Handling Class Imbalance in Machine Learning: The Accuracy Paradox (Part 1)
Introduction
In the previous article, we started a series on the class imbalance problem in supervised machine learning. We outlined a few techniques used to handle them. In this article, we will illustrate the class imbalance problem by fitting a logistic regression model in R and evaluate performance on the test data.
Data
To implement this, we will be using the unmodified NHANES survey data collected between 2009 and 2012. This data set can be assessed from the NHANES library in R. Survey data is collected by the US National for Health Statistics (NCHS). It consists of 78 variables of health information collected from 20,293 participants. To make it simpler, we will be using a few of the variables. The variables will be the Participant ID, Survey year, Age, Poverty ratio, Weight, Height, BMI, Pulse rate, Average Systolic Blood Pressure, Average Diastolic Blood Pressure, Total HDL cholesterol level, urine volume and diabetes (a variable that indicates if the participant is diabetic or not). We will use these variables to predict participants who have diabetes and those who do not.
EDA and Data Preparation
Before we start, we will perform some data preparation and exploratory data analysis. We will import the necessary R libraries, select variables, check for data types, duplicate rows and missing values and drop them where necessary. After this step, 12,321 observations were remaining after all missing values were dropped.
When we look at the class distribution, over 9,000 participants do not have diabetes while less than 1,500 have diabetes. This is about a 9:1 class imbalance ratio (Figure 1). We are interested in identifying participants who are diabetic.
# importing libraries
# install.packages('NHANES')
library(tidyverse)
library(caret)
library(NHANES)
# selecting variables
selected_features <- c('ID', 'SurveyYr', 'Age', 'Poverty', 'Weight',
'Height', 'BMI', 'Pulse', 'BPSysAve', 'BPDiaAve',
'TotChol', 'UrineVol1', 'Diabetes')
data <- NHANESraw |>
select(all_of(selected_features))
# dropping missing rows
data <- data |>
drop_na()
# dropping survey year and participant ID
data <- data |>
select(-ID, -SurveyYr)
# target distribution
class.prob <- 100 *prop.table(table(data$Diabetes)) |>
round(2)
ggplot(data, aes(Diabetes)) +
geom_bar(fill='steelblue', alpha=0.9, width = 0.7) +
theme(axis.ticks.y = element_blank()) +
geom_text(stat = 'count', label=paste0(class.prob,'%'),
vjust=-0.2, size=3.2) +
labs(title = 'Diabetes Occurrence Distribution') +
theme_minimal()
In the first part of the R code we import the necessary libraries we will be working with. In the next part, we select the variables we’ll be working with using the select function in R and using the all_of function to indicate that we want to select those inside the variable selected_features.
The third part is a fancy way of dropping all rows with missing values in them. In the fourth part, we drop the SurveyYr and ID variables. Using the “-” sign in front of the variable name tells the select function not to select them. In the last part we plot the target distribution (dependent variable) using the ggplot library. The class_prob variable stores the percentage of the class proportions.
Diabetes Classification
Fitting a classifier without handling class imbalance
To illustrate, the majority-class bias seen in an imbalanced dataset, a logistic regression model will be fit after randomly splitting the cleaned data into train (75%) and test (25%) data sets.
To fit a logistic regression model, we will be using the glmnet library from the caret package in R. The method argument in the trainControl function was set to none to stop fitting a model via cross-validation. Also, the lambda argument was set to 0 to prevent fitting a penalised logistic regression. Because logistic regression requires the numerical variables to be standardised, the preProcess argument is assigned to center and scale them.
The test data was used to evaluate the model’s performance.
# splitting into train and test
set.seed(4)
train_idx <- createDataPartition(data$Diabetes, times=1,
p=0.75, list=FALSE)
train.data <- data[train_idx, ]
test.data <- data[-train_idx, ]
# fitting a logistic regression model
logit <- train(
Diabetes ~.,
data = train.data,
method = 'glmnet',
preProcess = c('center', 'scale'),
trControl = trainControl(method='none'),
tuneGrid = expand.grid(alpha=0, lambda=0)
)
# Performance
predictions <- predict(logit, test.data)
# confusion matrix
table(Actual=test.data$Diabetes, Prediction=predictions)
# Accuracy
acc = mean(test.data$Diabetes == predictions)
# sensitivity
rec = sensitivity(predictions, test.data$Diabetes, positive = 'Yes')
# F1
f1 = F_meas(predictions, test.data$Diabetes, relevant = 'Yes')
Figure 2 shows the confusion matrix and performace metrics evaluated on the test data. The performance metrics are the recall and F1 scores.
Depending on the nature of the problem and the cost of misclassifying the positive class as negative, reducing the number of false negatives is more often desired.
From figure 2, we can observe that without handling class imbalance, our logistic regression model is heavily biased towards the majority class. Firstly, there’s a huge misclassification of our class of interest (305/321), that is about 95% error rate. However, when we look at the overall accuracy (89.7%), it is performing “absolutely well.” Like we explained in the last article, we can observe an accuracy paradox where our model is performing well overall but fails abysmally in correctly detecting participants who are diabetic (from the recall and F1 scores)- only 16 were correctly detected. Conversely, about 99% of the negative class (non-diabetic) were correctly classified with fewer misclassifications (12).
A confusion matrix (figure 3) presents the correct and incorrect classifications of the model for each class. The number of samples of the class of interest, typically represented as the positive class, correctly detected by a model is called the True Positives (TP), while the number of samples in the negative class correctly predicted is called True Negatives (TN). The number of instances in the negative class incorrectly classified as positive, is called the False Positives (FP) while misclassifications of the positive class (misclassified as belonging to the negative class), is called the False Negatives (FN).
Next Steps
Is there any way we can improve?
In part 2, we will look at the performance of the model on the test data using resampling methods. We shall discuss the random downsampling and upsampling and also the SMOTE methods and implement them in R.
In conclusion, not handling datasets with huge class imbalance in machine learning classification can cause our model to perform badly on our class of interest. As we saw above, a model has the potential of becoming biased toward the majority class in an imbalanced dataset. Therefore, finding an appropriate method of handling this classification problem is recommended.
Thanks for reading and I hope you enjoyed it. Feel free to leave a clap or message me if you have any queries.
Follow me for more content.