Aug 13, 20215 min read

Handling Imbalanced Data — Intuition to Implementation

Did you ever found yourself in a situation whereupon training your machine learning model you obtain an accuracy above 90%, but then realize that the model is predicting everything in the class with the majority of records ?

There you actually get a smell of Imbalanced Data and now you know that your model is a complete waste !

What is Imbalanced Data ?

Imbalance means that the number of data points available for different classes is different. For example, consider you have the famous dataset of ‘normal’ and ‘fraudulent’ Credit Card Transactions where number of records for ‘normal’ transactions is 284315, and that of ‘fraud’ transactions is only 492.

When you deal with such problem, you are likely to work with imbalanced data, which if not balanced properly before fitting it with your preferred machine/deep learning algorithm, could prove to be biased and inaccurate, thus leading to wrong predictions which can be disastrous.

Well, this blog will give you a very clear intuition and insight on how to handle imbalanced data, with the best preferred practices, and make a comparison of each of them, to let you decide the best technique to reshape your data to a balanced one, for running a perfect model !

Here, we will discuss about -

Undersampling and Oversampling techniques to handle imbalanced data.
NearMiss, SMOTETomek and RandomOverSampler Algorithms to balance the imbalanced data.
Compare the algorithms used in step-2 and choose the best one with proper reasoning.

LET’S BEGIN !

We will use the famous creditcard dataset available in Kaggle for our task. You can download the dataset from here.

Let’s first get our hands dirty by making the data ready for work ! Look at the code below , and try it yourself in your preferred idle !

import pandas as pd
LABELS = ["Normal", "Fraud"]
data=pd.read_csv('creditcard.csv',sep=',')
data.head()
data.info()
#Create independent and Dependent Features
columns=data.columns.tolist()
# Filter the columns to remove data we do not want 
columns= [c for c in columns if c not in ["Class"]]
# Store the variable we are predicting 
target="Class"
# Define a random state 
state=np.random.RandomState(42)
X = data[columns]
Y = data[target]
X_outliers = state.uniform(low=0, high=1, size=(X.shape[0], X.shape[1]))
# Print the shapes of X & Y
print(X.shape)
print(Y.shape)

While you execute this, you will get the shape of your independent(X) and dependent/target (Y) variables.

You are likely to get output like this -

(284807, 30)
(284807,)

Now let’s do some Exploratory Data Analysis to figure out the nature of the data !

data.isnull().values.any()
#o/p will be False as there are no missing values in this dataset
count_classes=pd.value_counts(data['Class'], sort=True)

count_classes.plot(kind='bar', rot=0)

plt.title("Transaction Class Distribution")

plt.xticks(range(2), LABELS)

plt.xlabel("Class")

plt.ylabel("Frequency")

The generated visualization is shown below -

The data is so highly imbalanced that the fraudulent transactions' distribution is near to 0 whereas normal transactions’ distributions head even more than the 250000 border. So you saw how the distribution of classes of an imbalanced data looks like !

Let’s count the number of fraud and normal transactions to get the numerical estimation, and then we will try out the different methods to balance our data.

## Get the Fraud and the normal dataset 
fraud=data[data['Class']==1]
normal=data[data['Class']==0]
#print the shape
print(fraud.shape,normal.shape)

You are likely to get an output like this-

(492, 31) (284315, 31)

So we have only 492 fraudulent transactions and 284315 normal transactions.

Let’s now discover the various techniques to balance our data !

Undersampling

This technique involves randomly removing samples from the majority class, with or without replacement. This is one of the earliest techniques used to alleviate imbalance in the dataset.

We will use Near Miss Undersampling algorithm here , which functions by selecting examples based on the distance of majority class examples to minority class examples.

Look at the code here and try yourself !

import sklearn 
from imblearn.under_sampling import NearMiss

# Implementing Undersampling for Handling Imbalanced dataset
near_miss = NearMiss(random_state=42)
X_res,y_res=near_miss.fit_sample(X,Y)

X_res.shape,y_res.shape

The output will be similar to the one shown below :

((984, 30), (984,))

Yay! Our data is balanced now!

But wait. Did you just realize that we lost a huge amount of data, or rather, a huge amount of valuable information?!!!

YES. WE DID. AND THAT’S WHERE OVERSAMPLING WINS OVER UNDERSAMPLING TECHNIQUES !

Let’s now look at how Oversampling works -

Oversampling

Oversampling involves supplementing the training data with multiple copies of some minority classes. Oversampling can be done more than once (2x, 3x, 5x, 10x, etc.) This is one of the earliest proposed methods, that is also proven to be robust, and far better than undersampling , as data is not lost here, rather new data is created!

And you know, more the data, more robust your model!

So let’s now see the practical implementation of Oversampling using RandomOverSampler Algorithm.

Random oversampling involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset.

Look at the code again, and try out on your own idle-

# RandomOverSampler algo to handle imbalanced data

from imblearn.over_sampling import RandomOverSampler

over_sampl = RandomOverSampler(ratio=0.5)

X_train_res, y_train_res=over_sampl.fit_sample(X, Y)

X_train_res.shape,y_train_res.shape

The output will likely reshape your data like the one below-

((426472, 30), (426472,))

Did you just observe the difference in the outputs of using undersampling and oversampling?!

Such huge amount of data gets created if we use oversampling! Pretty much awesome right?

But again, everything has its advantages and disadvantages, right?

Random oversampling is known to increase the likelihood of occurring overfitting, since it makes exact copies of the minority class examples.

By now, you must be thinking,

What can be the best and most preferred method for balancing our data ?

Okay, so let’s introduce you to the hybrid smart algorithm SMOTETomek , which beats up almost all approaches to fix imbalanced data!

Let’s welcome the HOLY GRAIL !

SMOTETomek

SMOTETomek (which is a method of imblearn. SMOTETomek , is a hybrid method which uses an under sampling method (Tomek) with an over sampling method (SMOTE)) involves the removal of the Tomek links which are the pairs of examples that are in closest proximity to each other or called the nearest neighbours but belongs to opposite classes.

Hence, the resulting dataset is free from the in between class crowded overlaps so that all nearest neighbours placed at the minimal distance belongs to one class only !

GREAT INDEED !

Let’s now look at the practical implementation of SMOTETomek algorithm !

from imblearn.combine importSMOTETomek

oversamp_undersamp=SMOTETomek(ratio=0.5)

X_train_res1, y_train_res1=oversamp_undersamp.fit_sample(X, Y)

X_train_res1.shape,y_train_res1.shape

The output will likely resample your data like the one below-

((424788, 30), (424788,))

Now let’s look at the original and reshaped data, and compare them -

from collections import Counter
print('Original dataset shape {}'.format(Counter(Y)))
print('Resampled dataset shape {}'.format(Counter(y_train_res1)))

Output-

Original dataset shape Counter({0: 284315, 1: 492})
Resampled dataset shape Counter({0: 283473, 1: 141315})

You can clearly visualise that in the original dataset, there were 284315 records belonging to class 0 and only 492 to class 1 whereas, SMOTETomek resampled it to 283473 records belonging to class 0 and 141315 records to class 1.

Conclusion

To conclude we can always agree to the fact that SMOTETomek is a method that provides improved accuracy for both the majority and minority class examples and deals with the completeness of both.

Imbalanced learning is a very critical and a challenging problem in the field of data engineering and knowledge discovery where the cost of misclassification can be very high.

Also, I would never deny that all these methods are limited or bounded to certain criteria or conditions under which they perform better. Or in other words, there is a need to set certain rules of thumbs or criteria for using each method, so that the method performs well in its specified range of datasets.

So, always be curious and careful and study your data well so that you can apply the best method to it !

Source: Medium - by Sukanya Bag

The Tech Platform

www.thetechplatform.com