Undersampling & Oversampling

Why do we need undersampling/oversampling

Undersampling or oversampling is used when there is an extreme imbalance between data. Here, imbalance refers to drastic differences in counts between classes.

Several examples are shown below:

fraud detection
phishing detection
product defects in a manufacturing factory
…

For example, In a fraud detection scenario with 1% fraudulent and 99% legitimate transactions, a model that naively predicts all transactions as legitimate would achieve a misleading 99% accuracy. This demonstrates how extreme class imbalance can skew performance metrics.

To address such imbalances, we can have two different apporoaches:

Change the metrics (ROC AUC, precision, recall, etc.)
Sampling method
Cost sensitive learning
Novelty detection

In this page, we focus on the second one, sampling. Sampling can be classified into 2 different categories:

Undersampling (reducing the majority class samples)
Oversampling (increasing the minority class samples)

1. Undersampling

Undersampling is the approach that reduces the majority class samples. For example, say a dataset has 1000 data points for majority class and 100 data points for minority class. Then, we remove 900 data points in the majority class so that the dataset becomes a balanced dataset.

In this page, I would like to introduce 4 ways of undersampling: