link here

Why do we need undersampling/oversampling

Undersampling or oversampling is used when there is an extreme imbalance between data. Here, imbalance refers to drastic differences in counts between classes.

Several examples are shown below:

For example, In a fraud detection scenario with 1% fraudulent and 99% legitimate transactions, a model that naively predicts all transactions as legitimate would achieve a misleading 99% accuracy. This demonstrates how extreme class imbalance can skew performance metrics.

To address such imbalances, we can have two different apporoaches:

In this page, we focus on the second one, sampling. Sampling can be classified into 2 different categories:

1. Undersampling

Undersampling is the approach that reduces the majority class samples. For example, say a dataset has 1000 data points for majority class and 100 data points for minority class. Then, we remove 900 data points in the majority class so that the dataset becomes a balanced dataset.

In this page, I would like to introduce 4 ways of undersampling: