☆ 4.7 Article

A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning

MACHINE LEARNING (2023)

Journal

MACHINE LEARNING

Volume -, Issue -, Pages -

Publisher

SPRINGER

DOI: 10.1007/s10994-022-06296-4

Keywords

SMOTE; Class imbalance; Distribution density; Over-sampling; Minority class

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Class imbalance refers to unequal class distribution, with one class being under-represented (minority class) and the other class having significantly more samples (majority class). The synthetic minority over-sampling technique (SMOTE) is a prominent method for handling imbalanced data. However, the generated SMOTE patterns may not accurately represent the original minority class distribution. This paper presents a novel theoretical analysis of SMOTE by deriving the probability distribution of the generated samples, allowing for an assessment of their representativeness.

Class imbalance occurs when the class distribution is not equal. Namely, one class is under-represented (minority class), and the other class has significantly more samples in the data (majority class). The class imbalance problem is prevalent in many real world applications. Generally, the under-represented minority class is the class of interest. The synthetic minority over-sampling technique (SMOTE) method is considered the most prominent method for handling unbalanced data. The SMOTE method generates new synthetic data patterns by performing linear interpolation between minority class samples and their K nearest neighbors. However, the SMOTE generated patterns do not necessarily conform to the original minority class distribution. This paper develops a novel theoretical analysis of the SMOTE method by deriving the probability distribution of the SMOTE generated samples. To the best of our knowledge, this is the first work deriving a mathematical formulation for the SMOTE patterns' probability distribution. This allows us to compare the density of the generated samples with the true underlying class-conditional density, in order to assess how representative the generated samples are. The derived formula is verified by computing it on a number of densities versus densities computed and estimated empirically.

A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning

Journal

MACHINE LEARNING

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning

Journal

MACHINE LEARNING

Publisher

SPRINGER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper