4.7 Article

Gaussian Distribution Based Oversampling for Imbalanced Data Classification

Journal

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TKDE.2020.2985965

Keywords

Gaussian distribution; Data models; Adaptation models; Probabilistic logic; Internet; Cancer; Machine learning; Imbalanced learning; oversampling; probabilistic anchor selection; gaussian resampling

Funding

  1. Shandong Provincial Key RD Program [2018CXGC0706, 2017CXZC1206]
  2. National Natural Science Foundation of China [61972176, 61472164, 61672262, 61572230, 61573166]

Ask authors/readers for more resources

In this study, a new data resampling technique called Gaussian Distribution based Oversampling (GDO) is proposed to handle imbalanced data for classification. Experimental results show that GDO outperforms other compared methods in terms of AUC, G-mean, and memory usage, with an increase in running time. The effectiveness of GDO is further validated in two real imbalanced data classification problems.
The imbalanced data classification problem widely exists in many real-world applications. Data resampling is a promising technique to deal with imbalanced data through either oversampling or undersampling. However, the traditional data resampling approaches simply take into account the local neighbor information to generate new instances in linear ways, leading to the generation of incorrect and unnecessary instances. In this study, we propose a new data resampling technique, namely, Gaussian Distribution based Oversampling (GDO), to handle the imbalanced data for classification. In GDO, anchor instances are selected from the minority class instances in a probabilistic way by taking into account the density and distance information carried by the minority instances. Then new minority instances are generated following a Gaussian distribution model. The proposed method is validated in experimental study by comparing with seven imbalanced learning approaches on 40 data sets from the KEEL repository and 10 large data sets from the UCI repository. Experimental results show that our method outperforms the other compared methods in terms of AUC, G-mean and memory usage with an increase in running time. We also apply GDO to deal with two real imbalanced data classification problems: Internet video traffic identification and metastasis detection of esophageal cancer. The experimental results once again validate the effectiveness of our approach.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available