☆ 4.7 Article

TDMO: Dynamic multi-dimensional oversampling for exploring data distribution based on extreme gradient boosting learning

INFORMATION SCIENCES (2023)

Journal

INFORMATION SCIENCES

Volume 649, Issue -, Pages -

Publisher

ELSEVIER SCIENCE INC

DOI: 10.1016/j.ins.2023.119621

Keywords

Class imbalance learning; Data distribution; Oversampling; k -nearest neighbors; SMOTE

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper proposes a novel approach based on XGBoost and TDMO to address the issue of data distribution. By training multiple balanced subsets, filtering noise, and combining multiple samples, the diversity of the minority class is expanded, resulting in superior classification results compared to other methods.

The synthetic minority oversampling technique (SMOTE) is the most general and popular solution for imbalanced data. Although SMOTE is effective in solving the class imbalance problem in most cases, it insufficiently exploits the data prior distribution. Additionally, most existing SMOTE variants randomly produce new instances between a minority sample and its nearest neighbors, which carries the risk of noise propagation. To address this, in this paper, local distribution trust estimation based on extreme gradient boosting (XGBoost) and dynamic multi-dimensional oversampling (TDMO) is proposed as a novel approach to exploring data distributions. First, undersampling and XGBoost techniques are introduced to train multiple balanced subsets to identify the internal structure of the original data and obtain the classification prediction accuracy of each instance, called the confidence level (CL). Then, instances with low CL (i.e., noise) are filtered out, and the densities of the two classes in the neighborhood of the non-noise instances are evaluated to create candidate samples to expand the diversity of the minority class. Finally, the minority class is enhanced by combining multiple samples in a multi-dimensional feature space. Extensive experimental results demonstrate that TDMO outperformed the comparative oversampling methods clearly and obtained the optimal classification results.

TDMO: Dynamic multi-dimensional oversampling for exploring data distribution based on extreme gradient boosting learning

Journal

INFORMATION SCIENCES

Publisher

ELSEVIER SCIENCE INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

TDMO: Dynamic multi-dimensional oversampling for exploring data distribution based on extreme gradient boosting learning

Journal

INFORMATION SCIENCES

Publisher

ELSEVIER SCIENCE INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper