☆ 4.7 Article

A hybrid data-level ensemble to enable learning from highly imbalanced dataset

INFORMATION SCIENCES (2021)

Journal

INFORMATION SCIENCES

Volume 554, Issue -, Pages 157-176

Publisher

ELSEVIER SCIENCE INC

DOI: 10.1016/j.ins.2020.12.023

Keywords

Ensemble learning; Highly imbalanced learning; Hybrid ensemble; Oversampling; Undersampling

Funding

Humanities and Social Science Fund of Ministry of Education of China [19XJC880001]
Key Research and Development Program of Chengdu Science and Technology Bureau [2019-YF05-02106-GX]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

A hybrid data-level ensemble method was developed to address the performance degradation issue caused by highly imbalanced class distribution. By integrating undersampling and oversampling, the method aims to balance data distribution and optimize the fundamental properties of the ensemble. Experimental results on 42 highly imbalanced datasets demonstrated the significant performance advantages of the proposed HD-Ensemble over other ensemble solutions.

Highly imbalanced class distribution has been well-recognized as a major cause of performance degradation for most supervised learning algorithms. Unfortunately, such detrimental distribution inherently occurs in various real-world applications. In this work, we developed a hybrid data-level ensemble (HD-Ensemble), which integrates ensemble learning with the union of a margin-based undersampling and diversity-enhancing oversampling. The proposed undersampling method filters out certain number of unrepresentative majority instances based on an unsupervised margin definition, while the proposed oversampling method generates diverse minority instances according to the behavior of ensemble learning. The combination of the two data-level approaches serves a twofold purpose of balancing the data distribution, and optimizing the fundamental properties (e.g., margin distribution and diversity) of the ensemble, therefore, the inferior performance caused by adopting single data-level approach can be better addressed. Targeting on binary classification task, we evaluated the HD-Ensemble on 42 highly imbalanced datasets, which exhibited a considerable variety in sample number (ranging from 129 to 20,034), feature number (ranging from 3 to 5,000) and imbalance ratio (ranging from 9.08 to 970.6). Experimental results demonstrated the performance advantages of proposed HD-Ensemble over ten other ensemble solutions. (C) 2020 Elsevier Inc. All rights reserved.

A hybrid data-level ensemble to enable learning from highly imbalanced dataset

Journal

INFORMATION SCIENCES

Publisher

ELSEVIER SCIENCE INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A hybrid data-level ensemble to enable learning from highly imbalanced dataset

Journal

INFORMATION SCIENCES

Publisher

ELSEVIER SCIENCE INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper