4.7 Article

A hybrid data-level ensemble to enable learning from highly imbalanced dataset

Journal

INFORMATION SCIENCES
Volume 554, Issue -, Pages 157-176

Publisher

ELSEVIER SCIENCE INC
DOI: 10.1016/j.ins.2020.12.023

Keywords

Ensemble learning; Highly imbalanced learning; Hybrid ensemble; Oversampling; Undersampling

Funding

  1. Humanities and Social Science Fund of Ministry of Education of China [19XJC880001]
  2. Key Research and Development Program of Chengdu Science and Technology Bureau [2019-YF05-02106-GX]

Ask authors/readers for more resources

A hybrid data-level ensemble method was developed to address the performance degradation issue caused by highly imbalanced class distribution. By integrating undersampling and oversampling, the method aims to balance data distribution and optimize the fundamental properties of the ensemble. Experimental results on 42 highly imbalanced datasets demonstrated the significant performance advantages of the proposed HD-Ensemble over other ensemble solutions.
Highly imbalanced class distribution has been well-recognized as a major cause of performance degradation for most supervised learning algorithms. Unfortunately, such detrimental distribution inherently occurs in various real-world applications. In this work, we developed a hybrid data-level ensemble (HD-Ensemble), which integrates ensemble learning with the union of a margin-based undersampling and diversity-enhancing oversampling. The proposed undersampling method filters out certain number of unrepresentative majority instances based on an unsupervised margin definition, while the proposed oversampling method generates diverse minority instances according to the behavior of ensemble learning. The combination of the two data-level approaches serves a twofold purpose of balancing the data distribution, and optimizing the fundamental properties (e.g., margin distribution and diversity) of the ensemble, therefore, the inferior performance caused by adopting single data-level approach can be better addressed. Targeting on binary classification task, we evaluated the HD-Ensemble on 42 highly imbalanced datasets, which exhibited a considerable variety in sample number (ranging from 129 to 20,034), feature number (ranging from 3 to 5,000) and imbalance ratio (ranging from 9.08 to 970.6). Experimental results demonstrated the performance advantages of proposed HD-Ensemble over ten other ensemble solutions. (C) 2020 Elsevier Inc. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available