4.7 Article

Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering

Journal

INFORMATION SCIENCES
Volume 519, Issue -, Pages 43-73

Publisher

ELSEVIER SCIENCE INC
DOI: 10.1016/j.ins.2020.01.032

Keywords

Imbalanced datasets; Classification; Clustering; Over-sampling; Within-class imbalance

Funding

  1. Fundamental Research Funds for the Central Universities [2572017EB02, 2572017CB07]
  2. Innovative talent fund of Harbin science and technology Bureau [2017RAXXJ018]
  3. Double first-class scientific research foundation of Northeast Forestry University [411112438]

Ask authors/readers for more resources

Learning from imbalanced datasets poses a major challenge in data mining community. When dealing with imbalanced datasets, conventional classification algorithms generally perform poorly as they are originally designed to work under balanced class distribution scenarios. Although there exist different methods to addressing this issue, sampling methods especially over-sampling techniques have shown great potentials as they aim to improve datasets itself rather than the classifiers, which can allow them to be used for any classifier. In this paper, we propose a novel adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering. Unlike other clustering-based over-sampling methods, the proposed approach applies modified density peaks clustering rather than traditional k-means clustering techniques to cluster the minority instances due to its capability of accurately identifying sub-clusters with different sizes and densities, which is beneficial for the proposed method to simultaneously accommodate for between-class and within-class imbalance issues caused by various reasons. Subsequently, the size for each identified sub-cluster to be oversampled is adaptively determined according to its own size and density and then the minority instances within each sub-cluster are oversampled based on their probabilities inversely proportional to their distances to the majority class and their densities with the aim of generating more synthetic minority instances for borderline and sparser ones. Finally, in order to avoid the generation of overlapping, a heuristic filtering strategy is also developed to iteratively move the possibly overlapped minority instances away from the majority class. The extensive experimental results on the different imbalanced datasets demonstrate that the proposed approach can achieve better classification performance in most datasets as compared to the other existing over-sampling techniques. (C) 2020 Elsevier Inc. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available