4.7 Article

An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult

期刊

EXPERT SYSTEMS WITH APPLICATIONS
卷 182, 期 -, 页码 -

出版社

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2021.115297

关键词

Classification; Class imbalance; Class overlapping; Data intrinsic characteristics; Noise; Small disjuncts

资金

  1. European Regional Development Fund [KK.01.1.1.01.0009]

向作者/读者索取更多资源

The study identified noise as the characteristic that most impairs classifier performance on imbalanced datasets, followed by class overlapping and class imbalance. To mitigate these issues, oversampling and undersampling procedures were tested, and guidance is provided for selecting appropriate techniques.
Learning from data stemming from real-world problems is inherently challenging and difficult due to the numerous intrinsic characteristics present in datasets. The problem of class imbalance is known to significantly impair classification performance and has attracted increasing attention from researchers. On the other hand, some studies suggest that the detrimental effects of class imbalance occur only when the dataset encompasses other intrinsic characteristics such as small disjuncts, class overlapping, noise or data rarity. However, the literature is often ambiguous in terms of understanding and distinguishing the influence of these characteristics on the behaviour of standard classification algorithms. This paper provides a contemporary empirical study of the behaviour and performance of five well-known classifiers on a large number of imbalanced datasets exhibiting numerous combinations of the stated characteristics. The aim of the study is to identify and rank difficulty factors when learning from imbalanced data, depending on the type of classification algorithm used. In general, the obtained results suggest that if classifiers conceptually have no problem with class separation into sub-concepts, noise is the characteristic that most impairs their performance, closely followed by class overlapping and class imbalance. To alleviate these problems, oversampling and undersampling procedures were tested and directions are given for selecting appropriate techniques when dealing with the problem of class imbalance.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据