4.7 Article

On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed

期刊

INFORMATION SCIENCES
卷 257, 期 -, 页码 1-13

出版社

ELSEVIER SCIENCE INC
DOI: 10.1016/j.ins.2013.09.038

关键词

Classification; Imbalanced dataset; Covariate shift; Dataset shift; Validation technique; Partitioning

资金

  1. Spanish Ministry of Science and Technology [TIN2011-28488]
  2. Andalusian Research Plans [P11-TIC-7765, P10-TIC-6858]
  3. Spanish Ministry of Education

向作者/读者索取更多资源

In the field of Data Mining, the estimation of the quality of the learned models is a key step in order to select the most appropriate tool for the problem to be solved. Traditionally, a k-fold validation technique has been carried out so that there is a certain degree of independency among the results for the different partitions. In this way, the highest average performance will be obtained by the most robust approach. However, applying a random division of the instances over the folds may result in a problem known as dataset shift, which consists in having a different data distribution between the training and test folds. In classification with imbalanced datasets, in which the number of instances of one class is much lower than the other class, this problem is more severe. The misclassification of minority class instances due to an incorrect learning of the real boundaries caused by a not well fitted data distribution, truly affects the measures of performance in this scenario. Regarding this fact, we propose the use of a specific validation technique for the partitioning of the data, known as Distribution optimally balanced stratified cross-validation to avoid this harmful situation in the presence of imbalance. This methodology makes the decision of placing close-by samples on different folds, so that each partition will end up with enough representatives of every region. We have selected a wide number of imbalanced datasets from KEEL dataset repository for our study, using several learning techniques from different paradigms, thus making the conclusions extracted to be independent of the underlying classifier. The analysis of the results has been carried out by means of the proper statistical study, which shows the goodness of this approach for dealing with imbalanced data. (C) 2013 Elsevier Inc. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据