☆ 4.7 Article

On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed

INFORMATION SCIENCES (2014)

期刊

INFORMATION SCIENCES

卷 257, 期 -, 页码 1-13

出版社

ELSEVIER SCIENCE INC

DOI: 10.1016/j.ins.2013.09.038

关键词

Classification; Imbalanced dataset; Covariate shift; Dataset shift; Validation technique; Partitioning

类别

Computer Science, Information Systems

资金

Spanish Ministry of Science and Technology [TIN2011-28488]
Andalusian Research Plans [P11-TIC-7765, P10-TIC-6858]
Spanish Ministry of Education

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

In the field of Data Mining, the estimation of the quality of the learned models is a key step in order to select the most appropriate tool for the problem to be solved. Traditionally, a k-fold validation technique has been carried out so that there is a certain degree of independency among the results for the different partitions. In this way, the highest average performance will be obtained by the most robust approach. However, applying a random division of the instances over the folds may result in a problem known as dataset shift, which consists in having a different data distribution between the training and test folds. In classification with imbalanced datasets, in which the number of instances of one class is much lower than the other class, this problem is more severe. The misclassification of minority class instances due to an incorrect learning of the real boundaries caused by a not well fitted data distribution, truly affects the measures of performance in this scenario. Regarding this fact, we propose the use of a specific validation technique for the partitioning of the data, known as Distribution optimally balanced stratified cross-validation to avoid this harmful situation in the presence of imbalance. This methodology makes the decision of placing close-by samples on different folds, so that each partition will end up with enough representatives of every region. We have selected a wide number of imbalanced datasets from KEEL dataset repository for our study, using several learning techniques from different paradigms, thus making the conclusions extracted to be independent of the underlying classifier. The analysis of the results has been carried out by means of the proper statistical study, which shows the goodness of this approach for dealing with imbalanced data. (C) 2013 Elsevier Inc. All rights reserved.

On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed

期刊

INFORMATION SCIENCES

出版社

ELSEVIER SCIENCE INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed

期刊

INFORMATION SCIENCES

出版社

ELSEVIER SCIENCE INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文