4.7 Article

Analysis of sampling techniques for imbalanced data: An n=648 ADNI study

期刊

NEUROIMAGE
卷 87, 期 -, 页码 220-241

出版社

ACADEMIC PRESS INC ELSEVIER SCIENCE
DOI: 10.1016/j.neuroimage.2013.10.005

关键词

Alzheimer's disease; Classification; Imbalanced data; Undersampling; Oversampling; Feature selection

资金

  1. Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health) [U01 AG024904]
  2. National Institute on Aging
  3. National Institute of Biomedical Imaging and Bioengineering
  4. Canadian Institutes of Health Research
  5. NIH [P30 AG010129, K01 AG030514]
  6. Dana Foundation
  7. National Institute on Aging [AGO 16570, R21AG043760]
  8. National Library of Medicine
  9. National Institute for Biomedical Imaging and Bioengineering
  10. National Center for Research Resources [LM05639, EB01651, RR019771]
  11. US National Science Foundation (NSF) [IIS0812551, IIS-0953662]
  12. National Library of Medicine [R01 LM010730]

向作者/读者索取更多资源

Many neuroimaging applications deal with imbalanced imaging data. For example, in Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, the mild cognitive impairment (MCI) cases eligible for the study are nearly two times the Alzheimer's disease (AD) patients for structural magnetic resonance imaging (MRI) modality and six times the control cases for proteomics modality. Constructing an accurate classifier from imbalanced data is a challenging task. Traditional classifiers that aim to maximize the overall prediction accuracy tend to classify all data into the majority class. In this paper, we study an ensemble system of feature selection and data sampling for the class imbalance problem. We systematically analyze various sampling techniques by examining the efficacy of different rates and types of undersampling, oversampling, and a combination of over and undersampling approaches. We thoroughly examine six widely used feature selection algorithms to identify significant biomarkers and thereby reduce the complexity of the data. The efficacy of the ensemble techniques is evaluated using two different classifiers including Random Forest and Support Vector Machines based on classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, and specificity measures. Our extensive experimental results show that for various problem settings in ADNI, (1) a balanced training set obtained with K-Medoids technique based undersampling gives the best overall performance among different data sampling techniques and no sampling approach; and (2) sparse logistic regression with stability selection achieves competitive performance among various feature selection algorithms. Comprehensive experiments with various settings show that our proposed ensemble model of multiple undersampled datasets yields stable and promising results. (C) 2013 Elsevier Inc. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据