☆ 4.6 Article

A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data

IEEE ACCESS (2021)

Journal

IEEE ACCESS

Volume 9, Issue -, Pages 109960-109975

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/ACCESS.2021.3102399

Keywords

Training; Lung cancer; Lung; Support vector machines; Classification algorithms; Standards; Bagging; Class imbalance; data resampling; healthcare; lung cancer; machine learning

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This study utilized 23 class imbalance methods and three classical classifiers to identify the best imbalance techniques for medical datasets. The results show that class imbalance learning can improve the classification ability of the model, with random forest performing the best in predictive ability among the over-sampling techniques used.

Medical datasets are usually imbalanced, where negative cases severely outnumber positive cases. Therefore, it is essential to deal with this data skew problem when training machine learning algorithms. This study uses two representative lung cancer datasets, PLCO and NLST, with imbalance ratios (the proportion of samples in the majority class to those in the minority class) of 24.7 and 25.0, respectively, to predict lung cancer incidence. This research uses the performance of 23 class imbalance methods (resampling and hybrid systems) with three classical classifiers (logistic regression, random forest, and LinearSVC) to identify the best imbalance techniques suitable for medical datasets. Resampling includes ten under-sampling methods (RUS, etc.), seven over-sampling methods (SMOTE, etc.), and two integrated sampling methods (SMOTEENN, SMOTE-Tomek). Hybrid systems include (Balanced Bagging, etc.). The results show that class imbalance learning can improve the classification ability of the model. Compared with other imbalanced techniques, under-sampling techniques have the highest standard deviation (SD), and over-sampling techniques have the lowest SD. Over-sampling is a stable method, and the AUC in the model is generally higher than in other ways. Using ROS, the random forest performs the best predictive ability and is more suitable for the lung cancer datasets used in this study. The code is available at https://mkhushi.github.io/

A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data

Journal

IEEE ACCESS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data

Journal

IEEE ACCESS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper