4.5 Article

A score-based preprocessing technique for class imbalance problems

Journal

PATTERN ANALYSIS AND APPLICATIONS
Volume 25, Issue 4, Pages 913-931

Publisher

SPRINGER
DOI: 10.1007/s10044-022-01084-1

Keywords

Imbalanced data classification; Hybrid sampling; Score sharing; Sparse samples; Binary tournament selection

Ask authors/readers for more resources

This paper proposes a score-based preprocessing technique based on under-sampling and over-sampling to overcome the weakness of classifiers in class imbalance problems. The technique selects suitable samples based on their importance in the feature space and balances the classes' distribution. Experiments show the superiority and effectiveness of the proposed method compared to other methods.
In classification, one of the common problems is the class imbalance problem. This phenomenon that is growing significance emerges in most real fields and occurs when data samples are distributed among classes unevenly. This means that most of the data are in the larger class, and there are fewer data in the smaller class. Since standard classifiers do not consider the distribution of imbalanced class, they indicate undesirable behavior in facing them. Many techniques have been proposed to solve the problem of class imbalance. Among these methods, a group called preprocessing techniques tries to create a balance between training sets. These methods balance the classes' distribution by removing redundant samples from the larger class or creating new samples for the smaller one. The first group is known as under-sampling, and the second one is known as over-sampling techniques. In this paper, we propose a score-based preprocessing technique based on both under-sampling and over-sampling to overcome the weakness of classifiers in class imbalance problems. For this purpose, we apply the sharing strategy in both stages to determine more suitable samples based on their importance in the feature space. In the over-sampling stage, the smaller class's synthetic samples are generated by interpolating between more sparse samples. After that, in the under-sampling stage, denser samples of the larger class are selected to be removed. We use the binary tournament selection operator in both stages to perform over-sampling and under-sampling based on probabilities. In experiments, the support vector machine (SVM) is employed to train a classification model from the balanced training sets obtained by different preprocessing methods. Besides, F-measure and AUC measures are considered as evaluation tools. At the last step, we compare all methods in terms of the classification model's complexity. According to the results obtained from 44 standard imbalanced datasets, the proposed method's superiority and effectiveness compared to other methods have been revealed.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available