4.7 Article

Handling missing values and imbalanced classes in machine learning to predict consumer preference: Demonstrations and comparisons to prominent methods

Journal

EXPERT SYSTEMS WITH APPLICATIONS
Volume 237, Issue -, Pages -

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2023.121694

Keywords

Missing values; Imbalanced classes; Machine learning; Consumer preference

Ask authors/readers for more resources

Consumer preference prediction aims to predict future purchases based on historical behavior-level data. However, missing values and imbalanced class problems often make machine learning algorithms ineffective. This study proposes an adaptive process for selecting the optimal combination of amputation, imputation, imbalance treatment, and classification based on classification performance.
Consumer preference prediction aims to predict consumers' future purchases based on their historical behavior-level data. Using machine learning algorithms, the prediction results provide evidence to conduct commercial activities and further improve consumer experiences. However, missing values and imbalanced class problems of consumer behavioral data always make machine learning algorithms ineffective. While several methods have been proposed to address missing data or imbalanced class problems, few works have considered the relation-ships among missing mechanisms, imputation algorithms, imbalanced class methods, and the effectiveness of classification algorithms that use impute data. In this study, we aim to propose an adaptive process for selecting the optimal combination of amputation, imputation, imbalance treatment, and classification based on classifi-cation performance. Our research extends the literature by showing significant interaction effects between 1) the amputation mechanism and imputation algorithms, 2) imputation and imbalance treatments, and 3) imbalance treatments and classification algorithms. Using three consumer behavioral datasets from the UCI Machine Learning Repository, we empirically show that, among different classification methods, the overall performance of Random Forest is better than that of Logit, SVM, or Decision Tree. Moreover, Logit, as the most widely used classification method, suffers most from imbalance issues in real-world datasets. Furthermore, Metacost is always the best imbalance treatment for different imputation techniques or missing value mechanisms.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available