4.7 Article

Handling missing values and imbalanced classes in machine learning to predict consumer preference: Demonstrations and comparisons to prominent methods

期刊

EXPERT SYSTEMS WITH APPLICATIONS
卷 237, 期 -, 页码 -

出版社

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2023.121694

关键词

Missing values; Imbalanced classes; Machine learning; Consumer preference

向作者/读者索取更多资源

Consumer preference prediction aims to predict future purchases based on historical behavior-level data. However, missing values and imbalanced class problems often make machine learning algorithms ineffective. This study proposes an adaptive process for selecting the optimal combination of amputation, imputation, imbalance treatment, and classification based on classification performance.
Consumer preference prediction aims to predict consumers' future purchases based on their historical behavior-level data. Using machine learning algorithms, the prediction results provide evidence to conduct commercial activities and further improve consumer experiences. However, missing values and imbalanced class problems of consumer behavioral data always make machine learning algorithms ineffective. While several methods have been proposed to address missing data or imbalanced class problems, few works have considered the relation-ships among missing mechanisms, imputation algorithms, imbalanced class methods, and the effectiveness of classification algorithms that use impute data. In this study, we aim to propose an adaptive process for selecting the optimal combination of amputation, imputation, imbalance treatment, and classification based on classifi-cation performance. Our research extends the literature by showing significant interaction effects between 1) the amputation mechanism and imputation algorithms, 2) imputation and imbalance treatments, and 3) imbalance treatments and classification algorithms. Using three consumer behavioral datasets from the UCI Machine Learning Repository, we empirically show that, among different classification methods, the overall performance of Random Forest is better than that of Logit, SVM, or Decision Tree. Moreover, Logit, as the most widely used classification method, suffers most from imbalance issues in real-world datasets. Furthermore, Metacost is always the best imbalance treatment for different imputation techniques or missing value mechanisms.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据