4.7 Article

RESI: A Region-Splitting Imputation method for different types of missing data

期刊

EXPERT SYSTEMS WITH APPLICATIONS
卷 168, 期 -, 页码 -

出版社

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2020.114425

关键词

Data mining; Missing data imputation; Region-splitting; k-fold cross validation

资金

  1. National Natural Science Foundation of China [61772342, 61703278]

向作者/读者索取更多资源

This paper introduces a novel tuple-based imputation model RESI, which defines the mean integrity rate to measure the missing degree of a dataset, and utilizes the entropy weight method to select features and assign weights to attributes for improved imputation accuracy and generalization capability.
A certain degree of data loss seriously affects the accuracy and availability of data, especially on the effects of the subsequent in-depth data analysis and mining. It is of great value in practical applications to construct a data imputation model, which is suitable for completing different types of missing data, including numerical only, categorical only and mixed-type data, and has strong capability of generalization. To address this issue, this paper defines a new metric, mean integrity rate, to measure the missing degree of a dataset, and proposes RESI, a novel tuple-based REgion-Splitting Imputation model, to impute different type missing data. We first select features and assign weights to each attribute by using the entropy weight method, and then partition the tuples into a subset of complete tuples and several subsets of incomplete tuples based on their integrity rate, which is formulated with the weights of attributes and the missing degree of tuples. The model performs training iterations on the complete tuple subset. In each iteration, the trained model is used to impute the next missing subset, and the computed subset is merged into the complete subset for training the next model. To improve the imputation accuracy, we leverage..-fold cross validation to correct errors. Besides imputing diverse types of missing data, extensive experimental results have shown that our model, RESI, significantly outperforms the state-of-the-art methods in the sensitivity to missing rate and accuracy of imputed data.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据