4.7 Article

Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset

Journal

EXPERT SYSTEMS WITH APPLICATIONS
Volume 131, Issue -, Pages 299-307

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2019.04.049

Keywords

Multiple imputation; High missingness; Model selection; Machine learning; Data preprocessing; Water quality

Ask authors/readers for more resources

In the current era of information everywhere, extracting knowledge from a great amount of data is increasingly acknowledged as a promising channel for providing relevant insights to decision makers. One key issue encountered may be the poor quality of the raw data, particularly due to the high missingness, that may affect the quality and the relevance of the results' interpretation. Automating the exploration of the underlying data with powerful methods, allowing to handle missingness and then perform a learning process to discover relevant knowledge, can then be considered as a successful strategy for systems' monitoring. Within the context of water quality analysis, the aim of the present study is to propose a robust method for selecting the best algorithm to combine with MICE (Multivariate Imputations by Chained Equations) in order to handle multiple relationships between a high amount of features of interest (more than 200) concerned with a high rate of missingness (more than 80%). The main contribution is to improve MICE, taking advantage of the ability of Machine Learning algorithms to address complex relationships among a large number of parameters. The competing methods that are implemented are Random Forest (RF), Boosted Regression Trees (BRT), K- Nearest Neighbors (KNN) and Support Vector Regression (SVR). The obtained results show that the hybridization of MICE with SVR, KNN, RF and BRT performs better than the original MICE taken alone. Furthermore, MICE-SVR gives a good trade-off in terms of performance and computing time. (C) 2019 Elsevier Ltd. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available