4.7 Article

Evaluating the state of the art in missing data imputation for clinical data

Journal

BRIEFINGS IN BIOINFORMATICS
Volume 23, Issue 1, Pages -

Publisher

OXFORD UNIV PRESS
DOI: 10.1093/bib/bbab489

Keywords

missing data imputation; machine learning; clinical laboratory test; time series

Funding

  1. National Library of Medicine [R01LM013337]

Ask authors/readers for more resources

Clinical data often have missing entries, posing a challenge to deriving optimal knowledge from the data. The Data Analytics Challenge on Missing data Imputation (DACMI) provides a benchmark dataset for evaluating and advancing imputation techniques for clinical time series. Competitive machine learning and statistical models coupled with carefully engineered features show strong performance in imputation.
Clinical data are increasingly being mined to derive new medical knowledge with a goal of enabling greater diagnostic precision, better-personalized therapeutic regimens, improved clinical outcomes and more efficient utilization of health-care resources. However, clinical data are often only available at irregular intervals that vary between patients and type of data, with entries often being unmeasured or unknown. As a result, missing data often represent one of the major impediments to optimal knowledge derivation from clinical data. The Data Analytics Challenge on Missing data Imputation (DACMI) presented a shared clinical dataset with ground truth for evaluating and advancing the state of the art in imputing missing data for clinical time series. We extracted 13 commonly measured blood laboratory tests. To evaluate the imputation performance, we randomly removed one recorded result per laboratory test per patient admission and used them as the ground truth. DACMI is the first shared-task challenge on clinical time series imputation to our best knowledge. The challenge attracted 12 international teams spanning three continents across multiple industries and academia. The evaluation outcome suggests that competitive machine learning and statistical models (e.g. LightGBM, MICE and XGBoost) coupled with carefully engineered temporal and cross-sectional features can achieve strong imputation performance. However, care needs to be taken to prevent overblown model complexity. The challenge participating systems collectively experimented with a wide range of machine learning and probabilistic algorithms to combine temporal imputation and cross-sectional imputation, and their design principles will inform future efforts to better model clinical missing data.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available