☆ 4.7 Article

Evaluating the state of the art in missing data imputation for clinical data

BRIEFINGS IN BIOINFORMATICS (2022)

Journal

BRIEFINGS IN BIOINFORMATICS

Volume 23, Issue 1, Pages -

Publisher

OXFORD UNIV PRESS

DOI: 10.1093/bib/bbab489

Keywords

missing data imputation; machine learning; clinical laboratory test; time series

Funding

National Library of Medicine [R01LM013337]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Clinical data often have missing entries, posing a challenge to deriving optimal knowledge from the data. The Data Analytics Challenge on Missing data Imputation (DACMI) provides a benchmark dataset for evaluating and advancing imputation techniques for clinical time series. Competitive machine learning and statistical models coupled with carefully engineered features show strong performance in imputation.

Clinical data are increasingly being mined to derive new medical knowledge with a goal of enabling greater diagnostic precision, better-personalized therapeutic regimens, improved clinical outcomes and more efficient utilization of health-care resources. However, clinical data are often only available at irregular intervals that vary between patients and type of data, with entries often being unmeasured or unknown. As a result, missing data often represent one of the major impediments to optimal knowledge derivation from clinical data. The Data Analytics Challenge on Missing data Imputation (DACMI) presented a shared clinical dataset with ground truth for evaluating and advancing the state of the art in imputing missing data for clinical time series. We extracted 13 commonly measured blood laboratory tests. To evaluate the imputation performance, we randomly removed one recorded result per laboratory test per patient admission and used them as the ground truth. DACMI is the first shared-task challenge on clinical time series imputation to our best knowledge. The challenge attracted 12 international teams spanning three continents across multiple industries and academia. The evaluation outcome suggests that competitive machine learning and statistical models (e.g. LightGBM, MICE and XGBoost) coupled with carefully engineered temporal and cross-sectional features can achieve strong imputation performance. However, care needs to be taken to prevent overblown model complexity. The challenge participating systems collectively experimented with a wide range of machine learning and probabilistic algorithms to combine temporal imputation and cross-sectional imputation, and their design principles will inform future efforts to better model clinical missing data.

Evaluating the state of the art in missing data imputation for clinical data

Journal

BRIEFINGS IN BIOINFORMATICS

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Evaluating the state of the art in missing data imputation for clinical data

Journal

BRIEFINGS IN BIOINFORMATICS

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper