☆ 4.6 Article

Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

AMERICAN JOURNAL OF EPIDEMIOLOGY (2014)

期刊

AMERICAN JOURNAL OF EPIDEMIOLOGY

卷 179, 期 6, 页码 764-774

出版社

OXFORD UNIV PRESS INC

DOI: 10.1093/aje/kwt312

关键词

angina; stable; imputation; missing data; missingness at random; regression trees; simulation; survival

类别

Public, Environmental & Occupational Health

资金

United Kingdom National Institute for Health Research [RP-PG-0407-10314]
Wellcome Trust [086091/Z/08/Z, 0938/30/Z/10/Z]
Medical Research Council [MR/K006584/1, G0902393, G0900724]
United Kingdom Biobank
Farr Institute of Health Informatics Research (Health eResearch Centre Network)
Medical Research Council
Arthritis Research UK
British Heart Foundation
Cancer Research UK
Economic and Social Research Council
Engineering and Physical Sciences Research Council
National Institute of Health Research
National Institute for Social Care and Health Research (Welsh Assembly Government)
Chief Scientist Office (Scottish Government Health Directorates)
Wellcome Trust
ESRC [ES/H022252/1, ES/G026300/1] Funding Source: UKRI
MRC [MC_EX_G0800814, G0900724, G0902393, MR/K02180X/1] Funding Source: UKRI
Economic and Social Research Council [ES/H022252/1, ES/G026300/1] Funding Source: researchfish
Medical Research Council [MR/K006584/1, MC_EX_G0800814, G0902393, G0900724, MR/K02180X/1] Funding Source: researchfish

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The true imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made missing at random, and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.

Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

期刊

AMERICAN JOURNAL OF EPIDEMIOLOGY

出版社

OXFORD UNIV PRESS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

期刊

AMERICAN JOURNAL OF EPIDEMIOLOGY

出版社

OXFORD UNIV PRESS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文