4.4 Article

How to Apply Variable Selection Machine Learning Algorithms With Multiply Imputed Data: A Missing Discussion

Journal

PSYCHOLOGICAL METHODS
Volume 28, Issue 2, Pages 452-471

Publisher

AMER PSYCHOLOGICAL ASSOC
DOI: 10.1037/met0000478

Keywords

LASSO; missing data; multiple imputation; regularization; regression

Ask authors/readers for more resources

Psychological researchers often use standard linear regression to identify relevant predictors of an outcome of interest. Regularization methods like the LASSO can mitigate overfitting, increase interpretability, and improve prediction. However, handling missing data when using regularization-based variable selection methods is complicated. This tutorial describes three approaches for fitting a LASSO when using multiple imputation to handle missing data and highlights the need for additional research on best practices.
Psychological researchers often use standard linear regression to identify relevant predictors of an outcome of interest, but challenges emerge with incomplete data and growing numbers of candidate predictors. Regularization methods like the LASSO can reduce the risk of overfitting, increase model interpretability, and improve prediction in future samples; however, handling missing data when using regularization-based variable selection methods is complicated. Using listwise deletion or an ad hoc imputation strategy to deal with missing data when using regularization methods can lead to loss of precision, substantial bias, and a reduction in predictive ability. In this tutorial, we describe three approaches for fitting a LASSO when using multiple imputation to handle missing data and illustrate how to implement these approaches in practice with an applied example. We discuss implications of each approach and describe additional research that would help solidify recommendations for best practices. Translational Abstract Standard linear regression is a commonly used model in psychological research that tests the relationships between hypothesized predictors and an outcome of interest; however, the estimated regression coefficients representing such associations are highly variable from sample to sample, making the conclusions less generalizable. Regularization methods like the LASSO reduce the variance of the estimates, increase model interpretability, and improve prediction in future samples. Until recently, regularization methods were primarily applied on data sets without missing values. Missing data are prevalent in psychological research and need to be handled appropriately to avoid substantial bias. Multiple imputation has gained currency as a principled approach to deal with missing data. This tutorial describes three approaches for fitting a LASSO for variable selection when using multiple imputation to handle missing data, highlighting the additional research needed to solidify recommendations for best practices.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.4
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available