☆ 4.6 Article

Concentrations of criteria pollutants in the contiguous US, 1979-2015: Role of prediction model parsimony in integrated empirical geographic regression

PLOS ONE (2020)

Journal

PLOS ONE

Volume 15, Issue 2, Pages -

Publisher

PUBLIC LIBRARY SCIENCE

DOI: 10.1371/journal.pone.0228535

Keywords

Funding

Center for Clean Air Climate Solution (CACES) - U.S. Environmental Protection Agency (EPA) [R835873]
Ministry of Science and ICT of South Korea [2013R1A6A3A04059017, 2018R1A2B6004608]
National Cancer Center of South Korea [NCC-1810220-01]
U.S. National Institutes of Health [NIH/NIEHS R01 ES026246, R01 ES026187]
National Research Foundation of Korea - Ministry of Education
Korea Health Promotion Institute [1810220-3] Funding Source: Korea Institute of Science & Technology Information (KISTI), National Science & Technology Information Service (NTIS)

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

National-scale empirical models for air pollution can include hundreds of geographic variables. The impact of model parsimony (i.e., how model performance differs for a large versus small number of covariates) has not been systematically explored. We aim to (1) build annual-average integrated empirical geographic (IEG) regression models for the contiguous U.S. for six criteria pollutants during 1979-2015; (2) explore systematically the impact on model performance of the number of variables selected for inclusion in a model; and (3) provide publicly available model predictions. We compute annual-average concentrations from regulatory monitoring data for PM10, PM2.5, NO2, SO2, CO, and ozone at all monitoring sites for 1979-2015. We also use similar to 350 geographic characteristics at each location including measures of traffic, land use, land cover, and satellite-based estimates of air pollution. We then develop IEG models, employing universal kriging and summary factors estimated by partial least squares (PLS) of geographic variables. For all pollutants and years, we compare three approaches for choosing variables to include in the PLS model: (1) no variables, (2) a limited number of variables selected from the full set by forward selection, and (3) all variables. We evaluate model performance using 10-fold cross-validation (CV) using conventional and spatially-clustered test data. Models using 3 to 30 variables selected from the full set generally have the best performance across all pollutants and years (median R-2 conventional [clustered] CV: 0.66 [0.47]) compared to models with no (0.37 [0]) or all variables (0.64 [0.27]). Concentration estimates for all Census Blocks reveal generally decreasing concentrations over several decades with local heterogeneity. Our findings suggest that national prediction models can be built by empirically selecting only a small number of important variables to provide robust concentration estimates. Model estimates are freely available online.

Concentrations of criteria pollutants in the contiguous US, 1979-2015: Role of prediction model parsimony in integrated empirical geographic regression

Journal

PLOS ONE

Publisher

PUBLIC LIBRARY SCIENCE

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Concentrations of criteria pollutants in the contiguous US, 1979-2015: Role of prediction model parsimony in integrated empirical geographic regression

Journal

PLOS ONE

Publisher

PUBLIC LIBRARY SCIENCE

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper