☆ 4.6 Article

Correcting for Optimistic Prediction in Small Data Sets

AMERICAN JOURNAL OF EPIDEMIOLOGY (2014)

Journal

AMERICAN JOURNAL OF EPIDEMIOLOGY

Volume 180, Issue 3, Pages 318-324

Publisher

OXFORD UNIV PRESS INC

DOI: 10.1093/aje/kwu140

Keywords

logistic models; models; statistical; multivariate analysis; receiver operating characteristic curve

Funding

National Institute for Health Research United Kingdom (Cambridge Comprehensive Biomedical Research Centre)
MRC [MR/L003120/1, MC_EX_G0800814, G0701619, MC_U105260558] Funding Source: UKRI
British Heart Foundation [RG/08/014/24067] Funding Source: researchfish
Medical Research Council [MC_U105260558, MC_EX_G0800814, G0701619, MR/L003120/1] Funding Source: researchfish
National Institute for Health Research [NF-SI-0512-10165] Funding Source: researchfish

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

The C statistic is a commonly reported measure of screening test performance. Optimistic estimation of the C statistic is a frequent problem because of overfitting of statistical models in small data sets, and methods exist to correct for this issue. However, many studies do not use such methods, and those that do correct for optimism use diverse methods, some of which are known to be biased. We used clinical data sets (United Kingdom Down syndrome screening data from Glasgow (1991-2003), Edinburgh (1999-2003), and Cambridge (1990-2006), as well as Scottish national pregnancy discharge data (2004-2007)) to evaluate different approaches to adjustment for optimism. We found that sample splitting, cross-validation without replication, and leave-1-out cross-validation produced optimism-adjusted estimates of the C statistic that were biased and/or associated with greater absolute error than other available methods. Cross-validation with replication, bootstrapping, and a new method (leave-pairout cross-validation) all generated unbiased optimism-adjusted estimates of the C statistic and had similar absolute errors in the clinical data set. Larger simulation studies confirmed that all 3 methods performed similarly with 10 or more events per variable, or when the C statistic was 0.9 or greater. However, with lower events per variable or lower C statistics, bootstrapping tended to be optimistic but with lower absolute and mean squared errors than both methods of cross-validation.

Correcting for Optimistic Prediction in Small Data Sets

Journal

AMERICAN JOURNAL OF EPIDEMIOLOGY

Publisher

OXFORD UNIV PRESS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Correcting for Optimistic Prediction in Small Data Sets

Journal

AMERICAN JOURNAL OF EPIDEMIOLOGY

Publisher

OXFORD UNIV PRESS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper