☆ 3.8 Proceedings Paper

Test data reuse for evaluation of adaptive machine learning algorithms: Overfitting to a fixed test dataset and a potential solution

MEDICAL IMAGING 2018: IMAGE PERCEPTION, OBSERVER PERFORMANCE, AND TECHNOLOGY ASSESSMENT (2018)

Journal

MEDICAL IMAGING 2018: IMAGE PERCEPTION, OBSERVER PERFORMANCE, AND TECHNOLOGY ASSESSMENT

Volume 10577, Issue -, Pages -

Publisher

SPIE-INT SOC OPTICAL ENGINEERING

DOI: 10.1117/12.2293818

Keywords

Adaptive data analysis; continuous machine learning; data reuse; receiver operating characteristic curve (ROC); area under the ROC curve (AUC); classification performance; Thresholdout; reusable holdout

Funding

Research Participation Program at the U.S. Food and Drug Administration
U.S. Food and Drug Administration

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

After the initial release of a machine learning algorithm, the subsequently gathered data can be used to augment the training dataset in order to modify or fine-tune the algorithm. For algorithm performance evaluation that generalizes to a targeted population of cases, ideally, test datasets randomly drawn from the targeted population are used. To ensure that test results generalize to new data, the algorithm needs to be evaluated on new and independent test data each time a new performance evaluation is required. However, medical test datasets of sufficient quality are often hard to acquire, and it is tempting to utilize a previously-used test dataset for a new performance evaluation. With extensive simulation studies, we illustrate how such a naive approach to test data reuse can inadvertently result in overfitting the algorithm to the test data, even when only a global performance metric is reported back from the test dataset. The overfitting behavior leads to a loss in generalization and overly optimistic conclusions about the algorithm performance. We investigate the use of the Thresholdout method of Dwork et. al. (Ref. 1) to tackle this problem. Thresholdout allows repeated reuse of the same test dataset. It essentially reports a noisy version of the performance metric on the test data, and provides theoretical guarantees on how many times the test dataset can be accessed to ensure generalization of the reported answers to the underlying distribution. With extensive simulation studies, we show that Thresholdout indeed substantially reduces the problem of overfitting to the test data under the simulation conditions, at the cost of a mild additional uncertainty on the reported test performance. We also extend some of the theoretical guarantees to the area under the ROC curve as the reported performance metric.

Test data reuse for evaluation of adaptive machine learning algorithms: Overfitting to a fixed test dataset and a potential solution

Journal

MEDICAL IMAGING 2018: IMAGE PERCEPTION, OBSERVER PERFORMANCE, AND TECHNOLOGY ASSESSMENT

Publisher

SPIE-INT SOC OPTICAL ENGINEERING

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Test data reuse for evaluation of adaptive machine learning algorithms: Overfitting to a fixed test dataset and a potential solution

Journal

MEDICAL IMAGING 2018: IMAGE PERCEPTION, OBSERVER PERFORMANCE, AND TECHNOLOGY ASSESSMENT

Publisher

SPIE-INT SOC OPTICAL ENGINEERING

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper