4.6 Article

A strategy for validation of variables derived from large-scale electronic health record data

Journal

JOURNAL OF BIOMEDICAL INFORMATICS
Volume 121, Issue -, Pages -

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE
DOI: 10.1016/j.jbi.2021.103879

Keywords

Electronic phenotyping; Large-scale electronic health records; Data abstraction validation; Sample size; Positive predictive value; Negative predictive value

Funding

  1. VA Office of Research and Development of the Department of Veterans Affairs [5 I01 HX 001574-05]
  2. National Cancer Institute of the National Institutes of Health [5 R37 CA 222866-04, 5 F32 CA 239360-03]

Ask authors/readers for more resources

A methodology for rigorously validating phenotyping from large-scale EHR data, known as the San Diego Approach to Variable Validation (SDAVV), was proposed. By applying this approach to actual case studies, the phenotyping algorithms showed high levels of performance, indicating the success of this structured approach in improving big data research quality.
Purpose: Standardized approaches for rigorous validation of phenotyping from large-scale electronic health record (EHR) data have not been widely reported. We proposed a methodologically rigorous and efficient approach to guide such validation, including strategies for sampling cases and controls, determining sample sizes, estimating algorithm performance, and terminating the validation process, hereafter referred to as the San Diego Approach to Variable Validation (SDAVV). Methods: We propose sample size formulae which should be used prior to chart review, based on pre-specified critical lower bounds for positive predictive value (PPV) and negative predictive value (NPV). We also propose a stepwise strategy for iterative algorithm development/validation cycles, updating sample sizes for data abstraction until both PPV and NPV achieve target performance. Results: We applied the SDAVV to a Department of Veterans Affairs study in which we created two phenotyping algorithms, one for distinguishing normal colonoscopy cases from abnormal colonoscopy controls and one for identifying aspirin exposure. Estimated PPV and NPV both reached 0.970 with a 95% confidence lower bound of 0.915, estimated sensitivity was 0.963 and specificity was 0.975 for identifying normal colonoscopy cases. The phenotyping algorithm for identifying aspirin exposure reached a PPV of 0.990 (a 95% lower bound of 0.950), an NPV of 0.980 (a 95% lower bound of 0.930), and sensitivity and specificity were 0.960 and 1.000. Conclusions: A structured approach for prospectively developing and validating phenotyping algorithms from large-scale EHR data can be successfully implemented, and should be considered to improve the quality of big data research.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available