4.7 Article

Development and validation of a pancreatic cancer risk model for the general population using electronic health records: An observational study

Journal

EUROPEAN JOURNAL OF CANCER
Volume 143, Issue -, Pages 19-30

Publisher

ELSEVIER SCI LTD
DOI: 10.1016/j.ejca.2020.10.019

Keywords

Pancreatic carcinoma; Adenocarcinoma; Electronic health records; Logistic regression models; AUC

Categories

Ask authors/readers for more resources

This study utilized electronic health record databases and machine learning models to successfully identify individuals at high risk of pancreatic ductal adenocarcinoma (PDAC). The LR model performed the best, able to identify high-risk patients 365 days before diagnosis. Risk stratification revealed a significantly higher cancer prevalence in the high-risk group compared to the entire dataset.
Aim: Pancreatic ductal adenocarcinoma (PDAC) is often diagnosed at a late, incurable stage. We sought to determine whether individuals at high risk of developing PDAC could be identified early using routinely collected data. Methods: Electronic health record (EHR) databases from two independent hospitals in Boston, Massachusetts, providing inpatient, outpatient, and emergency care, from 1979 through 2017, were used with case-control matching. PDAC cases were selected using International Classification of Diseases 9/10 codes and validated with tumour registries. A data-driven feature selection approach was used to develop neural networks and L2-regularised logistic regression (LR) models on training data (594 cases, 100,787 controls) and compared with a published model based on hand-selected diagnoses ('baseline'). Model performance was validated on an external database (408 cases, 160,185 controls). Three prediction lead times (180, 270 and 365 days) were considered. Results: The LR model had the best performance, with an area under the curve (AUC) of 0.71 (confidence interval [CI]: 0.67-0.76) for the training set, and AUC 0.68 (CI: 0.65-0.71) for the validation set, 365 days before diagnosis. Data-driven feature selection improved results over 'baseline' (AUC = 0.55; CI: 0.52-0.58). The LR model flags 2692 (CI 2592-2791) of 156,485 as high risk, 365 days in advance, identifying 25 (CI: 16-36) cancer patients. Risk stratification showed that the high-risk group presented a cancer rate 3 to 5 times the prevalence in our data set. Conclusion: A simple EHR model, based on diagnoses, can identify high-risk individuals for PDAC up to one year in advance. This inexpensive, systematic approach may serve as the first sieve for selection of individuals for PDAC screening programs. (C) 2020 Elsevier Ltd. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available