☆ 4.7 Article

Development and validation of a pancreatic cancer risk model for the general population using electronic health records: An observational study

EUROPEAN JOURNAL OF CANCER (2021)

Journal

EUROPEAN JOURNAL OF CANCER

Volume 143, Issue -, Pages 19-30

Publisher

ELSEVIER SCI LTD

DOI: 10.1016/j.ejca.2020.10.019

Keywords

Pancreatic carcinoma; Adenocarcinoma; Electronic health records; Logistic regression models; AUC

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This study utilized electronic health record databases and machine learning models to successfully identify individuals at high risk of pancreatic ductal adenocarcinoma (PDAC). The LR model performed the best, able to identify high-risk patients 365 days before diagnosis. Risk stratification revealed a significantly higher cancer prevalence in the high-risk group compared to the entire dataset.

Aim: Pancreatic ductal adenocarcinoma (PDAC) is often diagnosed at a late, incurable stage. We sought to determine whether individuals at high risk of developing PDAC could be identified early using routinely collected data. Methods: Electronic health record (EHR) databases from two independent hospitals in Boston, Massachusetts, providing inpatient, outpatient, and emergency care, from 1979 through 2017, were used with case-control matching. PDAC cases were selected using International Classification of Diseases 9/10 codes and validated with tumour registries. A data-driven feature selection approach was used to develop neural networks and L2-regularised logistic regression (LR) models on training data (594 cases, 100,787 controls) and compared with a published model based on hand-selected diagnoses ('baseline'). Model performance was validated on an external database (408 cases, 160,185 controls). Three prediction lead times (180, 270 and 365 days) were considered. Results: The LR model had the best performance, with an area under the curve (AUC) of 0.71 (confidence interval [CI]: 0.67-0.76) for the training set, and AUC 0.68 (CI: 0.65-0.71) for the validation set, 365 days before diagnosis. Data-driven feature selection improved results over 'baseline' (AUC = 0.55; CI: 0.52-0.58). The LR model flags 2692 (CI 2592-2791) of 156,485 as high risk, 365 days in advance, identifying 25 (CI: 16-36) cancer patients. Risk stratification showed that the high-risk group presented a cancer rate 3 to 5 times the prevalence in our data set. Conclusion: A simple EHR model, based on diagnoses, can identify high-risk individuals for PDAC up to one year in advance. This inexpensive, systematic approach may serve as the first sieve for selection of individuals for PDAC screening programs. (C) 2020 Elsevier Ltd. All rights reserved.

Development and validation of a pancreatic cancer risk model for the general population using electronic health records: An observational study

Journal

EUROPEAN JOURNAL OF CANCER

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Development and validation of a pancreatic cancer risk model for the general population using electronic health records: An observational study

Journal

EUROPEAN JOURNAL OF CANCER

Publisher

ELSEVIER SCI LTD

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper