4.6 Article

Scalable Feature Engineering from Electronic Free Text Notes to Supplement Confounding Adjustment of Claims-Based Pharmacoepidemiologic Studies

Journal

CLINICAL PHARMACOLOGY & THERAPEUTICS
Volume 113, Issue 4, Pages 832-838

Publisher

WILEY
DOI: 10.1002/cpt.2826

Keywords

-

Ask authors/readers for more resources

Natural language processing (NLP) tools are applied to convert free-text notes (FTNs) from electronic health records (EHRs) into data features that can enhance confounding adjustment in pharmacoepidemiologic studies. In this study, unsupervised NLP was utilized to generate high-dimensional feature spaces from FTNs, improving drug exposure and outcome prediction compared to claims-based analyses. These findings have important implications for improving confounding adjustment in pharmacoepidemiologic studies using EHR data.
Natural language processing (NLP) tools turn free-text notes (FTNs) from electronic health records (EHRs) into data features that can supplement confounding adjustment in pharmacoepidemiologic studies. However, current applications are difficult to scale. We used unsupervised NLP to generate high-dimensional feature spaces from FTNs to improve prediction of drug exposure and outcomes compared with claims-based analyses. We linked Medicare claims with EHR data to generate three cohort studies comparing different classes of medications on the risk of various clinical outcomes. We used bag-of-words to generate features for the top 20,000 most prevalent terms from FTNs. We compared machine learning (ML) prediction algorithms using different sets of candidate predictors: Set1 (39 researcher-specified variables), Set2 (Set1 + ML-selected claims codes), and Set3 (Set1 + ML-selected NLP-generated features), vs. Set4 (Set1 + 2 + 3). When modeling treatment choice, we observed a consistent pattern across the examples: ML models utilizing Set4 performed best followed by Set2, Set3, then Set1. When modeling the outcome risk, there was little to no improvement beyond models based on Set1. Supplementing claims data with NLP-generated features from free text notes improved prediction of prescribing choices but had little or no improvement on clinical risk prediction. These findings have implications for strategies to improve confounding using EHR data in pharmacoepidemiologic studies.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available