4.7 Article

Assessment of Natural Language Processing of Electronic Health Records to Measure Goals-of-Care Discussions as a Clinical Trial Outcome

期刊

JAMA NETWORK OPEN
卷 6, 期 3, 页码 -

出版社

AMER MEDICAL ASSOC
DOI: 10.1001/jamanetworkopen.2023.1204

关键词

-

向作者/读者索取更多资源

This study evaluates the use of natural language processing (NLP) to measure outcomes in a randomized clinical trial of a communication intervention for adults with serious illness. The findings suggest that NLP can effectively measure trial outcomes and save resources compared to manual data collection. Incorporating misclassification-adjusted power calculations into studies using NLP may be beneficial.
This diagnostic study evaluates the performance, feasibility, and power implications of using natural language processing to measure outcomes in a randomized clinical trial of a communication intervention among adults with serious illness. Key Points Question Can natural language processing (NLP) be used to measure clinical trial outcomes? Findings In this diagnostic study evaluating the performance, feasibility, and power implications of using deep-learning NLP to measure the outcome of documented goals-of-care discussions in a 2512-patient pragmatic trial, NLP-screened human abstraction measured the outcome with 92.6% sensitivity, substantial savings in abstractor-hours, and minimal loss of power, compared with manual abstraction. Meaning The findings suggest that NLP may facilitate measurement of previously inaccessible outcomes in clinical trials and that incorporation of misclassification-adjusted power calculations into the design of studies using NLP may be beneficial. Importance Many clinical trial outcomes are documented in free-text electronic health records (EHRs), making manual data collection costly and infeasible at scale. Natural language processing (NLP) is a promising approach for measuring such outcomes efficiently, but ignoring NLP-related misclassification may lead to underpowered studies. Objective To evaluate the performance, feasibility, and power implications of using NLP to measure the primary outcome of EHR-documented goals-of-care discussions in a pragmatic randomized clinical trial of a communication intervention. Design, Setting, and Participants This diagnostic study compared the performance, feasibility, and power implications of measuring EHR-documented goals-of-care discussions using 3 approaches: (1) deep-learning NLP, (2) NLP-screened human abstraction (manual verification of NLP-positive records), and (3) conventional manual abstraction. The study included hospitalized patients aged 55 years or older with serious illness enrolled between April 23, 2020, and March 26, 2021, in a pragmatic randomized clinical trial of a communication intervention in a multihospital US academic health system. Main Outcomes and Measures Main outcomes were natural language processing performance characteristics, human abstractor-hours, and misclassification-adjusted statistical power of methods of measuring clinician-documented goals-of-care discussions. Performance of NLP was evaluated with receiver operating characteristic (ROC) curves and precision-recall (PR) analyses and examined the effects of misclassification on power using mathematical substitution and Monte Carlo simulation. Results A total of 2512 trial participants (mean [SD] age, 71.7 [10.8] years; 1456 [58%] female) amassed 44324 clinical notes during 30-day follow-up. In a validation sample of 159 participants, deep-learning NLP trained on a separate training data set from identified patients with documented goals-of-care discussions with moderate accuracy (maximal F-1 score, 0.82; area under the ROC curve(,) 0.924; area under the PR curve, 0.879). Manual abstraction of the outcome from the trial data set would require an estimated 2000 abstractor-hours and would power the trial to detect a risk difference of 5.4% (assuming 33.5% control-arm prevalence, 80% power, and 2-sided alpha=.05). Measuring the outcome by NLP alone would power the trial to detect a risk difference of 7.6%. Measuring the outcome by NLP-screened human abstraction would require 34.3 abstractor-hours to achieve estimated sensitivity of 92.6% and would power the trial to detect a risk difference of 5.7%. Monte Carlo simulations corroborated misclassification-adjusted power calculations. Conclusions and Relevance In this diagnostic study, deep-learning NLP and NLP-screened human abstraction had favorable characteristics for measuring an EHR outcome at scale. Adjusted power calculations accurately quantified power loss from NLP-related misclassification, suggesting that incorporation of this approach into the design of studies using NLP would be beneficial.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据