☆ 4.4 Article

Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4

RHEUMATOLOGY INTERNATIONAL (2023)

Journal

RHEUMATOLOGY INTERNATIONAL

Volume -, Issue -, Pages -

Publisher

SPRINGER HEIDELBERG

DOI: 10.1007/s00296-023-05464-6

Keywords

Large language models; ChatGPT; Rheumatology; Triage; Diagnostic process; Artificial intelligence

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Pre-clinical studies suggest that language models like ChatGPT could be useful in distinguishing inflammatory rheumatic diseases (IRD) from others. In this study, the diagnostic accuracy of ChatGPT-4 was compared to rheumatologists using patient data. ChatGPT-4 provided comparable correct diagnoses to rheumatologists in a significant number of cases, particularly for IRD-positive cases. However, its specificity was lower in non-IRD cases. The results highlight the potential of ChatGPT-4 as a triage tool for diagnosing IRD, with better sensitivity than rheumatologists.

Pre-clinical studies suggest that large language models (i.e., ChatGPT) could be used in the diagnostic process to distinguish inflammatory rheumatic (IRD) from other diseases. We therefore aimed to assess the diagnostic accuracy of ChatGPT-4 in comparison to rheumatologists. For the analysis, the data set of Graf et al. (2022) was used. Previous patient assessments were analyzed using ChatGPT-4 and compared to rheumatologists' assessments. ChatGPT-4 listed the correct diagnosis comparable often to rheumatologists as the top diagnosis 35% vs 39% (p = 0.30); as well as among the top 3 diagnoses, 60% vs 55%, (p = 0.38). In IRD-positive cases, ChatGPT-4 provided the top diagnosis in 71% vs 62% in the rheumatologists' analysis. Correct diagnosis was among the top 3 in 86% (ChatGPT-4) vs 74% (rheumatologists). In non-IRD cases, ChatGPT-4 provided the correct top diagnosis in 15% vs 27% in the rheumatologists' analysis. Correct diagnosis was among the top 3 in non-IRD cases in 46% of the ChatGPT-4 group vs 45% in the rheumatologists group. If only the first suggestion for diagnosis was considered, ChatGPT-4 correctly classified 58% of cases as IRD compared to 56% of the rheumatologists (p = 0.52). ChatGPT-4 showed a slightly higher accuracy for the top 3 overall diagnoses compared to rheumatologist's assessment. ChatGPT-4 was able to provide the correct differential diagnosis in a relevant number of cases and achieved better sensitivity to detect IRDs than rheumatologist, at the cost of lower specificity. The pilot results highlight the potential of this new technology as a triage tool for the diagnosis of IRD.

Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4

Journal

RHEUMATOLOGY INTERNATIONAL

Publisher

SPRINGER HEIDELBERG

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4

Journal

RHEUMATOLOGY INTERNATIONAL

Publisher

SPRINGER HEIDELBERG

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper