4.6 Article

Performance of Generative Large Language Models on Ophthalmology Board-Style Questions

期刊

AMERICAN JOURNAL OF OPHTHALMOLOGY
卷 254, 期 -, 页码 141-149

出版社

ELSEVIER SCIENCE INC
DOI: 10.1016/j.ajo.2023.05.024

关键词

-

向作者/读者索取更多资源

This study investigates the ability of generative artificial intelligence models to answer ophthalmology board-style questions. The results show that ChatGPT-4.0 and Bing Chat perform similarly to human respondents, but they struggle with image interpretation and may experience hallucinations and nonlogical reasoning.
& BULL; PURPOSE: To investigate the ability of generative artifi-cial intelligence models to answer ophthalmology board-style questions.& BULL; DESIGN: Experimental study.& BULL; METHODS: This study evaluated 3 large language mod -els (LLMs) with chat interfaces, Bing Chat (Microsoft) and ChatGPT 3.5 and 4.0 (OpenAI), using 250 ques-tions from the Basic Science and Clinical Science Self-Assessment Program. Although ChatGPT is trained on information last updated in 2021, Bing Chat incorporates a more recently indexed internet search to generate its answers. Performance was compared with human respon-dents. Questions were categorized by complexity and pa-tient care phase, and instances of information fabrication or nonlogical reasoning were documented.& BULL; MAIN OUTCOME MEASURES: Primary outcome was re-sponse accuracy. Secondary outcomes were performance in question subcategories and hallucination frequency.& BULL; RESULTS: Human respondents had an average ac-curacy of 72.2%. ChatGPT-3.5 scored the lowest (58.8%), whereas ChatGPT-4.0 (71.6%) and Bing Chat (71.2%) performed comparably. ChatGPT-4.0 excelled in workup-type questions (odds ratio [OR], 3.89, 95% CI, 1.19-14.73, P = .03) compared with diagnostic ques-tions, but struggled with image interpretation (OR, 0.14, 95% CI, 0.05-0.33, P < . 01) when compared with single-step reasoning questions. Against single-step ques-tions, Bing Chat also faced difficulties with image in-terpretation (OR, 0.18, 95% CI, 0.08-0.44, P < . 01) and multi-step reasoning (OR, 0.30, 95% CI, 0.11-0.84, P = .02). ChatGPT-3.5 had the highest rate of halluci-nations and nonlogical reasoning (42.4%), followed by ChatGPT-4.0 (18.0%) and Bing Chat (25.6%).& BULL; CONCLUSIONS: LLMs (particularly ChatGPT-4.0 and Bing Chat) can perform similarly with human respon-dents answering questions from the Basic Science and Clinical Science Self-Assessment Program. The fre-quency of hallucinations and nonlogical reasoning sug-gests room for improvement in the performance of con-versational agents in the medical domain. (Am J Oph-thalmol 2023;254: 141-149.& COPY; 2023 Elsevier Inc. All rights reserved.)

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据