☆ 4.8 Article

Large language models encode clinical knowledge

NATURE (2023)

期刊

NATURE

卷 620, 期 7972, 页码 172-+

出版社

NATURE PORTFOLIO

DOI: 10.1038/s41586-023-06291-2

关键词

类别

Multidisciplinary Sciences

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper introduces a multi-domain benchmark for medical question answering, which evaluates the performance of models in terms of factuality, comprehension, reasoning, possible harm, and bias through human evaluation. In addition, it proposes instruction prompt tuning to align language models to new domains. The experimental results suggest the potential value of model scale and instruction prompt tuning in improving comprehension, knowledge recall, and reasoning abilities. The human evaluations reveal the limitations of current models and emphasize the importance of evaluation frameworks and method development in creating safe and helpful large language models for clinical applications.

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model(1) (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA(3), MedMCQA(4), PubMedQA(5) and Measuring Massive Multitask Language Understanding (MMLU) clinical topics(6)), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

Large language models encode clinical knowledge

期刊

NATURE

出版社

NATURE PORTFOLIO

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Large language models encode clinical knowledge

期刊

NATURE

出版社

NATURE PORTFOLIO

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文