☆ 4.6 Article

A compressed large language model embedding dataset of ICD 10 CM descriptions

BMC BIOINFORMATICS (2023)

期刊

BMC BIOINFORMATICS

卷 24, 期 1, 页码 -

出版社

BMC

DOI: 10.1186/s12859-023-05597-2

关键词

Large language model; Autoencoder; ICD-10-CM; Electronic health records; EHR; NLP

类别

Biochemical Research Methods Biotechnology & Applied Microbiology Mathematical & Computational Biology

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper presents a novel dataset of numerical representations for ICD-10-CM codes generated using a language model and autoencoder. These datasets can enable more advanced analyses in the biomedical domain and have the potential to significantly improve the utility of ICD-10-CM codes.

This paper presents novel datasets providing numerical representations of ICD-10-CM codes by generating description embeddings using a large language model followed by a dimension reduction via autoencoder. The embeddings serve as informative input features for machine learning models by capturing relationships among categories and preserving inherent context information. The model generating the data was validated in two ways. First, the dimension reduction was validated using an autoencoder, and secondly, a supervised model was created to estimate the ICD-10-CM hierarchical categories. Results show that the dimension of the data can be reduced to as few as 10 dimensions while maintaining the ability to reproduce the original embeddings, with the fidelity decreasing as the reduced-dimension representation decreases. Multiple compression levels are provided, allowing users to choose as per their requirements, download and use without any other setup. The readily available datasets of ICD-10-CM codes are anticipated to be highly valuable for researchers in biomedical informatics, enabling more advanced analyses in the field. This approach has the potential to significantly improve the utility of ICD-10-CM codes in the biomedical domain.

A compressed large language model embedding dataset of ICD 10 CM descriptions

期刊

BMC BIOINFORMATICS

出版社

BMC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

A compressed large language model embedding dataset of ICD 10 CM descriptions

期刊

BMC BIOINFORMATICS

出版社

BMC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文