☆ 4.6 Article

Domain specific word embeddings for natural language processing in radiology

JOURNAL OF BIOMEDICAL INFORMATICS (2021)

Journal

JOURNAL OF BIOMEDICAL INFORMATICS

Volume 113, Issue -, Pages -

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE

DOI: 10.1016/j.jbi.2020.103665

Keywords

Natural language processing; Word embeddings; Analogy completion; Multi-label classification

Funding

National Institute of Biomedical Imaging and Bioengineering (NIBIB) [5T32EB001631]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This study utilized Radiopaedia as a general radiology corpus to train specific word embeddings, demonstrating their potential to improve performance on NLP tasks in radiological text.

Background: There has been increasing interest in machine learning based natural language processing (NLP) methods in radiology; however, models have often used word embeddings trained on general web corpora due to lack of a radiology-specific corpus. Purpose: We examined the potential of Radiopaedia to serve as a general radiology corpus to produce radiology specific word embeddings that could be used to enhance performance on a NLP task on radiological text. Materials and methods: Embeddings of dimension 50, 100, 200, and 300 were trained on articles collected from Radiopaedia using a GloVe algorithm and evaluated on analogy completion. A shallow neural network using input from either our trained embeddings or pre-trained Wikipedia 2014 + Gigaword 5 (WG) embeddings was used to label the Radiopaedia articles. Labeling performance was evaluated based on exact match accuracy and Hamming loss. The McNemar's test with continuity and the Benjamini-Hochberg correction and a 5x2 cross validation paired two-tailed t-test were used to assess statistical significance. Results: For accuracy in the analogy task, 50-dimensional (50-D) Radiopaedia embeddings outperformed WG embeddings on tumor origin analogies (p < 0.05) and organ adjectives (p < 0.01) whereas WG embeddings tended to outperform on inflammation location and bone vs. muscle analogies (p < 0.01). The two embeddings had comparable performance on other subcategories. In the labeling task, the Radiopaedia-based model outperformed the WG based model at 50, 100, 200, and 300-D for exact match accuracy (p < 0.001, p < 0.001, p < 0.01, and p < 0.05, respectively) and Hamming loss (p < 0.001, p < 0.001, p < 0.01, and p < 0.05, respectively). Conclusion: We have developed a set of word embeddings from Radiopaedia and shown that they can preserve relevant medical semantics and augment performance on a radiology NLP task. Our results suggest that the cultivation of a radiology-specific corpus can benefit radiology NLP models in the future.

Domain specific word embeddings for natural language processing in radiology

Journal

JOURNAL OF BIOMEDICAL INFORMATICS

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Domain specific word embeddings for natural language processing in radiology

Journal

JOURNAL OF BIOMEDICAL INFORMATICS

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper