☆ 4.6 Article

Localizing in-domain adaptation of transformer-based biomedical language models

JOURNAL OF BIOMEDICAL INFORMATICS (2023)

Journal

JOURNAL OF BIOMEDICAL INFORMATICS

Volume 144, Issue -, Pages -

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE

DOI: 10.1016/j.jbi.2023.104431

Keywords

Natural language processing; Deep learning; Language model; Biomedical text mining; Transformer

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

In the era of digital healthcare, the underused textual information in hospitals could be effectively utilized with task-specific, fine-tuned biomedical language representation models. However, less-resourced languages face challenges in accessing in-domain adaptation resources. To address this issue, our study investigates two accessible approaches to derive biomedical language models in languages like Italian, and demonstrates that data quantity is a harder constraint than data quality for biomedical adaptation. The models developed from our investigations have the potential to unlock important research opportunities for Italian healthcare institutions and academia, and also provide insights towards building generalizable biomedical language models for less-resourced languages and different domains.

In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.

Localizing in-domain adaptation of transformer-based biomedical language models

Journal

JOURNAL OF BIOMEDICAL INFORMATICS

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Localizing in-domain adaptation of transformer-based biomedical language models

Journal

JOURNAL OF BIOMEDICAL INFORMATICS

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper