☆ 4.6 Article

Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models

APPLIED SCIENCES-BASEL (2021)

Journal

APPLIED SCIENCES-BASEL

Volume 11, Issue 5, Pages -

Publisher

MDPI

DOI: 10.3390/app11051974

Keywords

cross-lingual; pretraining; language model; transfer learning; deep learning; RoBERTa

Funding

Institute for Information and Communications Technology Planning and Evaluation (IITP) - Korea government (MSIT) [2020-0-00368]
MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program [IITP-2020-2018-0-01405]
Institute for Information & Communication Technology Planning & Evaluation (IITP), Republic of Korea [2020-0-00368-002] Funding Source: Korea Institute of Science & Technology Information (KISTI), National Science & Technology Information Service (NTIS)

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This study improves data efficiency by pretraining language models on high-resource languages and treating language modeling of low-resource languages as a domain adaptation task. By selectively reusing parameters from high-resource language models and post-training them alongside learning language-specific parameters in low-resource languages, the method outperforms monolingual training in intrinsic and extrinsic evaluations.

Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.

Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models

Journal

APPLIED SCIENCES-BASEL

Publisher

MDPI

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models

Journal

APPLIED SCIENCES-BASEL

Publisher

MDPI

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper