4.6 Article

Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models

Journal

APPLIED SCIENCES-BASEL
Volume 11, Issue 5, Pages -

Publisher

MDPI
DOI: 10.3390/app11051974

Keywords

cross-lingual; pretraining; language model; transfer learning; deep learning; RoBERTa

Funding

  1. Institute for Information and Communications Technology Planning and Evaluation (IITP) - Korea government (MSIT) [2020-0-00368]
  2. MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program [IITP-2020-2018-0-01405]
  3. Institute for Information & Communication Technology Planning & Evaluation (IITP), Republic of Korea [2020-0-00368-002] Funding Source: Korea Institute of Science & Technology Information (KISTI), National Science & Technology Information Service (NTIS)

Ask authors/readers for more resources

This study improves data efficiency by pretraining language models on high-resource languages and treating language modeling of low-resource languages as a domain adaptation task. By selectively reusing parameters from high-resource language models and post-training them alongside learning language-specific parameters in low-resource languages, the method outperforms monolingual training in intrinsic and extrinsic evaluations.
Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available