Journal
APPLIED SCIENCES-BASEL
Volume 11, Issue 5, Pages -Publisher
MDPI
DOI: 10.3390/app11051974
Keywords
cross-lingual; pretraining; language model; transfer learning; deep learning; RoBERTa
Categories
Funding
- Institute for Information and Communications Technology Planning and Evaluation (IITP) - Korea government (MSIT) [2020-0-00368]
- MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program [IITP-2020-2018-0-01405]
- Institute for Information & Communication Technology Planning & Evaluation (IITP), Republic of Korea [2020-0-00368-002] Funding Source: Korea Institute of Science & Technology Information (KISTI), National Science & Technology Information Service (NTIS)
Ask authors/readers for more resources
This study improves data efficiency by pretraining language models on high-resource languages and treating language modeling of low-resource languages as a domain adaptation task. By selectively reusing parameters from high-resource language models and post-training them alongside learning language-specific parameters in low-resource languages, the method outperforms monolingual training in intrinsic and extrinsic evaluations.
Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available