4.7 Article

MUSEDA: Multilingual Unsupervised and Supervised Embedding for Domain Adaption

Journal

KNOWLEDGE-BASED SYSTEMS
Volume 273, Issue -, Pages -

Publisher

ELSEVIER
DOI: 10.1016/j.knosys.2023.110560

Keywords

Multilingual word embeddings; Multilingual Natural Language Processing; Domain adaption; Isomorphism

Ask authors/readers for more resources

Based on the assumption of isomorphism, approaches for generating high-quality, low-cost multilingual word embeddings have been critical for knowledge transfer. However, recent studies have shown limitations in these approaches, leading to stagnation in multilingual natural language processing. To address this, we propose MUSEDA, a framework for building multilingual word embeddings for domain transfer learning.
Based on the assumption of isomorphism, approaches for generating high-quality, low-cost multilin-gual word embeddings (MLWEs) have been critical mechanisms for facilitating knowledge transfer between languages, notably between resource-rich and resource-lean languages. However, recent studies have discovered that the isomorphism assumption is not widely applicable, and that ap-proaches based on this assumption face significant limitations, leading to stagnation in multilingual natural language processing (MNLP). Instead of pursuing the higher quality of MLWEs, we propose MUSEDA (Multilingual Unsupervised and Supervised Embedding for Domain Adaption), a framework for building multilingual word embeddings for domain transfer learning, including pivot features (domain-keyword) mining and weighted embedding alignment in supervised or unsupervised meth-ods. Without using any additional knowledge and parallel data, pivot words can be mined from current data and introduced into the embedding alignment as the weight factor. We applied our framework for real-world tasks from different domains: sentiment classification, news categorization, and named entity recognition on lean-source languages, including Arabic, Vietnamese, etc., and the experimental results demonstrate that the proposed method has surpasses the baseline approaches. In addition, to alleviate the insufficient corpus in certain domains of a single source language, we propose a many-to -many (M2M) mode to further enhance experimental performance by integrating multi-source language into a large common space. (c) 2023 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available