☆ 4.7 Article

MUSEDA: Multilingual Unsupervised and Supervised Embedding for Domain Adaption

KNOWLEDGE-BASED SYSTEMS (2023)

Journal

KNOWLEDGE-BASED SYSTEMS

Volume 273, Issue -, Pages -

Publisher

ELSEVIER

DOI: 10.1016/j.knosys.2023.110560

Keywords

Multilingual word embeddings; Multilingual Natural Language Processing; Domain adaption; Isomorphism

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Based on the assumption of isomorphism, approaches for generating high-quality, low-cost multilingual word embeddings have been critical for knowledge transfer. However, recent studies have shown limitations in these approaches, leading to stagnation in multilingual natural language processing. To address this, we propose MUSEDA, a framework for building multilingual word embeddings for domain transfer learning.

Based on the assumption of isomorphism, approaches for generating high-quality, low-cost multilin-gual word embeddings (MLWEs) have been critical mechanisms for facilitating knowledge transfer between languages, notably between resource-rich and resource-lean languages. However, recent studies have discovered that the isomorphism assumption is not widely applicable, and that ap-proaches based on this assumption face significant limitations, leading to stagnation in multilingual natural language processing (MNLP). Instead of pursuing the higher quality of MLWEs, we propose MUSEDA (Multilingual Unsupervised and Supervised Embedding for Domain Adaption), a framework for building multilingual word embeddings for domain transfer learning, including pivot features (domain-keyword) mining and weighted embedding alignment in supervised or unsupervised meth-ods. Without using any additional knowledge and parallel data, pivot words can be mined from current data and introduced into the embedding alignment as the weight factor. We applied our framework for real-world tasks from different domains: sentiment classification, news categorization, and named entity recognition on lean-source languages, including Arabic, Vietnamese, etc., and the experimental results demonstrate that the proposed method has surpasses the baseline approaches. In addition, to alleviate the insufficient corpus in certain domains of a single source language, we propose a many-to -many (M2M) mode to further enhance experimental performance by integrating multi-source language into a large common space. (c) 2023 Elsevier B.V. All rights reserved.

MUSEDA: Multilingual Unsupervised and Supervised Embedding for Domain Adaption

Journal

KNOWLEDGE-BASED SYSTEMS

Publisher

ELSEVIER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

MUSEDA: Multilingual Unsupervised and Supervised Embedding for Domain Adaption

Journal

KNOWLEDGE-BASED SYSTEMS

Publisher

ELSEVIER

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper