☆ 4.7 Article

MUSEDA: Multilingual Unsupervised and Supervised Embedding for Domain Adaption

KNOWLEDGE-BASED SYSTEMS (2023)

期刊

KNOWLEDGE-BASED SYSTEMS

卷 273, 期 -, 页码 -

出版社

ELSEVIER

DOI: 10.1016/j.knosys.2023.110560

关键词

Multilingual word embeddings; Multilingual Natural Language Processing; Domain adaption; Isomorphism

类别

Computer Science, Artificial Intelligence

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Based on the assumption of isomorphism, approaches for generating high-quality, low-cost multilingual word embeddings have been critical for knowledge transfer. However, recent studies have shown limitations in these approaches, leading to stagnation in multilingual natural language processing. To address this, we propose MUSEDA, a framework for building multilingual word embeddings for domain transfer learning.

Based on the assumption of isomorphism, approaches for generating high-quality, low-cost multilin-gual word embeddings (MLWEs) have been critical mechanisms for facilitating knowledge transfer between languages, notably between resource-rich and resource-lean languages. However, recent studies have discovered that the isomorphism assumption is not widely applicable, and that ap-proaches based on this assumption face significant limitations, leading to stagnation in multilingual natural language processing (MNLP). Instead of pursuing the higher quality of MLWEs, we propose MUSEDA (Multilingual Unsupervised and Supervised Embedding for Domain Adaption), a framework for building multilingual word embeddings for domain transfer learning, including pivot features (domain-keyword) mining and weighted embedding alignment in supervised or unsupervised meth-ods. Without using any additional knowledge and parallel data, pivot words can be mined from current data and introduced into the embedding alignment as the weight factor. We applied our framework for real-world tasks from different domains: sentiment classification, news categorization, and named entity recognition on lean-source languages, including Arabic, Vietnamese, etc., and the experimental results demonstrate that the proposed method has surpasses the baseline approaches. In addition, to alleviate the insufficient corpus in certain domains of a single source language, we propose a many-to -many (M2M) mode to further enhance experimental performance by integrating multi-source language into a large common space. (c) 2023 Elsevier B.V. All rights reserved.

MUSEDA: Multilingual Unsupervised and Supervised Embedding for Domain Adaption

期刊

KNOWLEDGE-BASED SYSTEMS

出版社

ELSEVIER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

MUSEDA: Multilingual Unsupervised and Supervised Embedding for Domain Adaption

期刊

KNOWLEDGE-BASED SYSTEMS

出版社

ELSEVIER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文