4.7 Article

MUSEDA: Multilingual Unsupervised and Supervised Embedding for Domain Adaption

期刊

KNOWLEDGE-BASED SYSTEMS
卷 273, 期 -, 页码 -

出版社

ELSEVIER
DOI: 10.1016/j.knosys.2023.110560

关键词

Multilingual word embeddings; Multilingual Natural Language Processing; Domain adaption; Isomorphism

向作者/读者索取更多资源

Based on the assumption of isomorphism, approaches for generating high-quality, low-cost multilingual word embeddings have been critical for knowledge transfer. However, recent studies have shown limitations in these approaches, leading to stagnation in multilingual natural language processing. To address this, we propose MUSEDA, a framework for building multilingual word embeddings for domain transfer learning.
Based on the assumption of isomorphism, approaches for generating high-quality, low-cost multilin-gual word embeddings (MLWEs) have been critical mechanisms for facilitating knowledge transfer between languages, notably between resource-rich and resource-lean languages. However, recent studies have discovered that the isomorphism assumption is not widely applicable, and that ap-proaches based on this assumption face significant limitations, leading to stagnation in multilingual natural language processing (MNLP). Instead of pursuing the higher quality of MLWEs, we propose MUSEDA (Multilingual Unsupervised and Supervised Embedding for Domain Adaption), a framework for building multilingual word embeddings for domain transfer learning, including pivot features (domain-keyword) mining and weighted embedding alignment in supervised or unsupervised meth-ods. Without using any additional knowledge and parallel data, pivot words can be mined from current data and introduced into the embedding alignment as the weight factor. We applied our framework for real-world tasks from different domains: sentiment classification, news categorization, and named entity recognition on lean-source languages, including Arabic, Vietnamese, etc., and the experimental results demonstrate that the proposed method has surpasses the baseline approaches. In addition, to alleviate the insufficient corpus in certain domains of a single source language, we propose a many-to -many (M2M) mode to further enhance experimental performance by integrating multi-source language into a large common space. (c) 2023 Elsevier B.V. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据