☆ 3.8 Proceedings Paper

Building a First Language Model for Code-switch Arabic-English

ARABIC COMPUTATIONAL LINGUISTICS (ACLING 2017) (2017)

期刊

ARABIC COMPUTATIONAL LINGUISTICS (ACLING 2017)

卷 117, 期 -, 页码 208-216

出版社

ELSEVIER SCIENCE BV

DOI: 10.1016/j.procs.2017.10.111

关键词

Automatic Speech Recognition; language model; code-mixing; code-switching; Arabic-English corpus; web corpus; web crawling

类别

Computer Science, Interdisciplinary Applications Linguistics Language & Linguistics

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

The use of mixed languages in daily conversations, referred to as code-switching, has become a common linguistic phenomenon among bilingual/multilingual communities. Code-switching involves the alternating use of distinct languages or codes at sentence boundaries or within the same sentence. With the rise of globalization, code-switching has become prevalent in daily conversations, especially among urban youth. This lead to an increasing demand on automatic speech recognition systems to be able to handle such mixed speech. In this paper, we present the first steps towards building a multilingual language model (LM) for code-switched Arabic-English. One of the main challenges faced when building a multilingual LM is the need of explicit mixed text corpus. Since code-switching is a behaviour used more commonly in spoken than written form, text corpora with code-switching are usually scarce. Therefore, the first aim of this paper is to introduce a code-switch Arabic-English text corpus that is collected by automatically downloading relevant documents from the web. The text is then extracted from the documents and processed to be useable by NLP tasks. For language modeling, a baseline LM was built from existing monolingual corpora. The baseline LM gave a perplexity of 11841.9 and Out-of-Vocabulary (OOV) rate of 4.07%. The gathered code-switch Arabic-English corpus, along with the existing monolingual corpora were then used to construct several LMs. The best LM achieved a great improvement over the baseline LM, with a perplexity of 275.41 and an OOV rate of 0.71%. (C) 2017 The Authors. Published by Elsevier B.V.

Building a First Language Model for Code-switch Arabic-English

期刊

ARABIC COMPUTATIONAL LINGUISTICS (ACLING 2017)

出版社

ELSEVIER SCIENCE BV

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Building a First Language Model for Code-switch Arabic-English

期刊

ARABIC COMPUTATIONAL LINGUISTICS (ACLING 2017)

出版社

ELSEVIER SCIENCE BV

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文