☆ 4.1 Article

Mandarin-English code-switching speech corpus in South-East Asia: SEAME

LANGUAGE RESOURCES AND EVALUATION (2015)

期刊

LANGUAGE RESOURCES AND EVALUATION

卷 49, 期 3, 页码 581-600

出版社

SPRINGER

DOI: 10.1007/s10579-015-9303-x

关键词

Code-switching speech; Spontaneous spoken corpus development; Mandarin-English; Speech recognition; Language recognition

类别

Computer Science, Interdisciplinary Applications

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

This paper introduces the South East Asia Mandarin-English corpus, a 63-h spontaneous Mandarin-English code-switching transcribed speech corpus suitable for LVCSR and language change detection/identification research. The corpus is recorded under unscripted interview and conversational settings from 157 Singaporean and Malaysian speakers who spoke a mixture of Mandarin and English within a single sentence. About 82 % of the transcribed utterances are intra-sentential code-switching speech and the corpus will be release by LDC in 2015. This paper presents an analysis of the code-switching statistics of the corpus, such as the duration of monolingual segments and the frequency of language turns in code-switch utterances. We also summarize the development effort, details such as the processing time for transcription, validation and language boundary labelling. Lastly, we present textual analyses of code-switch segments examining the word length of monolingual segments in code-switch utterances and the most common single word and two-word phrase of such segments.

Mandarin-English code-switching speech corpus in South-East Asia: SEAME

期刊

LANGUAGE RESOURCES AND EVALUATION

出版社

SPRINGER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Mandarin-English code-switching speech corpus in South-East Asia: SEAME

期刊

LANGUAGE RESOURCES AND EVALUATION

出版社

SPRINGER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文