4.6 Article

Towards Malay named entity recognition: an open-source dataset and a multi-task framework

Journal

CONNECTION SCIENCE
Volume 35, Issue 1, Pages -

Publisher

TAYLOR & FRANCIS LTD
DOI: 10.1080/09540091.2022.2159014

Keywords

Malay; named entity recognition; dataset; multi-task learning; Bi-revision

Ask authors/readers for more resources

Named entity recognition (NER) plays a crucial role in various natural language processing (NLP) applications. However, applying advanced NER research to low-resource languages like Malay has been challenging due to the lack of sufficient data. This paper presents a system for building a Malay NER dataset (MS-NER) with 20,146 sentences through labeled datasets in related languages and iterative optimization. Additionally, a Multi-Task framework (MTBR) is proposed to effectively integrate boundary information for improved NER performance.
Named entity recognition (NER) is a key component of many natural language processing (NLP) applications. The majority of advanced research, however, has not been widely applied to low-resource languages represented by Malay due to the data-hungry problem. In this paper, we present a system for building a Malay NER dataset (MS-NER) of 20,146 sentences through labelled datasets of homologous languages and iterative optimisation. Additionally, we propose a Multi-Task framework, namely MTBR, to integrate boundary information more effectively for NER. Specifically, boundary detection is treated as an auxiliary task and an enhanced Bidirectional Revision module with a gated ignoring mechanism is proposed to undertake conditional label transfer. This can reduce error propagation by the auxiliary task. We conduct extensive experiments on Malay, Indonesian, and English. Experimental results show that MTBR could achieve competitive performance and tends to outperform multiple baselines. The constructed dataset and model would be made available to the public as a new, reliable benchmark for Malay NER.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available