☆ 4.5 Article

Developing Multi-Labelled Corpus of Twitter Short Texts: A Semi-Automatic Method

SYSTEMS (2023)

期刊

SYSTEMS

卷 11, 期 8, 页码 -

出版社

MDPI

DOI: 10.3390/systems11080390

关键词

emotion; sentiment corpus; annotation; multi-labelled; Twitter corpus

类别

Social Sciences, Interdisciplinary

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Facing the growing need to extract textual features of online texts for better communication in the Digital Media Age, sentiment classification, by developing corpora with annotation of emotions, is considered the key method to catch emotions of online communication. However, the manual annotation process is labor-intensive and costly, resulting in the lack of corpora for emotional words. Therefore, there is an urgent need for improvement in the methods of automatic emotion tagging with multiple emotion labels to construct new semantic corpora.

Facing fast-increasing electronic documents in the Digital Media Age, the need to extract textual features of online texts for better communication is growing. Sentiment classification might be the key method to catch emotions of online communication, and developing corpora with annotation of emotions is the first step to achieving sentiment classification. However, the labour-intensive and costly manual annotation has resulted in the lack of corpora for emotional words. Furthermore, single-label semantic corpora could hardly meet the requirement of modern analysis of complicated user's emotions, but tagging emotional words with multiple labels is even more difficult than usual. Improvement of the methods of automatic emotion tagging with multiple emotion labels to construct new semantic corpora is urgently needed. Taking Twitter short texts as the case, this study proposes a new semi-automatic method to annotate Internet short texts with multiple labels and form a multi-labelled corpus for further algorithm training. Each sentence is tagged with both the emotional tendency and polarity, and each tweet, which generally contains several sentences, is tagged with the first two major emotional tendencies. The semi-automatic multi-labelled annotation is achieved through the process of selecting the base corpus and emotional tags, data preprocessing, automatic annotation through word matching and weight calculation, and manual correction in case of multiple emotional tendencies are found. The experiments on the Sentiment140 published Twitter corpus demonstrate the effectiveness of the proposed approach and show consistency between the results of semi-automatic annotation and manual annotation. By applying this method, this study summarises the annotation specification and constructs a multi-labelled emotion corpus with 6500 tweets for further algorithm training.

Developing Multi-Labelled Corpus of Twitter Short Texts: A Semi-Automatic Method

期刊

SYSTEMS

出版社

MDPI

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Developing Multi-Labelled Corpus of Twitter Short Texts: A Semi-Automatic Method

期刊

SYSTEMS

出版社

MDPI

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文