期刊
IEEE ACCESS
卷 10, 期 -, 页码 105328-105351出版社
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/ACCESS.2022.3211396
关键词
Social networking (online); Data models; Blogs; Data integrity; Sentiment analysis; Information integrity; Big Data; Coherence; Social media; big data; microblogging platforms; topic modeling; data cleansing; data quality; topic coherence; purity
资金
- Department of Computer Science, University of Amran, Amran, Yemen
- ENT
This study addresses the issue of poor quality microblog data and proposes a Social Media Data Cleansing Model (SMDCM) to improve data quality for Short-Text Topic Modelling (STTM). By evaluating six topic modelling methods, it was found that GLTM and WNTM were the most effective when applying SMDCM techniques, achieving optimal topic coherence and accuracy values.
With the emergence of microblogging platforms and social media applications, large amounts of user-generated data in the form of comments, reviews, and brief text messages are produced every day. Microblog data is typically of poor quality; hence improving the quality of the data is a significant scientific and practical challenge. In spite of the relevance of the problem, there has been not much work so far, especially in regard to microblog data quality for Short-Text Topic Modelling (STTM) purposes. This paper addresses this problem and proposes an approach called the Social Media Data Cleansing Model (SMDCM) to improve data quality for STTM. We evaluate SMDCM using six topic modelling methods, namely the Latent Dirichlet Allocation (LDA), Word-Network Topic Model (WNTM), Pseudo-document-based Topic Modelling (PTM), Biterm Topic Model (BTM), Global and Local word embedding-based Topic Modeling (GLTM), and Fuzzy Topic modelling (FTM). We used the Real-world Cyberbullying Twitter (RW-CB-Twitter) and the Cyberbullying Mendeley (CB-MNDLY) datasets in the evaluation. The results proved the efficiency of the GLTM and WNTM over the other STTM models when applying the SMDCM techniques, which achieved optimum topic coherence and high accuracy values on RW-CB-Twitter and CB-MNDLY datasets.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据