4.6 Article

Enhancing Big Social Media Data Quality for Use in Short-Text Topic Modeling

期刊

IEEE ACCESS
卷 10, 期 -, 页码 105328-105351

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/ACCESS.2022.3211396

关键词

Social networking (online); Data models; Blogs; Data integrity; Sentiment analysis; Information integrity; Big Data; Coherence; Social media; big data; microblogging platforms; topic modeling; data cleansing; data quality; topic coherence; purity

资金

  1. Department of Computer Science, University of Amran, Amran, Yemen
  2. ENT

向作者/读者索取更多资源

This study addresses the issue of poor quality microblog data and proposes a Social Media Data Cleansing Model (SMDCM) to improve data quality for Short-Text Topic Modelling (STTM). By evaluating six topic modelling methods, it was found that GLTM and WNTM were the most effective when applying SMDCM techniques, achieving optimal topic coherence and accuracy values.
With the emergence of microblogging platforms and social media applications, large amounts of user-generated data in the form of comments, reviews, and brief text messages are produced every day. Microblog data is typically of poor quality; hence improving the quality of the data is a significant scientific and practical challenge. In spite of the relevance of the problem, there has been not much work so far, especially in regard to microblog data quality for Short-Text Topic Modelling (STTM) purposes. This paper addresses this problem and proposes an approach called the Social Media Data Cleansing Model (SMDCM) to improve data quality for STTM. We evaluate SMDCM using six topic modelling methods, namely the Latent Dirichlet Allocation (LDA), Word-Network Topic Model (WNTM), Pseudo-document-based Topic Modelling (PTM), Biterm Topic Model (BTM), Global and Local word embedding-based Topic Modeling (GLTM), and Fuzzy Topic modelling (FTM). We used the Real-world Cyberbullying Twitter (RW-CB-Twitter) and the Cyberbullying Mendeley (CB-MNDLY) datasets in the evaluation. The results proved the efficiency of the GLTM and WNTM over the other STTM models when applying the SMDCM techniques, which achieved optimum topic coherence and high accuracy values on RW-CB-Twitter and CB-MNDLY datasets.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据