☆ 4.7 Article

Traditional and context-specific spam detection in low resource settings

MACHINE LEARNING (2022)

期刊

MACHINE LEARNING

卷 111, 期 7, 页码 2515-2536

出版社

SPRINGER

DOI: 10.1007/s10994-022-06176-x

关键词

Context-specific spam; Low-resource learning; Content-based spam detection; Cross-domain learning

类别

Computer Science, Artificial Intelligence

资金

National Science Foundation [1934925, 1934494]
Massive Data Institute (MDI) at Georgetown University

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

The study finds that social media data contains a mixture of high and low-quality content. By analyzing Twitter data sets, the existence of context-specific spam is identified, and traditional machine learning models and a neural network model are compared for identifying spam. The neural network model outperforms traditional models with an F1 score of 0.91. The impact of data imbalance is also investigated, with findings showing that a simple Bag-of-Words model performs best under extreme imbalance, while a neural model fine-tuned using language models from other domains improves the F1 score significantly.

Social media data has a mix of high and low-quality content. One form of commonly studied low-quality content is spam. Most studies assume that spam is context-neutral. We show on different Twitter data sets that context-specific spam exists and is identifiable. We then compare multiple traditional machine learning models and a neural network model that uses a pre-trained BERT language model to capture contextual features for identifying spam, both traditional and context-specific, using only content-based features. The neural network model outperforms the traditional models with an F1 score of 0.91. Because spam training data sets are notoriously imbalanced, we also investigate the impact of this imbalance and show that simple Bag-of-Words models are best with extreme imbalance, but a neural model that fine-tunes using language models from other domains significantly improves the F1 score, but not to the levels of domain-specific neural models. This suggests that the strategy employed may vary depending upon the level of imbalance in the data set, the amount of data available in a low resource setting, and the prevalence of context-specific spam vs. traditional spam. Finally, we make our data sets available for use by the research community.

Traditional and context-specific spam detection in low resource settings

期刊

MACHINE LEARNING

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Traditional and context-specific spam detection in low resource settings

期刊

MACHINE LEARNING

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文