4.6 Article

A weighted feature enhanced Hidden Markov Model for spam SMS filtering

期刊

NEUROCOMPUTING
卷 444, 期 -, 页码 48-58

出版社

ELSEVIER
DOI: 10.1016/j.neucom.2021.02.075

关键词

Hidden Markov Model (HMM); Short Messaging Service (SMS); Spam filtering; Weighted features; Text classification

资金

  1. Soft Engineering of Key Subjects Construction at Shanghai Polytechnic University [xxkzd1604]

向作者/读者索取更多资源

Short Message Service (SMS) is commonly used by people in daily life, but it is also misused by spammers. Researchers have developed rule-based and content-based filtering techniques, as well as machine learning methods, to combat spam messages. The weighted feature enhanced Hidden Markov Model (HMM) has shown significant improvement in filtering accuracy and speed.
Short message service (SMS) is a most favored communication service people use in daily life. However, this service is being misused by spammers. Rule based systems (RBS) and content based filtering (CBF) techniques have been developed to filter out spam messages. New rules can be easily added into RBS, but the throughput usually reduces as the rules increase. The bag-of-words (BoW) assumption based CBF techniques ignore the word order, which use machine learning methods to extract features from SMS message body according to word frequency and distribution. Striving to improve performance, researchers developed hybrid models that made algorithms ever-more complex. In addition, frequently conducting the time consuming models training and deployment forces the anti-spam industry still rely mainly on rule-based systems with unsolved throughput issue. A discrete Hidden Markov Model (HMM) was proposed in our previous study to address these issues, and the HMM method achieved a comparable performance to the deep learning methods. To further improve the performance of HMM method, we propose a new approach to weight and label words in SMS for formatting the observation sequence in HMM method. The weighted feature enhanced HMM achieves higher accuracy, and much faster training and filtering speed for meeting the anti-spam industry requirement. The performance comparison with other machine learning methods is conducted on the same open respiratory data set maintained by University of California, Irvine (UCI). Experimental results show that the weighted features enhanced HMM outperforms the LSTM (long short-term memory model) and close to CNN (convolutional neural network) in terms of classification accuracy. In addition, a Chinese SMS data set is used to further validate filtering accuracy and filtering speed. (c) 2021 Elsevier B.V. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据