4.6 Article

Supervised classification of spam emails with natural language stylometry

Journal

NEURAL COMPUTING & APPLICATIONS
Volume 27, Issue 8, Pages 2315-2331

Publisher

SPRINGER LONDON LTD
DOI: 10.1007/s00521-015-2069-7

Keywords

Spam classification; Natural language processing; Stylometry; Supervised machine learning; Text classification; Computational linguistics; Text mining; Performance evaluation

Funding

  1. Natural Sciences and Engineering Research Council of Canada (NSERC) [36853-2010 RGPIN]

Ask authors/readers for more resources

Email spam is one of the biggest threats to today's Internet. To deal with this threat, there are long-established measures like supervised anti-spam filters. In this paper, we report the development and evaluation of SENTINEL-an anti-spam filter based on natural language and stylometry attributes. The performance of the filter is evaluated not only on non-personalized emails (i.e., emails collected randomly) but also on personalized emails (i.e., emails collected from particular individuals). Among the non-personalized datasets are CSDMC2010, SpamAssassin, and LingSpam, while the Enron-Spam collection comprises personalized emails. The proposed filter extracts natural language attributes from email text that are closely related to writer stylometry and generate classifiers using multiple learning algorithms. Experimental outcomes show that classifiers generated by meta-learning algorithms such as ADABOOSTM1 and BAGGING are the best, performing equally well and surpassing the performance of a number of filters proposed in previous studies, while a random forest generated classifier is a close second. On the other hand, the performance of classifiers using support vector machine and Naive Bayes is not satisfactory. In addition, we find much improved results on personalized emails and mixed results on non-personalized emails.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available