4.5 Article

Online active multi-field learning for efficient email spam filtering

Journal

KNOWLEDGE AND INFORMATION SYSTEMS
Volume 33, Issue 1, Pages 117-136

Publisher

SPRINGER LONDON LTD
DOI: 10.1007/s10115-011-0461-x

Keywords

Online learning; Multi-field learning; Active learning; Email spam filtering; TREC spam track

Funding

  1. National Natural Science Foundation of China [60873097, 60933005]
  2. Program for New Century Excellent Talents in University [NCET-06-0926]
  3. Fund of Innovation of NUDT [B080605]

Ask authors/readers for more resources

Email spam causes a serious waste of time and resources. This paper addresses the email spam filtering problem and proposes an online active multi-field learning approach, which is based on the following ideas: (1) Email spam filtering is an online application, which suggests an online learning idea; (2) Email document has a multi-field text structure, which suggests a multi-field learning idea; and (3) It is costly to obtain a label for a real-world email spam filter, which suggests an active learning idea. The online learner regards the email spam filtering as an incremental supervised binary streaming text classification. The multi-field learner combines multiple results predicted by field classifiers in a novel compound weight schema, and each field classifier calculates the arithmetical average of multiple conditional probabilities calculated from feature strings according to a data structure of string-frequency index. Comparing the current variance of field classifying results with the historical variance, the active learner evaluates the classifying confidence and takes the more uncertain email as the more informative sample for which to request a label. The experimental results show that the proposed approach can achieve the state-of-the-art performance with greatly reduced label requirements and very low space-time costs. The performance of our online active multi-field learning, the standard (1-ROCA)% measurement, even exceeds the full feedback performance of some advanced individual text classification algorithms.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available