Spam Filtering with Naive Bayes -- Which Naive Bayes? (2006)
Abstract:
Naive Bayes is very popular in commercial and open-source anti-spam e-mail filters. There are, however, several forms of Naive Bayes, something the anti-spam literature does not always acknowledge. We discuss five di#erent versions of Naive Bayes, and compare them on six new, non-encoded datasets, that contain ham messages of particular Enron users and fresh spam messages. The new datasets, which we make publicly available, are more realistic than previous comparable benchmarks, because they maintain the temporal order of the messages in the two categories, and they emulate the varying proportion of spam and ham messages that users receive over time. We adopt an experimental procedure that emulates the incremental training of personalized spam filters, and we plot roc curves that allow us to compare the di#erent versions of nb over the entire tradeo# between true positives and true negatives.
Citations
| 363 | On the optimality of the simple Bayesian classifier under zero-one loss – Domingos, Pazzani - 1997 |
| 133 | Vapnik V: Support vector machines for spam categorization – Drucker, Wu - 1995 |
| 63 | 33 experimental comparison of naive Bayesian and keyword-based antispam filtering with encrypted personal e-mail messages – Androutsopoulos, Koutsias, et al. - 2000 |
| 55 | Boosting Trees for Anti-Spam Email Filtering – Carreras, Márquez - 2001 |
| 15 | Automatic categorization of email into folders: Benchmark experiments on enron and sri corpora – Bekkerman, McCallum, et al. - 2004 |

