Abstract:
Essentially all data mining algorithms assume that the datagenerating process is independent of the data miner's activities. However, in many domains, including spam detection, intrusion detection, fraud detection, surveillance and counter-terrorism, this is far from the case: the data is actively manipulated by an adversary seeking to make the classifier produce false negatives. In these domains, the performance of a classifier can degrade rapidly after it is deployed, as the adversary learns to defeat it. Currently the only solution to this is repeated, manual, ad hoc reconstruction of the classifier. In this paper we develop a formal framework and algorithms for this problem. We view classification as a game between the classifier and the adversary, and produce a classifier that is optimal given the adversary's optimal strategy. Experiments in a spam detection domain show that this approach can greatly outperform a classifier learned in the standard way, and (within the parameters of the problem) automatically adapt the classifier to the adversary's evolving manipulations.
Citations
|
773
|
Game Theory
– Fudenberg, Tirole
- 1993
|
|
592
|
Wrappers for feature subset selection
– Kohavi, John
- 1996
|
|
514
|
A comparison of event models for naive bayes text classification
– McCallum, Nigam
- 1998
|
|
363
|
On the optimality of the simple Bayesian classifier under zero-one loss
– Domingos, Pazzani
- 1997
|
|
341
|
The Theory of Learning in Games
– Fudenberg, Levine
- 1998
|
|
340
|
Markov games as a framework for multi-agent reinforcement learning
– Littman
- 1994
|
|
309
|
Evolution and the Theory of Games
– Smith
- 1982
|
|
272
|
A Scalable Comparison-Shopping Agent for the World Wide Web
– Doorenbos, Etzioni, et al.
- 1997
|
|
236
|
A Bayesian approach to filtering junk e-mail
– Sahami, Dumais, et al.
- 1998
|
|
170
|
Robust classification for imprecise environments
– Provost, Fawcett
|
|
155
|
Mining time-changing data streams
– Hulten, Spencer, et al.
- 2001
|
|
153
|
MetaCost: A general method for making classifiers cost-sensitive
– Domingos
- 1999
|
|
119
|
Adaptive Fraud Detection
– Fawcett, Provost
- 1997
|
|
57
|
Learning Nonstationary Models of Normal Network Traffic for Detecting Novel Attacks
– Mahoney, Chan
- 2002
|
|
25
|
A memory-based approach to anti-spam filtering for mailing lists
– Sakkis, Androutsopoulos, et al.
- 2003
|
|
12
|
Adaptive image analysis for aerial surveillance
– Robertson, Brady
- 1999
|
|
8
|
Cost-sensitive learning bibliography. C Online bibliography
– Turney
|
|
6
|
Information awareness: A prospective technical assessment
– Jensen, Rattigan, et al.
- 2003
|
|
5
|
In vivo” spam filtering: A challenge problem for KDD
– Fawcett
- 2003
|
|
4
|
Online piracy spurs high-tech arms race
– Krebs
- 2003
|
|
4
|
Ongoing Management and Application of Discovered Knowledge in a Large Regulatory Organization: A Case Study of the Use and
– Senator
- 2000
|
|
1
|
Retailers rise in Google rankings as rivals cry foul
– Guernsey
- 2003
|
|
1
|
Computational game theory. Tutorial
– Kearns
- 2002
|
|
1
|
Been gazumped by Google? Trying to make sense of the “Florida” update. Search Engine Guide
– Lloyd
- 2003
|
|
1
|
Email data
– Nielsen
- 2003
|
|
1
|
Ifile spam classifier
– Rennie
- 2003
|