## Online passive-aggressive algorithms (2006)

### Cached

### Download Links

- [webee.technion.ac.il]
- [engr.case.edu]
- [books.nips.cc]
- [www.jmlr.org]
- [www.cs.huji.ac.il]
- [ttic.uchicago.edu]
- [static.googleusercontent.com]
- [eprints.pascal-network.org]
- [eprints.pascal-network.org]
- [www.lirmm.fr]
- [jmlr.org]
- [www.lirmm.fr]
- [www.lirmm.fr]
- [www.lirmm.fr]
- [books.nips.cc]
- [www.cs.huji.ac.il]
- [ttic.uchicago.edu]
- [books.nips.cc]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Machine Learning Research |

Citations: | 307 - 22 self |

### BibTeX

@ARTICLE{Crammer06onlinepassive-aggressive,

author = {Koby Crammer and Ofer Dekel and Shai Shalev-shwartz and Yoram Singer},

title = {Online passive-aggressive algorithms},

journal = {Journal of Machine Learning Research},

year = {2006},

volume = {7},

pages = {2006}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a unified view for online classification, regression, and uniclass problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also discussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss. 1

### Citations

9462 | Statistical Learning Theory
- Vapnik, Ed
- 1998
(Show Context)
Citation Context .... Due to the lack of space we mention here only a handful of related papers. The updates we derive are based on an optimization problem directly related to the one employed by Support Vector Machines =-=[11]-=-. Li and Long [10] were among the first to suggest the idea of converting a batch optimization problem into an online task. Our work borrows ideas from the work of Warmuth and colleagues [8]. In parti... |

3927 |
Convex Optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...n problem in Eq. (2) has a simple closed form solution, wt+1 = wt + τtytxt where τt = ℓt ‖xt‖2. (3) We now show how this update is derived using standard tools from convex analysis (see for instance (=-=Boyd and Vandenberghe, 2004-=-)). If ℓt = 0 then wt itself satisfies the constraint in Eq. (2) and is clearly the optimal solution. We therefore concentrate on the case where ℓt > 0. First, we define the Lagrangian of the optimiza... |

2132 | Learning with kernels
- Schölkopf, Smola
- 2002
(Show Context)
Citation Context ...and colleagues [8]. In particular, Gentile and Warmuth [4] generalized and adapted techniques from [8] to the hinge loss which is closely related to the losses defined in Eqs. (1)-(3). Kivinen et al. =-=[7]-=- discussed a general framework for gradient-based online learning where some of their bounds bare similarities to the bounds presented in this paper. Our work also generalizes and greatly improves onl... |

1634 | Term weighting approaches in automatic text retrieval
- Salton, Buckley
- 1988
(Show Context)
Citation Context ... our construction with an example from the domain of text categorization. We describe a variant of the Term Frequency - Inverse Document Frequency (TF-IDF) representation of documents (Rocchio, 1971; =-=Salton and Buckley, 1988-=-). Each feature φ j corresponds to a different word, denoted µj. Given a corpus of documents S, for every x ∈ S and for every potential topic y, the feature φ j(x,y) is defined to be, ( ) |S| φ j(x,y)... |

980 |
An Introduction to Support Vector Machines
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ... formal properties and simplicity of linear predictors. For concreteness, our presentation and analysis are confined to the linear case which is often referred to as the primal version (Vapnik, 1998; =-=Cristianini and Shawe-Taylor, 2000-=-; Schölkopf and Smola, 2002). As in other constructions of linear kernel machines, our paradigm also builds on the notion of margin. Binary classification is the first setting we discuss in the paper.... |

893 |
Relevance feedback in information retrieval
- Rocchio
- 1971
(Show Context)
Citation Context ...) . We motivate our construction with an example from the domain of text categorization. We describe a variant of the Term Frequency - Inverse Document Frequency (TF-IDF) representation of documents (=-=Rocchio, 1971-=-; Salton and Buckley, 1988). Each feature φ j corresponds to a different word, denoted µj. Given a corpus of documents S, for every x ∈ S and for every potential topic y, the feature φ j(x,y) is defin... |

849 |
The perceptron: a probabilistic model for information storage and organization in the brain
- Rosenblatt
- 1958
(Show Context)
Citation Context ...(Littlestone, 1989; Kivinen and Warmuth, 1997). Online margin-based prediction algorithms are also quite prevalent. The roots of many of the papers date back to the Perceptron algorithm (Agmon, 1954; =-=Rosenblatt, 1958-=-; Novikoff, 1962). More modern examples include the ROMMA algorithm of Li and Long (2002), Gentile’s ALMA algorithm (Gentile, 2001), the MIRA algorithm (Crammer and Singer, 2003b), and the NORMA algor... |

631 |
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
- Schölkopf, Smola
- 2002
(Show Context)
Citation Context ... linear predictors. For concreteness, our presentation and analysis are confined to the linear case which is often referred to as the primal version (Vapnik, 1998; Cristianini and Shawe-Taylor, 2000; =-=Schölkopf and Smola, 2002-=-). As in other constructions of linear kernel machines, our paradigm also builds on the notion of margin. Binary classification is the first setting we discuss in the paper. In this setting each insta... |

534 | Distance metric learning with application to clustering with side-information - Xing, Ng, et al. - 2003 |

511 | BoosTexter: A boostingbased system for text categorization - Schapire, Singer - 2000 |

459 | Max-margin markov networks
- Taskar, Guestrin, et al.
- 2003
(Show Context)
Citation Context ...s a key role in constructing efficient learning and inference procedures. Notable examples for structured label sets are graphs (in particular trees) and sequences (Collins, 2000; Altun et al., 2003; =-=Taskar et al., 2003-=-; Tsochantaridis et al., 2004). We now overview how the cost-sensitive learning algorithms described in the previous section can be adapted to structured output settings. For concreteness, we focus on... |

420 | Large margin classification using the Perceptron algorithm
- Freund, Schapire
- 1999
(Show Context)
Citation Context ...C, setting the remaining T − 1 new coordinates to zero, and then using the simple PA update. This technique was previously used to derive noise-tolerant online algorithms in (Klasner and Simon, 1995; =-=Freund and Schapire, 1999-=-). We do not use this observation explicitly in this paper, since it does not lead to a tighter analysis. Up until now, we have restricted our discussion to linear predictors of the form sign(w · x). ... |

380 | On the algorithmic implementation of multiclass kernel-based vector machines - Crammer, Singer |

324 | Support vector machine learning for interdependent and structured output spaces
- Tsochantaridis, Hofmann, et al.
- 2004
(Show Context)
Citation Context ...ructing efficient learning and inference procedures. Notable examples for structured label sets are graphs (in particular trees) and sequences (Collins, 2000; Altun et al., 2003; Taskar et al., 2003; =-=Tsochantaridis et al., 2004-=-). We now overview how the cost-sensitive learning algorithms described in the previous section can be adapted to structured output settings. For concreteness, we focus on an adaptation for sequence p... |

265 |
Parallel optimization: theory, algorithms and applications
- Censor, Stavros
- 1997
(Show Context)
Citation Context ...generalizes and greatly improves online loss bounds for classification given in [2]. Finally, we would like to note that similar algorithms have been devised in the convex optimization community (see =-=[1]-=-). The main difference between these algorithms and the online algorithms presented in this paper lies in the analysis: while we derive worst case, finite horizon, loss bounds, the optimization commun... |

256 | Ultraconservative online algorithms for multiclass problems
- Crammer, Singer
- 2003
(Show Context)
Citation Context ...based online learning where some of their bounds bare similarities to the bounds presented in this paper. Our work also generalizes and greatly improves online loss bounds for classification given in =-=[2]-=-. Finally, we would like to note that similar algorithms have been devised in the convex optimization community (see [1]). The main difference between these algorithms and the online algorithms presen... |

253 | Exponentiated gradient versus gradient descent for linear predictors
- Kivinen, Warmuth
- 1997
(Show Context)
Citation Context ...Machines [11]. Li and Long [10] were among the first to suggest the idea of converting a batch optimization problem into an online task. Our work borrows ideas from the work of Warmuth and colleagues =-=[8]-=-. In particular, Gentile and Warmuth [4] generalized and adapted techniques from [8] to the hinge loss which is closely related to the losses defined in Eqs. (1)-(3). Kivinen et al. [7] discussed a ge... |

197 | Hidden Markov support vector machines
- Altun, Tsochantaridis, et al.
- 2003
(Show Context)
Citation Context ...d the structure plays a key role in constructing efficient learning and inference procedures. Notable examples for structured label sets are graphs (in particular trees) and sequences (Collins, 2000; =-=Altun et al., 2003-=-; Taskar et al., 2003; Tsochantaridis et al., 2004). We now overview how the cost-sensitive learning algorithms described in the previous section can be adapted to structured output settings. For conc... |

189 | Online convex programming and generalized infinitesimal gradient ascent - Zinkevich - 2003 |

142 | A kernel method for multi-labelled classification
- Elisseeff, Weston
- 2002
(Show Context)
Citation Context ...alizes the definition of margin for binary classification and was employed by both single-label and multilabel learning algorithms for support vector machines (Vapnik, 1998; Weston and Watkins, 1999; =-=Elisseeff and Weston, 2001-=-; Crammer and Singer, 2003a). In words, the margin is the difference between the score of the lowest ranked relevant label and the score of the highest ranked irrelevant label. The margin is positive ... |

134 |
On convergence proofs on perceptrons
- Novikoff
- 1962
(Show Context)
Citation Context ...; Kivinen and Warmuth, 1997). Online margin-based prediction algorithms are also quite prevalent. The roots of many of the papers date back to the Perceptron algorithm (Agmon, 1954; Rosenblatt, 1958; =-=Novikoff, 1962-=-). More modern examples include the ROMMA algorithm of Li and Long (2002), Gentile’s ALMA algorithm (Gentile, 2001), the MIRA algorithm (Crammer and Singer, 2003b), and the NORMA algorithm (Kivinen et... |

109 |
Mistake bounds and logarithmic linear-threshold learning algorithms. Doctoral dissertation
- Littlestone
- 1989
(Show Context)
Citation Context .... The mere idea of deriving an update as a result of a constrained optimization problem compromising of two opposing terms, has been largely advocated by Littlestone, Warmuth, Kivinen and colleagues (=-=Littlestone, 1989-=-; Kivinen and Warmuth, 1997). Online margin-based prediction algorithms are also quite prevalent. The roots of many of the papers date back to the Perceptron algorithm (Agmon, 1954; Rosenblatt, 1958; ... |

90 | A new approximate maximal margin classification algorithm
- Gentile
(Show Context)
Citation Context ..., 8, 6]). An interesting question is whether the unified view of classification, regression, and uniclass can be exported and used with other algorithms for classification such as ROMMA [10] and ALMA =-=[3]-=-. Another, rather general direction for possible extension surfaces when replacing the Euclidean distance between wt+1 and wt with other distances and divergences such as the Bregman divergence. The r... |

76 | Relative loss bounds for multidimensional regression problems - Kivinen, Warmuth |

74 | The relaxed online maximum margin algorithm
- Li, Long
- 2002
(Show Context)
Citation Context ...of space we mention here only a handful of related papers. The updates we derive are based on an optimization problem directly related to the one employed by Support Vector Machines [11]. Li and Long =-=[10]-=- were among the first to suggest the idea of converting a batch optimization problem into an online task. Our work borrows ideas from the work of Warmuth and colleagues [8]. In particular, Gentile and... |

74 |
The relaxation method for linear inequalities
- Agmon
- 1954
(Show Context)
Citation Context ...d colleagues (Littlestone, 1989; Kivinen and Warmuth, 1997). Online margin-based prediction algorithms are also quite prevalent. The roots of many of the papers date back to the Perceptron algorithm (=-=Agmon, 1954-=-; Rosenblatt, 1958; Novikoff, 1962). More modern examples include the ROMMA algorithm of Li and Long (2002), Gentile’s ALMA algorithm (Gentile, 2001), the MIRA algorithm (Crammer and Singer, 2003b), a... |

69 | Y.: Large margin hierarchical classification - Dekel, Keshet, et al. |

64 | The robustness of the p-norm algorithms - Gentile |

61 | Online and batch learning of pseudo-metrics - Shalev-Shwartz, Singer, et al. - 2004 |

59 | A new family of online algorithms for category ranking - Crammer, Singer - 2002 |

38 | Linear hinge loss and average margin
- Gentile, Warmuth
- 1998
(Show Context)
Citation Context ...ng the first to suggest the idea of converting a batch optimization problem into an online task. Our work borrows ideas from the work of Warmuth and colleagues [8]. In particular, Gentile and Warmuth =-=[4]-=- generalized and adapted techniques from [8] to the hinge loss which is closely related to the losses defined in Eqs. (1)-(3). Kivinen et al. [7] discussed a general framework for gradient-based onlin... |

36 | Relative loss bounds for single neurons
- Helmbold, Kivinen, et al.
- 1999
(Show Context)
Citation Context ... : B ≥ ‖xt‖2 ) and B = 16 for uniclass. ∞ The proof of the theorem is rather technical and uses the proof technique of Thm. 1 in conjunction with inequalities on the logarithm of Zt (see for instance =-=[5, 8, 6]-=-). An interesting question is whether the unified view of classification, regression, and uniclass can be exported and used with other algorithms for classification such as ROMMA [10] and ALMA [3]. An... |

34 | A comparison of new and old algorithms for a mixture estimation problem
- Helmbold, Singer, et al.
- 1995
(Show Context)
Citation Context ... : B ≥ ‖xt‖2 ) and B = 16 for uniclass. ∞ The proof of the theorem is rather technical and uses the proof technique of Thm. 1 in conjunction with inequalities on the logarithm of Zt (see for instance =-=[5, 8, 6]-=-). An interesting question is whether the unified view of classification, regression, and uniclass can be exported and used with other algorithms for classification such as ROMMA [10] and ALMA [3]. An... |

31 |
Learning additive models online with fast evaluating kernels.” Computational Learning Theory
- Herbster
- 2001
(Show Context)
Citation Context ...nalyze several online learning tasks through the same algorithmic prism. We first introduce a simple online algorithm which we call Passive-Aggressive (PA) for online binary classification (see also (=-=Herbster, 2001-=-)). We then propose two alternative modifications to the PA algorithm which improve the algorithm’s ability to cope with noise. We provide a unified analysis for the three variants. Building on this u... |

18 | Eds., “Learning to align polyphonic music - Shalev-Shwartz, Keshet, et al. - 2004 |

8 | The power of selective memory: Selfbounded learning of prediction suffix trees - Dekel, Shalev-Shwartz, et al. - 2004 |

1 | τt = =⇒ τ = 1 − yt(wt · xt - Agmon - 1954 |

1 | Discriminative reranking for natural language parsing - CRAMMER, KESHET, et al. - 2000 |

1 | From noise-free to noise-tolerant and from on-line to batch learning - Information, Computation - 1997 |

1 | Discriminative reranking for natural language parsing - DEKEL, SHALEV-SHWARTZ, et al. - 2000 |