## A New Approximate Maximal Margin Classification Algorithm (2001)

### Cached

### Download Links

Venue: | JOURNAL OF MACHINE LEARNING RESEARCH |

Citations: | 90 - 5 self |

### BibTeX

@MISC{Gentile01anew,

author = {Claudio Gentile},

title = { A New Approximate Maximal Margin Classification Algorithm},

year = {2001}

}

### Years of Citing Articles

### OpenURL

### Abstract

A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p 2 for a set of linearly separable data. Our algorithm, called alma p (Approximate Large Margin algorithm w.r.t. norm p), takes O (p 1) 2 2 corrections to separate the data with p-norm margin larger than (1 ) , where is the (normalized) p-norm margin of the data. alma p avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's Perceptron algorithm. We performed extensive experiments on both real-world and artificial datasets. We compared alma 2 (i.e., alma p with p = 2) to standard Support vector Machines (SVM) and to two incremental algorithms: the Perceptron algorithm and Li and Long's ROMMA. The accuracy levels achieved by alma 2 are superior to those achieved by the Perceptron algorithm and ROMMA, but slightly inferior to SVM's. On the other hand, alma 2 is quite faster and easier to implement than standard SVM training algorithms. When learning sparse target vectors, alma p with p > 2 largely outperforms Perceptron-like algorithms, such as alma 2 .

### Citations

4052 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...l evidence. 4. The way we handle multiclass classification problems is to reduce to a set ofbinary problems. As a matter offact, natural multiclass versions Perceptron-like algorithms do exist (e.g., =-=Duda and Hart, 1973-=-, Chap. 5). As in the one-versus-rest scheme, these algorithms associate one weight vector classifier with each class and predict according to the maximum output ofthese classifiers. Again, margin is ... |

2958 | UCI repository of machine learning databases - Blake, Merz - 1998 |

2300 | Support-vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...nd the hyperplane. If euclidean norm is used to measure the distance then computing the maximal margin hyperplane corresponds to the, by now classical, Support Vector Machines (SVMs) training problem =-=[3]-=-. This task is naturally formulated as a quadratic programming problem. If an arbitrary normÔis used then such a task turns to a more general mathematical programming problem (see, e.g., [15, 16]) to ... |

1641 | An Introduction to Support Vector Machines and Other Kernel-based Learning Methods - Cristianini, ShaweTaylor - 2000 |

1065 |
Fast Training of Support Vector Machines using Sequential Minimal Optimization
- Platt
- 1999
(Show Context)
Citation Context ...lightly worse than standard SVMs. On the other hand ALMA is much faster and easier to implement than standard SVMs training algorithms. For related work on SVMs (withÔ�), see Friess et al. [5], Platt =-=[17]-=- and references therein. The next section defines our major notation and recalls vectorÛ�Û�����ÛÒ toÊÒandÝ� �� some basic preliminaries. In Section 3 we describe ALMAÔand claim its theoretical propert... |

746 | Boosting the margin: a new explanation for the effectiveness of voting methods
- Schapire, Freund, et al.
- 1998
(Show Context)
Citation Context ...n the training phase for the N labels (recall that for Perceptron and almap with α = 1 a correction is the same as a mistaken trial). In Figures 3–5 we plotted a number of margin distribution graphs (=-=Schapire et al., 1998-=-) yielded when running almap on various datasets. For binary classification tasks the margin distribution ofa (binary) classifier w with ||w||q ≤ 1 is the fraction of examples (x,y) ∈X×{−1, +1} in the... |

691 | The Weighted Majority Algorithm
- Littlestone, Warmuth
- 1994
(Show Context)
Citation Context ...n ofthe instance space (Gentile and Littlestone, 1999) almap yields results similar to multiplicative algorithms, such as Littlestone’s Winnow (Littlestone, 1988) and the Weighted Majority algorithm (=-=Littlestone and Warmuth, 1994-=-; Grove et al., 2001). The associated margin-dependent generalization bounds are very close to those obtained by estimators based on linear programming (e.g., Mangasarian, 1968; Anthony and Bartlett, ... |

681 | I~srning Quickly When Irrelevant Attributes Abound. Machine Learning 2(4):285318
- Littlestone
- 1988
(Show Context)
Citation Context ...neral normÔseems to require numerical methods. 2 We assume thatÛ¡Ü�yields a wrong classification, independent ofÝ.slearning literature. We focus on an on-line learning model introduced by Littlestone =-=[14]-=-. An on-line learning algorithm processes the examples one at a time in trials. In each trial, the algorithm observes an instanceÜand is required to predict the labelÝassociated with Ü. We denote the ... |

661 | Queries and concept learning - Angluin - 1988 |

588 | Solving multi-class learning problems via error-correcting output codes - Dietterich, Bakiri - 1995 |

442 | Reducing multiclass to binary: A unifying approach for margin classifiers - Allwein, Schapire, et al. |

419 | Schapire, “Large Margin Classification using the Perceptron Algorithm
- Freud, E
- 1999
(Show Context)
Citation Context ...cond contribution of this paper is an experimental investigation of ALMA on the problem of handwritten digit recognition. For the sake of comparison, we followed the experimental setting described in =-=[3, 4, 12]-=-. We ran ALMA with polynomial kernels, using both the last and the voted hypotheses (as in [4]), and we compared our results to those described in [3, 4, 12]. We found that voted ALMA generalizes quit... |

317 | L.D.: Backpropagation applied to handwritten Zip code recognition - LeCun, Boser, et al. - 1989 |

292 |
Rozoner, “Theoretical foundations of the potential function method in pattern recognition learning
- Aizerman, Braverman, et al.
- 1964
(Show Context)
Citation Context ...ider all examples at once. alma2 (i.e., almap with p = 2) is a perceptron-like algorithm; the operations it performs can be expressed as dot products, so that we can replace them by kernel functions (=-=Aizerman et al., 1964-=-). alma2 approximately solves the SVM training problem, avoiding quadratic programming. Unlike previous approaches (Cortes and Vapnik, 1995; Osuna et al., 1997; Joachims, 1998; Friess et al., 1998; Pl... |

276 | Large margin dags for multiclass classification - Platt, Cristianini, et al. |

266 | An improved training algorithm for support vector machines. Neural networks for signal processing VII
- Osuna, Freund, et al.
- 1997
(Show Context)
Citation Context ...ing problems (e.g., Golding and Roth, 1996; Dagan et al., 1997). A fair amount of recent work on SVM centers on finding simple and efficient methods to solve maximal margin hyperplane problems (e.g., =-=Osuna et al., 1997-=-; Joachims, 1998; Friess et al., 1998; Platt, 1998; Kowalczyk, 1999; Keerthi et al., 1999; Li and Long, 1999). This paper follows that trend, giving two main contributions. The first contribution is a... |

256 | Structural risk minimization over data-dependent hierarchies
- SHAWE-TAYLOR
- 1998
(Show Context)
Citation Context ... and have spurred voluminous work in Machine Learning, both theoretical and experimental. The remarkable generalization ability exhibited by SVM can be explained through margin-based VC theory (e.g., =-=Shawe-Taylor et al., 1998-=-; Anthony and Bartlett, 1999; Vapnik, 1998; Cristianini and Shawe-Taylor, 2000, and references therein). At the core ofSVM lies the problem offinding the so-called maximal margin hyperplane. Briefly, ... |

252 |
principles of neurodynamics: perceptrons and the theory of brain mechanisms
- Rosenblatt
- 1962
(Show Context)
Citation Context ... The (unique) inverse� of�is [6] � �ÊÒ�ÊÒ,� �� ������ Ò, where namely,� is obtained from�by replacingÕwithÔ. 3 TheÔ-norm Perceptron algorithm is a generalization of the classical Perceptron algorithm =-=[18]-=-: Ô-norm Perceptron is actually Perceptron whenÔ�.sALMAÔ«���� Algorithm with« Set���ÔÔ Ô�� IfÝØÛ�¡ÜØ ��ÜØ��Ô� «�then���� ÔÔ ��ÜØ��ÔÔ�� � �℄,�,��. Initialization: Initial weight vectorÛ�;��. ��� Û���... |

242 | Ecient Pattern Recognition Using a New Transformation Distance - Simard, Cun, et al. - 1993 |

169 | Pattern Classi and Scene Analysis - Duda, Hart - 1973 |

135 |
Additive versus exponentiated gradient updates for linear prediction
- Kivinen, Warmuth
- 1995
(Show Context)
Citation Context ...α<1, one can prove a bound on the expected generalization error of “avg” by first proving a bound on the expected hinge loss (Gentile and Warmuth, 2001) and then applying a simple convexity argument (=-=Kivinen and Warmuth, 1997-=-). 224sbinary classifier on a new instance x is Approximate Maximal Margin Classification outputi(x) =w (i) m (i) · x, +1 where w (i) m (i) is the last weight vector produced during training by the i-... |

117 | Input space vs. feature space in kernel-based methods - Schölkopf, Mika, et al. - 1999 |

101 | Mistake-driven learning in text categorization
- Dagan, Karov, et al.
- 1997
(Show Context)
Citation Context ...when the target to be learned is sparse (i.e., when the target has many irrelevant features). This is often the case in a number of natural language processing problems (e.g., Golding and Roth, 1996; =-=Dagan et al., 1997-=-). A fair amount of recent work on SVM centers on finding simple and efficient methods to solve maximal margin hyperplane problems (e.g., Osuna et al., 1997; Joachims, 1998; Friess et al., 1998; Platt... |

97 |
The kernel adatron algorithm: a fast and simple learning procedure for support vector machines
- Friess, Cristianini, et al.
- 1998
(Show Context)
Citation Context ...way. The accuracy of our algorithm is slightly worse than SVMs’. On the other hand, our algorithm is quite faster and easier to implement than previous implementations of SVMs, such as those given in =-=[17, 5]-=-. An interesting features of ALMA is that its approximate solution relies on fewer support vectors than the SVM solution. We found the accuracy of 1.77 for ALMA (1.0) fairly remarkable, considering th... |

91 | Comparison of learning algorithms for handwritten digit recognition
- LeCun, Jackel, et al.
- 1995
(Show Context)
Citation Context ...value in�0,1,...,255�, representing a grey level. The database has 60000 training examples and ÃÜ�Ý� 10000 test examples. The best accuracy results for this dataset are those obtained by LeCun et al. =-=[11]-=- through boosting on top of the neural net LeNet4. They reported a test error rate of 0.7%. A soft margin SVM achieved an error rate of 1.1% [3]. In our experiments we used ALMA«�«�Ôwith different val... |

83 | General convergence results for linear discriminant updates
- Grove, Littlestone, et al.
(Show Context)
Citation Context ...a finite number of steps to (an approximation of) the maximal margin hyperplane forË. 3 The approximate large margin algorithm ALMAÔ ALMAÔis a large margin variant of theÔ-norm Perceptron algorithm 3 =-=[8, 6]-=-, and is similar in spirit to the variable learning rate algorithms introduced in [2]. We analyze ALMAÔby giving upper bounds on the number of corrections. The main theoretical result of this paper is... |

79 |
The perceptron: A model for brain functioning
- Block
- 1962
(Show Context)
Citation Context ...when p = 2 this bound is very similar to the one proven by Li and Long for a version of 2. The p-norm Perceptron algorithm is a generalization of the classical Perceptron algorithm (Rosenblatt, 1962; =-=Block, 1962-=-; Novikov, 1962): p-norm Perceptron is actually Perceptron when p =2. 216sApproximate Maximal Margin Classification Algorithm almap(α; B,C) with α ∈ (0, 1], B, C>0. Initialization: Initial weight vect... |

76 | Boosting the Margin: A new explanation for the eectiveness of voting methods" The Annals of Statistics - Schapire, Freund, et al. - 1998 |

74 | The relaxed online maximum margin algorithm
- Li, Long
- 2002
(Show Context)
Citation Context ...ic programming. As far as theoretical performance is concerned, ALMA achieves essentially the same bound on the number of corrections as the one obtained by a version of Li and Long’s ROMMA algorithm =-=[12]-=-, though the two algorithms are different. 1 In the case thatÔis logarithmic in the dimension of the instance space (as in [6]) ALMAÔyields results which are similar to those obtained by estimators ba... |

69 | A fast iterative nearest point algorithm for support vector machine classifier design
- Keerthi, Shevade, et al.
- 1999
(Show Context)
Citation Context ...t work on SVM centers on finding simple and efficient methods to solve maximal margin hyperplane problems (e.g., Osuna et al., 1997; Joachims, 1998; Friess et al., 1998; Platt, 1998; Kowalczyk, 1999; =-=Keerthi et al., 1999-=-; Li and Long, 1999). This paper follows that trend, giving two main contributions. The first contribution is a new efficient algorithm which approximates the maximal margin hyperplane w.r.t. norm p t... |

65 | Adaptive and self-confident online learning algorithms
- Auer, Gentile
- 2000
(Show Context)
Citation Context ...g feature of ALMAÔ is that its relevant parameters (such as the learning rate) are dynamically adjusted over time. In this sense, ALMAÔis a refinement of the on-line algorithms recently introduced in =-=[2]-=-. Moreover, ALMA (i.e., ALMAÔwithÔ�) is a perceptron-like algorithm; the operations it performs can be expressed as dot products, so that we can replace them by kernel functions evaluations. ALMA appr... |

64 | The robustness of the p-norm algorithms - Gentile |

55 | The Perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant - Kivinen, Warmuth, et al. - 1997 |

50 | On weak learning
- Helmbold, Warmuth
- 1995
(Show Context)
Citation Context ...ults are summarized in Table 1. As in [4], the output of a binary classifier is based on either the last hypothesis produced by the algorithms (denoted by “last” in Table 1) or Helmbold and Warmuth’s =-=[9]-=- leave-one-out voted hypothesis (denoted by “voted”). We refer the reader to [4] for details. We trained the algorithms by cycling up to 3 times (“epochs”) over the training set. All the results shown... |

45 | Convex neural networks - Bengio, Roux, et al. - 2005 |

39 |
Multisurface Method for Pattern Separation
- Mangasarian
- 1968
(Show Context)
Citation Context ... algorithm (Littlestone and Warmuth, 1994; Grove et al., 2001). The associated margin-dependent generalization bounds are very close to those obtained by estimators based on linear programming (e.g., =-=Mangasarian, 1968-=-; Anthony and Bartlett, 1999, Chap. 14). The second contribution ofthis paper is an experimental investigation ofalmap on both real-world and artificial datasets. In our experiments we emphasized the ... |

38 | Linear hinge loss and average margin
- Gentile, Warmuth
- 1998
(Show Context)
Citation Context ... proven in the general (nonseparable) case. The bound of part 2 is in terms of the margin-based ��Ü��Ô�,�. (Here aÔ-indexing for�is understood).�is called deviation in [4] and linear hinge loss in =-=[7]-=-. Notice Let£�Ñ�ÜÛÏÑ�ÒØ������ÌÝØÛ¡ÜØ LetÏ��ÛÊÒ���Û��Õ��,Ë�Ü�Ý�����ÜÌ�ÝÌ thatÝØÛ�¡ÜØ ÊÒ¢ that�and�in part 1 do not meet the requirements given in part 2. On the other hand, in the separable case�and�c... |

28 | Mathematical programming in data mining
- Mangasarian
- 1997
(Show Context)
Citation Context ...g problem [3]. This task is naturally formulated as a quadratic programming problem. If an arbitrary normÔis used then such a task turns to a more general mathematical programming problem (see, e.g., =-=[15, 16]-=-) to be solved by general purpose (and computationally intensive) optimization methods. This more general task arises in feature selection problems when the target to be learned is sparse. A major the... |

17 |
Maximal margin perceptron
- Kowalczyk
- 1999
(Show Context)
Citation Context ... hyperplane for a set of examples (the training set). For this purpose, we use terminology and analytical tools from the on-line 1 In fact, algorithms such as ROMMA and the one contained in Kowalczyk =-=[10]-=- have been specifically designed for euclidean norm. Any straightforward extension of these algorithms to a general normÔseems to require numerical methods. 2 We assume thatÛ¡Ü�yields a wrong classifi... |

12 |
On convergence proofs on Perceptrons
- Novikov
- 1962
(Show Context)
Citation Context ...is bound is very similar to the one proven by Li and Long for a version of 2. The p-norm Perceptron algorithm is a generalization of the classical Perceptron algorithm (Rosenblatt, 1962; Block, 1962; =-=Novikov, 1962-=-): p-norm Perceptron is actually Perceptron when p =2. 216sApproximate Maximal Margin Classification Algorithm almap(α; B,C) with α ∈ (0, 1], B, C>0. Initialization: Initial weight vector w1 = 0; k =1... |

9 | Large Margin DAGs for Multiclass Classi - Platt, Cristianini, et al. |

8 | Mathematical programming in data mining. Data Mining and Knowledge Discovery, 42(1):183201 - Mangasarian - 1997 |

5 | Adaptive and self-con on-line learning algorithms - Auer, Cesa-Bianchi, et al. - 2002 |

4 |
Uci repository ofmachine learning databases
- Blake, Merz
- 1998
(Show Context)
Citation Context ...2, 6, 10 without kernels on the artificial ones. The real-world datasets are well-known OCR benchmarks: the USPS dataset (e.g., Le Cun et al., 1995), the MNIST dataset 7 , and the UCI Letter dataset (=-=Blake et al., 1998-=-). The artificial datasets consist ofexamples generated by some random process according to the rules described in Section 4.4. For the sake ofcomparison, we tended to follow previous experimental set... |

4 | Primitivism," by the Editors - v - 1982 |

4 | A new approximate maximal margin classi algorithm - Gentile - 2001 |

1 |
The robustness of theÔ-norm algorithms
- Gentile, Littlestone
- 1999
(Show Context)
Citation Context ...ctions as the one obtained by a version of Li and Long’s ROMMA algorithm [12], though the two algorithms are different. 1 In the case thatÔis logarithmic in the dimension of the instance space (as in =-=[6]-=-) ALMAÔyields results which are similar to those obtained by estimators based on linear programming (see [1, Chapter 14]). The second contribution of this paper is an experimental investigation of ALM... |

1 |
From support vector machines to large margin classifiers
- Li
- 2000
(Show Context)
Citation Context ...n’s home page: http://www.research.att.com/�yann/ocr/mnist/.s“Corr’s” give the total number of corrections made in the training phase for the 10 labels. The first three rows of Table 1 are taken from =-=[4, 12, 13]-=-. The first two rows refer to the Perceptron algorithm, 8 while the third one refers to the best 9 noise-controlled (NC) version of ROMMA, called “aggressive ROMMA”. Our own experimental results are g... |

1 |
The generalized adatron algorithm
- Nachbar, Nossek, et al.
- 1993
(Show Context)
Citation Context ...g problem [3]. This task is naturally formulated as a quadratic programming problem. If an arbitrary normÔis used then such a task turns to a more general mathematical programming problem (see, e.g., =-=[15, 16]-=-) to be solved by general purpose (and computationally intensive) optimization methods. This more general task arises in feature selection problems when the target to be learned is sparse. A major the... |

1 |
The robustness ofthe p-norm algorithms
- Gentile, Littlestone
- 1999
(Show Context)
Citation Context ... alma2 achieves essentially the same bound on the number ofcorrections as the one obtained by a version ofLi and Long’s ROMMA. In the case when p is logarithmic in the dimension ofthe instance space (=-=Gentile and Littlestone, 1999-=-) almap yields results similar to multiplicative algorithms, such as Littlestone’s Winnow (Littlestone, 1988) and the Weighted Majority algorithm (Littlestone and Warmuth, 1994; Grove et al., 2001). T... |