## The Relaxed Online Maximum Margin Algorithm (2000)

### Cached

### Download Links

- [www.comp.nus.edu.sg]
- [www.comp.nus.edu.sg]
- [www.cs.cmu.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 73 - 1 self |

### BibTeX

@INPROCEEDINGS{Li00therelaxed,

author = {Yi Li and Philip M. Long},

title = {The Relaxed Online Maximum Margin Algorithm},

booktitle = {Machine Learning},

year = {2000},

pages = {2002}

}

### Years of Citing Articles

### OpenURL

### Abstract

We describe a new incremental algorithm for training linear threshold functions: the Relaxed Online Maximum Margin Algorithm, or ROMMA. ROMMA can be viewed as an approximation to the algorithm that repeatedly chooses the hyperplane that classifies previously seen examples correctly with the maximum margin. It is known that such a maximum-margin hypothesis can be computed by minimizing the length of the weight vector subject to a number of linear constraints. ROMMA works by maintaining a relatively simple relaxation of these constraints that can be eciently updated. We prove a mistake bound for ROMMA that is the same as that proved for the perceptron algorithm. Our analysis implies that the more computationally intensive maximum-margin algorithm also satis es this mistake bound; this is the rst worst-case performance guarantee for this algorithm. We describe some experiments using ROMMA and a variant that updates its hypothesis more aggressively as batch algorithms to recognize handwr...

### Citations

9002 | The Nature of Statistical Learning Theory
- Vapnik
- 1995
(Show Context)
Citation Context ...and standard QP routines have substantial memory requirements. One direction of the development of simple solutions to SVMs is centered on splitting the problem into a series of smaller size subtasks =-=[30, 37, 17, 16]-=-. Another direction is focused on an iterative algorithm for training SVMs, which will be discussed below. SMO (Sequential Minimal Optimization) proposed by Platt [32] works on the Wolfe dual problem ... |

2175 | Support-vector networks
- CORTES, V
- 1995
(Show Context)
Citation Context ...ve-one-out method to vote dierent prediction vectors produced by ROMMA is analyzed and discussed in the companion paper [23]. We conducted experiments similar to those performed by Cortes and Vapnik [=-=7]-=- and Freund and Schapire [9] on the problem of handwritten digit recognition. We tested the standard perceptron algorithm, the voted perceptron algorithm (for details, see [9]) and our new algorithm, ... |

1295 | A training algorithm for optimal margin classifiers
- Boser, Guyon, et al.
- 1992
(Show Context)
Citation Context ...the threshold issxed at 0. Keywords: Online Learning, Large Margin Classiers, Perceptrons, Support Vector Machines. 1. Introduction The perceptron algorithm [33, 34] and the maximum-margin classier [3] have similar theoretical bases, but dierent strengths. In the case of linearly separable data, Block [2], Noviko [29] and Minsky and Papert [27] showed that the number of mistakes made by the perce... |

1082 |
Practical Methods of Optimization
- Fletcher
- 1987
(Show Context)
Citation Context ...he weighted sum of examples on which mistakes occur during the training. In particular, each ~ w t is represented as ~ w t = 0 @ t 1 Y j=1 c j 1 A ~ w 1 + t 1 X j=1 0 @ t 1 Y n=j+1 c n 1 A d j ~x j ; (8) where ~ w 1 is the initial weight vector. If we let 0 = t 1 Y j=1 c j (9) and j = 0 @ t 1 Y n=j+1 c n 1 A d j ; 1 j t 1; (10) then (8) can be written as ~ w t = 0 ~ w 1 + t 1 X j=1 j ~x j... |

1007 |
Fast training of support vector machines using sequential minimal optimization
- Platt
- 1999
(Show Context)
Citation Context ...he experimental results of ROMMA and an aggressive variant of ROMMA with the perceptron and the voted perceptron algorithms. We also discuss scaling of the features in this section. Some related work =-=[32, 11, 18, 21]-=- is discussed in Section 5. We conclude with Section 6. 2. A mistake-bound analysis 2.1. The online algorithms For concreteness, our analysis will concern the case in which instances (also called patt... |

804 |
Estimation of Dependences Based on Empirical Data
- Vapnik
- 1982
(Show Context)
Citation Context ...as isn't considered. 1 Introduction The perceptron algorithm [10, 11] is well-known for its simplicity and effectiveness in the case of linearly separable data. Vapnik's support vector machines (SVM) =-=[13]-=- use quadratic programming to find the weight vector that classifies all the training data correctly and maximizes the margin, i.e. the minimal distance between the separating hyperplane and the insta... |

790 |
The perceptron: A probabilistic model for information storage and organization in the brain
- Rosenblatt
- 1958
(Show Context)
Citation Context ...ance of ROMMA converges to that of SVM if the threshold issxed at 0. Keywords: Online Learning, Large Margin Classiers, Perceptrons, Support Vector Machines. 1. Introduction The perceptron algorithm [=-=33, -=-34] and the maximum-margin classier [3] have similar theoretical bases, but dierent strengths. In the case of linearly separable data, Block [2], Noviko [29] and Minsky and Papert [27] showed that the... |

674 | Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm
- Littlestone
- 1988
(Show Context)
Citation Context ... but it is easy to see that our analysis generalizes to arbitrary inner product spaces, and therefore that our results also apply when kernel functions are used. In the standard online learning model =-=[24-=-], learning proceeds in trials. In the tth trial, the algorithm issrst presented with an instance ~x t 2 R n . Next, the algorithm outputs a prediction ^ y t of the classi- cation of ~x t . Finally, t... |

467 | Making large-scale support vector machine learning practical
- Joachims
- 1999
(Show Context)
Citation Context ...and standard QP routines have substantial memory requirements. One direction of the development of simple solutions to SVMs is centered on splitting the problem into a series of smaller size subtasks =-=[30, 37, 17, 16]-=-. Another direction is focused on an iterative algorithm for training SVMs, which will be discussed below. SMO (Sequential Minimal Optimization) proposed by Platt [32] works on the Wolfe dual problem ... |

412 | Large margin classification using the perceptron algorithm
- Freund, Schapire
- 1999
(Show Context)
Citation Context ...n our experiments is much better on average. Moreover, ROMMA can be applied with kernel functions. We conducted experiments similar to those performed by Cortes and Vapnik [2] and Freund and Schapire =-=[3]-=- on the problem of handwritten digit recognition. We tested the standard perceptron algorithm, the voted perceptron algorithm (for details, see [3]) and our new algorithm, using the polynomial kernel ... |

254 | An improved training algorithm for support vector machines
- Osuna, Freund, et al.
- 1997
(Show Context)
Citation Context ...and standard QP routines have substantial memory requirements. One direction of the development of simple solutions to SVMs is centered on splitting the problem into a series of smaller size subtasks =-=[30, 37, 17, 16]-=-. Another direction is focused on an iterative algorithm for training SVMs, which will be discussed below. SMO (Sequential Minimal Optimization) proposed by Platt [32] works on the Wolfe dual problem ... |

242 |
Principles of neurodynamics; perceptrons and the theory of brain mechanisms
- Rosenblatt
- 1962
(Show Context)
Citation Context ...ance of ROMMA converges to that of SVM if the threshold issxed at 0. Keywords: Online Learning, Large Margin Classiers, Perceptrons, Support Vector Machines. 1. Introduction The perceptron algorithm [=-=33, -=-34] and the maximum-margin classier [3] have similar theoretical bases, but dierent strengths. In the case of linearly separable data, Block [2], Noviko [29] and Minsky and Papert [27] showed that the... |

107 |
Mistake bounds and logarithmic linear-threshold learning algorithms. Unpublished doctoral dissertation
- Littlestone
- 1989
(Show Context)
Citation Context ... t ( ~ w t ~x t )s1, not just after mistakes. 2.2. Upper bound on the number of mistakes made Now we prove a bound on the number of mistakes made by ROMMA. As in previous mistake bound proofs (e.g. [=-=26]), we-=- will show that mistakes result in an increase in a \measure of progress", and then appeal to a bound on the total possible progress. Our proof will use the squared length of ~ w t as its measure... |

95 |
The kernel-Adatron algorithm: a fast and simple learning procedure for support vector machines
- Frieβ, Cristianini, et al.
- 1998
(Show Context)
Citation Context ...he experimental results of ROMMA and an aggressive variant of ROMMA with the perceptron and the voted perceptron algorithms. We also discuss scaling of the features in this section. Some related work =-=[32, 11, 18, 21]-=- is discussed in Section 5. We conclude with Section 6. 2. A mistake-bound analysis 2.1. The online algorithms For concreteness, our analysis will concern the case in which instances (also called patt... |

87 | Comparison of learning algorithms for handwritten digit recognition
- LeCun, Jackel, et al.
- 1995
(Show Context)
Citation Context ...uly6-00.tex; 21/07/2000; 12:57; p.10 11 4. Experiments We did some experiments using the perceptron algorithm, ROMMA and aggressive ROMMA as batch algorithms on the MNIST OCR database. 2 LeCun et al. =-=[22]-=- have published a detailed comparison of the performance of some of the best algorithms on this dataset. The best test error rate they achieve is 0:7%, through boosting on top of the neural net LeNet4... |

86 |
From on-line to batch learning
- Littlestone
- 1989
(Show Context)
Citation Context ...ctor is used to predict labels outside the training set. There are also several other ways to decide on the best prediction rule given the sequence of dierent classiers that the algorithm generates [1=-=2, 25, 15, 9-=-]. The majority voting method proposed by Freund and Schapire [9], applying the leave-one-out method of Helmbold and Warmuth [15], has the eect of improving the distribution of margins of the training... |

82 | Model selection for support vector machines
- Chapelle, Vapnik
- 2000
(Show Context)
Citation Context ... 12:57; p.12 13 the entropy number of this operator which serves as capacity control may be minimized over the dierent choices of scaling factors on the corresponding components of the feature space [=-=14, 6, 39]-=-. There is no ecient method to obtain the optimal scaling for polynomial kernel functions. In the experiment we only tried no scaling, a scaling factor of 255, and a scaling factor of 1100 for the per... |

78 |
The Perceptron: A model for brain functioning
- Block
- 1962
(Show Context)
Citation Context ... Machines. 1. Introduction The perceptron algorithm [33, 34] and the maximum-margin classier [3] have similar theoretical bases, but dierent strengths. In the case of linearly separable data, Block [2=-=-=-], Noviko [29] and Minsky and Papert [27] showed that the number of mistakes made by the perceptron algorithm is upper bounded by a function of the margin, i.e. the minimal distance from any instance ... |

75 |
Boosting the margin: A new explanation for the eectiveness of voting methods
- Schapire, Freund, et al.
- 1998
(Show Context)
Citation Context ...by Freund and Schapire [9], applying the leave-one-out method of Helmbold and Warmuth [15], has the eect of improving the distribution of margins of the training examples. For detailed analysis, see [=-=35]-=-. Experiments show that the voted perceptron algorithm has better performance than the standard perceptron algorithm [9]. In this paper, thesnal prediction vector of ROMMA is used to predict labels of... |

70 |
Single-layer learning revisited: a stepwise procedure for building and training a neural network
- Knerr, Personnaz, et al.
- 1990
(Show Context)
Citation Context ...se b i = 1. Classication of a test pattern is done according to the maximum output of these ten classiers. There are some other ways to combine many two-class classiers into a multiclass classier [31,=-= 10, 20]. To produ-=-ce output given a test instance ~x, besides using thesnal hypothesis, we also tried the \voting" method to convert the standard perceptron algorithm to a batch learning. The \voting" method ... |

67 | A fast iterative nearest point algorithm for support vector machine design
- Keerthi, Shevde, et al.
- 2000
(Show Context)
Citation Context ...algorithm performed better than the standard perceptron algorithm, had slightly better performance than the voted perceptron. For some other research with aims similar to ours, we refer the reader to =-=[9, 4, 5, 6]-=-. The paper is organized as follows. In Section 2, we describe ROMMA in enough detail to determine its predictions, and prove a mistake bound for it. In Section 3, we describe ROMMA in more detail. In... |

50 | On weak learning
- Helmbold, Warmuth
- 1995
(Show Context)
Citation Context ...ctor is used to predict labels outside the training set. There are also several other ways to decide on the best prediction rule given the sequence of dierent classiers that the algorithm generates [1=-=2, 25, 15, 9-=-]. The majority voting method proposed by Freund and Schapire [9], applying the leave-one-out method of Helmbold and Warmuth [15], has the eect of improving the distribution of margins of the training... |

46 |
Solving the quadratic programming problem arising in support vector classification
- Kaufman
- 1998
(Show Context)
Citation Context |

37 |
Optimal linear discriminants
- Gallant
- 1986
(Show Context)
Citation Context ...ctor is used to predict labels outside the training set. There are also several other ways to decide on the best prediction rule given the sequence of dierent classiers that the algorithm generates [1=-=2, 25, 15, 9-=-]. The majority voting method proposed by Freund and Schapire [9], applying the leave-one-out method of Helmbold and Warmuth [15], has the eect of improving the distribution of margins of the training... |

29 | Uniqueness of the SVM solution
- Burges, Crisp
- 1999
(Show Context)
Citation Context ... ~x t ) 1j jj~x t jj 1 jj~x t jj 1 R ; (3) since the fact that there was a mistake in trial t implies y t (~x t ~ w t ) 0. As shown in Figure 2, since ~ w t+1 2 A t , jjw t+1 w t jj t : (4) Because ~ w t is the normal vector of B t and ~ w t+1 2 B t , we have jj ~ w t+1 jj 2 = jj ~ w t jj 2 + jj ~ w t+1 ~ w t jj 2 : july6-00.tex; 21/07/2000; 12:57; p.7 8 Thus, applying (3) and (4), we h... |

27 |
A Framework for Structural Risk Minimization
- Shawe-Taylor, Bartlett, et al.
- 1996
(Show Context)
Citation Context ...completes the proof. 2 Since, as is easily proved by induction, for all t, P t H t , we have the following, which complements analyses of the maximum margin algorithm using independence assumptions [=-=3, 38-=-, 36]. THEOREM 4. Choose m 2 N, and a sequence (~x 1 ; y 1 ); ; (~x m ; y m ) of pattern-classication pairs in R n f1;+1g. Let R = max t jj~x t jj. If there is a weight vector ~u such that y t (~u ... |

19 | Covering numbers for support vector machines
- Guo, Bartlett, et al.
- 2002
(Show Context)
Citation Context ... 12:57; p.12 13 the entropy number of this operator which serves as capacity control may be minimized over the dierent choices of scaling factors on the corresponding components of the feature space [=-=14, 6, 39]-=-. There is no ecient method to obtain the optimal scaling for polynomial kernel functions. In the experiment we only tried no scaling, a scaling factor of 255, and a scaling factor of 1100 for the per... |

17 | From noise-free to noise-tolerant and from on-line to batch learning
- Klasner, Simon
- 1995
(Show Context)
Citation Context ...et al. [11] suggested the use of quadratic penalty in the cost function, which can be implemented july6-00.tex; 21/07/2000; 12:57; p.13 14 using a slightly dierent kernel function [11, 18] (see also [=-=-=-19]): ~ K(x k ; x j ) = K(x k ; x j ) +skj ; whereskj is the Kronecker delta function, is a predened parameter. The last rows in Table I and Table II are the results of aggressive ROMMA using this me... |

17 |
Maximal margin perceptron
- Kowalczyk
- 1999
(Show Context)
Citation Context ...he experimental results of ROMMA and an aggressive variant of ROMMA with the perceptron and the voted perceptron algorithms. We also discuss scaling of the features in this section. Some related work =-=[32, 11, 18, 21]-=- is discussed in Section 5. We conclude with Section 6. 2. A mistake-bound analysis 2.1. The online algorithms For concreteness, our analysis will concern the case in which instances (also called patt... |

13 |
On convergence proofs for perceptrons
- Noviko
- 1962
(Show Context)
Citation Context ... Introduction The perceptron algorithm [33, 34] and the maximum-margin classier [3] have similar theoretical bases, but dierent strengths. In the case of linearly separable data, Block [2], Noviko [29=-=]-=- and Minsky and Papert [27] showed that the number of mistakes made by the perceptron algorithm is upper bounded by a function of the margin, i.e. the minimal distance from any instance to the separat... |

9 |
Another approach to polychotomous classi
- Friedman
- 1996
(Show Context)
Citation Context ...Y j=1 c j 1 A ~ w 1 + t 1 X j=1 0 @ t 1 Y n=j+1 c n 1 A d j ~x j ; (8) where ~ w 1 is the initial weight vector. If we let 0 = t 1 Y j=1 c j (9) and j = 0 @ t 1 Y n=j+1 c n 1 A d j ; 1 j t 1; (10) then (8) can be written as ~ w t = 0 ~ w 1 + t 1 X j=1 j ~x j : Formula (8) may seem daunting; however, making use of the recurrence ( ~ w t+1 ~x) = c t ( ~ w t ~x) + d t (~x t ~x), it is o... |

8 |
Minimizing the quadratic form on a convex set
- Gilbert
- 1966
(Show Context)
Citation Context ... to compute the nearest point and proved a convergence rate of 2R 2 jjujj 2 2 ln Rjjujj 2 , where R, u andsrepresent the same as in Theorem 5. Keerthi et al. combined and modied two known algorithms [=-=13, 28]-=- of solving nearest point problem and only proved july6-00.tex; 21/07/2000; 12:57; p.16 17 its convergence. Both of their algorithms implement the corresponding soft-margin SVMs by introducing a quadr... |

8 |
Large margin DAGS for multiclass classi
- Platt, Cristianini, et al.
- 2000
(Show Context)
Citation Context ...se b i = 1. Classication of a test pattern is done according to the maximum output of these ten classiers. There are some other ways to combine many two-class classiers into a multiclass classier [31,=-= 10, 20]. To produ-=-ce output given a test instance ~x, besides using thesnal hypothesis, we also tried the \voting" method to convert the standard perceptron algorithm to a batch learning. The \voting" method ... |

7 |
Rozoner (1964) “Theoretical foundations of the potential function method in pattern recognition learning,” Automation and Remote
- Aizerman, Braverman, et al.
(Show Context)
Citation Context ...ived and the algorithm must update its hypothesis before making the next prediction. Both the perceptron algorithm and the maximum-margin algorithm can be applied in conjunction with kernel functions =-=[1, 3]-=- to enable the ecient use of large collections of features that are functions of a problem's raw features. After the patterns are embedded into the expanded feature space, the data is often linearly s... |

4 | Simple Learning Algorithms for Training Support Vector Machines
- Campbell, Cristianini
- 1998
(Show Context)
Citation Context ...s as in [7], that is, they solve the dual problem (17), (18) and (19). The training times of SMO and KA are shown to be subquadratic in the number of training examples in the experiments conducted in =-=[5]-=-, but there is no theoretical analysis to the convergence rate yet. Recently a method of training SVMs based on computing the nearest point between two convex polytopes was independently proposed by K... |

3 | Selective voting for perceptron-like online learning
- Li
- 2000
(Show Context)
Citation Context ...or of ROMMA is used to predict labels of the test set, and how to apply the leave-one-out method to vote dierent prediction vectors produced by ROMMA is analyzed and discussed in the companion paper [=-=23]-=-. We conducted experiments similar to those performed by Cortes and Vapnik [7] and Freund and Schapire [9] on the problem of handwritten digit recognition. We tested the standard perceptron algorithm,... |

1 |
Schapire: 1998, `Large margin classi using the perceptron algorithm
- Freund, E
(Show Context)
Citation Context ...hat the number of mistakes made by the perceptron algorithm is upper bounded by a function of the margin, i.e. the minimal distance from any instance to the separating hyperplane. Freund and Schapire =-=[9-=-] generalized this result to the inseparable case. The maximum-margin algorithm uses quadratic programming tosnd the weight vector that classies all the training data correctly and maximizes the margi... |

1 |
Murthy: 99, `A fast iterative nearest point algorithm for support vector machine classi design
- Keerthi, Shevade, et al.
(Show Context)
Citation Context |

1 |
expanded edition 1988, Perceptrons
- Minsky, Papert
- 1969
(Show Context)
Citation Context ...n algorithm [33, 34] and the maximum-margin classier [3] have similar theoretical bases, but dierent strengths. In the case of linearly separable data, Block [2], Noviko [29] and Minsky and Papert [27=-=]-=- showed that the number of mistakes made by the perceptron algorithm is upper bounded by a function of the margin, i.e. the minimal distance from any instance to the separating hyperplane. Freund and ... |

1 |
Malozemov: 1974, `Finding the point of a polyhedron closet to the origin
- Mitchell, Dem'yanov, et al.
(Show Context)
Citation Context ... to compute the nearest point and proved a convergence rate of 2R 2 jjujj 2 2 ln Rjjujj 2 , where R, u andsrepresent the same as in Theorem 5. Keerthi et al. combined and modied two known algorithms [=-=13, 28]-=- of solving nearest point problem and only proved july6-00.tex; 21/07/2000; 12:57; p.16 17 its convergence. Both of their algorithms implement the corresponding soft-margin SVMs by introducing a quadr... |

1 |
Scholkpof: in press, `Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators
- Williamson, Smola, et al.
(Show Context)
Citation Context ... 12:57; p.12 13 the entropy number of this operator which serves as capacity control may be minimized over the dierent choices of scaling factors on the corresponding components of the feature space [=-=14, 6, 39]-=-. There is no ecient method to obtain the optimal scaling for polynomial kernel functions. In the experiment we only tried no scaling, a scaling factor of 255, and a scaling factor of 1100 for the per... |