## General convergence results for linear discriminant updates (1997)

### Cached

### Download Links

- [cs.ualberta.ca]
- [webdocs.cs.ualberta.ca]
- [papersdb.cs.ualberta.ca]
- [logos.uwaterloo.ca]
- [logos.uwaterloo.ca]
- [logos.uwaterloo.ca]
- [cs.ualberta.ca]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 83 - 0 self |

### BibTeX

@INPROCEEDINGS{Grove97generalconvergence,

author = {Adam J. Grove and Nick Littlestone and Dale Schuurmans},

title = {General convergence results for linear discriminant updates},

booktitle = {Machine Learning},

year = {1997},

pages = {171--183},

publisher = {ACM Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. The problem of learning linear-discriminant concepts can be solved by various mistake-driven update procedures, including the Winnow family of algorithms and the well-known Perceptron algorithm. In this paper we define the general class of “quasi-additive ” algorithms, which includes Perceptron and Winnow as special cases. We give a single proof of convergence that covers a broad subset of algorithms in this class, including both Perceptron and Winnow, but also many new algorithms. Our proof hinges on analyzing a generic measure of progress construction that gives insight as to when and how such algorithms converge. Our measure of progress construction also permits us to obtain good mistake bounds for individual algorithms. We apply our unified analysis to new algorithms as well as existing algorithms. When applied to known algorithms, our method “automatically ” produces close variants of existing proofs (recovering similar bounds)—thus showing that, in a certain sense, these seemingly diverse results are fundamentally isomorphic. However, we also demonstrate that the unifying principles are more broadly applicable, and analyze a new class of algorithms that smoothly interpolate between the additive-update behavior of Perceptron and the multiplicative-update behavior of Winnow.

### Citations

3928 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...Many iterative mistake-driven algorithms have been proposed for learning lineardiscriminant concepts from examples, including the famous Perceptron algorithm (Rosenblatt, 1962; Minsky & Papert, 1969; =-=Duda & Hart, 1973-=-) and Littlestone’s Winnow family of algorithms (Littlestone, 1988, 1989, 1991; Kivinen, Warmuth & Auer, 1997). This is an important, well-studied, collection of algorithms with interesting properties... |

3280 | Variational Analysis
- ROCKAFELLAR, WETS
- 1998
(Show Context)
Citation Context ... is not so innocuous: for instance, even �z�2 is not differentiable at z = 0. The theory and language of Legendre-Fenchel transforms (Ellis, 1985) (also called conjugate functions in convex analysis (=-=Rockafellar, 1970-=-)) can simplify such issues. Briefly, the Legendre-Fenchel transform of H(z) is the function H ∗ (u) equal to the smallest value such that H(z) − u · z + H ∗ (u) is non-negative for all z. In particul... |

674 | Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm
- Littlestone
- 1988
(Show Context)
Citation Context ...arning lineardiscriminant concepts from examples, including the famous Perceptron algorithm (Rosenblatt, 1962; Minsky & Papert, 1969; Duda & Hart, 1973) and Littlestone’s Winnow family of algorithms (=-=Littlestone, 1988-=-, 1989, 1991; Kivinen, Warmuth & Auer, 1997). This is an important, well-studied, collection of algorithms with interesting properties and practical applications (Blum, 1997; Dagan, Karov & Roth, 1997... |

671 | The weighted majority algorithm
- Littlestone, Warmuth
- 1994
(Show Context)
Citation Context ...r results are surely very relevant and a step in the direction of an answer. Aside from the regression case, another different but seemingly related learning model is the so-called expert case (e.g., =-=Littlestone & Warmuth, 1989-=-; Vovk, 1990; Cesa-Bianchi et al., 1997; Cesa-Bianchi, Helmbold & Panizza, 1996), where there is only a single relevant attribute (that is, perfect classification can be accomplished by a discriminant... |

314 | How to use expert advice
- Cesa-Bianchi, Freund, et al.
- 1997
(Show Context)
Citation Context ...step in the direction of an answer. Aside from the regression case, another different but seemingly related learning model is the so-called expert case (e.g., Littlestone & Warmuth, 1989; Vovk, 1990; =-=Cesa-Bianchi et al., 1997-=-; Cesa-Bianchi, Helmbold & Panizza, 1996), where there is only a single relevant attribute (that is, perfect classification can be accomplished by a discriminant with a single non-zero weight). Algori... |

260 |
The relaxation method for finding the common point of convex sets and its application to the solution of problems in convex programming
- Bregman
- 1967
(Show Context)
Citation Context ...ative, it is potentially useful as a measure of progress. The form of DH —namely, the difference between a convex function and the tangent plane at a chosen point—is often called a Bregman distance7 (=-=Bregman, 1967-=-; Censor & Zenios, 1997). Thus we call this the Bregman construction for measures of progress. The Bregman construction depends on a parameter η; in general, we get different measures of progress for ... |

256 |
Parallel Optimization: Theory, Algorithms, and Applications
- Censor, Zenios
- 1997
(Show Context)
Citation Context ...tentially useful as a measure of progress. The form of DH —namely, the difference between a convex function and the tangent plane at a chosen point—is often called a Bregman distance7 (Bregman, 1967; =-=Censor & Zenios, 1997-=-). Thus we call this the Bregman construction for measures of progress. The Bregman construction depends on a parameter η; in general, we get different measures of progress for different choices of η,... |

249 |
Aggregating strategies
- Vovk
- 1990
(Show Context)
Citation Context ...evant and a step in the direction of an answer. Aside from the regression case, another different but seemingly related learning model is the so-called expert case (e.g., Littlestone & Warmuth, 1989; =-=Vovk, 1990-=-; Cesa-Bianchi et al., 1997; Cesa-Bianchi, Helmbold & Panizza, 1996), where there is only a single relevant attribute (that is, perfect classification can be accomplished by a discriminant with a sing... |

247 | Exponentiated gradient versus gradient descent for linear predictors - Kivinen, Warmuth - 1997 |

243 |
Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms
- Rosenblatt
- 1962
(Show Context)
Citation Context ...unds, Bregman divergence 1. Introduction Many iterative mistake-driven algorithms have been proposed for learning lineardiscriminant concepts from examples, including the famous Perceptron algorithm (=-=Rosenblatt, 1962-=-; Minsky & Papert, 1969; Duda & Hart, 1973) and Littlestone’s Winnow family of algorithms (Littlestone, 1988, 1989, 1991; Kivinen, Warmuth & Auer, 1997). This is an important, well-studied, collection... |

151 | Learning Machines
- Nilsson
- 1965
(Show Context)
Citation Context ...S�2 δ2 � 2 �u�2�S� = O 2 � 2 δ 2 u,S where we use the fact that by the Cauchy-Schwarz inequality, δ ≤�u�2�S�2. Up to a constant factor, this is just the classic result (see Papert, 1961; Block, 1962; =-=Nilsson, 1965-=-; Minsky & Papert, 1969; Duda & Hart, 1973). The similarity is even deeper: our measure of progress is in fact very closely related to that used in Papert (1961), Minsky & Papert (1969). The main tech... |

140 | Pattern Classi and Scene Analysis - Duda, Hart - 1973 |

126 | Empirical support for winnow and weighted-majority algorithms: results on a calendar scheduling domain
- Blum
- 1997
(Show Context)
Citation Context ...mily of algorithms (Littlestone, 1988, 1989, 1991; Kivinen, Warmuth & Auer, 1997). This is an important, well-studied, collection of algorithms with interesting properties and practical applications (=-=Blum, 1997-=-; Dagan, Karov & Roth, 1997; Golding & Roth, 1999; Khardon, Roth & Valiant, 1999). In this paper we define a general class of algorithms and provide a unified theoretical analysis which covers not onl... |

115 | Relative loss bounds for on-line density estimation with the exponential family of distributions
- AZOURY, WARMUTH
- 2001
(Show Context)
Citation Context ...rms—are current in related literature, notably work on general theories of on-line regression learning (as opposed to classification learning, which we are considering here) (Kivinen & Warmuth, 1998; =-=Azoury & Warmuth, 1999-=-). We say somewhat more about this in Section 9. For these alternative constructions, we start with some candidate function H(z) = ψ(G(z)), where ψ is a monotonically increasing function such that H i... |

107 | Redundant noisy attributes, attribute errors, and linearthreshold learning using Winnow - Littlestone - 1991 |

99 |
Entropy, Large Deviations and Statistical Mechanics
- Ellis
- 1985
(Show Context)
Citation Context ... number of regularity assumptions, notably differentiability. This is not so innocuous: for instance, even �z�2 is not differentiable at z = 0. The theory and language of Legendre-Fenchel transforms (=-=Ellis, 1985-=-) (also called conjugate functions in convex analysis (Rockafellar, 1970)) can simplify such issues. Briefly, the Legendre-Fenchel transform of H(z) is the function H ∗ (u) equal to the smallest value... |

98 | Mistakedriven learning in text categorization - Dagan, Karov, et al. - 1997 |

78 |
The Perceptron: A model for brain functioning
- Block
- 1962
(Show Context)
Citation Context ...2 2 + 5δ�u�2�S�2 δ2 � 2 �u�2�S� = O 2 � 2 δ 2 u,S where we use the fact that by the Cauchy-Schwarz inequality, δ ≤�u�2�S�2. Up to a constant factor, this is just the classic result (see Papert, 1961; =-=Block, 1962-=-; Nilsson, 1965; Minsky & Papert, 1969; Duda & Hart, 1973). The similarity is even deeper: our measure of progress is in fact very closely related to that used in Papert (1961), Minsky & Papert (1969)... |

72 | Relative loss bounds for multidimensional regression problems
- Kivinen, Warmuth
- 2001
(Show Context)
Citation Context ...ces, and Legendre transforms—are current in related literature, notably work on general theories of on-line regression learning (as opposed to classification learning, which we are considering here) (=-=Kivinen & Warmuth, 1998-=-; Azoury & Warmuth, 1999). We say somewhat more about this in Section 9. For these alternative constructions, we start with some candidate function H(z) = ψ(G(z)), where ψ is a monotonically increasin... |

62 | The robustness of the p-norm algorithms - Gentile - 2003 |

54 | The perceptron algorithm vs. winnow: linear vs. logarithmic mistake bounds when few input variables are relevant - Kivinen, Warmuth, et al. - 1997 |

38 | Linear hinge loss and average margin - Gentile, Warmuth - 1998 |

35 | Relative loss bounds for single neurons - Helmbold, Kivinen, et al. - 1999 |

28 | Relational learning for NLP using linear threshold elements - Khardon, Roth, et al. - 1999 |

19 | Comparing several linear-threshold learning algorithms on tasks involving superfluous attributes - Littlestone - 1995 |

12 |
A winnow-based approach to spelling correction
- Golding, Roth
- 1999
(Show Context)
Citation Context ...1989, 1991; Kivinen, Warmuth & Auer, 1997). This is an important, well-studied, collection of algorithms with interesting properties and practical applications (Blum, 1997; Dagan, Karov & Roth, 1997; =-=Golding & Roth, 1999-=-; Khardon, Roth & Valiant, 1999). In this paper we define a general class of algorithms and provide a unified theoretical analysis which covers not only Perceptron and Winnow, but also many new algori... |

10 | The relaxation method of the common point of convex sets and its application to the solution of problems in convex programming - Bregman - 1967 |

8 | The Perceptron algorithm versus Winnow: linear versus logarithmic mistake bounds when few input variables are relevant - Kivinen, Warmuth, et al. - 1997 |

8 |
Mistake Bounds and Linear-threshold Learning Algorithms
- Littlestone
- 1989
(Show Context)
Citation Context ...(1989) is equivalent4 to another quasi-additive procedure defined by choosing f (z) = ez and setting a = 1 1 log 2 β . Finally, the Fixed Threshold variant of the Winnow algorithm (Littlestone, 1988; =-=Littlestone, 1989-=-) can also be expressed in quasi-additive form, if we make one minor extension. In general, we can consider quasi-additive functions that use a different function fi for each component. The Fixed Thre... |

8 | Continuous and discrete-time nonlinear gradient descent: Relative loss bounds and convergence
- Warmuth, Jagota
- 1997
(Show Context)
Citation Context ...ere appears to be a much closer connection with the work of Warmuth and others, developed in Kivinen and Warmuth (1997) and several subsequent papers (Kivinen & Warmuth, 1998; Azoury & Warmuth, 1999; =-=Warmuth & Jagota, 1998-=-). In an extensive ongoing body of research, they apply related techniques to a variety of tasks. Their algorithms, like ours, base predictions on w · x for some weight vector w and instance x, but ty... |

7 | On bayes methods for on-line boolean prediction - Cesa-Bianchi, Helmbold, et al. - 1996 |

7 | C.: An apobayesian relative of winnow - Littlestone, Mesterharm - 1997 |

6 | The Perceptron: A model for brain functioning", Rev. mod - BLOCK - 1962 |

4 | Relative loss bounds and the exponential family of distributions”, ”1998”, Unpublished manuscript. L.M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming - Azoury, Warmuth” |

3 |
Some mathematical models of learning
- Papert
- 1961
(Show Context)
Citation Context ...xample, for Perceptron, our methodsGENERAL CONVERGENCE RESULTS FOR LINEAR DISCRIMINANT UPDATES 175 yields the same measure of progress used in one of the most famous proofs of Perceptron convergence (=-=Papert, 1961-=-; Minsky & Papert, 1969). When applied to the Winnow family, our construction leads to almost exactly the same measures of progress used by Littlestone in (1989). Thus, we show that, in a certain sens... |

2 | Relative loss bounds for multiclass regression problems. (unpublished manuscript - Kivinen, Warmuth - 1997 |

2 | and Arun Jagota. Continuous time non-linear gradient descent: Convergence and relative loss bounds. (unpublished manuscript - Warmuth - 1997 |

1 | Perceptrons - GROVE, LITTLESTONE, et al. - 1969 |