Results 1 - 10
of
29
Training Algorithms for Linear Text Classifiers
, 1996
"... Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers. We propose that two machine learning algorithms, the Widrow-Hoff and EG algorithms, be used in training linear text classifiers. In contrast to most IR methods, theoretical analysis provides pe ..."
Abstract
-
Cited by 216 (12 self)
- Add to MetaCart
Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers. We propose that two machine learning algorithms, the Widrow-Hoff and EG algorithms, be used in training linear text classifiers. In contrast to most IR methods, theoretical analysis provides performance guarantees and guidance on parameter settings for these algorithms. Experimental data is presented showing Widrow-Hoff and EG to be more effective than the widely used Rocchio algorithm on several categorization and routing tasks. 1 Introduction Document retrieval, categorization, routing, and filtering systems often are based on classification. That is, the IR system decides for each document which of two or more classes it belongs to, or how strongly it belongs to a class, in order to accomplish the IR task of interest. For instance, the two classes may be the documents relevant to and not relevant to a particular user, and the system may rank documents based on how likely it i...
Online Passive-Aggressive Algorithms
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based ..."
Abstract
-
Cited by 181 (14 self)
- Add to MetaCart
We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.
An introduction to boosting and leveraging
- Advanced Lectures on Machine Learning, LNCS
, 2003
"... ..."
General convergence results for linear discriminant updates
- Machine Learning
, 1997
"... Abstract. The problem of learning linear-discriminant concepts can be solved by various mistake-driven update procedures, including the Winnow family of algorithms and the well-known Perceptron algorithm. In this paper we define the general class of “quasi-additive ” algorithms, which includes Perce ..."
Abstract
-
Cited by 73 (0 self)
- Add to MetaCart
Abstract. The problem of learning linear-discriminant concepts can be solved by various mistake-driven update procedures, including the Winnow family of algorithms and the well-known Perceptron algorithm. In this paper we define the general class of “quasi-additive ” algorithms, which includes Perceptron and Winnow as special cases. We give a single proof of convergence that covers a broad subset of algorithms in this class, including both Perceptron and Winnow, but also many new algorithms. Our proof hinges on analyzing a generic measure of progress construction that gives insight as to when and how such algorithms converge. Our measure of progress construction also permits us to obtain good mistake bounds for individual algorithms. We apply our unified analysis to new algorithms as well as existing algorithms. When applied to known algorithms, our method “automatically ” produces close variants of existing proofs (recovering similar bounds)—thus showing that, in a certain sense, these seemingly diverse results are fundamentally isomorphic. However, we also demonstrate that the unifying principles are more broadly applicable, and analyze a new class of algorithms that smoothly interpolate between the additive-update behavior of Perceptron and the multiplicative-update behavior of Winnow.
Relative Loss Bounds for Multidimensional Regression Problems
- MACHINE LEARNING
, 2001
"... We study on-line generalized linear regression with multidimensional outputs, i.e., neural networks with multiple output nodes but no hidden nodes. We allow at the final layer transfer functions such as the softmax function that need to consider the linear activations to all the output neurons. The ..."
Abstract
-
Cited by 55 (11 self)
- Add to MetaCart
We study on-line generalized linear regression with multidimensional outputs, i.e., neural networks with multiple output nodes but no hidden nodes. We allow at the final layer transfer functions such as the softmax function that need to consider the linear activations to all the output neurons. The weight vectors used to produce the linear activations are represented indirectly by maintaining separate parameter vectors. We get the weight vector by applying a particular parameterization function to the parameter vector. Updating the parameter vectors upon seeing new examples is done additively, as in the usual gradient descent update. However, by using a nonlinear parameterization function between the parameter vectors and the weight vectors, we can make the resulting update of the weight vector quite different from a true gradient descent update. To analyse such updates, we define a notion of a matching loss function and apply it both to the transfer function and to the parameterization function. The loss function that matches the transfer function is used to measure the goodness of the predictions of the algorithm. The loss function that matches the parameterization function can be used both as a measure of divergence between models in motivating the update rule of the algorithm and as a measure of progress in analyzing its relative performance compared to an arbitrary fixed model. As a result, we have a unified treatment that generalizes earlier results for the gradient descent and exponentiated gradient algorithms to multidimensional outputs, including multiclass logistic regression.
Competing in the dark: An efficient algorithm for bandit linear optimization
- In Proceedings of the 21st Annual Conference on Learning Theory (COLT
, 2008
"... We introduce an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal O ∗ ( √ T) regret. The setting is a natural generalization of the nonstochastic multi-armed bandit problem, and the existence of an efficient optimal algorithm has bee ..."
Abstract
-
Cited by 36 (7 self)
- Add to MetaCart
We introduce an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal O ∗ ( √ T) regret. The setting is a natural generalization of the nonstochastic multi-armed bandit problem, and the existence of an efficient optimal algorithm has been posed as an open problem in a number of recent papers. We show how the difficulties encountered by previous approaches are overcome by the use of a self-concordant potential function. Our approach presents a novel connection between online learning and interior point methods. 1
Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent
- Neural Computation
, 2002
"... We propose a generic method for iteratively approximating various second-order gradient steps -- Newton, Gauss-Newton, Levenberg-Marquardt, and natural gradient -- in linear time per iteration, using special curvature matrix-vector products that can be computed in O(n). Two recent acceleration techn ..."
Abstract
-
Cited by 25 (11 self)
- Add to MetaCart
We propose a generic method for iteratively approximating various second-order gradient steps -- Newton, Gauss-Newton, Levenberg-Marquardt, and natural gradient -- in linear time per iteration, using special curvature matrix-vector products that can be computed in O(n). Two recent acceleration techniques for online learning, matrix momentum and stochastic meta-descent (SMD), in fact implement this approach. Since both were originally derived by very different routes, this o ers fresh insight into their operation, resulting in further improvements to SMD.
Learning permutations with exponential weights
- In 20th Annual Conference on Learning Theory
, 2007
"... Abstract. We give an algorithm for learning a permutation on-line. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic matrix. This matrix is updated by multiplying the current matrix entries by exponential factors. These factors destroy the doubly stochastic ..."
Abstract
-
Cited by 21 (5 self)
- Add to MetaCart
Abstract. We give an algorithm for learning a permutation on-line. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic matrix. This matrix is updated by multiplying the current matrix entries by exponential factors. These factors destroy the doubly stochastic property of the matrix and an iterative procedure is needed to re-normalize the rows and columns. Even though the result of the normalization procedure does not have a closed form, we can still bound the additional loss of our algorithm over the loss of the best permutation chosen in hindsight. 1
THE FORGETRON: A KERNEL-BASED PERCEPTRON ON A BUDGET
, 2008
"... The Perceptron algorithm, despite its simplicity, often performs well in online classification tasks. The Perceptron becomes especially effective when it is used in conjunction with kernel functions. However, a common difficulty encountered when implementing kernel-based online algorithms is the am ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
The Perceptron algorithm, despite its simplicity, often performs well in online classification tasks. The Perceptron becomes especially effective when it is used in conjunction with kernel functions. However, a common difficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly as the algorithm progresses. Moreover, the running time of each online round grows linearly with the amount of memory used to store the hypothesis. In this paper, we present the Forgetron family of kernel-based online classification algorithms, which overcome this problem by restricting themselves to a predefined memory budget. We obtain different members of this family by modifying the kernel-based Perceptron in various ways. We also prove a unified mistake bound for all of the Forgetron algorithms. To our knowledge, this is the first online kernel-based learning paradigm which, on one hand, maintains a strict limit on the amount of memory it uses and, on the other hand, entertains a relative mistake bound. We conclude with experiments using real datasets, which underscore the merits of our approach.

