Results 1 - 10
of
16
Pegasos: Primal Estimated sub-gradient solver for SVM
"... We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy ɛ is Õ(1/ɛ), where each iteration operates on a singl ..."
Abstract
-
Cited by 131 (10 self)
- Add to MetaCart
We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy ɛ is Õ(1/ɛ), where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require Ω(1/ɛ2) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is Õ(d/(λɛ)), where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an order-of-magnitude speedup over previous SVM learning methods.
Relational Learning via Collective Matrix Factorization
, 2008
"... Relational learning is concerned with predicting unknown values of a relation, given a database of entities and observed relations among entities. An example of relational learning is movie rating prediction, where entities could include users, movies, genres, and actors. Relations would then encode ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
Relational learning is concerned with predicting unknown values of a relation, given a database of entities and observed relations among entities. An example of relational learning is movie rating prediction, where entities could include users, movies, genres, and actors. Relations would then encode users ’ ratings of movies, movies ’ genres, and actors ’ roles in movies. A common prediction technique given one pairwise relation, for example a #users × #movies ratings matrix, is low-rank matrix factorization. In domains with multiple relations, represented as multiple matrices, we may improve predictive accuracy by exploiting information from one relation while predicting another. To this end, we propose a collective matrix factorization model: we simultaneously factor several matrices, sharing parameters among factors when an entity participates in multiple relations. Each relation can have a different value type and error distribution; so, we allow nonlinear relationships between the parameters and outputs, using Bregman divergences to measure error. We extend standard alternating projection algorithms to our model, and derive an efficient Newton update for the projection. Furthermore, we propose stochastic optimization methods to deal with large, sparse matrices. Our model generalizes several existing matrix factorization methods, and therefore yields new large-scale optimization algorithms for these problems. Our model can handle any pairwise relational schema and a
SVM Optimization: Inverse Dependence on Training Set Size
"... We discuss how the runtime of SVM optimization should decrease as the size of the training data increases. We present theoretical and empirical results demonstrating how a simple subgradient descent approach indeed displays such behavior, at least for linear kernels. 1. ..."
Abstract
-
Cited by 31 (8 self)
- Add to MetaCart
We discuss how the runtime of SVM optimization should decrease as the size of the training data increases. We present theoretical and empirical results demonstrating how a simple subgradient descent approach indeed displays such behavior, at least for linear kernels. 1.
Dual averaging methods for regularized stochastic learning and online optimization
- In Advances in Neural Information Processing Systems 23
, 2009
"... We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1-norm for promoting sparsity. We develop extensions of Nes ..."
Abstract
-
Cited by 28 (3 self)
- Add to MetaCart
We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1-norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1-regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method.
Identifying Suspicious URLs: An Application of Large-Scale Online Learning
- In Proc. of the International Conference on Machine Learning (ICML
, 2009
"... This paper explores online learning approaches for detecting malicious Web sites (those involved in criminal scams) using lexical and host-based features of the associated URLs. We show that this application is particularly appropriate for online algorithms as the size of the training data is larger ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
This paper explores online learning approaches for detecting malicious Web sites (those involved in criminal scams) using lexical and host-based features of the associated URLs. We show that this application is particularly appropriate for online algorithms as the size of the training data is larger than can be efficiently processed in batch and because the distribution of features that typify malicious URLs is changing continuously. Using a real-time system we developed for gathering URL features, combined with a real-time source of labeled URLs from a large Web mail provider, we demonstrate that recentlydeveloped online algorithms can be as accurate as batch techniques, achieving classification accuracies up to 99 % over a balanced data set. 1.
Recall systems: Efficient learning and use of category indices
- In International Conference on Artificial Intelligence and Statistics (AISTATS
, 2007
"... We introduce the framework of recall systems for efficient learning and retrieval of categories when the number of categories is large. A recallsystem here is a simple feature-based intermediate filtering step which reduces the potential categories for an instance to a small manageable set. The corr ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
We introduce the framework of recall systems for efficient learning and retrieval of categories when the number of categories is large. A recallsystem here is a simple feature-based intermediate filtering step which reduces the potential categories for an instance to a small manageable set. The correct categories from this set can then be determined using traditional classifiers. We present a formalization of the index learning problem and establish NP-hardness and approximation hardness. We proceed to give an efficient heuristic for learning indices, and evaluate it on several large data sets. In our experiments, the index is learned within minutes, and reduces the number of categories by several orders of magnitude, without affecting the quality of classification overall. 1
Matrix Updates for Perceptron Training of Continuous Density Hidden Markov Models
"... In this paper, we investigate a simple, mistakedriven learning algorithm for discriminative training of continuous density hidden Markov models (CD-HMMs). Most CD-HMMs for automatic speech recognition use multivariate Gaussian emission densities (or mixtures thereof) parameterized in terms of their ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
In this paper, we investigate a simple, mistakedriven learning algorithm for discriminative training of continuous density hidden Markov models (CD-HMMs). Most CD-HMMs for automatic speech recognition use multivariate Gaussian emission densities (or mixtures thereof) parameterized in terms of their means and covariance matrices. For discriminative training of CD-HMMs, we reparameterize these Gaussian distributions in terms of positive semidefinite matrices that jointly encode their mean and covariance statistics. We show how to explore the resulting parameter space in CD-HMMs with perceptron-style updates that minimize the distance between Viterbi decodings and target transcriptions. We experiment with several forms of updates, systematically comparing the effects of different matrix factorizations, initializations, and averaging schemes on phone accuracies and convergence rates. We present experimental results for context-independent CD-HMMs trained in this way on the TIMIT speech corpus. Our results show that certain types of perceptron training yield consistently significant and rapid reductions in phone error rates. 1.
Prediction games in infinitely rich worlds
- In Utility Based Data Mining Workshop (UBDM at KDD
, 2006
"... categories, every experience would be new, and one couldn’t make sense of one’s world. Furthermore, higher intelligence requires large numbers of categories, perhaps millions and beyond. Acquiring and robust detection of categories appears to be a complex task as categories inter-relate in complex w ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
categories, every experience would be new, and one couldn’t make sense of one’s world. Furthermore, higher intelligence requires large numbers of categories, perhaps millions and beyond. Acquiring and robust detection of categories appears to be a complex task as categories inter-relate in complex ways and occur in diverse conditions. We may then ask: how can a system learn so many complex inter-related categories? We propose and explore an avenue that we call prediction games in infinitely rich worlds. In these games, the world is a source of an unlimited stream of information. The games are played by a prediction system that in effect repeatedly experiments with its world and learns from its experiments. The system converts its input stream from the world into a sequence of learning episodes for itself. Each learning episode consists of the system hiding parts of the input, guessing (predicting) them using the remainder of the input (the local context), and updating itself based on comparing its observations with its predictions. The goal of the system is to improve its
A fast online algorithm for large margin training of continuous density hidden markov models
- in Proceedings of Interspeech-2009
, 2009
"... We propose an online learning algorithm for large margin training of continuous density hidden Markov models. The online algorithm updates the model parameters incrementally after the decoding of each training utterance. For large margin training, the algorithm attempts to separate the log-likelihoo ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We propose an online learning algorithm for large margin training of continuous density hidden Markov models. The online algorithm updates the model parameters incrementally after the decoding of each training utterance. For large margin training, the algorithm attempts to separate the log-likelihoods of correct and incorrect transcriptions by an amount proportional to their Hamming distance. We evaluate this approach to hidden Markov modeling on the TIMIT speech database. We find that the algorithm yields significantly lower phone error rates than other approaches—both online and batch—that do not attempt to enforce a large margin. We also find that the algorithm converges much more quickly than analogous batch optimizations for large margin training. Index Terms: hidden Markov models, online learning, large margin classification, discriminative training, automatic speech recognition 1.

