Results 1  10
of
957
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
, 2010
"... Stochastic subgradient methods are widely used, well analyzed, and constitute effective tools for optimization and online learning. Stochastic gradient methods ’ popularity and appeal are largely due to their simplicity, as they largely follow predetermined procedural schemes. However, most common s ..."
Abstract

Cited by 287 (3 self)
 Add to MetaCart
(Show Context)
Stochastic subgradient methods are widely used, well analyzed, and constitute effective tools for optimization and online learning. Stochastic gradient methods ’ popularity and appeal are largely due to their simplicity, as they largely follow predetermined procedural schemes. However, most common subgradient approaches are oblivious to the characteristics of the data being observed. We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradientbased learning. The adaptation, in essence, allows us to find needles in haystacks in the form of very predictive but rarely seenfeatures. Ourparadigmstemsfromrecentadvancesinstochasticoptimizationandonlinelearning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. In a companion paper, we validate experimentally our theoretical analysis and show that the adaptive subgradient approach outperforms stateoftheart, but nonadaptive, subgradient algorithms. 1
An extension on ―statistical comparisons of classifiers over multiple data sets‖ for all pairwise comparisons
 Journal of Machine Learning Research
"... In a recently published paper in JMLR, Demˇsar (2006) recommends a set of nonparametric statistical tests and procedures which can be safely used for comparing the performance of classifiers over multiple data sets. After studying the paper, we realize that the paper correctly introduces the basic ..."
Abstract

Cited by 158 (37 self)
 Add to MetaCart
(Show Context)
In a recently published paper in JMLR, Demˇsar (2006) recommends a set of nonparametric statistical tests and procedures which can be safely used for comparing the performance of classifiers over multiple data sets. After studying the paper, we realize that the paper correctly introduces the basic procedures and some of the most advanced ones when comparing a control method. However, it does not deal with some advanced topics in depth. Regarding these topics, we focus on more powerful proposals of statistical procedures for comparing n×n classifiers. Moreover, we illustrate an easy way of obtaining adjusted and comparable pvalues in multiple comparison procedures.
On smoothing and inference for topic models
 In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence
, 2009
"... Latent Dirichlet analysis, or topic modeling, is a flexible latent variable framework for modeling highdimensional sparse count data. Various learning algorithms have been developed in recent years, including collapsed Gibbs sampling, variational inference, and maximum a posteriori estimation, and ..."
Abstract

Cited by 112 (8 self)
 Add to MetaCart
(Show Context)
Latent Dirichlet analysis, or topic modeling, is a flexible latent variable framework for modeling highdimensional sparse count data. Various learning algorithms have been developed in recent years, including collapsed Gibbs sampling, variational inference, and maximum a posteriori estimation, and this variety motivates the need for careful empirical comparisons. In this paper, we highlight the close connections between these approaches. We find that the main differences are attributable to the amount of smoothing applied to the counts. When the hyperparameters are optimized, the differences in performance among the algorithms diminish significantly. The ability of these algorithms to achieve solutions of comparable accuracy gives us the freedom to select computationally efficient approaches. Using the insights gained from this comparative study, we show how accurate topic models can be learned in several seconds on text corpora with thousands of documents. 1
Sparse Online Learning via Truncated Gradient
"... We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to ..."
Abstract

Cited by 105 (4 self)
 Add to MetaCart
(Show Context)
We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical onlinelearning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1
Trust region Newton method for largescale logistic regression
 In Proceedings of the 24th International Conference on Machine Learning (ICML
, 2007
"... Largescale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the loglikelihood of the logistic regression model. The proposed method uses only approximate Newton steps in ..."
Abstract

Cited by 96 (22 self)
 Add to MetaCart
(Show Context)
Largescale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the loglikelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also compare it with existing linear SVM implementations. 1
Distributed algorithms for topic models
 In: The Journal of Machine Learning Research
"... We describe distributed algorithms for two widelyused topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distribute ..."
Abstract

Cited by 92 (3 self)
 Add to MetaCart
(Show Context)
We describe distributed algorithms for two widelyused topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbssampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newlycreated topics either by matching or by topicid. Using five realworld text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged testdata log probability for distributed learning is indistinguishable from that obtained with singleprocessor learning. Our extensive experimental results include learning topic models for two multimillion document collections using a 1024processor parallel computer.
Bolasso: model consistent lasso estimation through the bootstrap
 In Proceedings of the Twentyfifth International Conference on Machine Learning (ICML
, 2008
"... We consider the leastsquare linear regression problem with regularization by the ℓ1norm, a problem usually referred to as the Lasso. In this paper, we present a detailed asymptotic analysis of model consistency of the Lasso. For various decays of the regularization parameter, we compute asymptotic ..."
Abstract

Cited by 84 (15 self)
 Add to MetaCart
(Show Context)
We consider the leastsquare linear regression problem with regularization by the ℓ1norm, a problem usually referred to as the Lasso. In this paper, we present a detailed asymptotic analysis of model consistency of the Lasso. For various decays of the regularization parameter, we compute asymptotic equivalents of the probability of correct model selection (i.e., variable selection). For a specific rate decay, we show that the Lasso selects all the variables that should enter the model with probability tending to one exponentially fast, while it selects all other variables with strictly positive probability. We show that this property implies that if we run the Lasso for several bootstrapped replications of a given sample, then intersecting the supports of the Lasso bootstrap estimates leads to consistent model selection. This novel variable selection algorithm, referred to as the Bolasso, is compared favorably to other linear regression methods on synthetic data and datasets from the UCI machine learning repository. 1.
Nonlinear causal discovery with additive noise models
"... The discovery of causal relationships between a set of observed variables is a fundamental problem in science. For continuousvalued data linear acyclic causal models with additive noise are often used because these models are well understood and there are wellknown methods to fit them to data. In ..."
Abstract

Cited by 78 (30 self)
 Add to MetaCart
(Show Context)
The discovery of causal relationships between a set of observed variables is a fundamental problem in science. For continuousvalued data linear acyclic causal models with additive noise are often used because these models are well understood and there are wellknown methods to fit them to data. In reality, of course, many causal relationships are more or less nonlinear, raising some doubts as to the applicability and usefulness of purely linear methods. In this contribution we show that in fact the basic linear framework can be generalized to nonlinear models. In this extended framework, nonlinearities in the datagenerating process are in fact a blessing rather than a curse, as they typically provide information on the underlying causal system and allow more aspects of the true datagenerating mechanisms to be identified. In addition to theoretical results we show simulations and some simple real data experiments illustrating the identification power provided by nonlinearities. 1
Fast Support Vector Machine Training and Classification
 on Graphics Processors, Proc. 25th Int. Conf. Machine Learning
, 2008
"... Recent developments in programmable, highly parallel Graphics Processing Units (GPUs) have enabled high performance implementations of machine learning algorithms. We describe a solver for Support Vector Machine training running on a GPU, using the Sequential Minimal Optimization algorithm and an ad ..."
Abstract

Cited by 77 (2 self)
 Add to MetaCart
(Show Context)
Recent developments in programmable, highly parallel Graphics Processing Units (GPUs) have enabled high performance implementations of machine learning algorithms. We describe a solver for Support Vector Machine training running on a GPU, using the Sequential Minimal Optimization algorithm and an adaptive first and second order working set selection heuristic, which achieves speedups of 935 × over LIBSVM running on a traditional processor. We also present a GPUbased system for SVM classification which achieves speedups of 81138 × over LIBSVM (524 × over our own CPU based SVM classifier). 1.
Conditional random fields for activity recognition
 In Proceedings of the Sixth International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2007
, 2007
"... of any sponsoring institution, the U.S. government or any other entity. ..."
Abstract

Cited by 76 (0 self)
 Add to MetaCart
(Show Context)
of any sponsoring institution, the U.S. government or any other entity.