Results 1  10
of
90
The tradeoffs of large scale learning
 IN: ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 20
, 2008
"... This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of smallscale and largescale learning problems. Smallscale learning problems are subject to the usual approx ..."
Abstract

Cited by 138 (4 self)
 Add to MetaCart
This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of smallscale and largescale learning problems. Smallscale learning problems are subject to the usual approximation–estimation tradeoff. Largescale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in nontrivial ways.
Exponentiated gradient algorithms for conditional random fields and maxmargin Markov networks
, 2008
"... Loglinear and maximummargin models are two commonlyused methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large dat ..."
Abstract

Cited by 59 (1 self)
 Add to MetaCart
Loglinear and maximummargin models are two commonlyused methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the loglinear or maxmargin objective function; the dual in both the loglinear and maxmargin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the maxmargin case, O ( 1 ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for loglinear models only O(log (1/ε)) updates are required. For both the maxmargin and loglinear cases, our bounds suggest that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be
Efficient weight learning for Markov logic networks
 In Proceedings of the Eleventh European Conference on Principles and Practice of Knowledge Discovery in Databases
, 2007
"... Abstract. Markov logic networks (MLNs) combine Markov networks and firstorder logic, and are a powerful and increasingly popular representation for statistical relational learning. The stateoftheart method for discriminative learning of MLN weights is the voted perceptron algorithm, which is ess ..."
Abstract

Cited by 57 (7 self)
 Add to MetaCart
Abstract. Markov logic networks (MLNs) combine Markov networks and firstorder logic, and are a powerful and increasingly popular representation for statistical relational learning. The stateoftheart method for discriminative learning of MLN weights is the voted perceptron algorithm, which is essentially gradient descent with an MPE approximation to the expected sufficient statistics (true clause counts). Unfortunately, these can vary widely between clauses, causing the learning problem to be highly illconditioned, and making gradient descent very slow. In this paper, we explore several alternatives, from perweight learning rates to secondorder methods. In particular, we focus on two approaches that avoid computing the partition function: diagonal Newton and scaled conjugate gradient. In experiments on standard SRL datasets, we obtain orderofmagnitude speedups, or more accurate models given comparable learning times. 1
Information extraction
 FnT Databases
"... The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natu ..."
Abstract

Cited by 53 (2 self)
 Add to MetaCart
The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natural language processing community where the primary impetus came from competitions centered around the recognition of named entities like people names and organization from news articles. As society became more data oriented with easy online access to both structured and unstructured data, new applications of structure extraction came around. Now, there is interest in converting our personal desktops to structured databases, the knowledge in scientific publications to structured records, and harnessing the Internet for structured fact finding queries. Consequently, there are many different communities of researchers bringing in techniques from machine learning, databases, information retrieval, and computational linguistics for various aspects of the information extraction problem. This review is a survey of information extraction research of over two decades from these diverse communities. We create a taxonomy of the field along various dimensions derived from the nature of theextraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced. We elaborate on rulebased and statistical methods for entity and relationship extraction. In each case we highlight the different kinds of models for capturing the diversity of clues driving the recognition process and the algorithms for training and efficiently deploying the models. We survey techniques for optimizing the various steps in an information extraction pipeline, adapting to dynamic data, integrating with existing entities and handling uncertainty in the extraction process. 1
A tutorial on energybased learning
 Predicting Structured Data
, 2006
"... EnergyBased Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. Inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimize the energy. Learning consists in ..."
Abstract

Cited by 42 (6 self)
 Add to MetaCart
EnergyBased Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. Inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimize the energy. Learning consists in finding an energy function in which observed configurations of the variables are given lower energies than unobserved ones. The EBM approach provides a common theoretical framework for many learning models, including traditional discriminative and generative approaches, as well as graphtransformer networks, conditional random fields, maximum margin Markov networks, and several manifold learning methods. Probabilistic models must be properly normalized, which sometimes requires evaluating intractable integrals over the space of all possible variable configurations. Since EBMs have no requirement for proper normalization, this problem is naturally circumvented. EBMs can be viewed as a form of nonprobabilistic factor graphs, and they provide considerably more flexibility in the design of architectures and training criteria than probabilistic approaches. 1
Slow learners are fast
 In NIPS
, 2009
"... Online learning algorithms have impressive convergence properties when it comes to risk minimization and convex games on very large problems. However, they are inherently sequential in their design which prevents them from taking advantage of modern multicore architectures. In this paper we prove t ..."
Abstract

Cited by 35 (2 self)
 Add to MetaCart
Online learning algorithms have impressive convergence properties when it comes to risk minimization and convex games on very large problems. However, they are inherently sequential in their design which prevents them from taking advantage of modern multicore architectures. In this paper we prove that online learning with delayed updates converges well, thereby facilitating parallel online learning. 1
Exponentiated gradient algorithms for loglinear structured prediction
 In Proc. ICML
, 2007
"... Conditional loglinear models are a commonly used method for structured prediction. Efficient learning of parameters in these models is therefore an important problem. This paper describes an exponentiated gradient (EG) algorithm for training such models. EG is applied to the convex dual of the maxi ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
Conditional loglinear models are a commonly used method for structured prediction. Efficient learning of parameters in these models is therefore an important problem. This paper describes an exponentiated gradient (EG) algorithm for training such models. EG is applied to the convex dual of the maximum likelihood objective; this results in both sequential and parallel update algorithms, where in the sequential algorithm parameters are updated in an online fashion. We provide a convergence proof for both algorithms. Our analysis also simplifies previous results on EG for maxmargin models, and leads to a tighter bound on convergence rates. Experiments on a largescale parsing task show that the proposed algorithm converges much faster than conjugategradient and LBFGS approaches both in terms of optimization objective and test error. 1.
An Introduction to Conditional Random Fields
 Foundations and Trends in Machine Learning
, 2012
"... ..."
Minimizing and learning energy functions for sidechain prediction
 In RECOMB2007
, 2007
"... Sidechain prediction is an important subproblem of the general protein folding problem. Despite much progress in sidechain prediction, performance is far from satisfactory. As an example, the ROSETTA protocol that uses simulated annealing to select the minimum energy conformations, correctly predi ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
Sidechain prediction is an important subproblem of the general protein folding problem. Despite much progress in sidechain prediction, performance is far from satisfactory. As an example, the ROSETTA protocol that uses simulated annealing to select the minimum energy conformations, correctly predicts the first two sidechain angles for approximately 72 % of the buried residues in a standard data set. Is further improvement more likely to come from better search methods, or from better energy functions? Given that exact minimization of the energy is NP hard, it is difficult to get a systematic answer to this question. In this paper, we present a novel search method and a novel method for learning energy functions from training data that are both based on Tree Reweighted Belief Propagation (TRBP). We find that TRBP can find the global optimum of the ROSETTA energy function in a few minutes of computation for approximately 85 % of the proteins in a standard benchmark set. TRBP can also effectively bound the partition function which enables using the Conditional Random Fields (CRF) framework for learning. Interestingly, finding the global minimum does not significantly improve sidechain prediction for
Speech Recognition Using Augmented Conditional Random Fields
"... Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by stateoftheart stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
Abstract—Acoustic modeling based on hidden Markov models (HMMs) is employed by stateoftheart stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ability to model spectral phenomena well. In this paper, a new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed. This paradigm addresses some limitations of HMMs while maintaining many of the aspects which have made them successful. In particular, the acoustic modeling problem is reformulated in a data driven, sparse, augmented space to increase discrimination. Acoustic context modeling is explicitly integrated to handle the sequential phenomena of the speech signal. We present an efficient framework for estimating these models that ensures scalability and generality. In the TIMIT