Results 1  10
of
40
Support vector machines for speech recognition
 Proceedings of the International Conference on Spoken Language Processing
, 1998
"... Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative informati ..."
Abstract

Cited by 74 (2 self)
 Add to MetaCart
Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative information and are prone to overfitting and overparameterization. Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. In this paper, we show that SVMs provide a significant improvement in performance on a static pattern classification task based on the Deterding vowel data. We also describe an application of SVMs to large vocabulary speech recognition, and demonstrate an improvement in error rate on a continuous alphadigit task (OGI Aphadigits) and a large vocabulary conversational speech task (Switchboard). Issues related to the development and optimization of an SVM/HMM hybrid system are discussed.
Kernel matching pursuit
 Machine Learning
, 2002
"... Matching Pursuit algorithms learn a function that is a weighted sum of basis functions, by sequentially appending functions to an initially empty basis, to approximate a target function in the leastsquares sense. We show how matching pursuit can be extended to use nonsquared error loss functions, a ..."
Abstract

Cited by 62 (0 self)
 Add to MetaCart
Matching Pursuit algorithms learn a function that is a weighted sum of basis functions, by sequentially appending functions to an initially empty basis, to approximate a target function in the leastsquares sense. We show how matching pursuit can be extended to use nonsquared error loss functions, and how it can be used to build kernelbased solutions to machinelearning problems, while keeping control of the sparsity of the solution. We also derive MDL motivated generalization bounds for this type of algorithm, and compare them to related SVM (Support Vector Machine) bounds. Finally, links to boosting algorithms and RBF training procedures, as well as an extensive experimental comparison with SVMs for classification are given, showing comparable results with typically sparser models. 1
Misclassification Minimization
 JOURNAL OF GLOBAL OPTIMIZATION
, 1994
"... The problem of minimizing the number of misclassified points by a plane, attempting to separate two point sets with intersecting convex hulls in ndimensional real space, is formulated as a linear program with equilibrium constraints (LPEC). This general LPEC can be converted to an exact penalty pro ..."
Abstract

Cited by 40 (13 self)
 Add to MetaCart
The problem of minimizing the number of misclassified points by a plane, attempting to separate two point sets with intersecting convex hulls in ndimensional real space, is formulated as a linear program with equilibrium constraints (LPEC). This general LPEC can be converted to an exact penalty problem with a quadratic objective and linear constraints. A FrankWolfetype algorithm is proposed for the penalty problem that terminates at a stationary point or a global solution. Novel aspects of the approach include: (i) A linear complementarity formulation of the step function that "counts" misclassifications, (ii) Exact penalty formulation without boundedness, nondegeneracy or constraint qualification assumptions, (iii) An exact solution extraction from the sequence of minimizers of the penalty function for a finite value of the penalty parameter for the general LPEC and an explicitly exact solution for the LPEC with uncoupled constraints, and (iv) A parametric quadratic programming form...
Mathematical Programming in Neural Networks
 ORSA Journal on Computing
, 1993
"... This paper highlights the role of mathematical programming, particularly linear programming, in training neural networks. A neural network description is given in terms of separating planes in the input space that suggests the use of linear programming for determining these planes. A more standard d ..."
Abstract

Cited by 40 (13 self)
 Add to MetaCart
This paper highlights the role of mathematical programming, particularly linear programming, in training neural networks. A neural network description is given in terms of separating planes in the input space that suggests the use of linear programming for determining these planes. A more standard description in terms of a mean square error in the output space is also given, which leads to the use of unconstrained minimization techniques for training a neural network. The linear programming approach is demonstrated by a brief description of a system for breast cancer diagnosis that has been in use for the last four years at a major medical facility. 1 What is a Neural Network? A neural network is a representation of a map between an input space and an output space. A principal aim of such a map is to discriminate between the elements of a finite number of disjoint sets in the input space. Typically one wishes to discriminate between the elements of two disjoint point sets in the ndim...
Largescale machine learning with stochastic gradient descent
 in COMPSTAT
, 2010
"... Abstract. During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning methods is limited by the computing time rather than the sample size. A more precise analysis uncovers qualitatively different tradeoffs ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
Abstract. During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning methods is limited by the computing time rather than the sample size. A more precise analysis uncovers qualitatively different tradeoffs for the case of smallscale and largescale learning problems. The largescale case involves the computational complexity of the underlying optimization algorithm in nontrivial ways. Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for largescale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.
Stochastic Gradient Learning in Neural Networks
"... Many connectionist learning algorithms consists of minimizing a cost of the form C(w) = E(J(z; w)) = Z J(z; w)dP (z) where dP is an unknown probability distribution that characterizes the problem to learn, and J , the loss function, defines the learning system itself. This popular statistical fo ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
Many connectionist learning algorithms consists of minimizing a cost of the form C(w) = E(J(z; w)) = Z J(z; w)dP (z) where dP is an unknown probability distribution that characterizes the problem to learn, and J , the loss function, defines the learning system itself. This popular statistical formulation has led to many theoretical results. The minimization of such a cost may be achieved with a stochastic gradient descent algorithm, e.g.: w t+1 = w t \Gamma ffl t rwJ(z; w t ) With some restrictions on J and C, this algorithm converges, even if J is non differentiable on a set of measure 0. Links with simulated annealing are depicted. R'esum'e De nombreux algorithmes connexionnistes consistent `a minimiser un cout de la forme C(w) = E(J(z; w)) = Z J(z; w)dP (z) o`u dP est une distribution de probabilit'e inconnue qui caract'erise le probl`eme, et J , le crit`ere local, d'ecrit le syst`eme d'apprentissage lui meme. Cette formulation statistique bien connue a donn'e lieu `a de nom...
A Global Optimization Technique for Statistical Classifier Design
 IEEE Transactions on Signal Processing
"... A global optimization method is introduced for the design of statistical classifiers that minimize the rate of misclassification. We first derive the theoretical basis for the method, based on which we develop a novel design algorithm and demonstrate its effectiveness and superior performance in the ..."
Abstract

Cited by 25 (9 self)
 Add to MetaCart
A global optimization method is introduced for the design of statistical classifiers that minimize the rate of misclassification. We first derive the theoretical basis for the method, based on which we develop a novel design algorithm and demonstrate its effectiveness and superior performance in the design of practical classifiers for some of the most popular structures currently in use. The method, grounded in ideas from statistical physics and information theory, extends the deterministic annealing approach for optimization, both to incorporate structural constraints on data assignments to classes and to minimize the probability of error as the cost objective. During the design, data are assigned to classes in probability, so as to minimize the expected classification error given a specified level of randomness, as measured by Shannon's entropy. The constrained optimization is equivalent to a free energy minimization, motivating a deterministic annealing approach in which the entropy...
Links between Perceptrons, MLPs and SVMs
 In: Proceedings of ICML. (2004
, 2004
"... We propose to study links between three important classification algorithms: Perceptrons, MultiLayer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters and early stopping), using the margin idea i ..."
Abstract

Cited by 21 (4 self)
 Add to MetaCart
We propose to study links between three important classification algorithms: Perceptrons, MultiLayer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters and early stopping), using the margin idea introduced with SVMs. After showing that under simple conditions a Perceptron is equivalent to an SVM, we show it can be computationally expensive in time to train an SVM (and thus a Perceptron) with stochastic gradient descent, mainly because of the margin maximization term in the cost function. We then show that if we remove this margin maximization term, the learning rate or the use of early stopping can still control the margin.
Online learning and stochastic approximations
 In Online Learning in Neural Networks
, 1998
"... The convergence of online learning algorithms is analyzed using the tools of the stochastic approximation theory, and proved under very weak conditions. A general framework for online learning algorithms is first presented. This framework encompasses the most common online learning algorithms in use ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
The convergence of online learning algorithms is analyzed using the tools of the stochastic approximation theory, and proved under very weak conditions. A general framework for online learning algorithms is first presented. This framework encompasses the most common online learning algorithms in use today, as illustrated by several examples. The stochastic approximation theory then provides general results describing the convergence of all these learning algorithms at once.
Geometry in Learning
 In Geometry at Work
, 1997
"... One of the fundamental problems in learning is identifying members of two different classes. For example, to diagnose cancer, one must learn to discriminate between benign and malignant tumors. Through examination of tumors with previously determined diagnosis, one learns some function for distingui ..."
Abstract

Cited by 19 (6 self)
 Add to MetaCart
One of the fundamental problems in learning is identifying members of two different classes. For example, to diagnose cancer, one must learn to discriminate between benign and malignant tumors. Through examination of tumors with previously determined diagnosis, one learns some function for distinguishing the benign and malignant tumors. Then the acquired knowledge is used to diagnose new tumors. The perceptron is a simple biologically inspired model for this twoclass learning problem. The perceptron is trained or constructed using examples from the two classes. Then the perceptron is used to classify new examples. We describe geometrically what a perceptron is capable of learning. Using duality, we develop a framework for investigating different methods of training a perceptron. Depending on how we define the "best" perceptron, different minimization problems are developed for training the perceptron. The effectiveness of these methods is evaluated empirically on four practical applic...