Results 1 - 10
of
30
Kernel matching pursuit
- Machine Learning
, 2002
"... Matching Pursuit algorithms learn a function that is a weighted sum of basis functions, by sequentially appending functions to an initially empty basis, to approximate a target function in the leastsquares sense. We show how matching pursuit can be extended to use non-squared error loss functions, a ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
Matching Pursuit algorithms learn a function that is a weighted sum of basis functions, by sequentially appending functions to an initially empty basis, to approximate a target function in the leastsquares sense. We show how matching pursuit can be extended to use non-squared error loss functions, and how it can be used to build kernel-based solutions to machine-learning problems, while keeping control of the sparsity of the solution. We also derive MDL motivated generalization bounds for this type of algorithm, and compare them to related SVM (Support Vector Machine) bounds. Finally, links to boosting algorithms and RBF training procedures, as well as an extensive experimental comparison with SVMs for classification are given, showing comparable results with typically sparser models. 1
Support vector machines for speech recognition
- Proceedings of the International Conference on Spoken Language Processing
, 1998
"... Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative informati ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative information and are prone to overfitting and over-parameterization. Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. In this paper, we show that SVMs provide a significant improvement in performance on a static pattern classification task based on the Deterding vowel data. We also describe an application of SVMs to large vocabulary speech recognition, and demonstrate an improvement in error rate on a continuous alphadigit task (OGI Aphadigits) and a large vocabulary conversational speech task (Switchboard). Issues related to the development and optimization of an SVM/HMM hybrid system are discussed.
Mathematical Programming in Neural Networks
- ORSA Journal on Computing
, 1993
"... This paper highlights the role of mathematical programming, particularly linear programming, in training neural networks. A neural network description is given in terms of separating planes in the input space that suggests the use of linear programming for determining these planes. A more standard d ..."
Abstract
-
Cited by 39 (13 self)
- Add to MetaCart
This paper highlights the role of mathematical programming, particularly linear programming, in training neural networks. A neural network description is given in terms of separating planes in the input space that suggests the use of linear programming for determining these planes. A more standard description in terms of a mean square error in the output space is also given, which leads to the use of unconstrained minimization techniques for training a neural network. The linear programming approach is demonstrated by a brief description of a system for breast cancer diagnosis that has been in use for the last four years at a major medical facility. 1 What is a Neural Network? A neural network is a representation of a map between an input space and an output space. A principal aim of such a map is to discriminate between the elements of a finite number of disjoint sets in the input space. Typically one wishes to discriminate between the elements of two disjoint point sets in the n-dim...
Misclassification Minimization
- JOURNAL OF GLOBAL OPTIMIZATION
, 1994
"... The problem of minimizing the number of misclassified points by a plane, attempting to separate two point sets with intersecting convex hulls in n-dimensional real space, is formulated as a linear program with equilibrium constraints (LPEC). This general LPEC can be converted to an exact penalty pro ..."
Abstract
-
Cited by 38 (13 self)
- Add to MetaCart
The problem of minimizing the number of misclassified points by a plane, attempting to separate two point sets with intersecting convex hulls in n-dimensional real space, is formulated as a linear program with equilibrium constraints (LPEC). This general LPEC can be converted to an exact penalty problem with a quadratic objective and linear constraints. A Frank-Wolfe-type algorithm is proposed for the penalty problem that terminates at a stationary point or a global solution. Novel aspects of the approach include: (i) A linear complementarity formulation of the step function that "counts" misclassifications, (ii) Exact penalty formulation without boundedness, nondegeneracy or constraint qualification assumptions, (iii) An exact solution extraction from the sequence of minimizers of the penalty function for a finite value of the penalty parameter for the general LPEC and an explicitly exact solution for the LPEC with uncoupled constraints, and (iv) A parametric quadratic programming form...
A Global Optimization Technique for Statistical Classifier Design
- IEEE Transactions on Signal Processing
"... A global optimization method is introduced for the design of statistical classifiers that minimize the rate of misclassification. We first derive the theoretical basis for the method, based on which we develop a novel design algorithm and demonstrate its effectiveness and superior performance in the ..."
Abstract
-
Cited by 22 (8 self)
- Add to MetaCart
A global optimization method is introduced for the design of statistical classifiers that minimize the rate of misclassification. We first derive the theoretical basis for the method, based on which we develop a novel design algorithm and demonstrate its effectiveness and superior performance in the design of practical classifiers for some of the most popular structures currently in use. The method, grounded in ideas from statistical physics and information theory, extends the deterministic annealing approach for optimization, both to incorporate structural constraints on data assignments to classes and to minimize the probability of error as the cost objective. During the design, data are assigned to classes in probability, so as to minimize the expected classification error given a specified level of randomness, as measured by Shannon's entropy. The constrained optimization is equivalent to a free energy minimization, motivating a deterministic annealing approach in which the entropy...
Geometry in Learning
- In Geometry at Work
, 1997
"... One of the fundamental problems in learning is identifying members of two different classes. For example, to diagnose cancer, one must learn to discriminate between benign and malignant tumors. Through examination of tumors with previously determined diagnosis, one learns some function for distingui ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
One of the fundamental problems in learning is identifying members of two different classes. For example, to diagnose cancer, one must learn to discriminate between benign and malignant tumors. Through examination of tumors with previously determined diagnosis, one learns some function for distinguishing the benign and malignant tumors. Then the acquired knowledge is used to diagnose new tumors. The perceptron is a simple biologically inspired model for this two-class learning problem. The perceptron is trained or constructed using examples from the two classes. Then the perceptron is used to classify new examples. We describe geometrically what a perceptron is capable of learning. Using duality, we develop a framework for investigating different methods of training a perceptron. Depending on how we define the "best" perceptron, different minimization problems are developed for training the perceptron. The effectiveness of these methods is evaluated empirically on four practical applic...
A Theory Of Classifier Combination: The Neural Network Approach
, 1995
"... There is a trend in recent OCR development to improve system performance by combining recognition results of several complementary algorithms. This thesis examines the classifier combination problem under strict separation of the classifier and combinator design. None other than the fact that every ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
There is a trend in recent OCR development to improve system performance by combining recognition results of several complementary algorithms. This thesis examines the classifier combination problem under strict separation of the classifier and combinator design. None other than the fact that every classifier has the same input and output specification is assumed about the training, design or implementation of the classifiers. A general theory of combination should possess the following properties. It must be able to combine anytype of classifiers regardless of the level of information contents in the outputs. In addition, a general combinator must be able to combine any mixture of classifier types and utilize all information available. Since classifier independence is difficult to achieve and to detect, it is essential for a combinator to handle correlated classifiers robustly. Although the performance of a robust (against correlation) combinator can be improved by adding classifiers indiscriminantly, it is generally of interest to achieve comparable performance with the minimum number of classifiers. Therefore, the combinator should have the ability to eliminate redundant classifiers. Furthermore, it is desirable to have a complexity control mechanism for the combinator. In the past, simplifications come from assumptions and constraints imposed by the system designers. In the general theory, there should be a mechanism to reduce solution complexity by exercising non-classifier-specific constraints. Finally, a combinator should capture classifier/image dependencies. Nearly all combination methods have ignored the fact that classifier performances (and outputs) depend on various image characteristics, and this dependency is manifested in classifier output patterns in relation to input imag...
Online Learning and Stochastic Approximations
, 1998
"... The convergence of online learning algorithms is analyzed using the tools of the stochastic approximation theory, and proved under very weak conditions. A general framework for online learning algorithms is first presented. This framework encompasses the most common online learning algorithms in use ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
The convergence of online learning algorithms is analyzed using the tools of the stochastic approximation theory, and proved under very weak conditions. A general framework for online learning algorithms is first presented. This framework encompasses the most common online learning algorithms in use today, as illustrated by several examples. The stochastic approximation theory then provides general results describing the convergence of all these learning algorithms at once. 1 Introduction Almost all of the early work on Learning Systems focused on online algorithms (Hebb, 1949) (Rosenblatt, 1957) (Widrow and Hoff, 1960) (Amari, 1967) (Kohonen, 1982). In these early days, the algorithmic simplicity of online algorithms was a requirement. This is still the case when it comes to handling large, real-life training sets (LeCun et al., 1989) (Muller, Gunzinger and Guggenbuhl, 1995). The early Recursive Adaptive Algorithms were introduced during the same years (Robbins and Monro, 1951) and v...
Stochastic Gradient Learning in Neural Networks
"... Many connectionist learning algorithms consists of minimizing a cost of the form C(w) = E(J(z; w)) = Z J(z; w)dP (z) where dP is an unknown probability distribution that characterizes the problem to learn, and J , the loss function, defines the learning system itself. This popular statistical fo ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Many connectionist learning algorithms consists of minimizing a cost of the form C(w) = E(J(z; w)) = Z J(z; w)dP (z) where dP is an unknown probability distribution that characterizes the problem to learn, and J , the loss function, defines the learning system itself. This popular statistical formulation has led to many theoretical results. The minimization of such a cost may be achieved with a stochastic gradient descent algorithm, e.g.: w t+1 = w t \Gamma ffl t rwJ(z; w t ) With some restrictions on J and C, this algorithm converges, even if J is non differentiable on a set of measure 0. Links with simulated annealing are depicted. R'esum'e De nombreux algorithmes connexionnistes consistent `a minimiser un cout de la forme C(w) = E(J(z; w)) = Z J(z; w)dP (z) o`u dP est une distribution de probabilit'e inconnue qui caract'erise le probl`eme, et J , le crit`ere local, d'ecrit le syst`eme d'apprentissage lui meme. Cette formulation statistique bien connue a donn'e lieu `a de nom...
Links between Perceptrons, MLPs and SVMs
- In: Proceedings of ICML. (2004
, 2004
"... We propose to study links between three important classification algorithms: Perceptrons, Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters and early stopping), using the margin idea i ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
We propose to study links between three important classification algorithms: Perceptrons, Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters and early stopping), using the margin idea introduced with SVMs. After showing that under simple conditions a Perceptron is equivalent to an SVM, we show it can be computationally expensive in time to train an SVM (and thus a Perceptron) with stochastic gradient descent, mainly because of the margin maximization term in the cost function. We then show that if we remove this margin maximization term, the learning rate or the use of early stopping can still control the margin.

