Results 1  10
of
41
Solving multiclass learning problems via errorcorrecting output codes
 Journal of Artificial Intelligence Research
, 1995
"... Multiclass learning problems involve nding a de nition for an unknown function f(x) whose range is a discrete set containing k>2values (i.e., k \classes"). The de nition is acquired by studying collections of training examples of the form hx i;f(x i)i. Existing approaches to multiclass learning ..."
Abstract

Cited by 564 (9 self)
 Add to MetaCart
Multiclass learning problems involve nding a de nition for an unknown function f(x) whose range is a discrete set containing k>2values (i.e., k \classes"). The de nition is acquired by studying collections of training examples of the form hx i;f(x i)i. Existing approaches to multiclass learning problems include direct application of multiclass algorithms such as the decisiontree algorithms C4.5 and CART, application of binary concept learning algorithms to learn individual binary functions for each of the k classes, and application of binary concept learning algorithms with distributed output representations. This paper compares these three approaches to a new technique in which errorcorrecting codes are employed as a distributed output representation. We show that these output representations improve the generalization performance of both C4.5 and backpropagation on a wide range of multiclass learning tasks. We also demonstrate that this approach is robust with respect to changes in the size of the training sample, the assignment of distributed representations to particular classes, and the application of over tting avoidance techniques such as decisiontree pruning. Finally,we show thatlike the other methodsthe errorcorrecting code technique can provide reliable class probability estimates. Taken together, these results demonstrate that errorcorrecting output codes provide a generalpurpose method for improving the performance of inductive learning programs on multiclass problems. 1.
ErrorCorrecting Output Coding Corrects Bias and Variance
 In Proceedings of the Twelfth International Conference on Machine Learning
, 1995
"... Previous research has shown that a technique called errorcorrecting output coding (ECOC) can dramatically improve the classification accuracy of supervised learning algorithms that learn to classify data points into one of k AE 2 classes. This paper presents an investigation of why the ECOC techniq ..."
Abstract

Cited by 148 (5 self)
 Add to MetaCart
Previous research has shown that a technique called errorcorrecting output coding (ECOC) can dramatically improve the classification accuracy of supervised learning algorithms that learn to classify data points into one of k AE 2 classes. This paper presents an investigation of why the ECOC technique works, particularly when employed with decisiontree learning algorithms. It shows that the ECOC method like any form of voting or committeecan reduce the variance of the learning algorithm. Furthermoreunlike methods that simply combine multiple runs of the same learning algorithmECOC can correct for errors caused by the bias of the learning algorithm. Experiments show that this bias correction ability relies on the nonlocal behavior of C4.5. 1 Introduction Errorcorrecting output coding (ECOC) is a method for applying binary (twoclass) learning algorithms to solve kclass supervised learning problems. It works by converting the kclass supervised learning problem into a la...
Using Generative Models for Handwritten Digit Recognition
 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 1996
"... We describe a method of recognizing handwritten digits by fitting generative models that are built from deformable Bsplines with Gaussian "ink generators" spaced along the length of the spline. The splines are adjusted using a novel elastic matching procedure based on the Expectation Maximization ( ..."
Abstract

Cited by 69 (8 self)
 Add to MetaCart
We describe a method of recognizing handwritten digits by fitting generative models that are built from deformable Bsplines with Gaussian "ink generators" spaced along the length of the spline. The splines are adjusted using a novel elastic matching procedure based on the Expectation Maximization (EM) algorithm that maximizes the likelihood of the model generating the data. This approach has many advantages. (1) After identifying the model most likely to have generated the data, the system not only produces a classification of the digit but also a rich description of the instantiation parameters which can yield information such as the writing style. (2) During the process of explaining the image, generative models can perform recognition driven segmentation. (3) The method involves a relatively small number of parameters and hence training is relatively easy and fast. (4) Unlike many other recognition schemes it does not rely on some form of prenormalization of input images, but can ...
Ensembling Neural Networks: Many Could Be Better Than All
, 2002
"... Neural network ensemble is a learning paradigm where many neural networks are jointly used to solve a problem. In this paper, the relationship between the ensemble and its component neural networks is analyzed from the context of both regression and classification, which reveals that it may be bette ..."
Abstract

Cited by 67 (12 self)
 Add to MetaCart
Neural network ensemble is a learning paradigm where many neural networks are jointly used to solve a problem. In this paper, the relationship between the ensemble and its component neural networks is analyzed from the context of both regression and classification, which reveals that it may be better to ensemble many instead of all of the neural networks at hand. This result is interesting because at present, most approaches ensemble all the available neural networks for prediction. Then, in order to show that the appropriate neural networks for composing an ensemble can be effectively selected from a set of available neural networks, an approach named GASEN is presented. GASEN trains a number of neural networks at first. Then it assigns random weights to those networks and employs genetic algorithm to evolve the weights so that they can characterize to some extent the fitness of the neural networks in constituting an ensemble. Finally it selects some neural networks based on the evolved weights to make up the ensemble. A large empirical study shows that, comparing with some popular ensemble approaches such as Bagging and Boosting, GASEN can generate neural network ensembles with far smaller sizes but stronger generalization ability. Furthermore, in order to understand the working mechanism of GASEN, the biasvariance decomposition of the error is provided in this paper, which shows that the success of GASEN may lie in that it can significantly reduce the bias as well as the variance.
Method Combination For Document Filtering
, 1996
"... There is strong empirical and theoretic evidence that combination of retrieval methods can improve performance. In this paper, we systematically compare combination strategies in the context of document filtering, using queries from the Tipster reference corpus. We find that simple averaging strateg ..."
Abstract

Cited by 53 (1 self)
 Add to MetaCart
There is strong empirical and theoretic evidence that combination of retrieval methods can improve performance. In this paper, we systematically compare combination strategies in the context of document filtering, using queries from the Tipster reference corpus. We find that simple averaging strategies do indeed improve performance, but that direct averaging of probability estimates is not the correct approach. Instead, the probability estimates must be renormalized using logistic regression on the known relevance judgements. We examine more complex combination strategies but find them less successful due to the high correlations among our filtering methods which are optimized over the same training data and employ similar document representations. 1 Introduction A text filtering system monitors an incoming document stream and selects documents identified as relevant to one or more of its query profiles. If profile interactions are ignored, this reduces to a number of independent bina...
Equivalence Proofs for MultiLayer Perceptron Classifiers and the Bayesian Discriminant Function
, 1990
"... This paper presents a number of proofs that equate the outputs of a MultiLayer Perceptron (MLP) classifier and the optimal Bayesian discriminant function for asymptotically large sets of statistically independent training samples. Two broad classes of objective functions are shown to yield Bayesian ..."
Abstract

Cited by 53 (1 self)
 Add to MetaCart
This paper presents a number of proofs that equate the outputs of a MultiLayer Perceptron (MLP) classifier and the optimal Bayesian discriminant function for asymptotically large sets of statistically independent training samples. Two broad classes of objective functions are shown to yield Bayesian discriminant performance. The first class are "reasonable error measures," which achieve Bayesian discriminant performance by engendering classifier outputs that asymptotically equate to a posteriori probabilities. This class includes the meansquared error (MSE) objective function as well as a number of information theoretic objective functions. The second class are classification figures of merit (CFM mono ), which yield a qualified approximation to Bayesian discriminant performance by engendering classifier outputs that asymptotically identify the maximum a posteriori probability for a given input. Conditions and relationships for Bayesian discriminant functional equivalence are given f...
Combining the Predictions of Multiple Classifiers: Using Competitive Learning to Initialize Neural Networks
 In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence
, 1995
"... The primary goal of inductive learning is to generalize well  that is, induce a function that accurately produces the correct output for future inputs. Hansen and Salamon showed that, under certain assumptions, combining the predictions of several separately trained neural networks will improve ge ..."
Abstract

Cited by 38 (6 self)
 Add to MetaCart
The primary goal of inductive learning is to generalize well  that is, induce a function that accurately produces the correct output for future inputs. Hansen and Salamon showed that, under certain assumptions, combining the predictions of several separately trained neural networks will improve generalization. One of their key assumptions is that the individual networks should be independent in the errors they produce. In the standard way of performing backpropagation this assumption may be violated, because the standard procedure is to initialize network weights in the region of weight space near the origin. This means that backpropagation's gradientdescent search may only reach a small subset of the possible local minima. In this paper we present an approach to initializing neural networks that uses competitive learning to intelligently create networks that are originally located far from the origin of weight space, thereby potentially increasing the set of reachable local minima....
A Global Optimization Technique for Statistical Classifier Design
 IEEE Transactions on Signal Processing
"... A global optimization method is introduced for the design of statistical classifiers that minimize the rate of misclassification. We first derive the theoretical basis for the method, based on which we develop a novel design algorithm and demonstrate its effectiveness and superior performance in the ..."
Abstract

Cited by 25 (9 self)
 Add to MetaCart
A global optimization method is introduced for the design of statistical classifiers that minimize the rate of misclassification. We first derive the theoretical basis for the method, based on which we develop a novel design algorithm and demonstrate its effectiveness and superior performance in the design of practical classifiers for some of the most popular structures currently in use. The method, grounded in ideas from statistical physics and information theory, extends the deterministic annealing approach for optimization, both to incorporate structural constraints on data assignments to classes and to minimize the probability of error as the cost objective. During the design, data are assigned to classes in probability, so as to minimize the expected classification error given a specified level of randomness, as measured by Shannon's entropy. The constrained optimization is equivalent to a free energy minimization, motivating a deterministic annealing approach in which the entropy...
Connected Letter Recognition with a MultiState Time Delay Neural Network
 In 3rd European Conference on Speech, Communication and Technology (EUROSPEECH) 93
, 1993
"... The MultiState Time Delay Neural Network (MSTDNN) integrates a nonlinear time alignment procedure (DTW) and the highaccuracy phoneme spotting capabilities of a TDNN into a connectionist speech recognition system with wordlevel classification and error backpropagation. We present an MSTDNN for re ..."
Abstract

Cited by 23 (13 self)
 Add to MetaCart
The MultiState Time Delay Neural Network (MSTDNN) integrates a nonlinear time alignment procedure (DTW) and the highaccuracy phoneme spotting capabilities of a TDNN into a connectionist speech recognition system with wordlevel classification and error backpropagation. We present an MSTDNN for recognizing continuously spelled letters, a task characterized by a small but highly confusable vocabulary. Our MSTDNN achieves 98.5/92.0% word accuracy on speaker dependent/independent tasks, outperforming previously reported results on the same databases. We propose training techniques aimed at improving sentence level performance, including free alignment across word boundaries, word duration modeling and error backpropagation on the sentence rather than the word level. Architectures integrating submodules specialized on a subset of speakers achieved further improvements. 1 INTRODUCTION The recognition of spelled strings of letters is essential for all applications involving proper names,...
Discriminative Training of Hidden Markov Models
, 1998
"... vi Abbreviations vii Notation viii 1 Introduction 1 2 Hidden Markov Models 4 2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 HMM Modelling Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 HMM Topology . . . . . . . . . ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
vi Abbreviations vii Notation viii 1 Introduction 1 2 Hidden Markov Models 4 2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 HMM Modelling Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 HMM Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Finding the Best Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5 Setting the Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Objective Functions 19 3.1 Properties of Maximum Likelihood Estimators . . . . . . . . . . . . . . . . . . . 19 3.2 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Maximum Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Frame Discrimination . . . . . . . . . . . . . . . . ....