Results 1  10
of
51
Gradientbased learning applied to document recognition
 Proceedings of the IEEE
, 1998
"... Multilayer neural networks trained with the backpropagation algorithm constitute the best example of a successful gradientbased learning technique. Given an appropriate network architecture, gradientbased learning algorithms can be used to synthesize a complex decision surface that can classify hi ..."
Abstract

Cited by 1468 (84 self)
 Add to MetaCart
Multilayer neural networks trained with the backpropagation algorithm constitute the best example of a successful gradientbased learning technique. Given an appropriate network architecture, gradientbased learning algorithms can be used to synthesize a complex decision surface that can classify highdimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of two dimensional (2D) shapes, are shown to outperform all other techniques. Reallife document recognition systems are composed of multiple modules including field extraction, segmentation, recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN’s), allows such multimodule systems to be trained globally using gradientbased methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank check is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal checks. It is deployed commercially and reads several million checks per day.
Query by Committee
, 1992
"... We propose an algorithm called query by committee, in which a committee of students is trained on the same data set. The next query is chosen according to the principle of maximal disagreement. The algorithm is studied for two toy models: the highlow game and perceptron learning of another perceptr ..."
Abstract

Cited by 428 (3 self)
 Add to MetaCart
We propose an algorithm called query by committee, in which a committee of students is trained on the same data set. The next query is chosen according to the principle of maximal disagreement. The algorithm is studied for two toy models: the highlow game and perceptron learning of another perceptron. As the number of queries goes to infinity, the committee algorithm yields asymptotically finite information gain. This leads to generalization error that decreases exponentially with the number of examples. This in marked contrast to learning from randomly chosen inputs, for which the information gain approaches zero and the generalization error decreases with a relatively slow inverse power law. We suggest that asymptotically finite information gain may be an important characteristic of good query algorithms.
How to Use Expert Advice
 JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY
, 1997
"... We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worstcase situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the ..."
Abstract

Cited by 378 (74 self)
 Add to MetaCart
We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worstcase situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictions. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show howthis leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently known in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes.
Efficient noisetolerant learning from statistical queries
 JOURNAL OF THE ACM
, 1998
"... In this paper, we study the problem of learning in the presence of classification noise in the probabilistic learning model of Valiant and its variants. In order to identify the class of “robust” learning algorithms in the most general way, we formalize a new but related model of learning from stat ..."
Abstract

Cited by 363 (5 self)
 Add to MetaCart
(Show Context)
In this paper, we study the problem of learning in the presence of classification noise in the probabilistic learning model of Valiant and its variants. In order to identify the class of “robust” learning algorithms in the most general way, we formalize a new but related model of learning from statistical queries. Intuitively, in this model, a learning algorithm is forbidden to examine individual examples of the unknown target function, but is given access to an oracle providing estimates of probabilities over the sample space of random examples. One of our main results shows that any class of functions learnable from statistical queries is in fact learnable with classification noise in Valiant’s model, with a noise rate approaching the informationtheoretic barrier of 1/2. We then demonstrate the generality of the statistical query model, showing that practically every class learnable in Valiant’s model and its variants can also be learned in the new model (and thus can be learned in the presence of noise). A notable exception to this statement is the class of parity functions, which we prove is not learnable from statistical queries, and for which no noisetolerant algorithm is known.
Operations for Learning with Graphical Models
 Journal of Artificial Intelligence Research
, 1994
"... This paper is a multidisciplinary review of empirical, statistical learning from a graphical model perspective. Wellknown examples of graphical models include Bayesian networks, directed graphs representing a Markov chain, and undirected networks representing a Markov field. These graphical models ..."
Abstract

Cited by 277 (13 self)
 Add to MetaCart
This paper is a multidisciplinary review of empirical, statistical learning from a graphical model perspective. Wellknown examples of graphical models include Bayesian networks, directed graphs representing a Markov chain, and undirected networks representing a Markov field. These graphical models are extended to model data analysis and empirical learning using the notation of plates. Graphical operations for simplifying and manipulating a problem are provided including decomposition, differentiation, and the manipulation of probability models from the exponential family. Two standard algorithm schemas for learning are reviewed in a graphical framework: Gibbs sampling and the expectation maximization algorithm. Using these operations and schemas, some popular algorithms can be synthesized from their graphical specification. This includes versions of linear regression, techniques for feedforward networks, and learning Gaussian and discrete Bayesian networks from data. The paper conclu...
The Evidence Framework Applied to Classification Networks
, 1992
"... Three Bayesian ideas are presented for supervised adaptive classifiers. First, it is argued that the output of a classifier should be obtained by marginalizing over the posterior distribution of the parameters; a simple approximation to this integral is proposed and demonstrated. This involves a &qu ..."
Abstract

Cited by 189 (13 self)
 Add to MetaCart
Three Bayesian ideas are presented for supervised adaptive classifiers. First, it is argued that the output of a classifier should be obtained by marginalizing over the posterior distribution of the parameters; a simple approximation to this integral is proposed and demonstrated. This involves a "moderation" of the most probable classifier's outputs, and yields improved performance. Second, it is demonstrated that the Bayesian framework for model comparison described for regression models in MacKay (1992a,b) can also be applied to classification problems. This framework successfully chooses the magnitude of weight decay terms, and ranks solutions found using different numbers of hidden units. Third, an informationbased data selection criterion is derived and demonstrated within this framework.
Algorithmic Stability and SanityCheck Bounds for LeaveOneOut CrossValidation
 Neural Computation
, 1997
"... In this paper we prove sanitycheck bounds for the error of the leaveoneout crossvalidation estimate of the generalization error: that is, bounds showing that the worstcase error of this estimate is not much worse than that of the training error estimate. The name sanitycheck refers to the fact ..."
Abstract

Cited by 128 (1 self)
 Add to MetaCart
In this paper we prove sanitycheck bounds for the error of the leaveoneout crossvalidation estimate of the generalization error: that is, bounds showing that the worstcase error of this estimate is not much worse than that of the training error estimate. The name sanitycheck refers to the fact that although we often expect the leaveoneout estimate to perform considerably better than the training error estimate, we are here only seeking assurance that its performance will not be considerably worse. Perhaps surprisingly, such assurance has been given only for limited cases in the prior literature on crossvalidation. Any nontrivial bound on the error of leaveoneout must rely on some notion of algorithmic stability. Previous bounds relied on the rather strong notion of hypothesis stability, whose application was primarily limited to nearestneighbor and other local algorithms. Here we introduce the new and weaker notion of error stability, and apply it to obtain sanitycheck b...
An experimental and theoretical comparison of model selection methods. Machine Learning 27
, 1997
"... In the model selection problem, we must balance the complexity of a statistical model with its goodness of fit to the training data. This problem arises repeatedly in statistical estimation, machine learning, and scientific inquiry in general. ..."
Abstract

Cited by 117 (5 self)
 Add to MetaCart
(Show Context)
In the model selection problem, we must balance the complexity of a statistical model with its goodness of fit to the training data. This problem arises repeatedly in statistical estimation, machine learning, and scientific inquiry in general.
Rigorous learning curve bounds from statistical mechanics
 Machine Learning
, 1994
"... Abstract In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics. The advantage of our theory over the wellestablished VapnikChervonenkis theory is that our bounds can be considerably tighter in many cases, an ..."
Abstract

Cited by 59 (10 self)
 Add to MetaCart
(Show Context)
Abstract In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics. The advantage of our theory over the wellestablished VapnikChervonenkis theory is that our bounds can be considerably tighter in many cases, and are also more reflective of the true behavior (functional form) of learning curves. This behavior can often exhibit dramatic properties such as phase transitions, as well as power law asymptotics not explained by the VC theory. The disadvantages of our theory are that its application requires knowledge of the input distribution, and it is limited so far to finite cardinality function classes. We illustrate our results with many concrete examples of learning curve bounds derived from our theory. 1 Introduction According to the VapnikChervonenkis (VC) theory of learning curves [27, 26], minimizing empirical error within a function class F on a random sample of m examples leads to generalization error bounded by ~O(d=m) (in the case that the target function is contained in F) or ~O(pd=m) plus the optimal generalization error achievable within F (in the general case). 1 These bounds are universal: they hold for any class of hypothesis functions F, for any input distribution, and for any target function. The only problemspecific quantity remaining in these bounds is the VC dimension d, a measure of the complexity of the function class F. It has been shown that these bounds are essentially the best distributionindependent bounds possible, in the sense that for any function class, there exists an input distribution for which matching lower bounds on the generalization error can be given [5, 7, 22].
The Relationship between PAC, the Statistical Physics framework, the Bayesian framework, and the VC framework
"... This paper discusses the intimate relationships between the supervised learning frameworks mentioned in the title. In particular, it shows how all those frameworks can be viewed as particular instances of a single overarching formalism. In doing this many commonly misunderstood aspects of those fram ..."
Abstract

Cited by 45 (8 self)
 Add to MetaCart
This paper discusses the intimate relationships between the supervised learning frameworks mentioned in the title. In particular, it shows how all those frameworks can be viewed as particular instances of a single overarching formalism. In doing this many commonly misunderstood aspects of those frameworks are explored. In addition the strengths and weaknesses of those frameworks are compared, and some novel frameworks are suggested (resulting, for example, in a "correction" to the familiar biasplusvariance formula).