Results 1 - 10
of
30
Gradient-based learning applied to document recognition
- Proceedings of the IEEE
, 1998
"... Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradientbased learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify hi ..."
Abstract
-
Cited by 487 (38 self)
- Add to MetaCart
Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradientbased learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of two dimensional (2-D) shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation, recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN’s), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank check is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal checks. It is deployed commercially and reads several million checks per day.
Rigorous learning curve bounds from statistical mechanics
- Machine Learning
, 1994
"... Abstract In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics. The advantage of our theory over the well-established Vapnik-Chervonenkis theory is that our bounds can be considerably tighter in many cases, an ..."
Abstract
-
Cited by 52 (9 self)
- Add to MetaCart
Abstract In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics. The advantage of our theory over the well-established Vapnik-Chervonenkis theory is that our bounds can be considerably tighter in many cases, and are also more reflective of the true behavior (functional form) of learning curves. This behavior can often exhibit dramatic properties such as phase transitions, as well as power law asymptotics not explained by the VC theory. The disadvantages of our theory are that its application requires knowledge of the input distribution, and it is limited so far to finite cardinality function classes. We illustrate our results with many concrete examples of learning curve bounds derived from our theory. 1 Introduction According to the Vapnik-Chervonenkis (VC) theory of learning curves [27, 26], minimizing empirical error within a function class F on a random sample of m examples leads to generalization error bounded by ~O(d=m) (in the case that the target function is contained in F) or ~O(pd=m) plus the optimal generalization error achievable within F (in the general case). 1 These bounds are universal: they hold for any class of hypothesis functions F, for any input distribution, and for any target function. The only problem-specific quantity remaining in these bounds is the VC dimension d, a measure of the complexity of the function class F. It has been shown that these bounds are essentially the best distribution-independent bounds possible, in the sense that for any function class, there exists an input distribution for which matching lower bounds on the generalization error can be given [5, 7, 22].
Discovering Informative Patterns and Data Cleaning
, 1996
"... We present a method for discovering informative patterns from data. With this method, large databases can be reduced to only a few representative data entries. Our framework also encompasses methods for cleaning databases containing corrupted data. Both on-line and off-line algorithms are proposed a ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
We present a method for discovering informative patterns from data. With this method, large databases can be reduced to only a few representative data entries. Our framework also encompasses methods for cleaning databases containing corrupted data. Both on-line and off-line algorithms are proposed and experimentally checked on databases of handwritten images. The generality of the framework makes it an attractive candidate for new applications in knowledge discovery. Keywords: knowledge discovery, machine learning, informative patterns, data cleaning, information gain. 4.1
Bounds on Sample Size for Policy Evaluation in Markov Environments
- In Fourteenth Annual Conference on Computational Learning Theory
, 2001
"... Reinforcement learning means finding the optimal course of action in Markovian environments without knowledge of the environment 's dynamics. Stochastic optimization algorithms used in the field rely on estimates of the value of a policy. Typically, the value of a policy is estimated from result ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
Reinforcement learning means finding the optimal course of action in Markovian environments without knowledge of the environment 's dynamics. Stochastic optimization algorithms used in the field rely on estimates of the value of a policy. Typically, the value of a policy is estimated from results of simulating that very policy in the environment.
An Empirical Quest for Optimal Rule Learning Heuristics
, 2008
"... The primary goal of the research reported in this paper is to identify what criteria are responsible for the good performance of a heuristic rule evaluation function in a greedy topdown covering algorithm. We first argue that search heuristics for inductive rule learning algorithms typically trade o ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
The primary goal of the research reported in this paper is to identify what criteria are responsible for the good performance of a heuristic rule evaluation function in a greedy topdown covering algorithm. We first argue that search heuristics for inductive rule learning algorithms typically trade off consistency and coverage, and we investigate this trade-off by determining optimal parameter settings for five different parametrized heuristics. In order to avoid biasing our study by known functional families, we also investigate the potential of using meta-learning for obtaining alternative rule learning heuristics. The key results of this experimental study are not only practical default values for commonly used heuristics and a broad comparative evaluation of known and novel rule learning heuristics, but we also gain theoretical insights into factors that are responsible for a good performance. For example, we observe that consistency should be weighed more heavily than coverage, presumably because a lack of coverage can later be corrected by learning additional rules.
Finding Optimal Neural Networks for Land Use Classification
- IEEE Transactions on Geoscience and Remote Sensing
, 1998
"... In this letter we present a fully automatic and computationally efficient algorithm based on the Minimum Description Length Principle (MDL) for optimizing multilayer perceptron classifiers. We demonstrate our method on the problem of multispectral Landsat image classification. We compare our results ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
In this letter we present a fully automatic and computationally efficient algorithm based on the Minimum Description Length Principle (MDL) for optimizing multilayer perceptron classifiers. We demonstrate our method on the problem of multispectral Landsat image classification. We compare our results with a hand designed multi-layer perceptron and a Gaussian maximum likelihood classifier where our method produces better classification accuracy with a smaller number of hidden units. 1 Introduction The number of applications of neural networks to remote sensing problems (especially classification) has been constantly increasing in the last few years (e.g. see [1, 2, 3, 4]). It has been demonstrated that in many cases neural networks perform considerably better than classical methods e.g. [1]. However, to achieve this superior performance, the neural networks need to be carefully designed. This includes both the design of the network topology as well as the input/output representati...
Learning Coherent Concepts
, 2001
"... This paper develops a theory for learning scenarios where multiple learners co-exist but there are mutual coherency constraints on their outcomes. This is natural in cognitive learning situations, where "natural" constraints are imposed on the outcomes of classifiers so that a valid sentence, ima ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
This paper develops a theory for learning scenarios where multiple learners co-exist but there are mutual coherency constraints on their outcomes. This is natural in cognitive learning situations, where "natural" constraints are imposed on the outcomes of classifiers so that a valid sentence, image or any other domain representation is produced. We formalize these learning situations, after a model suggested in (Roth & Zelenko, 2000) and study generalization abilities of learning algorithms under these conditions in several frameworks. We show that the mere existence of coherency constraints, even without the learner's awareness of them, deems the learning problem easier than predicted by general theories and explains the ability to generalize well from a fairly small number of examples. In particular, it is shown that within this model one can develop an understanding to several realistic learning situations such as highly biased training sets and low dimensional data that is embedded in high dimensional instance spaces. 1.
Sources of Success for Boosted Wrapper Induction
- Journal of Machine Learning Research
, 2004
"... In this paper, we examine an important recent rule-based information extraction (IE) technique named Boosted Wrapper Induction (BWI) by conducting experiments on a wider variety of tasks than previously studied, including tasks using several collections of natural text documents. We investigate syst ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In this paper, we examine an important recent rule-based information extraction (IE) technique named Boosted Wrapper Induction (BWI) by conducting experiments on a wider variety of tasks than previously studied, including tasks using several collections of natural text documents. We investigate systematically how each algorithmic component of BWI, in particular boosting, contributes to its success. We show that the benefit of boosting arises from the ability to reweight examples to learn specific rules (resulting in high precision) combined with the ability to continue learning rules after all positive examples have been covered (resulting in high recall). As a quantitative indicator of the regularity of an extraction task, we propose a new measure that we call the SWI ratio. We show that this measure is a good predictor of IE success and a useful tool for analyzing IE tasks. Based on these results, we analyze the strengths and limitations of BWI. Specifically, we explain limitations in the information made available, and in the representations used. We also investigate the consequences of the fact that confidence values returned during extraction are not true probabilities. Next, we investigate the benefits of including grammatical and semantic information for natural text documents, as well as parse tree and attribute-value information for XML and

