Results 1  10
of
28
InformationTheoretic Determination of Minimax Rates of Convergence
 Ann. Stat
, 1997
"... In this paper, we present some general results determining minimax bounds on statistical risk for density estimation based on certain informationtheoretic considerations. These bounds depend only on metric entropy conditions and are used to identify the minimax rates of convergence. ..."
Abstract

Cited by 98 (18 self)
 Add to MetaCart
In this paper, we present some general results determining minimax bounds on statistical risk for density estimation based on certain informationtheoretic considerations. These bounds depend only on metric entropy conditions and are used to identify the minimax rates of convergence.
Efficient Agnostic Learning of Neural Networks with Bounded Fanin
, 1996
"... We show that the class of two layer neural networks with bounded fanin is efficiently learnable in a realistic extension to the Probably Approximately Correct (PAC) learning model. In this model, a joint probability distribution is assumed to exist on the observations and the learner is required to ..."
Abstract

Cited by 68 (18 self)
 Add to MetaCart
We show that the class of two layer neural networks with bounded fanin is efficiently learnable in a realistic extension to the Probably Approximately Correct (PAC) learning model. In this model, a joint probability distribution is assumed to exist on the observations and the learner is required to approximate the neural network which minimizes the expected quadratic error. As special cases, the model allows learning realvalued functions with bounded noise, learning probabilistic concepts and learning the best approximation to a target function that cannot be well approximated by the neural network. The networks we consider have realvalued inputs and outputs, an unlimited number of threshold hidden units with bounded fanin, and a bound on the sum of the absolute values of the output weights. The number of computation This work was supported by the Australian Research Council and the Australian Telecommunications and Electronics Research Board. The material in this paper was pres...
On the Rate of Convergence of Regularized Boosting Classifiers
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... A regularized boosting method is introduced, for which regularization is obtained through a penalization function. It is shown through oracle inequalities that this method is model adaptive. The rate of convergence of the probability of misclassification is investigated. It is shown that for quite ..."
Abstract

Cited by 46 (10 self)
 Add to MetaCart
A regularized boosting method is introduced, for which regularization is obtained through a penalization function. It is shown through oracle inequalities that this method is model adaptive. The rate of convergence of the probability of misclassification is investigated. It is shown that for quite a large class of distributions, the probability of error converges to the Bayes risk at a rate faster than n (V+2)/(4(V+1)) where V is the VC dimension of the "base" class whose elements are combined by boosting methods to obtain an aggregated classifier. The dimensionindependent nature of the rates may partially explain the good behavior of these methods in practical problems. Under Tsybakov's noise condition the rate of convergence is even faster. We investigate the conditions necessary to obtain such rates for different base classes. The special case of boosting using decision stumps is studied in detail. We characterize the class of classifiers realizable by aggregating decision stumps.
Approximation theory of the MLP model in neural networks
 ACTA NUMERICA
, 1999
"... In this survey we discuss various approximationtheoretic problems that arise in the multilayer feedforward perceptron (MLP) model in neural networks. Mathematically it is one of the simpler models. Nonetheless the mathematics of this model is not well understood, and many of these problems are appr ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
In this survey we discuss various approximationtheoretic problems that arise in the multilayer feedforward perceptron (MLP) model in neural networks. Mathematically it is one of the simpler models. Nonetheless the mathematics of this model is not well understood, and many of these problems are approximationtheoretic in character. Most of the research we will discuss is of very recent vintage. We will report on what has been done and on various unanswered questions. We will not be presenting practical (algorithmic) methods. We will, however, be exploring the capabilities and limitations of this model. In the first
Neural networks for control
 in Essays on Control: Perspectives in the Theory and its Applications (H.L. Trentelman and
, 1993
"... This paper starts by placing neural net techniques in a general nonlinear control framework. After that, several basic theoretical results on networks are surveyed. 1 ..."
Abstract

Cited by 26 (8 self)
 Add to MetaCart
This paper starts by placing neural net techniques in a general nonlinear control framework. After that, several basic theoretical results on networks are surveyed. 1
Approximation and learning by greedy algorithms
 Ann. Statist
, 2008
"... We consider the problem of approximating a given element f from a Hilbert space H by means of greedy algorithms and the application of such procedures to the regression problem in statistical learning theory. We improve on the existing theory of convergence rates for both the orthogonal greedy algor ..."
Abstract

Cited by 25 (4 self)
 Add to MetaCart
We consider the problem of approximating a given element f from a Hilbert space H by means of greedy algorithms and the application of such procedures to the regression problem in statistical learning theory. We improve on the existing theory of convergence rates for both the orthogonal greedy algorithm and the relaxed greedy algorithm, as well as for the forward stepwise projection algorithm. For all these algorithms, we prove convergence results for a variety of function classes and not simply those that are related to the convex hull of the dictionary. We then show how these bounds for convergence rates leads to a new theory for the performance of greedy algorithms in learning. In particular, we build upon the results in [18] to construct learning algorithms based on greedy approximations which are universally consistent and provide provable convergence rates for large classes of functions. The use of greedy algorithms in the context of learning is very appealing since it greatly reduces the computational burden when compared with standard model selection using general dictionaries. Key Words: Orthogonal, relaxed greedy algorithm, convergence estimates for a scale of interpolation spaces, universal consistency, applications to learning, neural networks. AMS Subject Classification: 41A25, 41A46, 41A63, 62F12, 62G08 1
What size neural network gives optimal generalization? convergence properties of backpropagation
, 1996
"... One of the most important aspects of any machine learning paradigm is how it scales according to problem size and complexity. Using a task with known optimal training error, and a prespecified maximum number of training updates, we investigate the convergence of the backpropagation algorithm with r ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
One of the most important aspects of any machine learning paradigm is how it scales according to problem size and complexity. Using a task with known optimal training error, and a prespecified maximum number of training updates, we investigate the convergence of the backpropagation algorithm with respect to a) the complexity of the required function approximation, b) the size of the network in relation to the size required for an optimal solution, and c) the degree of noise in the training data. In general, for a) the solution found is worse when the function to be approximated is more complex, for b) oversized networks can result in lower training and generalization error in certain cases, and for c) the use of committee or ensemble techniques can be more beneficial as the level of noise in the training data is increased. For the experiments we performed, we do not obtain the optimal solution in any case. We further support the observation that larger networks can produce better training and generalization error using a face recognition example where a network with many more parameters than training points generalizes better than smaller networks.
Greedy Algorithms for Classification  Consistency, Convergence Rates, and Adaptivity
 Journal of Machine Learning Research
, 2002
"... Many regression and classification algorithms proposed over the years can described as greedy procedures for the stagewise minimization of an appropriate cost function. Some examples include additive models, matching pursuit, and Boosting. In this work we focus on the classification problem, for ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
Many regression and classification algorithms proposed over the years can described as greedy procedures for the stagewise minimization of an appropriate cost function. Some examples include additive models, matching pursuit, and Boosting. In this work we focus on the classification problem, for which many recent algorithms have been proposed and applied successfully. For a specific regularized form of greedy stagewise optimization, we prove consistency of the approach under rather general conditions. Focusing on specific classes of problems we provide conditions under which our greedy procedure achieves the (nearly) minimax rate of convergence, implying that the procedure cannot be improved in a worst case setting. We also construct a fully adaptive procedure, which, without knowing the smoothness parameter of the decision boundary, converges at the same rate as if the smoothness parameter were known.
Minimax nonparametric classification  Part I: Rates of convergence

, 1998
"... This paper studies minimax aspects of nonparametric classification. We first study minimax estimation of the conditional probability of a class label, given the feature variable. This function, say f � is assumed to be in a general nonparametric class. We show the minimax rate of convergence under ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
This paper studies minimax aspects of nonparametric classification. We first study minimax estimation of the conditional probability of a class label, given the feature variable. This function, say f � is assumed to be in a general nonparametric class. We show the minimax rate of convergence under square L 2 loss is determined by the massiveness of the class as measured by metric entropy. The second part of the paper studies minimax classification. The loss of interest is the difference between the probability of misclassification of a classifier and that of the Bayes decision. As is wellknown, an upper bound on risk for estimating f gives an upper bound on the risk for classification, but the rate is known to be suboptimal for the class of monotone functions. This suggests that one does not have to estimate f well in order to classify well. However, we show that the two problems are in fact of the same difficulty in terms of rates of convergence under a sufficient condition, which is satisfied by many function classes including Besov (Sobolev), Lipschitz, and bounded variation. This is somewhat surprising in view of a result of Devroye, Györfi, and Lugosi (1996).
Foundations Of Recurrent Neural Networks
, 1993
"... "Artificial neural networks" provide an appealing model of computation. Such networks consist of an interconnection of a number of parallel agents, or "neurons." Each of these receives certain signals as inputs, computes some simple function, and produces a signal as output, which is in turn broadca ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
"Artificial neural networks" provide an appealing model of computation. Such networks consist of an interconnection of a number of parallel agents, or "neurons." Each of these receives certain signals as inputs, computes some simple function, and produces a signal as output, which is in turn broadcast to the successive neurons involved in a given computation. Some of the signals originate from outside the network, and act as inputs to the whole system, while some of the output signals are communicated back to the environment and are used to encode the end result of computation. In this dissertation we focus on the "recurrent network" model, in which the underlying graph is not subject to any constraints. We investigate the computational power of neural nets, taking a classical computer science point of view. We characterize the language re...