Results 1  10
of
47
Approximation theory of the MLP model in neural networks
 ACTA NUMERICA
, 1999
"... In this survey we discuss various approximationtheoretic problems that arise in the multilayer feedforward perceptron (MLP) model in neural networks. Mathematically it is one of the simpler models. Nonetheless the mathematics of this model is not well understood, and many of these problems are appr ..."
Abstract

Cited by 59 (3 self)
 Add to MetaCart
In this survey we discuss various approximationtheoretic problems that arise in the multilayer feedforward perceptron (MLP) model in neural networks. Mathematically it is one of the simpler models. Nonetheless the mathematics of this model is not well understood, and many of these problems are approximationtheoretic in character. Most of the research we will discuss is of very recent vintage. We will report on what has been done and on various unanswered questions. We will not be presenting practical (algorithmic) methods. We will, however, be exploring the capabilities and limitations of this model. In the first
Algebraic analysis for nonidentifiable learning machines
 Neural Computation
"... This paper clarifies the relation between the learning curve and the algebraic geometrical structure of a nonidentifiable learning machine such as a multilayer neural network whose true parameter set is an analytic set with singular points. By using a concept in algebraic analysis, we rigorously pr ..."
Abstract

Cited by 59 (18 self)
 Add to MetaCart
(Show Context)
This paper clarifies the relation between the learning curve and the algebraic geometrical structure of a nonidentifiable learning machine such as a multilayer neural network whose true parameter set is an analytic set with singular points. By using a concept in algebraic analysis, we rigorously prove that the Bayesian stochastic complexity or the free energy is asymptotically equal to λ1 log n − (m1 − 1) log log n+constant, where n is the number of training samples and λ1 and m1 are the rational number and the natural number which are determined as the birational invariant values of the singularities in the parameter space. Also we show an algorithm to calculate λ1 and m1 based on the resolution of singularities in algebraic geometry. In regular statistical models, 2λ1 is equal to the number of parameters and m1 = 1, whereas in nonregular models such as multilayer networks, 2λ1 is not larger than the number of parameters and m1 ≥ 1. Since the increase of the stochastic complexity is equal to the learning curve or the generalization error, the nonidentifiable learning machines are the better models than the regular ones if the Bayesian ensemble learning is applied. 1 1
Nonparametric time series prediction through adaptive model selection
 Machine Learning
, 2000
"... Abstract. We consider the problem of onestep ahead prediction for time series generated by an underlying stationary stochastic process obeying the condition of absolute regularity, describing the mixing nature of process. We make use of recent results from the theory of empirical processes, and ada ..."
Abstract

Cited by 58 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the problem of onestep ahead prediction for time series generated by an underlying stationary stochastic process obeying the condition of absolute regularity, describing the mixing nature of process. We make use of recent results from the theory of empirical processes, and adapt the uniform convergence framework of Vapnik and Chervonenkis to the problem of time series prediction, obtaining finite sample bounds. Furthermore, by allowing both the model complexity and memory size to be adaptively determined by the data, we derive nonparametric rates of convergence through an extension of the method of structural risk minimization suggested by Vapnik. All our results are derived for general L p error measures, and apply to both exponentially and algebraically mixing processes.
Generalization Bounds for Function Approximation from Scattered Noisy Data
, 1998
"... this paper we investigate the problem of providing error bounds for approximation of an unknown function from scattered, noisy data. This problem has particular relevance in the field of machine learning, where the unknown function represents the task that has to be learned and the scattered data re ..."
Abstract

Cited by 37 (1 self)
 Add to MetaCart
this paper we investigate the problem of providing error bounds for approximation of an unknown function from scattered, noisy data. This problem has particular relevance in the field of machine learning, where the unknown function represents the task that has to be learned and the scattered data represents the examples of this task. An obvious quantity of interest for us is the generalization error  a measure of how much the result of the approximation scheme differs from the unknown function  typically studied as a function of the number of data points. Since the data are randomly generated and noisy, the analysis of the generalization error necessarily involves statistical considerations in addition to the traditional
Approximation by ridge functions and neural networks
 SIAM J. Math. Anal
, 1999
"... We investigate the efficiency of approximation by linear combinations of ridge functions in the metric of L2(Bd) with Bd the unit ball in Rd. If Xn is an ndimensional linear space of univariate functions in L2(I), I = [−1, 1], and Ω is a subset of the unit sphere Sd−1 in Rd of cardinality m, then ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
(Show Context)
We investigate the efficiency of approximation by linear combinations of ridge functions in the metric of L2(Bd) with Bd the unit ball in Rd. If Xn is an ndimensional linear space of univariate functions in L2(I), I = [−1, 1], and Ω is a subset of the unit sphere Sd−1 in Rd of cardinality m, then the space Yn: = span{r(x ·ξ) : r ∈ Xn, ω ∈ Ω} is a linear space of ridge functions of dimension ≤ mn. We show that if Xn provides order of approximation O(n−r) for univariate functions with r derivatives in L2(I), and Ω are properly chosen sets of cardinality O(nd−1), then Yn will provide approximation of order O(n−r−d/2+1/2) for every function f ∈ L2(Bd) with smoothness of order r + d/2 − 1/2 in L2(Bd). Thus, the theorems we obtain show that this form of ridge approximation has the same efficiency of approximation as other more traditional methods of multivariate approximation such as polynomials, splines, or wavelets. The theorems we obtain can be applied to show that a feedforward neural network with one hidden layer of computational nodes given by certain sigmoidal function σ will also have this approximation efficiency. Minimal requirements are made of the sigmoidal functions and in particular our results hold for the unitimpulse function σ = χ [0,∞). 1
Hierarchical MixturesofExperts for Exponential Family Regression Models: Approximation and Maximum Likelihood Estimation
 Ann. Statistics
, 1999
"... this paper we consider the denseness and consistency of these models in the generalized linear model context. Before proceeding we present some notation regarding mixtures and hierarchical mixtures of generalized linear models and oneparameter exponential family HIERARCHICAL MIXTURESOFEXPERTS 3 ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
this paper we consider the denseness and consistency of these models in the generalized linear model context. Before proceeding we present some notation regarding mixtures and hierarchical mixtures of generalized linear models and oneparameter exponential family HIERARCHICAL MIXTURESOFEXPERTS 3 regression models. Generalized linear models are widely used in statistical practice [McCullagh and Nelder (1989)]. Oneparameter exponential family regression models [see Bickel and Doksum (1977), page 67] with generalized linear mean functions (GLM1) are special examples of the generalized linear models, where the probability distribution can be parameterized by the mean function. In the regression context, a GLM1 model proposes that the conditional expectation (x) of a real response variable y (the output) is related to a vector of predictors (or inputs)
Almost Linear VC Dimension Bounds for Piecewise Polynomial Networks
 Neural Computation
, 1998
"... We compute upper and lower bounds on the VC dimension of feedforward networks of units with piecewise polynomial activation functions. We show that if the number of layers is fixed, then the VC dimension grows as W log W , where W is the number of parameters in the network. This result stands in opp ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
(Show Context)
We compute upper and lower bounds on the VC dimension of feedforward networks of units with piecewise polynomial activation functions. We show that if the number of layers is fixed, then the VC dimension grows as W log W , where W is the number of parameters in the network. This result stands in opposition to the case where the number of layers is unbounded, in which case the VC dimension grows as W 2 . 1 MOTIVATION The VC dimension is an important measure of the complexity of a class of binaryvalued functions, since it characterizes the amount of data required for learning in the PAC setting (see [BEHW89, Vap82]). In this paper, we establish upper and lower bounds on the VC dimension of a specific class of multilayered feedforward neural networks. Let F be the class of binaryvalued functions computed by a feedforward neural network with W weights and k computational (noninput) units, each with a piecewise polynomial activation function. Goldberg and Jerrum [GJ95] have shown that...
Towards Robust Model Selection using Estimation and Approximation Error Bounds
 Proc. 9 th Annual Conference on Computational Learning Theory, p.57, ACM
, 1996
"... this paper we extend on previous work [17] and introduce a novel model selection criterion, based on combining two recent chains of thought. In particular we make use of the powerful framework of uniform convergence of empirical processes pioneered by Vapnik and Chernovenkins [23], combined with rec ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
(Show Context)
this paper we extend on previous work [17] and introduce a novel model selection criterion, based on combining two recent chains of thought. In particular we make use of the powerful framework of uniform convergence of empirical processes pioneered by Vapnik and Chernovenkins [23], combined with recent results concerning the approximation ability of nonlinear manifolds of functions, focusing in particular on feedforward neural networks. The main contributions of this work are twofold: (i) Conceptual  elucidating a coherent and robust framework for model selection, (ii) Technical  the main contribution here is a lower bound on the approximation error (Theorem 10), which holds in a well specified sense for most functions of interest. As far as we are aware, this result is new in the field of function approximation. The remainder of the paper is organized as follows. In
On the Approximation of Functional Classes Equipped with a Uniform Measure Using Ridge Functions
, 1999
"... this paper are threefold: (i) the construction of a uniform measure over a functional class B which is similar to a Besov class. (ii) Proving a lower bound on the degree of approximation by ridge functions which holds for all functions in some subset of B of probability measure 1&$ with re ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
this paper are threefold: (i) the construction of a uniform measure over a functional class B which is similar to a Besov class. (ii) Proving a lower bound on the degree of approximation by ridge functions which holds for all functions in some subset of B of probability measure 1&$ with respect to the uniform measure. (iii) Introducing a probabilistic width d n, $ for nonlinear approximation and estimating , +, M n ) for a uniform measure +
Error bounds for functional approximation and estimation using mixtures of experts
, 1998
"... We examine some mathematical aspects of learning unknown mappings with the Mixture of Experts Model (MEM). Specifically, we observe that the MEM is at least as powerful as a class of neural networks, in a sense that will be made precise. Upper bounds on the approximation error are established for a ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
We examine some mathematical aspects of learning unknown mappings with the Mixture of Experts Model (MEM). Specifically, we observe that the MEM is at least as powerful as a class of neural networks, in a sense that will be made precise. Upper bounds on the approximation error are established for a wide class of target functions. The general theorem states that inf kf; f nk p c=n r=d holds uniformly for f 2 W r(L) (a Sobolev class over [;1 � 1] p d), where fn belongs to an ndimensional manifold of normalized ridge functions. The same bound holds for the MEM as a special case of the above. The stochastic error, in the context of learning from i.i.d. examples, is also examined. An asymptotic analysis establishes the limiting behavior of this error, in terms of certain pseudoinformation matrices. These results substantiate the intuition behind the MEM, and motivate applications.