## Neural Networks: A Pattern Recognition Perspective (1996)

Citations: | 2 - 0 self |

### BibTeX

@TECHREPORT{Bishop96neuralnetworks:,

author = {Christopher M. Bishop},

title = {Neural Networks: A Pattern Recognition Perspective},

institution = {},

year = {1996}

}

### OpenURL

### Abstract

Introduction Neural networks have been exploited in a wide variety of applications, the majority of which are concerned with pattern recognition in one form or another. However, it has become widely acknowledged that the effective solution of all but the simplest of such problems requires a principled treatment, in other words one based on a sound theoretical framework. From the perspective of pattern recognition, neural networks can be regarded as an extension of the many conventional techniques which have been developed over several decades. Lack of understanding of the basic principles of statistical pattern recognition lies at the heart of many of the common mistakes in the application of neural networks. In this chapter we aim to show that the `black box' stigma of neural networks is largely unjustified, and that there is actually considerable insight available into the way in which neural networks operate, and how to use them effectively. Some of the ke

### Citations

4828 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...squares error function was derived from the requirement that the network output vector should represent the conditional mean of the target data, as a function of the input vector. It is easily shown (=-=Bishop, 1995-=-) that minimization of this error, for an infinitely large data set and a highly flexible network model, does indeed lead to a network satisfying this property. We have derived the sum-of-squares erro... |

3921 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...sification error will occur if we assign a new pattern to class C 1 when in fact it belongs to class C 2 , or vice versa. We can calculate the total probability of an error of either kind by writing (=-=Duda and Hart, 1973-=-) P (error) = P (x 2 R 2 ; C 1 ) + P (x 2 R 1 ; C 2 ) = P (x 2 R 2 jC 1 )P (C 1 ) + P (x 2 R 1 jC 2 )P (C 2 ) = Z R2 p(xjC 1 )P (C 1 ) dx + Z R1 p(xjC 2 )P (C 2 ) dx (15) where P (x 2 R 1 ; C 2 ) is t... |

2723 |
Learning internal representations by error propagation
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...ficient compared with a simple direct evaluation of derivatives. For network training algorithms, this efficiency is crucial. 8. The original learning algorithm for multi-layer feed-forward networks (=-=Rumelhart et al., 1986-=-) was based on gradient descent. In fact the problem of optimizing the weights in a network corresponds to unconstrained non-linear optimization for which many substantially more powerful algorithms h... |

2649 | Introduction to Statistical Pattern Recognitionâ€ť, 2nd edition - Fukunaga - 1990 |

1346 |
Practical Optimization
- E, Murray, et al.
- 1981
(Show Context)
Citation Context ...timal values for these parameters will often vary during the optimization process. In fact much more powerful techniques have been developed for solving non-linear optimization problems (Polak, 1971; =-=Gill et al., 1981-=-; Dennis and Schnabel, 1983; Luenberger, 1984; Fletcher, 1987; Bishop, 1995). These include conjugate gradient methods, quasi-Newton algorithms, and the Levenberg-Marquardt technique. It should be not... |

1225 | Multilayer feedforward networks are universal approximators. Neural Netw - Hornik, Stinchcombe, et al. - 1989 |

1081 |
Practical Methods of Optimization
- Fletcher
- 1981
(Show Context)
Citation Context ...imization process. In fact much more powerful techniques have been developed for solving non-linear optimization problems (Polak, 1971; Gill et al., 1981; Dennis and Schnabel, 1983; Luenberger, 1984; =-=Fletcher, 1987-=-; Bishop, 1995). These include conjugate gradient methods, quasi-Newton algorithms, and the Levenberg-Marquardt technique. It should be noted that the term back-propagation is used in the neural compu... |

1040 |
Linear and Nonlinear Programming
- Luenberger
- 1984
(Show Context)
Citation Context ...ary during the optimization process. In fact much more powerful techniques have been developed for solving non-linear optimization problems (Polak, 1971; Gill et al., 1981; Dennis and Schnabel, 1983; =-=Luenberger, 1984-=-; Fletcher, 1987; Bishop, 1995). These include conjugate gradient methods, quasi-Newton algorithms, and the Levenberg-Marquardt technique. It should be noted that the term back-propagation is used in ... |

942 |
Numerical Methods for Unconstrained Optimization and Nonlinear Equations
- Dennis, Schnabel
- 1983
(Show Context)
Citation Context ...ese parameters will often vary during the optimization process. In fact much more powerful techniques have been developed for solving non-linear optimization problems (Polak, 1971; Gill et al., 1981; =-=Dennis and Schnabel, 1983-=-; Luenberger, 1984; Fletcher, 1987; Bishop, 1995). These include conjugate gradient methods, quasi-Newton algorithms, and the Levenberg-Marquardt technique. It should be noted that the term back-propa... |

841 | Approximation by superpositions of a sigmoidal function - Cybenko - 1989 |

721 |
Cross-Validatory Choices and Assessment of Statistical Prediction (with Discussion
- Stone
- 1974
(Show Context)
Citation Context ... see that the optimum complexity can be chosen by comparing the performance of a range of trained models using an independent test set. A more elaborate version of this procedure is cross-validation (=-=Stone, 1974-=-, 1978; Wahba and Wold, 1975). Instead of directly varying the number of adaptive parameters in a network, the effective complexity of the model may be controlled through the technique of regularizati... |

645 | Pattern recognition: A statistical approach - Devijver, Kittler - 1982 |

608 | Bayesian Learning for Neural Networks
- Neal
- 1996
(Show Context)
Citation Context ... to a practical scheme which involves relatively small modifications to conventional algorithms. An alternative approach to the Bayesian treatment of neural networks is to use Monte Carlo techniques (=-=Neal, 1994-=-) to perform the required integrations numerically without making analytical approximations. Again, this leads to a practical scheme which has been applied to some real-world problems. An interesting ... |

522 | Bayesian interpolation
- Mackay
- 1992
(Show Context)
Citation Context ... Although the Bayesian approach is very appealing, a full implementation is intractable for neural networks. Two principal approximation schemes have therefore been considered. In the first of these (=-=MacKay, 1992-=-a, 1992b, 1992c) the distribution over weights is approximated by a Gaussian centred on the most probable weight vector. Integrations over weight space can then be performed analytically, and this lea... |

476 |
Fast learning in networks of locally-tuned processing units
- Moody, Darken
- 1989
(Show Context)
Citation Context ...was given by Hornik et al. (1990). 5 The other major class of network model, which also possesses universal approximation capabilities, is the radial basis function network (Broomhead and Lowe, 1988; =-=Moody and Darken, 1989-=-). Such networks again take the form (4), but the basis functions now depend on some measure of distance between the input vector x and a prototype vectorsj . A typical example would be a Gaussian bas... |

446 |
Adaptive Control Processes: A Guided Tour
- Bellman
- 1961
(Show Context)
Citation Context ...w like d M , which represents a dramatic growth in the number of degrees of freedom in the model as the dimensionality of the input space increases. This is an example of the curse of dimensionality (=-=Bellman, 1961-=-). The presence of a large number of adaptive parameters in a model can cause major problems as we shall discuss in Section 4. In order that the model make good predictions for new inputs it is necess... |

434 |
Multivariable functional interpolation and adaptive networks
- Broomhead, Lowe
- 1988
(Show Context)
Citation Context ...ction and its derivatives was given by Hornik et al. (1990). 5 The other major class of network model, which also possesses universal approximation capabilities, is the radial basis function network (=-=Broomhead and Lowe, 1988-=-; Moody and Darken, 1989). Such networks again take the form (4), but the basis functions now depend on some measure of distance between the input vector x and a prototype vectorsj . A typical example... |

406 | Projection pursuit regression
- Friedman, Stuetzle
- 1981
(Show Context)
Citation Context ... 0 x 1 x d y 1 z 1 z 0 y c z M Figure 2. An example of a feed-forward network having two layers of adaptive weights. the statistics literature and include, for example, projection pursuit regression (=-=Friedman and Stuetzle, 1981-=-; Huber, 1985) which has a form remarkably similar to that of the feed-forward network discussed above. The procedures for determining the parameters in projection pursuit regression are, however, qui... |

399 | A practical Bayesian framework for backpropagation networks
- MacKay
- 1992
(Show Context)
Citation Context ... Although the Bayesian approach is very appealing, a full implementation is intractable for neural networks. Two principal approximation schemes have therefore been considered. In the first of these (=-=MacKay, 1992-=-a, 1992b, 1992c) the distribution over weights is approximated by a Gaussian centred on the most probable weight vector. Integrations over weight space can then be performed analytically, and this lea... |

339 | On the Approximate Realization of Continuous Mappings by Neural Networks - Funahashi - 1989 |

339 | Connectionist Learning Procedures
- Hinton
(Show Context)
Citation Context ... (27) As usual, it is more convenient to minimize the negative logarithm of the likelihood. This leads to the cross-entropy error function (Hopfield, 1987; Baum and Wilczek, 1988; Solla et al., 1988; =-=Hinton, 1989-=-; Hampshire and Pearlmutter, 1990) in the form E = \Gamma X n ft n ln y n + (1 \Gamma t n ) ln(1 \Gamma y n )g : (28) For the network model introduced in (4) the outputs were linear functions of the a... |

304 | Backpropagation applied to handwritten zip code recognition - Cun, Boser, et al. - 1989 |

232 |
Projection pursuit
- Huber
- 1985
(Show Context)
Citation Context ...M Figure 2. An example of a feed-forward network having two layers of adaptive weights. the statistics literature and include, for example, projection pursuit regression (Friedman and Stuetzle, 1981; =-=Huber, 1985-=-) which has a form remarkably similar to that of the feed-forward network discussed above. The procedures for determining the parameters in projection pursuit regression are, however, quite different ... |

204 | Approximation capabilities of multilayer feedforward networks - Hornik - 1991 |

158 | Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks - Hornik, Stinchcombe, et al. - 1990 |

153 | The evidence framework applied to classification networks
- MacKay
- 1992
(Show Context)
Citation Context ... Although the Bayesian approach is very appealing, a full implementation is intractable for neural networks. Two principal approximation schemes have therefore been considered. In the first of these (=-=MacKay, 1992-=-a, 1992b, 1992c) the distribution over weights is approximated by a Gaussian centred on the most probable weight vector. Integrations over weight space can then be performed analytically, and this lea... |

139 | Discrimination and Classification - Hand - 1981 |

106 | Neurocomputing: Foundations of Research - Anderson, Rosenfeld - 1989 |

93 |
Backpropagation: The basic theory
- Rumelhart, Durbin, et al.
- 1993
(Show Context)
Citation Context ...he hidden units. While this is appropriate for regression problems, we need to consider the correct choice of output unit activation function for the case of classification problems. We shall assume (=-=Rumelhart et al., 1995-=-) that the class-conditional distributions of the outputs of the hidden units, represented here by the vector z, are described by p(zjC k ) = exp n A(` k ) + B(z; OE) + ` T k z o (29) which is a membe... |

88 | Theory of the backpropagation neural networks - Hecht-Nielsen - 1989 |

76 |
Computational Methods in Optimization: A Unified Approach
- Polak
- 1971
(Show Context)
Citation Context ...rmore, the optimal values for these parameters will often vary during the optimization process. In fact much more powerful techniques have been developed for solving non-linear optimization problems (=-=Polak, 1971-=-; Gill et al., 1981; Dennis and Schnabel, 1983; Luenberger, 1984; Fletcher, 1987; Bishop, 1995). These include conjugate gradient methods, quasi-Newton algorithms, and the Levenberg-Marquardt techniqu... |

56 |
Accelerated learning in layered neural networks
- Solla, Levin, et al.
- 1988
(Show Context)
Citation Context ...a y n ) 1\Gammat n : (27) As usual, it is more convenient to minimize the negative logarithm of the likelihood. This leads to the cross-entropy error function (Hopfield, 1987; Baum and Wilczek, 1988; =-=Solla et al., 1988-=-; Hinton, 1989; Hampshire and Pearlmutter, 1990) in the form E = \Gamma X n ft n ln y n + (1 \Gamma t n ) ln(1 \Gamma y n )g : (28) For the network model introduced in (4) the outputs were linear func... |

54 | Equivalence proofs for multi-layer perceptron classifiers and the Bayes discriminant function
- Hampshire, Perlmutter
- 1990
(Show Context)
Citation Context ..., it is more convenient to minimize the negative logarithm of the likelihood. This leads to the cross-entropy error function (Hopfield, 1987; Baum and Wilczek, 1988; Solla et al., 1988; Hinton, 1989; =-=Hampshire and Pearlmutter, 1990-=-) in the form E = \Gamma X n ft n ln y n + (1 \Gamma t n ) ln(1 \Gamma y n )g : (28) For the network model introduced in (4) the outputs were linear functions of the activations of the hidden units. W... |

48 | Mixture density networks
- Bishop
- 1994
(Show Context)
Citation Context ...ch cases is to combine a feed-forward network with a Gaussian mixture model (i.e. a linear combination of Gaussian functions) thereby allowing general conditional distributions p(tjx) to be modelled (=-=Bishop, 1994-=-). 10 3.2 Error functions for classification In the case of classification problems, the goal as we have seen is to approximate the posterior probabilities of class membership P (C k jx) given the inp... |

47 |
Supervised learning of probability distributions by neural networks
- Baum, Wilczek
- 1988
(Show Context)
Citation Context ... Y n (y n ) t n (1 \Gamma y n ) 1\Gammat n : (27) As usual, it is more convenient to minimize the negative logarithm of the likelihood. This leads to the cross-entropy error function (Hopfield, 1987; =-=Baum and Wilczek, 1988-=-; Solla et al., 1988; Hinton, 1989; Hampshire and Pearlmutter, 1990) in the form E = \Gamma X n ft n ln y n + (1 \Gamma t n ) ln(1 \Gamma y n )g : (28) For the network model introduced in (4) the outp... |

47 |
Exact calculation of the hessian matrix for the multilayer perceptron
- Bishop
- 1992
(Show Context)
Citation Context ...r that the simple sum-ofsquares, and to the evaluation of other quantities such as the Hessian matrix whose elements comprise the second derivatives of the error function with respect to the weights (=-=Bishop, 1992-=-). Similarly, the second stage of weight adjustment using the calculated derivatives can be tackled using a variety of optimization schemes (discussed above), many of which are substantially more effe... |

39 | Universaf approximation using feedforward networks with non-sigrnoid hidden lave: activation functions - Stinchcombe, White - 1989 |

36 | Bayesian methods: General background
- Jaynes
- 1986
(Show Context)
Citation Context ...d on the principle of maximum likelihood, which itself stems from the frequentist school of statistics. A more fundamental, and potentially more powerful, approach is given by the Bayesian viewpoint (=-=Jaynes, 1986-=-). Instead of describing a trained network by a single weight vector w , the Bayesian approach expresses 20 our uncertainty in the values of the weights through a probability distribution p(w). The ef... |

35 |
Learning algorithms and probability distributions in feedforward and feed-back networks
- Hopfield
- 1987
(Show Context)
Citation Context ...is then given by Y n (y n ) t n (1 \Gamma y n ) 1\Gammat n : (27) As usual, it is more convenient to minimize the negative logarithm of the likelihood. This leads to the cross-entropy error function (=-=Hopfield, 1987-=-; Baum and Wilczek, 1988; Solla et al., 1988; Hinton, 1989; Hampshire and Pearlmutter, 1990) in the form E = \Gamma X n ft n ln y n + (1 \Gamma t n ) ln(1 \Gamma y n )g : (28) For the network model in... |

29 | Cross-validation: A review - Stone - 1978 |

20 | Representation of functions by superpositions of a step or sigmoid function and their applications to neural network theory - Ito - 1991 |

14 | Arbitrary Nonlinearity Is Sufficient to Represent All Functions by Neural Networks: A Theorem - Kreinovich - 1991 |

12 | The Stone-Weierstrass Theorem and Its Application to Neural Networks - Cotter - 1990 |

1 | Hinton (Eds - Sejnowski, E |