Results 1  10
of
11
Exponentiated Gradient Versus Gradient Descent for Linear Predictors
 Information and Computation
, 1995
"... this paper, we concentrate on linear predictors . To any vector u 2 R ..."
Abstract

Cited by 326 (14 self)
 Add to MetaCart
this paper, we concentrate on linear predictors . To any vector u 2 R
Information Geometry of the EM and em Algorithms for Neural Networks
 Neural Networks
, 1995
"... In order to realize an inputoutput relation given by noisecontaminated examples, it is effective to use a stochastic model of neural networks. A model network includes hidden units whose activation values are not specified nor observed. It is useful to estimate the hidden variables from the obs ..."
Abstract

Cited by 122 (9 self)
 Add to MetaCart
(Show Context)
In order to realize an inputoutput relation given by noisecontaminated examples, it is effective to use a stochastic model of neural networks. A model network includes hidden units whose activation values are not specified nor observed. It is useful to estimate the hidden variables from the observed or specified inputoutput data based on the stochastic model. Two algorithms, the EM  and emalgorithms, have so far been proposed for this purpose. The EMalgorithm is an iterative statistical technique of using the conditional expectation, and the emalgorithm is a geometrical one given by information geometry. The emalgorithm minimizes iteratively the KullbackLeibler divergence in the manifold of neural networks. These two algorithms are equivalent in most cases. The present paper gives a unified information geometrical framework for studying stochastic models of neural networks, by forcussing on the EM and em algorithms, and proves a condition which guarantees their equ...
Conditional distribution learning with neural networks and its application to channel equalization," to appear
 IEEE Trans. Signal Processing
, 1997
"... Abstract — We present a conditional distribution learning formulation for realtime signal processing with neural networks based on a recent extension of maximum likelihood theory—partial likelihood (PL) estimation—which allows for i) dependent observations and ii) sequential processing. For a gener ..."
Abstract

Cited by 30 (11 self)
 Add to MetaCart
(Show Context)
Abstract — We present a conditional distribution learning formulation for realtime signal processing with neural networks based on a recent extension of maximum likelihood theory—partial likelihood (PL) estimation—which allows for i) dependent observations and ii) sequential processing. For a general neural network conditional distribution model, we establish a fundamental informationtheoretic connection, the equivalence of maximum PL estimation, and accumulated relative entropy (ARE) minimization, and obtain large sample properties of PL for the general case of dependent observations. As an example, the binary case with the sigmoidal perceptron as the probability model is presented. It is shown that the single and multilayer perceptron (MLP) models satisfy conditions for the equivalence of the two cost functions: ARE and negative log partial likelihood. The practical issue of their gradient descent minimization is then studied within the wellformed cost functions framework. It is shown that these are wellformed cost functions for networks without hidden units; hence, their gradient descent minimization is guaranteed to converge to a solution if one exists on such networks. The formulation is applied to adaptive channel equalization, and simulation results are presented to show the ability of the least relative entropy equalizer to realize complex decision boundaries and to recover during training from convergence at the wrong extreme in cases where the mean square errorbased MLP equalizer cannot. I.
Information Geometric Measurements of Generalisation
, 1995
"... Neural networks can be regarded as statistical models, and can be analysed in a Bayesian framework. Generalisation is measured by the performance on independent test data drawn from the same distribution as the training data. Such performance can be quantified by the posterior average of the informa ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
Neural networks can be regarded as statistical models, and can be analysed in a Bayesian framework. Generalisation is measured by the performance on independent test data drawn from the same distribution as the training data. Such performance can be quantified by the posterior average of the information divergence between the true and the model distributions. Averaging over the Bayesian posterior guarantees internal coherence; Using information divergence guarantees invariance with respect to representation. The theory generalises the least mean squares theory for linear Gaussian models to general problems of statistical estimation. The main results are: (1) the ideal optimal estimate is always given by average over the posterior; (2) the optimal estimate within a computational model is given by the projection of the ideal estimate to the model. This incidentally shows some currently popular methods dealing with hyperpriors are in general unnecessary and misleading. The extension of in...
Knowledge Enhancement and Reuse with Radial Basis Function Networks
, 2002
"... This paper presents a technique for enhancing an RBFN when provided with additional information in the form of new features, without retraining or resorting to the original features. The proposed technique improves the learning speed as well as network performance as compared to a network that is tr ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
This paper presents a technique for enhancing an RBFN when provided with additional information in the form of new features, without retraining or resorting to the original features. The proposed technique improves the learning speed as well as network performance as compared to a network that is trained from scratch. We also present a method of reusing knowledge embedded in an RBFN for initializing another RBFN to be trained on a related problem. Both methods have several reallife applications. I.
Statistical Physics of Clustering Algorithms
 DIPLOMARBEIT, TECHNIQUE UNIVERSITÄT, FB PHYSIK, INSTITUT FÜR THEORETISHE PHYSIK
, 1998
"... ..."
Entropy Estimation
, 1996
"... We consider two algorithm for online prediction based on a linear model. The algorithms are the wellknown gradient descent (GD) algorithm and a new algorithm, which we call EG \Sigma . They both maintain a weight vector using simple updates. For the GD algorithm, the update is based on subtr ..."
Abstract
 Add to MetaCart
We consider two algorithm for online prediction based on a linear model. The algorithms are the wellknown gradient descent (GD) algorithm and a new algorithm, which we call EG \Sigma . They both maintain a weight vector using simple updates. For the GD algorithm, the update is based on subtracting the gradient of the squared error made on a prediction. The EG \Sigma algorithm uses the components of the gradient in the exponents of factors that are used in updating the weight vector multiplicatively. We present worstcase loss bounds for EG \Sigma and compare them to previously known bounds for the GD algorithm. The bounds suggest that the losses of the algorithms are in general incomparable, but EG \Sigma has a much smaller loss if only few components of the input are relevant for the predictions. We have performed experiments, which show that our worstcase upper bounds are quite tight already on simple artificial data. 1 Introduction We consider a scenario in w...
and
"... We consider two algorithms for online prediction based on a linear model. The algorithms are the wellknown gradient descent (GD) algorithm and a new algorithm, which we call EG \. They both maintain a weight vector using simple updates. For the GD algorithm, the update is based on subtracting the ..."
Abstract
 Add to MetaCart
We consider two algorithms for online prediction based on a linear model. The algorithms are the wellknown gradient descent (GD) algorithm and a new algorithm, which we call EG \. They both maintain a weight vector using simple updates. For the GD algorithm, the update is based on subtracting the gradient of the squared error made on a prediction. The EG \ algorithm uses the components of the gradient in the exponents of factors that are used in updating the weight vector multiplicatively. We present worstcase loss bounds for EG \ and compare them to previously known bounds for the GD algorithm. The bounds suggest that the losses of the algorithms are in general incomparable, but EG \ has a much smaller loss if only few components of the input are relevant for the predictions. We have performed experiments which show that our worstcase upper bounds are quite tight already on simple artificial data.] 1997 Academic Press 1.