Results 1  10
of
183
Natural Gradient Works Efficiently in Learning
 Neural Computation
, 1998
"... When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for ..."
Abstract

Cited by 289 (16 self)
 Add to MetaCart
When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for blind source separation) and the space of linear dynamical systems (for blind source deconvolution). The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. This suggests that the plateau phenomenon which appears in the backpropagation learning algorithm of multilayer perceptrons might disappear or might be not so serious when the natural gradient is used. An adaptive method of updating the learning rate is proposed and analyzed. 1 Introduction The stochastic gradient method (Widrow, 1963; Amari, 1967; Tsypkin, 1973; Rumelhart et al...
Learning with Labeled and Unlabeled Data
, 2001
"... In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data. On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as well as ..."
Abstract

Cited by 165 (3 self)
 Add to MetaCart
In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data. On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as well as numerous suggestions for potential future work. Therefore, this work contains more speculative and partly subjective material than the reader might expect from a literature review. We give a rigorous definition of the problem and relate it to supervised and unsupervised learning. The crucial role of prior knowledge is put forward, and we discuss the important notion of inputdependent regularization. We postulate a number of baseline methods, being algorithms or algorithmic schemes which can more or less straightforwardly be applied to the problem, without the need for genuinely new concepts. However, some of them might serve as basis for a genuine method. In the literature revi...
Information Geometry on Hierarchy of Probability Distributions
, 2001
"... An exponential family or mixture family of probability distributions has a natural hierarchical structure. This paper gives an “orthogonal” decomposition of such a system based on information geometry. A typical example is the decomposition of stochastic dependency among a number of random variables ..."
Abstract

Cited by 72 (5 self)
 Add to MetaCart
An exponential family or mixture family of probability distributions has a natural hierarchical structure. This paper gives an “orthogonal” decomposition of such a system based on information geometry. A typical example is the decomposition of stochastic dependency among a number of random variables. In general, they have a complex structure of dependencies. Pairwise dependency is easily represented by correlation, but it is more difficult to measure effects of pure triplewise or higher order interactions (dependencies) among these variables. Stochastic dependency is decomposed quantitatively into an “orthogonal” sum of pairwise, triplewise, and further higher order dependencies. This gives a new invariant decomposition of joint entropy. This problem is important for extracting intrinsic interactions in firing patterns of an ensemble of neurons and for estimating its functional connections. The orthogonal decomposition is given in a wide class of hierarchical structures including both exponential and mixture families. As an example, we decompose the dependency in a higher order Markov chain into a sum of those in various lower order Markov chains.
Streaming and sublinear approximation of entropy and information distances
 In ACMSIAM Symposium on Discrete Algorithms
, 2006
"... In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the pr ..."
Abstract

Cited by 55 (13 self)
 Add to MetaCart
In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the problem of property testing with respect to the JensenShannon distance. We present optimal algorithms for estimating bounded, symmetric fdivergences (including the JensenShannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve. Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylogspace PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model. 1
Csiszár’s divergences for nonnegative matrix factorization: Family of new algorithms
 LNCS
, 2006
"... In this paper we discus a wide class of loss (cost) functions for nonnegative matrix factorization (NMF) and derive several novel algorithms with improved efficiency and robustness to noise and outliers. We review several approaches which allow us to obtain generalized forms of multiplicative NMF a ..."
Abstract

Cited by 52 (19 self)
 Add to MetaCart
In this paper we discus a wide class of loss (cost) functions for nonnegative matrix factorization (NMF) and derive several novel algorithms with improved efficiency and robustness to noise and outliers. We review several approaches which allow us to obtain generalized forms of multiplicative NMF algorithms and unify some existing algorithms. We give also the flexible and relaxed form of the NMF algorithms to increase convergence speed and impose some desired constraints such as sparsity and smoothness of components. Moreover, the effects of various regularization terms and constraints are clearly shown. The scope of these results is vast since the proposed generalized divergence functions include quite large number of useful loss functions such as the squared Euclidean distance,KulbackLeibler divergence, ItakuraSaito, Hellinger, Pearson’s chisquare, and Neyman’s chisquare distances, etc. We have applied successfully the developed algorithms to blind (or semi blind) source separation (BSS) where sources can be generally statistically dependent, however they satisfy some other conditions or additional constraints such as nonnegativity, sparsity and/or smoothness.
Ensemble learning for independent component analysis
 in Advances in Independent Component Analysis
, 2000
"... i Abstract This thesis is concerned with the problem of Blind Source Separation. Specifically we considerthe Independent Component Analysis (ICA) model in which a set of observations are modelled by xt = Ast: (1) where A is an unknown mixing matrix and st is a vector of hidden source components atti ..."
Abstract

Cited by 49 (2 self)
 Add to MetaCart
i Abstract This thesis is concerned with the problem of Blind Source Separation. Specifically we considerthe Independent Component Analysis (ICA) model in which a set of observations are modelled by xt = Ast: (1) where A is an unknown mixing matrix and st is a vector of hidden source components attime t. The ICA problem is to find the sources given only a set of observations. In chapter 1, the blind source separation problem is introduced. In chapter 2 the methodof Ensemble Learning is explained. Chapter 3 applies Ensemble Learning to the ICA model and chapter 4 assesses the use of Ensemble Learning for model selection.Chapters 57 apply the Ensemble Learning ICA algorithm to data sets from physics (a medical imaging data set consisting of images of a tooth), biology (data sets from cDNAmicroarrays) and astrophysics (Planck image separation and galaxy spectra separation).
Bankruptcy Analysis with SelfOrganizing Maps in Learning Metrics
 IEEE Transactions on Neural Networks
, 2001
"... We introduce a method for deriving a metric, locally based on the Fisher information matrix, into the data space. A SelfOrganizing Map is computed in the new metric to explore financial statements of enterprises. The metric measures local distances in terms of changes in the distribution of an auxi ..."
Abstract

Cited by 48 (19 self)
 Add to MetaCart
We introduce a method for deriving a metric, locally based on the Fisher information matrix, into the data space. A SelfOrganizing Map is computed in the new metric to explore financial statements of enterprises. The metric measures local distances in terms of changes in the distribution of an auxiliary random variable that reflects what is important in the data. In this paper the variable indicates bankruptcy within the next few years. The conditional density of the auxiliary variable is first estimated, and the change in the estimate resulting from local displacements in the primary data space is measured using the Fisher information matrix. When a SelfOrganizing Map is computed in the new metric it still visualizes the data space in a topologypreserving fashion, but represents the (local) directions in which the probability of bankruptcy changes the most.
Divergence Measures and Message Passing
, 2005
"... This paper presents a unifying view of messagepassing algorithms, as methods to approximate a complex Bayesian network by a simpler network with minimum information divergence. In this view, the difference between meanfield methods and belief propagation is not the amount of structure they model, b ..."
Abstract

Cited by 48 (2 self)
 Add to MetaCart
This paper presents a unifying view of messagepassing algorithms, as methods to approximate a complex Bayesian network by a simpler network with minimum information divergence. In this view, the difference between meanfield methods and belief propagation is not the amount of structure they model, but only the measure of loss they minimize (‘exclusive ’ versus ‘inclusive’ KullbackLeibler divergence). In each case, messagepassing arises by minimizing a localized version of the divergence, local to each factor. By examining these divergence measures, we can intuit the types of solution they prefer (symmetrybreaking, for example) and their suitability for different tasks. Furthermore, by considering a wider variety of divergence measures (such as alphadivergences), we can achieve different complexity and performance goals. 1
Algebraic analysis for nonidentifiable learning machines
 Neural Computation
"... This paper clarifies the relation between the learning curve and the algebraic geometrical structure of a nonidentifiable learning machine such as a multilayer neural network whose true parameter set is an analytic set with singular points. By using a concept in algebraic analysis, we rigorously pr ..."
Abstract

Cited by 46 (14 self)
 Add to MetaCart
This paper clarifies the relation between the learning curve and the algebraic geometrical structure of a nonidentifiable learning machine such as a multilayer neural network whose true parameter set is an analytic set with singular points. By using a concept in algebraic analysis, we rigorously prove that the Bayesian stochastic complexity or the free energy is asymptotically equal to λ1 log n − (m1 − 1) log log n+constant, where n is the number of training samples and λ1 and m1 are the rational number and the natural number which are determined as the birational invariant values of the singularities in the parameter space. Also we show an algorithm to calculate λ1 and m1 based on the resolution of singularities in algebraic geometry. In regular statistical models, 2λ1 is equal to the number of parameters and m1 = 1, whereas in nonregular models such as multilayer networks, 2λ1 is not larger than the number of parameters and m1 ≥ 1. Since the increase of the stochastic complexity is equal to the learning curve or the generalization error, the nonidentifiable learning machines are the better models than the regular ones if the Bayesian ensemble learning is applied. 1 1
Bayesian conditional random fields
 In Conference on Artificial Intelligence and Statistics (AISTATS), 2005. 193 Yuan
, 2005
"... We propose Bayesian Conditional Random Fields (BCRFs) for classifying interdependent and structured data, such as sequences, images or webs. BCRFs are a Bayesian approach to training and inference with conditional random fields, which were previously trained by maximizing likelihood (ML) (Lafferty e ..."
Abstract

Cited by 41 (1 self)
 Add to MetaCart
We propose Bayesian Conditional Random Fields (BCRFs) for classifying interdependent and structured data, such as sequences, images or webs. BCRFs are a Bayesian approach to training and inference with conditional random fields, which were previously trained by maximizing likelihood (ML) (Lafferty et al., 2001). Our framework eliminates the problem of overfitting, and offers the full advantages of a Bayesian treatment. Unlike the ML approach, we estimate the posterior distribution of the model parameters during training, and average over this posterior during inference. We apply an extension of EP method, the power EP method, to incorporate the partition function. For algorithmic stability and accuracy, we flatten the approximation structures to avoid twolevel approximations. We demonstrate the superior prediction accuracy of BCRFs over conditional random fields trained with ML or MAP on synthetic and real datasets. 1