Results 1  10
of
45
Probabilistic Latent Semantic Indexing
, 1999
"... Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized ..."
Abstract

Cited by 784 (8 self)
 Add to MetaCart
Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domainspecific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methodsaswell as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous.
Natural Gradient Works Efficiently in Learning
 Neural Computation
, 1998
"... When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for ..."
Abstract

Cited by 289 (16 self)
 Add to MetaCart
When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for blind source separation) and the space of linear dynamical systems (for blind source deconvolution). The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. This suggests that the plateau phenomenon which appears in the backpropagation learning algorithm of multilayer perceptrons might disappear or might be not so serious when the natural gradient is used. An adaptive method of updating the learning rate is proposed and analyzed. 1 Introduction The stochastic gradient method (Widrow, 1963; Amari, 1967; Tsypkin, 1973; Rumelhart et al...
Information Geometry of the EM and em Algorithms for Neural Networks
 Neural Networks
, 1995
"... In order to realize an inputoutput relation given by noisecontaminated examples, it is effective to use a stochastic model of neural networks. A model network includes hidden units whose activation values are not specified nor observed. It is useful to estimate the hidden variables from the obs ..."
Abstract

Cited by 101 (8 self)
 Add to MetaCart
In order to realize an inputoutput relation given by noisecontaminated examples, it is effective to use a stochastic model of neural networks. A model network includes hidden units whose activation values are not specified nor observed. It is useful to estimate the hidden variables from the observed or specified inputoutput data based on the stochastic model. Two algorithms, the EM  and emalgorithms, have so far been proposed for this purpose. The EMalgorithm is an iterative statistical technique of using the conditional expectation, and the emalgorithm is a geometrical one given by information geometry. The emalgorithm minimizes iteratively the KullbackLeibler divergence in the manifold of neural networks. These two algorithms are equivalent in most cases. The present paper gives a unified information geometrical framework for studying stochastic models of neural networks, by forcussing on the EM and em algorithms, and proves a condition which guarantees their equ...
Information Geometry on Hierarchy of Probability Distributions
, 2001
"... An exponential family or mixture family of probability distributions has a natural hierarchical structure. This paper gives an “orthogonal” decomposition of such a system based on information geometry. A typical example is the decomposition of stochastic dependency among a number of random variables ..."
Abstract

Cited by 72 (5 self)
 Add to MetaCart
An exponential family or mixture family of probability distributions has a natural hierarchical structure. This paper gives an “orthogonal” decomposition of such a system based on information geometry. A typical example is the decomposition of stochastic dependency among a number of random variables. In general, they have a complex structure of dependencies. Pairwise dependency is easily represented by correlation, but it is more difficult to measure effects of pure triplewise or higher order interactions (dependencies) among these variables. Stochastic dependency is decomposed quantitatively into an “orthogonal” sum of pairwise, triplewise, and further higher order dependencies. This gives a new invariant decomposition of joint entropy. This problem is important for extracting intrinsic interactions in firing patterns of an ensemble of neurons and for estimating its functional connections. The orthogonal decomposition is given in a wide class of hierarchical structures including both exponential and mixture families. As an example, we decompose the dependency in a higher order Markov chain into a sum of those in various lower order Markov chains.
Blind Source Separation  Semiparametric Statistical Approach
 IEEE Trans. Signal Processing
, 1997
"... The semiparametric statistical model is used to formulate the problem of blind source separation. The method of estimating functions is applied to this problem. It is shown that estimation of the mixing matrix or its learning rule version is given by an estimating function. The statistical efficienc ..."
Abstract

Cited by 55 (8 self)
 Add to MetaCart
The semiparametric statistical model is used to formulate the problem of blind source separation. The method of estimating functions is applied to this problem. It is shown that estimation of the mixing matrix or its learning rule version is given by an estimating function. The statistical efficiencies of these algorithms are studied. The main results are as follows 1) The space of all the estimating functions is derived. 2)The space is decomposed into the orthogonal sum of effective and redundant ancillary parts. 3) The Fisher efficient (that is, asymptotically best) estimating functions are derived. 4) The stability of learning algorithms is studied. EDICS number: SP 6.1.7 Corresponding Author: Shunichi Amari, RIKEN FRP, Wakoshi, Hirosawa 21, Saitama 35101, JAPAN fax: +81484629881 amari@zoo.riken.go.jp Permission to publish this abstract separately is granted. 1 Introduction Since the proposal of Jutten and Herault [1988], blind source separation is one of the most active a...
Bankruptcy Analysis with SelfOrganizing Maps in Learning Metrics
 IEEE Transactions on Neural Networks
, 2001
"... We introduce a method for deriving a metric, locally based on the Fisher information matrix, into the data space. A SelfOrganizing Map is computed in the new metric to explore financial statements of enterprises. The metric measures local distances in terms of changes in the distribution of an auxi ..."
Abstract

Cited by 48 (19 self)
 Add to MetaCart
We introduce a method for deriving a metric, locally based on the Fisher information matrix, into the data space. A SelfOrganizing Map is computed in the new metric to explore financial statements of enterprises. The metric measures local distances in terms of changes in the distribution of an auxiliary random variable that reflects what is important in the data. In this paper the variable indicates bankruptcy within the next few years. The conditional density of the auxiliary variable is first estimated, and the change in the estimate resulting from local displacements in the primary data space is measured using the Fisher information matrix. When a SelfOrganizing Map is computed in the new metric it still visualizes the data space in a topologypreserving fashion, but represents the (local) directions in which the probability of bankruptcy changes the most.
Tractable Bayesian Learning of Tree Belief Networks
, 2000
"... In this paper we present decomposable priors, a family of priors over structure and parameters of tree belief nets for which Bayesian learning with complete observations is tractable, in the sense that the posterior is also decomposable and can be completely determined analytically in polynomial tim ..."
Abstract

Cited by 36 (1 self)
 Add to MetaCart
In this paper we present decomposable priors, a family of priors over structure and parameters of tree belief nets for which Bayesian learning with complete observations is tractable, in the sense that the posterior is also decomposable and can be completely determined analytically in polynomial time. This follows from two main results: First, we show that factored distributions over spanning trees in a graph can be integrated in closed form. Second, we examine priors over tree parameters and show that a set of assumptions similar to (Heckerman and al., 1995) constrain the tree parameter priors to be a compactly parametrized product of Dirichlet distributions. Besides allowing for exact Bayesian learning, these results permit us to formulate a new class of tractable latent variable models in which the likelihood of a data point is computed through an ensemble average over tree structures. 1 Introduction In the framework of graphical models, tree distributions stand out by their spec...
A Review of Kernel Methods in Machine Learning
, 2006
"... We review recent methods for learning with positive definite kernels. All these methods formulate learning and estimation problems as linear tasks in a reproducing kernel Hilbert space (RKHS) associated with a kernel. We cover a wide range of methods, ranging from simple classifiers to sophisticate ..."
Abstract

Cited by 35 (3 self)
 Add to MetaCart
We review recent methods for learning with positive definite kernels. All these methods formulate learning and estimation problems as linear tasks in a reproducing kernel Hilbert space (RKHS) associated with a kernel. We cover a wide range of methods, ranging from simple classifiers to sophisticated methods for estimation with structured data.
Assessing the Distinguishability of Models and the Informativeness of Data
"... A difficulty in the development and testing of psychological models is that they are typically evaluated solely on their ability to fit experimental data, with little consideration given to their ability to fit other possible data patterns. By examining how well model A fits data generated by mod ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
A difficulty in the development and testing of psychological models is that they are typically evaluated solely on their ability to fit experimental data, with little consideration given to their ability to fit other possible data patterns. By examining how well model A fits data generated by model B, and vice versa (a technique that we call landscaping), much safer inferences can be made about the meaning of a models fit to data. We demonstrate the landscaping technique using four models of retention and 77 historical data sets, and show how the method can be used to (1) evaluate the distinguishability of models, (2) evaluate the informativeness of data in distinguishing between models, and (3) suggest new ways to distinguish between models. The generality of the method is demonstrated in two other research areas (information integration and categorization), and its relationship to the important notion of model complexity is discussed.
Information Geometry on Hierarchical Decomposition of Stochastic Interactions
 IEEE Transaction on Information Theory
, 1999
"... A joint probability distribution represents stochastic dependency among a number of random variables. They in general have complex structure of dependencies. A pairwise dependency is easily represented by correlation, but it is more difficult to extract triplewise or higherorder interactions (depen ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
A joint probability distribution represents stochastic dependency among a number of random variables. They in general have complex structure of dependencies. A pairwise dependency is easily represented by correlation, but it is more difficult to extract triplewise or higherorder interactions (dependencies) among these variables. They form a hierarchy of dependencies. The present paper decomposes the higher order dependency into an "orthogonal " sum of pairwise, triplewise and further higherorder dependencies, by using Information Geometry. This naturally gives a new invariant decomposition of joint entropy. This problem is important for extracting intrinsic interactions in firing patterns of an ensemble of neurons and for estimating its functional connections. The results are generalized to give an orthogonal decomposition in a wide class of hierarchical structures. As an example, we decompose higher order Markov chains into various lower order Markov 1 chains. This type of decompo...