Results 1  10
of
235
Clustering with Bregman Divergences
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2005
"... A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergence ..."
Abstract

Cited by 310 (52 self)
 Add to MetaCart
A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroidbased parametric clustering approaches, such as classical kmeans and informationtheoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the basic idea to a very large class of clustering loss functions. There are two main contributions in this paper. First, we pose the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by ratedistortion theory, and present an algorithm to minimize this loss. Secondly, we show an explicit bijection between Bregman divergences and exponential families. The bijection enables the development of an alternative interpretation of an ecient EM scheme for learning models involving mixtures of exponential distributions. This leads to a simple soft clustering algorithm for all Bregman divergences.
Natural Gradient Works Efficiently in Learning
 Neural Computation
, 1998
"... When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for ..."
Abstract

Cited by 289 (16 self)
 Add to MetaCart
When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for blind source separation) and the space of linear dynamical systems (for blind source deconvolution). The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. This suggests that the plateau phenomenon which appears in the backpropagation learning algorithm of multilayer perceptrons might disappear or might be not so serious when the natural gradient is used. An adaptive method of updating the learning rate is proposed and analyzed. 1 Introduction The stochastic gradient method (Widrow, 1963; Amari, 1967; Tsypkin, 1973; Rumelhart et al...
Toward a method of selecting among computational models of cognition
 Psychological Review
, 2002
"... The question of how one should decide among competing explanations of data is at the heart of the scientific enterprise. Computational models of cognition are increasingly being advanced as explanations of behavior. The success of this line of inquiry depends on the development of robust methods to ..."
Abstract

Cited by 74 (4 self)
 Add to MetaCart
The question of how one should decide among competing explanations of data is at the heart of the scientific enterprise. Computational models of cognition are increasingly being advanced as explanations of behavior. The success of this line of inquiry depends on the development of robust methods to guide the evaluation and selection of these models. This article introduces a method of selecting among mathematical models of cognition known as minimum description length, which provides an intuitive and theoretically wellgrounded understanding of why one model should be chosen. A central but elusive concept in model selection, complexity, can also be derived with the method. The adequacy of the method is demonstrated in 3 areas of cognitive modeling: psychophysics, information integration, and categorization. How should one choose among competing theoretical explanations of data? This question is at the heart of the scientific enterprise, regardless of whether verbal models are being tested in an experimental setting or computational models are being evaluated in simulations. A number of criteria have been proposed to assist in this endeavor, summarized nicely by Jacobs and Grainger
Information Geometry on Hierarchy of Probability Distributions
, 2001
"... An exponential family or mixture family of probability distributions has a natural hierarchical structure. This paper gives an “orthogonal” decomposition of such a system based on information geometry. A typical example is the decomposition of stochastic dependency among a number of random variables ..."
Abstract

Cited by 72 (5 self)
 Add to MetaCart
An exponential family or mixture family of probability distributions has a natural hierarchical structure. This paper gives an “orthogonal” decomposition of such a system based on information geometry. A typical example is the decomposition of stochastic dependency among a number of random variables. In general, they have a complex structure of dependencies. Pairwise dependency is easily represented by correlation, but it is more difficult to measure effects of pure triplewise or higher order interactions (dependencies) among these variables. Stochastic dependency is decomposed quantitatively into an “orthogonal” sum of pairwise, triplewise, and further higher order dependencies. This gives a new invariant decomposition of joint entropy. This problem is important for extracting intrinsic interactions in firing patterns of an ensemble of neurons and for estimating its functional connections. The orthogonal decomposition is given in a wide class of hierarchical structures including both exponential and mixture families. As an example, we decompose the dependency in a higher order Markov chain into a sum of those in various lower order Markov chains.
A Hilbert space embedding for distributions
 In Algorithmic Learning Theory: 18th International Conference
, 2007
"... Abstract. We describe a technique for comparing distributions without the need for density estimation as an intermediate step. Our approach relies on mapping the distributions into a reproducing kernel Hilbert space. Applications of this technique can be found in twosample tests, which are used for ..."
Abstract

Cited by 53 (26 self)
 Add to MetaCart
Abstract. We describe a technique for comparing distributions without the need for density estimation as an intermediate step. Our approach relies on mapping the distributions into a reproducing kernel Hilbert space. Applications of this technique can be found in twosample tests, which are used for determining whether two sets of observations arise from the same distribution, covariate shift correction, local learning, measures of independence, and density estimation. Kernel methods are widely used in supervised learning [1, 2, 3, 4], however they are much less established in the areas of testing, estimation, and analysis of probability distributions, where information theoretic approaches [5, 6] have long been dominant. Recent examples include [7] in the context of construction of graphical models, [8] in the context of feature extraction, and [9] in the context of independent component analysis. These methods have by and large a common issue: to compute quantities such as the mutual information, entropy, or KullbackLeibler divergence, we require sophisticated space partitioning and/or
Is there something out there? Infering space from sensorimotor dependencies
 Neural Computation
, 2002
"... This paper suggests that in biological organisms, the perceived structure of reality, in particular the notions of body, environment, space, object, and attribute, could be a consequence of an effort on the part of brains to account for the dependency between their inputs and their outputs in terms ..."
Abstract

Cited by 52 (3 self)
 Add to MetaCart
This paper suggests that in biological organisms, the perceived structure of reality, in particular the notions of body, environment, space, object, and attribute, could be a consequence of an effort on the part of brains to account for the dependency between their inputs and their outputs in terms of a small number of parameters. To validate this idea, a procedure is demonstrated whereby the brain of an organism with arbitrary input and output connectivity can deduce the dimensionality of the rigid group of the space underlying its input output relationship, that is the dimension of what the organism will call physical space.
Covariant Policy Search
, 2003
"... We investigate the problem of noncovariant behavior of policy gradient reinforcement learning algorithms. ..."
Abstract

Cited by 48 (4 self)
 Add to MetaCart
We investigate the problem of noncovariant behavior of policy gradient reinforcement learning algorithms.
On Bregman Voronoi Diagrams
 in "Proc. 18th ACMSIAM Sympos. Discrete Algorithms
, 2007
"... The Voronoi diagram of a point set is a fundamental geometric structure that partitions the space into elementary regions of influence defining a discrete proximity graph and dually a wellshaped Delaunay triangulation. In this paper, we investigate a framework for defining and building the Voronoi ..."
Abstract

Cited by 42 (22 self)
 Add to MetaCart
The Voronoi diagram of a point set is a fundamental geometric structure that partitions the space into elementary regions of influence defining a discrete proximity graph and dually a wellshaped Delaunay triangulation. In this paper, we investigate a framework for defining and building the Voronoi diagrams for a broad class of distortion measures called Bregman divergences, that includes not only the traditional (squared) Euclidean distance, but also various divergence measures based on entropic functions. As a byproduct, Bregman Voronoi diagrams allow one to define informationtheoretic Voronoi diagrams in statistical parametric spaces based on the relative entropy of distributions. We show that for a given Bregman divergence, one can define several types of Voronoi diagrams related to each other
Sufficient Dimensionality Reduction
 Journal of Machine Learning Research
, 2003
"... Dimensionality reduction of empirical cooccurrence data is a fundamental problem in unsupervised learning. It is also a well studied problem in statistics known as the analysis of crossclassified data. One principled approach to this problem is to represent the data in low dimension with minimal l ..."
Abstract

Cited by 35 (8 self)
 Add to MetaCart
Dimensionality reduction of empirical cooccurrence data is a fundamental problem in unsupervised learning. It is also a well studied problem in statistics known as the analysis of crossclassified data. One principled approach to this problem is to represent the data in low dimension with minimal loss of (mutual) information contained in the original data. In this paper we introduce an information theoretic nonlinear method for finding such a most informative dimension reduction. In contrast with...
Fast Joint Separation And Segmentation Of Mixed Images
, 2004
"... We consider the problem of the blind separation of noisy instantaneously mixed images. The images are modeled by hidden Markov fields with unknown parameters. Given the observed images, we give a Bayesian formulation and we propose a fast version of the MCMC algorithm based on the Bartlett decomposi ..."
Abstract

Cited by 31 (22 self)
 Add to MetaCart
We consider the problem of the blind separation of noisy instantaneously mixed images. The images are modeled by hidden Markov fields with unknown parameters. Given the observed images, we give a Bayesian formulation and we propose a fast version of the MCMC algorithm based on the Bartlett decomposition for the resulting data augmentation problem. We separate the unknown variables into two categories: 1. The parameters of interest which are the mixing matrix, the noise covariance and the parameters of the sources distributions. 2. The hidden variables which are the unobserved sources and the unobserved pixel segmentation labels. The proposed algorithm provides, in the stationary regime, samples drawn from the posterior distributions of all the variables involved in the problem leading to great flexibility in the cost function choice. Finally, we show the results for both synthetic and real data to illustrate the feasibility of the proposed solution. 2004 SPIE and IS&T. [DOI: 10.1117/1.1666873] 1