Results 1  10
of
30
Clustering with Bregman Divergences
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2005
"... A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergence ..."
Abstract

Cited by 309 (52 self)
 Add to MetaCart
A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroidbased parametric clustering approaches, such as classical kmeans and informationtheoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the basic idea to a very large class of clustering loss functions. There are two main contributions in this paper. First, we pose the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by ratedistortion theory, and present an algorithm to minimize this loss. Secondly, we show an explicit bijection between Bregman divergences and exponential families. The bijection enables the development of an alternative interpretation of an ecient EM scheme for learning models involving mixtures of exponential distributions. This leads to a simple soft clustering algorithm for all Bregman divergences.
A Generalized Maximum Entropy Approach to Bregman Coclustering and Matrix Approximation
 In KDD
, 2004
"... Coclustering is a powerful data mining technique with varied applications such as text clustering, microarray analysis and recommender systems. Recently, an informationtheoretic coclustering approach applicable to empirical joint probability distributions was proposed. In many situations, coclust ..."
Abstract

Cited by 97 (25 self)
 Add to MetaCart
Coclustering is a powerful data mining technique with varied applications such as text clustering, microarray analysis and recommender systems. Recently, an informationtheoretic coclustering approach applicable to empirical joint probability distributions was proposed. In many situations, coclustering of more general matrices is desired. In this paper, we present a substantially generalized coclustering framework wherein any Bregman divergence can be used in the objective function, and various conditional expectation based constraints can be considered based on the statistics that need to be preserved. Analysis of the coclustering problem leads to the minimum Bregman information principle, which generalizes the maximum entropy principle, and yields an elegant meta algorithm that is guaranteed to achieve local optimality. Our methodology yields new algorithms and also encompasses several previously known clustering and coclustering algorithms based on alternate minimization.
Information, Divergence and Risk for Binary Experiments
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2009
"... We unify fdivergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROCcurves and statistical information. We do this by systematically studying integral and variational representations of these various objects and in so doing identify their primitives which all ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
We unify fdivergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROCcurves and statistical information. We do this by systematically studying integral and variational representations of these various objects and in so doing identify their primitives which all are related to costsensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating fdivergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates Maximum Mean Discrepancy to Fisher Linear Discriminants.
Sided and symmetrized Bregman centroids
 IEEE Transactions on Information Theory
, 2009
"... Abstract—In this paper, we generalize the notions of centroids (and barycenters) to the broad class of informationtheoretic distortion measures called Bregman divergences. Bregman divergences form a rich and versatile family of distances that unifies quadratic Euclidean distances with various well ..."
Abstract

Cited by 15 (7 self)
 Add to MetaCart
Abstract—In this paper, we generalize the notions of centroids (and barycenters) to the broad class of informationtheoretic distortion measures called Bregman divergences. Bregman divergences form a rich and versatile family of distances that unifies quadratic Euclidean distances with various wellknown statistical entropic measures. Since besides the squared Euclidean distance, Bregman divergences are asymmetric, we consider the leftsided and rightsided centroids and the symmetrized centroids as minimizers of average Bregman distortions. We prove that all three centroids are unique and give closedform solutions for the sided centroids that are generalized means. Furthermore, we design a provably fast and efficient arbitrary close approximation algorithm for the symmetrized centroid based on its exact geometric characterization. The geometric approximation algorithm requires only to walk on a geodesic linking the two left/rightsided centroids. We report on our implementation for computing entropic centers of image histogram clusters and entropic centers of multivariate normal distributions that are useful operations for processing multimedia information and retrieval. These experiments illustrate that our generic methods compare favorably with former limited ad hoc methods. Index Terms—Bregman divergence, Bregman information, Bregman power divergence, Burbea–Rao divergence, centroid,
Bayesian Quadratic Discriminant Analysis
 Journal of Machine Learning Research
, 2007
"... Quadratic discriminant analysis is a common tool for classification, but estimation of the Gaussian parameters can be illposed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distributionbased Bayesian classifier is deriv ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Quadratic discriminant analysis is a common tool for classification, but estimation of the Gaussian parameters can be illposed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distributionbased Bayesian classifier is derived using information geometry. Using a calculus of variations approach to define a functional Bregman divergence for distributions, it is shown that the Bayesian distributionbased classifier that minimizes the expected Bregman divergence of each class conditional distribution also minimizes the expected misclassification cost. A series approximation is used to relate regularized discriminant analysis to Bayesian discriminant analysis. A new Bayesian quadratic discriminant analysis classifier is proposed where the prior is defined using a coarse estimate of the covariance based on the training data; this classifier is termed BDA7. Results on benchmark data sets and simulations show that BDA7 performance is competitive with, and in some cases significantly better than, regularized quadratic discriminant analysis and the crossvalidated Bayesian quadratic discriminant analysis classifier Quadratic Bayes.
Bregman divergences and surrogates for learning
 IEEE Trans. Pattern Anal. Mach. Intell., 2009 [Online]. Available: http://ieeexplore.ieee.org/xpl/preabsprintf.jsp?arnumber=4626960
"... Abstract—Bartlett et al. (2006) recently proved that a ground condition for surrogates, classification calibration, ties up their consistent minimization to that of the classification risk, and left as an important problem the algorithmic questions about their minimization. In this paper, we address ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
Abstract—Bartlett et al. (2006) recently proved that a ground condition for surrogates, classification calibration, ties up their consistent minimization to that of the classification risk, and left as an important problem the algorithmic questions about their minimization. In this paper, we address this problem for a wide set which lies at the intersection of classification calibrated surrogates and those of Murata et al. (2004). This set coincides with those satisfying three common assumptions about surrogates. Equivalent expressions for the members—sometimes well known—follow for convex and concave surrogates, frequently used in the induction of linear separators and decision trees. Most notably, they share remarkable algorithmic features: for each of these two types of classifiers, we give a minimization algorithm provably converging to the minimum of any such surrogate. While seemingly different, we show that these algorithms are offshoots of the same “master ” algorithm. This provides a new and broad unified account of different popular algorithms, including additive regression with the squared loss, the logistic loss, and the topdown induction performed in CART, C4.5. Moreover, we show that the induction enjoys the most popular boosting features, regardless of the surrogate. Experiments are provided on 40 readily available domains.
Probabilistic coherence and proper scoring rules
 IEEE Transactions on Information Theory
, 2009
"... We provide selfcontained proof of a theorem relating probabilistic coherence of forecasts to their nondomination by rival forecasts with respect to any proper scoring rule. The theorem recapitulates insights achieved by other investigators, and clarifies the connection of coherence and proper scor ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
We provide selfcontained proof of a theorem relating probabilistic coherence of forecasts to their nondomination by rival forecasts with respect to any proper scoring rule. The theorem recapitulates insights achieved by other investigators, and clarifies the connection of coherence and proper scoring rules to Bregman divergence. 1
Functional bregman divergence and bayesian estimation of distributions
 CoRR
"... Abstract—A class of distortions termed functional Bregman divergences is defined, which includes squared error and relative entropy. A functional Bregman divergence acts on functions or distributions, and generalizes the standard Bregman divergence for vectors and a previous pointwise Bregman diverg ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Abstract—A class of distortions termed functional Bregman divergences is defined, which includes squared error and relative entropy. A functional Bregman divergence acts on functions or distributions, and generalizes the standard Bregman divergence for vectors and a previous pointwise Bregman divergence that was defined for functions. A recent result showed that the mean minimizes the expected Bregman divergence. The new functional definition enables the extension of this result to the continuous case to show that the mean minimizes the expected functional Bregman divergence over a set of functions or distributions. It is shown how this theorem applies to the Bayesian estimation of distributions. Estimation of the uniform distribution from independent and identically drawn samples is presented as a case study. Index Terms—Bayesian estimation, Bregman divergence, convexity, Fréchet derivative, uniform distribution.
Functional Bregman divergence
 Int. Symp. Inf. Theory
, 2008
"... Abstract — To characterize the differences between two positive functions or two distributions, a class of distortion functions has recently been defined termed the functional Bregman divergences. The class generalizes the standard Bregman divergence defined for vectors, and includes total squared d ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Abstract — To characterize the differences between two positive functions or two distributions, a class of distortion functions has recently been defined termed the functional Bregman divergences. The class generalizes the standard Bregman divergence defined for vectors, and includes total squared difference and relative entropy. Recently a key property was discovered for the vector Bregman divergence: that the mean minimizes the average Bregman divergence for a finite set of vectors. In this paper the analog result is proven: that the mean function minimizes the average Bregman divergence for a set of positive functions that can be parameterized by a finite number of parameters. In addition, the relationship of the functional Bregman divergence to the vector Bregman divergence and pointwise Bregman divergence is stated, as well as some important properties. I.
Quantum theory as inductive inference
, 2010
"... We present the elements of a new approach to the foundations of quantum theory and information theory which is based on the algebraic approach to integration, information geometry, and maximum relative entropy methods. It enables us to deal with conceptual and mathematical problems of quantum theory ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
We present the elements of a new approach to the foundations of quantum theory and information theory which is based on the algebraic approach to integration, information geometry, and maximum relative entropy methods. It enables us to deal with conceptual and mathematical problems of quantum theory without any appeal to Hilbert space framework and without frequentist or subjective interpretation of probability. PACS: 89.70.Cf 02.50.Cw 03.67.a 03.65.w 1