Results 1  10
of
20
Clustering with Bregman Divergences
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2005
"... A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergence ..."
Abstract

Cited by 309 (52 self)
 Add to MetaCart
A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroidbased parametric clustering approaches, such as classical kmeans and informationtheoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the basic idea to a very large class of clustering loss functions. There are two main contributions in this paper. First, we pose the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by ratedistortion theory, and present an algorithm to minimize this loss. Secondly, we show an explicit bijection between Bregman divergences and exponential families. The bijection enables the development of an alternative interpretation of an ecient EM scheme for learning models involving mixtures of exponential distributions. This leads to a simple soft clustering algorithm for all Bregman divergences.
Logistic Regression, AdaBoost and Bregman Distances
, 2000
"... We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt al ..."
Abstract

Cited by 203 (43 self)
 Add to MetaCart
We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms can be divided into two types based on whether the parameters are iteratively updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms which interpolates smoothly between these two extremes. For all of the algorithms, we give convergence proofs using a general formalization of the auxiliaryfunction proof technique. As one of our sequentialupdate algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with iterative scaling. We conclude with a few experimental results with synthetic data that highlight the behavior of the old and newly proposed algorithms in different settings.
Additive Models, Boosting, and Inference for Generalized Divergences
 In Proc. 12th Annu. Conf. on Comput. Learning Theory
, 1999
"... We present a framework for designing incremental learning algorithms derived from generalized entropy functionals. Our approach is based on the use of Bregman divergences together with the associated class of additive models constructed using the Legendre transform. A particular oneparameter family ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
We present a framework for designing incremental learning algorithms derived from generalized entropy functionals. Our approach is based on the use of Bregman divergences together with the associated class of additive models constructed using the Legendre transform. A particular oneparameter family of Bregman divergences is shown to yield a family of loss functions that includes the loglikelihood criterion of logistic regression as a special case, and that closely approximates the exponential loss criterion used in the AdaBoost algorithms of Schapire et al., as the natural parameter of the family varies. We also show how the quadratic approximation of the gain in Bregman divergence results in a weighted leastsquares criterion. This leads to a family of incremental learning algorithms that builds upon and extends the recent interpretation of boosting in terms of additive models proposed by Friedman, Hastie, and Tibshirani. 1 Introduction Logistic regression is a widely used statisti...
Statistical Learning Algorithms Based on Bregman Distances
, 1997
"... We present a class of statistical learning algorithms formulated in terms of minimizing Bregman distances, a family of generalized entropy measures associated with convex functions. The inductive learning scheme is akin to growing a decision tree, with the Bregman distance filling the role of the im ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
We present a class of statistical learning algorithms formulated in terms of minimizing Bregman distances, a family of generalized entropy measures associated with convex functions. The inductive learning scheme is akin to growing a decision tree, with the Bregman distance filling the role of the impurity function in treebased classifiers. Our approach is based on two components. In the feature selection step, each linear constraint in a pool of candidate features is evaluated by the reduction in Bregman distance that would result from adding it to the model. In the constraint satisfaction step, all of the parameters are adjusted to minimize the Bregman distance subject to the chosen constraints. We introduce a new iterative estimation algorithm for carrying out both the feature selection and constraint satisfaction steps, and outline a proof of the convergence of these algorithms. 1 Introduction In this paper we present a class of statistical learning algorithms formulated in terms...
Information, Divergence and Risk for Binary Experiments
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2009
"... We unify fdivergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROCcurves and statistical information. We do this by systematically studying integral and variational representations of these various objects and in so doing identify their primitives which all ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
We unify fdivergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROCcurves and statistical information. We do this by systematically studying integral and variational representations of these various objects and in so doing identify their primitives which all are related to costsensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating fdivergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates Maximum Mean Discrepancy to Fisher Linear Discriminants.
The BurbeaRao and Bhattacharyya centroids
 IEEE Transactions on Information Theory
, 2010
"... Abstract—We study the centroid with respect to the class of informationtheoretic BurbeaRao divergences that generalize the celebrated JensenShannon divergence by measuring the nonnegative Jensen difference induced by a strictly convex and differentiable function. Although those BurbeaRao diverge ..."
Abstract

Cited by 10 (7 self)
 Add to MetaCart
Abstract—We study the centroid with respect to the class of informationtheoretic BurbeaRao divergences that generalize the celebrated JensenShannon divergence by measuring the nonnegative Jensen difference induced by a strictly convex and differentiable function. Although those BurbeaRao divergences are symmetric by construction, they are not metric since they fail to satisfy the triangle inequality. We first explain how a particular symmetrization of Bregman divergences called JensenBregman distances yields exactly those BurbeaRao divergences. We then proceed by defining skew BurbeaRao divergences, and show that skew BurbeaRao divergences amount in limit cases to compute Bregman divergences. We then prove that BurbeaRao centroids can be arbitrarily finely approximated by a generic iterative concaveconvex optimization algorithm with guaranteed convergence property. In the second part of the paper, we consider the Bhattacharyya distance that is commonly used to measure overlapping degree of probability distributions. We show that Bhattacharyya distances on members of the same statistical exponential family amount to calculate a BurbeaRao divergence in disguise. Thus we get an efficient algorithm for computing the Bhattacharyya centroid of a set of parametric distributions belonging to the same exponential families, improving over former specialized methods found in the literature that were limited to univariate or “diagonal ” multivariate Gaussians. To illustrate the performance of our Bhattacharyya/BurbeaRao centroid algorithm, we present experimental performance results for kmeans and hierarchical clustering methods of Gaussian mixture models.
Bayesian Quadratic Discriminant Analysis
 Journal of Machine Learning Research
, 2007
"... Quadratic discriminant analysis is a common tool for classification, but estimation of the Gaussian parameters can be illposed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distributionbased Bayesian classifier is deriv ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Quadratic discriminant analysis is a common tool for classification, but estimation of the Gaussian parameters can be illposed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distributionbased Bayesian classifier is derived using information geometry. Using a calculus of variations approach to define a functional Bregman divergence for distributions, it is shown that the Bayesian distributionbased classifier that minimizes the expected Bregman divergence of each class conditional distribution also minimizes the expected misclassification cost. A series approximation is used to relate regularized discriminant analysis to Bayesian discriminant analysis. A new Bayesian quadratic discriminant analysis classifier is proposed where the prior is defined using a coarse estimate of the covariance based on the training data; this classifier is termed BDA7. Results on benchmark data sets and simulations show that BDA7 performance is competitive with, and in some cases significantly better than, regularized quadratic discriminant analysis and the crossvalidated Bayesian quadratic discriminant analysis classifier Quadratic Bayes.
Functional bregman divergence and bayesian estimation of distributions
 CoRR
"... Abstract—A class of distortions termed functional Bregman divergences is defined, which includes squared error and relative entropy. A functional Bregman divergence acts on functions or distributions, and generalizes the standard Bregman divergence for vectors and a previous pointwise Bregman diverg ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Abstract—A class of distortions termed functional Bregman divergences is defined, which includes squared error and relative entropy. A functional Bregman divergence acts on functions or distributions, and generalizes the standard Bregman divergence for vectors and a previous pointwise Bregman divergence that was defined for functions. A recent result showed that the mean minimizes the expected Bregman divergence. The new functional definition enables the extension of this result to the continuous case to show that the mean minimizes the expected functional Bregman divergence over a set of functions or distributions. It is shown how this theorem applies to the Bayesian estimation of distributions. Estimation of the uniform distribution from independent and identically drawn samples is presented as a case study. Index Terms—Bayesian estimation, Bregman divergence, convexity, Fréchet derivative, uniform distribution.
Testing for NonNested Conditional Moment Restrictions using Unconditional Empirical Likelihood
, 2008
"... We propose nonnested hypotheses tests for conditional moment restriction models based on the method of generalized empirical likelihood (GEL). By utilizing the implied GEL probabilities from a sequence of unconditional moment restrictions that contains equivalent information of the conditional mome ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
We propose nonnested hypotheses tests for conditional moment restriction models based on the method of generalized empirical likelihood (GEL). By utilizing the implied GEL probabilities from a sequence of unconditional moment restrictions that contains equivalent information of the conditional moment restrictions, we construct KolmogorovSmirnov and Cramérvon Mises type moment encompassing tests. Advantages of our tests over Otsu and Whang’s (2007) tests are: (i) they are free from smoothing parameters, (ii) they can be applied to weakly dependent data, and (iii) they allow nonsmooth moment functions. We derive the null distributions, validity of a bootstrap procedure, and local and global power properties of our tests. The simulation results show that our tests have reasonable size and power performance in finite samples.
Functional Bregman divergence
 Int. Symp. Inf. Theory
, 2008
"... Abstract — To characterize the differences between two positive functions or two distributions, a class of distortion functions has recently been defined termed the functional Bregman divergences. The class generalizes the standard Bregman divergence defined for vectors, and includes total squared d ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Abstract — To characterize the differences between two positive functions or two distributions, a class of distortion functions has recently been defined termed the functional Bregman divergences. The class generalizes the standard Bregman divergence defined for vectors, and includes total squared difference and relative entropy. Recently a key property was discovered for the vector Bregman divergence: that the mean minimizes the average Bregman divergence for a finite set of vectors. In this paper the analog result is proven: that the mean function minimizes the average Bregman divergence for a set of positive functions that can be parameterized by a finite number of parameters. In addition, the relationship of the functional Bregman divergence to the vector Bregman divergence and pointwise Bregman divergence is stated, as well as some important properties. I.