Results 1  10
of
122
Clustering with Bregman Divergences
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2005
"... A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergence ..."
Abstract

Cited by 441 (59 self)
 Add to MetaCart
(Show Context)
A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroidbased parametric clustering approaches, such as classical kmeans and informationtheoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the basic idea to a very large class of clustering loss functions. There are two main contributions in this paper. First, we pose the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by ratedistortion theory, and present an algorithm to minimize this loss. Secondly, we show an explicit bijection between Bregman divergences and exponential families. The bijection enables the development of an alternative interpretation of an ecient EM scheme for learning models involving mixtures of exponential distributions. This leads to a simple soft clustering algorithm for all Bregman divergences.
Exponentiated Gradient Versus Gradient Descent for Linear Predictors
 Information and Computation
, 1995
"... this paper, we concentrate on linear predictors . To any vector u 2 R ..."
Abstract

Cited by 325 (14 self)
 Add to MetaCart
this paper, we concentrate on linear predictors . To any vector u 2 R
Learning with Labeled and Unlabeled Data
, 2001
"... In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data. On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as we ..."
Abstract

Cited by 197 (3 self)
 Add to MetaCart
(Show Context)
In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data. On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as well as numerous suggestions for potential future work. Therefore, this work contains more speculative and partly subjective material than the reader might expect from a literature review. We give a rigorous definition of the problem and relate it to supervised and unsupervised learning. The crucial role of prior knowledge is put forward, and we discuss the important notion of inputdependent regularization. We postulate a number of baseline methods, being algorithms or algorithmic schemes which can more or less straightforwardly be applied to the problem, without the need for genuinely new concepts. However, some of them might serve as basis for a genuine method. In the literature revi...
Maximum Entropy Discrimination
, 1999
"... We present a general framework for discriminative estimation based on the maximum entropy principle and its extensions. All calculations involve distributions over structures and/or parameters rather than specific settings and reduce to relative entropy projections. This holds even when the data is ..."
Abstract

Cited by 141 (21 self)
 Add to MetaCart
We present a general framework for discriminative estimation based on the maximum entropy principle and its extensions. All calculations involve distributions over structures and/or parameters rather than specific settings and reduce to relative entropy projections. This holds even when the data is not separable within the chosen parametric class, in the context of anomaly detection rather than classification, or when the labels in the training set are uncertain or incomplete. Support vector machines are naturally subsumed under this class and we provide several extensions. We are also able to estimate exactly and efficiently discriminative distributions over tree structures of classconditional models within this framework. Preliminary experimental results are indicative of the potential in these techniques.
Csiszár’s divergences for nonnegative matrix factorization: Family of new algorithms
 LNCS
, 2006
"... In this paper we discus a wide class of loss (cost) functions for nonnegative matrix factorization (NMF) and derive several novel algorithms with improved efficiency and robustness to noise and outliers. We review several approaches which allow us to obtain generalized forms of multiplicative NMF a ..."
Abstract

Cited by 77 (20 self)
 Add to MetaCart
(Show Context)
In this paper we discus a wide class of loss (cost) functions for nonnegative matrix factorization (NMF) and derive several novel algorithms with improved efficiency and robustness to noise and outliers. We review several approaches which allow us to obtain generalized forms of multiplicative NMF algorithms and unify some existing algorithms. We give also the flexible and relaxed form of the NMF algorithms to increase convergence speed and impose some desired constraints such as sparsity and smoothness of components. Moreover, the effects of various regularization terms and constraints are clearly shown. The scope of these results is vast since the proposed generalized divergence functions include quite large number of useful loss functions such as the squared Euclidean distance,KulbackLeibler divergence, ItakuraSaito, Hellinger, Pearson’s chisquare, and Neyman’s chisquare distances, etc. We have applied successfully the developed algorithms to blind (or semi blind) source separation (BSS) where sources can be generally statistically dependent, however they satisfy some other conditions or additional constraints such as nonnegativity, sparsity and/or smoothness.
A Unified Framework for Modelbased Clustering
 Journal of Machine Learning Research
, 2003
"... Modelbased clustering techniques have been widely used and have shown promising results in many applications involving complex data. This paper presents a unified framework for probabilistic modelbased clustering based on a bipartite graph view of data and models that highlights the commonaliti ..."
Abstract

Cited by 74 (7 self)
 Add to MetaCart
(Show Context)
Modelbased clustering techniques have been widely used and have shown promising results in many applications involving complex data. This paper presents a unified framework for probabilistic modelbased clustering based on a bipartite graph view of data and models that highlights the commonalities and differences among existing modelbased clustering algorithms. In this view, clusters are represented as probabilistic models in a model space that is conceptually separate from the data space. For partitional clustering, the view is conceptually similar to the ExpectationMaximization (EM) algorithm. For hierarchical clustering, the graphbased view helps to visualize critical/important distinctions between similaritybased approaches and modelbased approaches.
Maximum Conditional Likelihood via Bound Maximization and the CEM Algorithm
 IN ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 11
, 1998
"... We present the CEM (Conditional Expectation Maximization) algorithm as an extension of the EM (Expectation Maximization) algorithm to conditional density estimation under missing data. A bounding and maximization process is given to specifically optimize conditional likelihood instead of the usual j ..."
Abstract

Cited by 61 (7 self)
 Add to MetaCart
We present the CEM (Conditional Expectation Maximization) algorithm as an extension of the EM (Expectation Maximization) algorithm to conditional density estimation under missing data. A bounding and maximization process is given to specifically optimize conditional likelihood instead of the usual joint likelihood. Weapply the method to conditioned mixture models and use bounding techniques to derive the model's update rules. Monotonic convergence, computational efficiency and regression results superior to EM are demonstrated.
Neural Learning in Structured Parameter Spaces  Natural Riemannian Gradient
 In Advances in Neural Information Processing Systems
, 1997
"... The parameter space of neural networks has the Riemannian metric structure. The natural Riemannian gradient should be used instead of the conventional gradient, since the former denotes the steepest descent direction of a loss function in the Riemannian space. The behavior of the stochastic gradient ..."
Abstract

Cited by 56 (6 self)
 Add to MetaCart
The parameter space of neural networks has the Riemannian metric structure. The natural Riemannian gradient should be used instead of the conventional gradient, since the former denotes the steepest descent direction of a loss function in the Riemannian space. The behavior of the stochastic gradient learning algorithm is much more effective if the natural gradient is used. The present paper studies the informationgeometrical structure of perceptrons and other networks, and prove that the online learning method based on the natural gradient is asymptotically as efficient as the optimal batch algorithm. Adaptive modification of the learning constant is proposed and analyzed in terms of the Riemannian measure and is shown to be efficient. The natural gradient is finally applied to blind separation of mixtured independent signal sources. 1 Introduction Neural learning takes place in the parameter space of modifiable synaptic weights of a neural network. The role of each parameter is dif...
The em algorithm for kernel matrix completion with auxiliary data
 Journal of Machine Learning Research
, 2003
"... In biological data, it is often the case that observed data are available only for a subset of samples. When a kernel matrix is derived from such data, we have to leave the entries for unavailable samples as missing. In this paper, the missing entries are completed by exploiting an auxiliary kernel ..."
Abstract

Cited by 53 (6 self)
 Add to MetaCart
In biological data, it is often the case that observed data are available only for a subset of samples. When a kernel matrix is derived from such data, we have to leave the entries for unavailable samples as missing. In this paper, the missing entries are completed by exploiting an auxiliary kernel matrix derived from another information source. The parametric model of kernel matrices is created as a set of spectral variants of the auxiliary kernel matrix, and the missing entries are estimated by fitting this model to the existing entries. For model fitting, we adopt the em algorithm (distinguished from the EM algorithm of Dempster et al., 1977) based on the information geometry of positive definite matrices. We will report promising results on bacteria clustering experiments using two marker sequences: 16S and gyrB.