Results 1  10
of
28
Supervised Learning of Quantizer Codebooks by Information Loss Minimization
, 2007
"... This paper proposes a technique for jointly quantizing continuous features and the posterior distributions of their class labels based on minimizing empirical information loss, such that the index K of the quantizer region to which a given feature X is assigned approximates a sufficient statistic fo ..."
Abstract

Cited by 35 (0 self)
 Add to MetaCart
This paper proposes a technique for jointly quantizing continuous features and the posterior distributions of their class labels based on minimizing empirical information loss, such that the index K of the quantizer region to which a given feature X is assigned approximates a sufficient statistic for its class label Y. We derive an alternating minimization procedure for simultaneously learning codebooks in the Euclidean feature space and in the simplex of posterior class distributions. The resulting quantizer can be used to encode unlabeled points outside the training set and to predict their posterior class distributions, and has an elegant interpretation in terms of lossless source coding. The proposed method is extensively validated on synthetic and real datasets, and is applied to two diverse problems: learning discriminative visual vocabularies for bagoffeatures image classification, and image segmentation.
Discriminative Clustering by Regularized Information Maximization
"... Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive informationtheoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semisupervised learning. In particular, we instantiate the framework to unsupervised, multiclass kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1
Information bottleneck for non cooccurrence data
 In Advances in Neural Information Processing Systems 19
, 2007
"... We present a general modelindependent approach to the analysis of data in cases when these data do not appear in the form of cooccurrence of two variables X, Y, but rather as a sample of values of an unknown (stochastic) function Z(X, Y). For example, in gene expression data, the expression level ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
We present a general modelindependent approach to the analysis of data in cases when these data do not appear in the form of cooccurrence of two variables X, Y, but rather as a sample of values of an unknown (stochastic) function Z(X, Y). For example, in gene expression data, the expression level Z is a function of gene X and condition Y; or in movie ratings data the rating Z is a function of viewer X and movie Y. The approach represents a consistent extension of the Information Bottleneck method that has previously relied on the availability of cooccurrence statistics. By altering the relevance variable we eliminate the need in the sample of joint distribution of all input variables. This new formulation also enables simple MDLlike model complexity control and prediction of missing values of Z. The approach is analyzed and shown to be on a par with the best known clustering algorithms for a wide range of domains. For the prediction of missing values (collaborative filtering) it improves the currently best known results. 1
PACBayesian Analysis of Coclustering and Beyond
"... We derive PACBayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as coclustering, matrix trifactorization, graphical models, graph clustering, and pairwise clustering. 1 We begin with the analysis of coclustering, which is a widely used approa ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
We derive PACBayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as coclustering, matrix trifactorization, graphical models, graph clustering, and pairwise clustering. 1 We begin with the analysis of coclustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in cooccurrence matrices. We derive PACBayesian generalization bounds for the expected outofsample performance of coclusteringbased solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of coclustering. The bounds suggest that the expected performance of coclustering is governed by a tradeoff between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this tradeoff for discriminative prediction tasks. This algorithm achieved stateoftheart performance in the MovieLens collaborative filtering task. Our coclustering model can also be seen as matrix trifactorization and the results provide generalization bounds, regularization
Learning nearestneighbor quantizers from labeled data by information loss minimization
 In Int. Conf. on AI and Stat
, 2007
"... This paper proposes a technique for jointly quantizing continuous features and the posterior distributions of their class labels based on minimizing empirical information loss, such that the index K of the quantizer region to which a given feature X is assigned approximates a sufficient statistic fo ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
This paper proposes a technique for jointly quantizing continuous features and the posterior distributions of their class labels based on minimizing empirical information loss, such that the index K of the quantizer region to which a given feature X is assigned approximates a sufficient statistic for its class label Y. We derive an alternating minimization procedure for simultaneously learning codebooks in the Euclidean feature space and in the simplex of posterior class distributions. The resulting quantizer can be used to encode unlabeled points outside the training set and to predict their posterior class distributions, and has an elegant interpretation in terms of universal lossless coding. The promise of our method is demonstrated for the application of learning discriminative visual vocabularies for bagoffeatures image classification. 1
A Nonparametric Information Theoretic Clustering Algorithm
"... In this paper we propose a novel clustering algorithm based on maximizing the mutual information between data points and clusters. Unlike previous methods, we neither assume the data are given in terms of distributions nor impose any parametric model on the withincluster distribution. Instead, we u ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
In this paper we propose a novel clustering algorithm based on maximizing the mutual information between data points and clusters. Unlike previous methods, we neither assume the data are given in terms of distributions nor impose any parametric model on the withincluster distribution. Instead, we utilize a nonparametric estimation of the average cluster entropies and search for a clustering that maximizes the estimated mutual information between data points and clusters. The improved performance of the proposed algorithm is demonstrated on several standard datasets. 1.
Learning and generalization with the information bottleneck method
, 2008
"... The Information Bottleneck (IB) method, introduced in [22], is an informationtheoretic framework for extracting relevant components of an ‘input ’ random variable X, with respect to an ‘output ’ random variable Y. This is performed by finding a compressed, nonparametric and modelindependent repres ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
The Information Bottleneck (IB) method, introduced in [22], is an informationtheoretic framework for extracting relevant components of an ‘input ’ random variable X, with respect to an ‘output ’ random variable Y. This is performed by finding a compressed, nonparametric and modelindependent representation
Parallel pairwise clustering
 SDM’09, Proceedings of SIAM Data Mining conference
, 2009
"... Given the pairwise affinity relations associated with a set of data items, the goal of a clustering algorithm is to automatically partition the data into a small number of homogeneous clusters. However, since the input size is quadratic in the number of data points, existing algorithms are non feasi ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Given the pairwise affinity relations associated with a set of data items, the goal of a clustering algorithm is to automatically partition the data into a small number of homogeneous clusters. However, since the input size is quadratic in the number of data points, existing algorithms are non feasible for many practical applications. Here, we propose a simple strategy to cluster massive data by randomly splitting the original affinity matrix into small manageable affinity matrices that are clustered independently. Our proposal is most appealing in a parallel computing environment where at each iteration, each worker node clusters a subset of the input data and the results from all workers are then integrated in a master node to create a new clustering partition over the entire data. We demonstrate that this approach yields high quality clustering partitions for various real world problems, even though at each iteration only small fractions of the original data matrix are examined and at no point is the entire affinity matrix stored in memory or even computed. Furthermore, we demonstrate that the proposed algorithm has intriguing stochastic convergence properties that provide further insight into the clustering problem. 1
Allegro: Analyzing expression and sequence in concert to discover regulatory programs
, 2008
"... A major goal of system biology is the characterization of transcription factors and microRNAs (miRNAs) and the transcriptional programs they regulate. We present Allegro, a method for denovo discovery of cisregulatory transcriptional programs through joint analysis of genomewide expression data a ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
A major goal of system biology is the characterization of transcription factors and microRNAs (miRNAs) and the transcriptional programs they regulate. We present Allegro, a method for denovo discovery of cisregulatory transcriptional programs through joint analysis of genomewide expression data and promoter or 3 ’ UTR sequences. The algorithm uses a novel loglikelihoodbased, nonparametric model to describe the expression pattern shared by a group of coregulated genes. We show that Allegro is more accurate and sensitive than existing techniques, and can simultaneously analyze multiple expression datasets with more than 100 conditions. We apply Allegro on datasets from several species and report on the transcriptional modules it uncovers. Our analysis reveals a novel motif overrepresented in the promoters of genes highly expressed in murine oocytes, and several new motifs related to fly development. Finally, using stemcell expression profiles, we identify three miRNA families with pivotal roles in human embryogenesis.
A PACBayesian Analysis of Graph Clustering and Pairwise Clustering
"... We formulate weighted graph clustering as a prediction problem 1: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. This formulation enables practical and theoretical comparison of different approaches to graph clustering as well as comp ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We formulate weighted graph clustering as a prediction problem 1: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. This formulation enables practical and theoretical comparison of different approaches to graph clustering as well as comparison of graph clustering with other possible ways to model the graph. We adapt the PACBayesian analysis of coclustering (Seldin and Tishby, 2008; Seldin, 2009) to derive a PACBayesian generalization bound for graph clustering. The bound shows that graph clustering should optimize a tradeoff between empirical data fit and the mutual information that clusters preserve on the graph nodes. A similar tradeoff derived from informationtheoretic considerations was already shown to produce stateoftheart results in practice (Slonim et al., 2005; YomTov and Slonim, 2009). This paper supports the empirical evidence by providing a better theoretical foundation, suggesting formal generalization guarantees, and offering a more accurate way to deal with finite sample issues. We derive a bound minimization algorithm and show that it provides good results in reallife problems and that the derived PACBayesian bound is reasonably tight. 1