Results 1  10
of
30
Tensor decompositions for learning latent variable models
, 2014
"... This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models—including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation—which exploits a certain tensor structure in their loworder observable mo ..."
Abstract

Cited by 83 (7 self)
 Add to MetaCart
(Show Context)
This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models—including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation—which exploits a certain tensor structure in their loworder observable moments (typically, of second and thirdorder). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin’s perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.
MultiView Clustering via Canonical Correlation Analysis
"... Clustering data in highdimensions is believed to be a hard problem in general. A number of efficient clustering algorithms developed in recent years address this problem by projecting the data into a lowerdimensional subspace, e.g. via Principal Components Analysis (PCA) or random projections, bef ..."
Abstract

Cited by 76 (6 self)
 Add to MetaCart
(Show Context)
Clustering data in highdimensions is believed to be a hard problem in general. A number of efficient clustering algorithms developed in recent years address this problem by projecting the data into a lowerdimensional subspace, e.g. via Principal Components Analysis (PCA) or random projections, before clustering. Such techniques typically require stringent requirements on the separation between the cluster means (in order for the algorithm to be be successful). Here, we show how using multiple views of the data can relax these stringent requirements. We use Canonical Correlation Analysis (CCA) to project the data in each view to a lowerdimensional subspace. Under the assumption that conditioned on the cluster label the views are uncorrelated, we show that the separation conditions required for the algorithm to be successful are rather mild (significantly weaker than those of prior results in the literature). We provide results for mixture of The multiview approach to learning is one in which we have ‘views ’ of the data (sometimes in a rather abstract sense) and, if we understand the underlying relationship between these views, the hope is that this relationship can be used to alleviate the difficulty of a learning problem of interest [BM98, KF07, AZ07]. In this work, we explore how having ‘two views ’ of the data makes
A Spectral Algorithm for Latent Dirichlet Allocation
"... Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating th ..."
Abstract

Cited by 49 (11 self)
 Add to MetaCart
Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topicword distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topicword distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of loworder moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. 1
Polynomial Learning of Distribution Families
"... Abstract—The question of polynomial learnability of probability distributions, particularly Gaussian mixture distributions, has recently received significant attention in theoretical computer science and machine learning. However, despite major progress, the general question of polynomial learnabili ..."
Abstract

Cited by 49 (0 self)
 Add to MetaCart
(Show Context)
Abstract—The question of polynomial learnability of probability distributions, particularly Gaussian mixture distributions, has recently received significant attention in theoretical computer science and machine learning. However, despite major progress, the general question of polynomial learnability of Gaussian mixture distributions still remained open. The current work resolves the question of polynomial learnability for Gaussian mixtures in high dimension with an arbitrary fixed number of components. Specifically, we show that parameters of a Gaussian mixture distribution with fixed number of components can be learned using a sample whose size is polynomial in dimension and all other parameters. The result on learning Gaussian mixtures relies on an analysis of distributions belonging to what we call “polynomial families” in low dimension. These families are characterized by their moments being polynomial in parameters and include almost all common probability distributions as well as their mixtures and products. Using tools from real algebraic geometry, we show that parameters of any distribution belonging to such a family can be learned in polynomial time and using a polynomial number of sample points. The result on learning polynomial families is quite general and is of independent interest. To estimate parameters of a Gaussian mixture distribution in high dimensions, we provide a deterministic algorithm for dimensionality reduction. This allows us to reduce learning a highdimensional mixture to a polynomial number of parameter estimations in low dimension. Combining this reduction with the results on polynomial families yields our result on learning arbitrary Gaussian mixtures in high dimensions. Index Terms—Gaussian mixture learning, polynomial learnability I.
Disentangling Gaussians
 Communications of the ACM
, 2012
"... doi:10.1145/2076450.2076474 ..."
(Show Context)
Clustering with Interactive Feedback
"... Abstract. In this paper, we initiate a theoretical study of the problem of clustering data under interactive feedback. We introduce a querybased model in which users can provide feedback to a clustering algorithm in a natural way via split and merge requests. We then analyze the “clusterability” of ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper, we initiate a theoretical study of the problem of clustering data under interactive feedback. We introduce a querybased model in which users can provide feedback to a clustering algorithm in a natural way via split and merge requests. We then analyze the “clusterability” of different concept classes in this framework — the ability to cluster correctly with a bounded number of requests under only the assumption that each cluster can be described by a concept in the class — and provide efficient algorithms as well as informationtheoretic upper and lower bounds. 1
Robust pca and clustering on noisy mixtures
 in Proc. of SODA
, 2009
"... This paper presents a polynomial algorithm for learning mixtures of logconcave distributions in R n in the presence of malicious noise. That is, each sample is corrupted with some small probability, being replaced by a point about which we can make no assumptions. A key element of the algorithm is R ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
This paper presents a polynomial algorithm for learning mixtures of logconcave distributions in R n in the presence of malicious noise. That is, each sample is corrupted with some small probability, being replaced by a point about which we can make no assumptions. A key element of the algorithm is Robust Principle Components Analysis (PCA), which is less susceptible to corruption by noisy points. While noise may cause standard PCA to collapse wellseparated mixture components so that they are indistinguishable, Robust PCA preserves the distance between some of the components, making a partition possible. It then recurses on each half of the mixture until every component is isolated. The success of this algorithm requires only a O ∗ (log n) factor increase in the required separation between components of the mixture compared to the noiseless case. 1
Learning mixtures of Gaussians using the kmeans algorithm. arXiv preprint arXiv:0912.0086
, 2009
"... ar ..."
Improved spectralnorm bounds for clustering
 In APPROXRANDOM. 37–49
, 2012
"... Aiming to unify known results about clustering mixtures of distributions under separation conditions, Kumar and Kannan [KK10] introduced a deterministic condition for clustering datasets. They showed that this single deterministic condition encompasses many previously studied clustering assumptions. ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Aiming to unify known results about clustering mixtures of distributions under separation conditions, Kumar and Kannan [KK10] introduced a deterministic condition for clustering datasets. They showed that this single deterministic condition encompasses many previously studied clustering assumptions. More specifically, their proximity condition requires that in the target kclustering, the projection of a point x onto the line joining its cluster center µ and some other center µ ′ , is a large additive factor closer to µ than to µ ′. This additive factor can be roughly described as k times the spectral norm of the matrix representing the differences between the given (known) dataset and the means of the (unknown) target clustering. Clearly, the proximity condition implies center separation – the distance between any two centers must be as large as the above mentioned bound. In this paper we improve upon the work of Kumar and Kannan [KK10] along several axes. First, we weaken the center separation bound by a factor of √ k, and secondly we weaken the proximity condition by a factor of k (in other words, the revised separation condition is independent of k). Using these weaker bounds we still achieve the same guarantees when all
Faster and Sample NearOptimal Algorithms for Proper Learning Mixtures of Gaussians
"... We provide an algorithm for properly learning mixtures of two singledimensional Gaussians without any separability assumptions. Given Õ(1/ε2) samples from an unknown mixture, our algorithm outputs a mixture that is εclose in total variation distance, in time Õ(1/ε5). Our sample complexity is o ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We provide an algorithm for properly learning mixtures of two singledimensional Gaussians without any separability assumptions. Given Õ(1/ε2) samples from an unknown mixture, our algorithm outputs a mixture that is εclose in total variation distance, in time Õ(1/ε5). Our sample complexity is optimal up to logarithmic factors, and significantly improves upon both Kalai et al. (2010), whose algorithm has a prohibitive dependence on 1/ε, and Feldman et al. (2006), whose algorithm requires bounds on the mixture parameters and depends pseudopolynomially in these parameters. One of our main contributions is an improved and generalized algorithm for selecting a good candidate distribution from among competing hypotheses. Namely, given a collection of N hypotheses containing at least one candidate that is εclose to an unknown distribution, our algorithm outputs a candidate which is O(ε)close to the distribution. The algorithm requires O(logN/ε2) samples from the unknown distribution and O(N logN/ε2) time, which improves previous such results (such as the Scheffe ́ estimator) from a quadratic dependence of the running time on N to quasilinear. Given the wide use of such results for the purpose of hypothesis selection, our improved algorithm implies immediate improvements to any such use.