Results 1 - 10
of
11
Factorial LDA: Sparse Multi-Dimensional Text Models
"... Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each wo ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
(Show Context)
Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. 1
Which Clustering Do You Want? Inducing Your Ideal Clustering with Minimal Feedback
- Journal of Artificial Intelligence Research
"... Abstract While traditional research on text clustering has largely focused on grouping documents by topic, it is conceivable that a user may want to cluster documents along other dimensions, such as the author's mood, gender, age, or sentiment. Without knowing the user's intention, a clus ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract While traditional research on text clustering has largely focused on grouping documents by topic, it is conceivable that a user may want to cluster documents along other dimensions, such as the author's mood, gender, age, or sentiment. Without knowing the user's intention, a clustering algorithm will only group documents along the most prominent dimension, which may not be the one the user desires. To address the problem of clustering documents along the user-desired dimension, previous work has focused on learning a similarity metric from data manually annotated with the user's intention or having a human construct a feature space in an interactive manner during the clustering process. With the goal of reducing reliance on human knowledge for fine-tuning the similarity function or selecting the relevant features required by these approaches, we propose a novel active clustering algorithm, which allows a user to easily select the dimension along which she wants to cluster the documents by inspecting only a small number of words. We demonstrate the viability of our algorithm on a variety of commonly-used sentiment datasets.
A nonparametric bayesian model for multiple clustering with overlapping feature views
- Journal of Machine Learning Research
, 2012
"... Most clustering algorithms produce a single clustering solution. This is inadequate for many data sets that are multi-faceted and can be grouped and interpreted in many different ways. Moreover, for high-dimensional data, different features may be relevant or irrelevant to each clustering solution, ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Most clustering algorithms produce a single clustering solution. This is inadequate for many data sets that are multi-faceted and can be grouped and interpreted in many different ways. Moreover, for high-dimensional data, different features may be relevant or irrelevant to each clustering solution, suggesting the need for feature selection in clustering. Features relevant to one clustering interpretation may be different from the ones relevant for an alternative interpretation or view of the data. In this paper, we introduce a probabilistic nonparametric Bayesian model that can discover multiple clustering solutions from data and the feature subsets that are relevant for the clusters in each view. In our model, the features in different views may be shared and therefore the sets of relevant features are allowed to overlap. We model feature relevance to each view using an Indian Buffet Process and the cluster membership in each view using a Chinese Restaurant Process. We provide an inference approach to learn the latent parameters corresponding to this multiple partitioning problem. Our model not only learns the features and clusters in each view but also automatically learns the number of clusters, number of views and number of features in each view. 1
Evaluating unsupervised learning for natural language processing tasks
"... The development of unsupervised learning methods for natural language processing tasks has become an important and popular area of research. The primary advantage of these methods is that they do not require annotated data to learn a model. However, this advantage makes them difficult to evaluate ag ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The development of unsupervised learning methods for natural language processing tasks has become an important and popular area of research. The primary advantage of these methods is that they do not require annotated data to learn a model. However, this advantage makes them difficult to evaluate against a manually labeled gold standard. Using unsupervised part-of-speech tagging as our case study, we discuss the reasons that render this evaluation paradigm unsuitable for the evaluation of unsupervised learning methods. Instead, we argue that the rarely used in-context evaluation is more appropriate and more informative, as it takes into account the way these methods are likely to be applied. Finally, bearing the issue of evaluation in mind, we propose directions for future work in unsupervised natural language processing. 1
Model-Based Clustering of High-Dimensional Data: Variable Selection versus Facet Determination ✩
"... Variable selection is an important problem for cluster analysis of high-dimensional data. It is also a difficult one. The difficulty originates not only from the lack of class information but also the fact that high-dimensional data are often multifaceted and can be meaningfully clustered in multipl ..."
Abstract
- Add to MetaCart
(Show Context)
Variable selection is an important problem for cluster analysis of high-dimensional data. It is also a difficult one. The difficulty originates not only from the lack of class information but also the fact that high-dimensional data are often multifaceted and can be meaningfully clustered in multiple ways. In such a case the effort to find one subset of attributes that presumably gives the “best ” clustering may be misguided. It makes more sense to identify various facets of a data set (each being based on a subset of attributes), cluster the data along each one, and present the results to the domain experts for appraisal and selection. In this paper, we propose a generalization of the Gaussian mixture models and demonstrate its ability to automatically identify natural facets of data and cluster data along each of those facets simultaneously. We present empirical results to show that facet determination usually leads to better clustering results than variable selection.
algorithm for choosing from the
, 2012
"... A flexible cluster-oriented alternative clustering ..."
(Show Context)
Iterative Discovery of Multiple Alternative Clustering Views
"... Abstract—Complex data can be grouped and interpreted in many different ways. Most existing clustering algorithms, however, only find one clustering solution, and provide little guidance to data analysts who may not be satisfied with that single clustering and may wish to explore alternatives. We int ..."
Abstract
- Add to MetaCart
Abstract—Complex data can be grouped and interpreted in many different ways. Most existing clustering algorithms, however, only find one clustering solution, and provide little guidance to data analysts who may not be satisfied with that single clustering and may wish to explore alternatives. We introduce a novel approach that provides several clustering solutions to the user for the purposes of exploratory data analysis. Our approach additionally captures the notion that alternative clusterings may reside in different subspaces (or views). We present an algorithm that simultaneously finds these subspaces and the corresponding clusterings. The algorithm is based on an optimization procedure that incorporates terms for cluster quality and novelty relative to previously discovered clustering solutions. We present a range of experiments that compare our approach to alternatives and explore the connections between simultaneous and iterative modes of discovery of multiple clusterings. Index Terms—Kernel methods, non-redundant clustering, alternative clustering, multiple clustering, dimensionality reduction 1
Noname manuscript No. (will be inserted by the editor) A Comparison of Perfect Table Cryptanalytic Tradeoff Algorithms
, 2012
"... Abstract Analyses of three major time memory tradeoff algorithms were presented by a recent paper in such a way that facilitates comparisons of the algorithm performances at arbitrary choices of the algorithm parameters. The algorithms considered there were the classical Hellman tradeoff and the non ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract Analyses of three major time memory tradeoff algorithms were presented by a recent paper in such a way that facilitates comparisons of the algorithm performances at arbitrary choices of the algorithm parameters. The algorithms considered there were the classical Hellman tradeoff and the non-perfect table versions of the distinguished point method and the rainbow table method. This paper adds the perfect table versions of the distinguished point method and the rainbow table method to the list, so that all the major tradeoff algorithms may now be compared against each other. The algorithm performance information provided by this and the preceding paper is aimed at making practical comparisons possible. Comparisons that take both the cost of precomputation and the efficiency of the online phase into account, at parameters that achieve a common success rate, can now be carried out with ease. Comparisons can be based on the expected execution complexities rather than the worst case complexities, and details such as the effects of false alarms and various storage optimization techniques need no longer be ignored. A large portion of this paper is allocated to accurately analyzing the execution behavior of the perfect table distinguished point method. In particular, we obtain a closed-form formula for the average length of chains associated with a perfect distinguished point table. Keywords time memory tradeoff · distinguished point · rainbow table · perfect table · algorithm complexity 1