Results 1 - 10
of
44
Online Learning for Latent Dirichlet Allocation
"... We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collection ..."
Abstract
-
Cited by 22 (5 self)
- Add to MetaCart
We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 1
Creating Speech and Language Data With Amazon’s Mechanical Turk
"... In this paper we give an introduction to using Amazon’s Mechanical Turk crowdsourcing platform for the purpose of collecting data for human language technologies. We survey the papers published in the NAACL-2010 Workshop. 24 researchers participated in the workshop’s shared task to create data for s ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
In this paper we give an introduction to using Amazon’s Mechanical Turk crowdsourcing platform for the purpose of collecting data for human language technologies. We survey the papers published in the NAACL-2010 Workshop. 24 researchers participated in the workshop’s shared task to create data for speech and language applications with $100.
A two-dimensional topic-aspect model for discovering multi-faceted topics
- In AAAI
, 2010
"... This paper presents the Topic-Aspect Model (TAM), a Bayesian mixture model which jointly discovers topics and aspects. We broadly define an aspect of a document as a characteristic that spans the document, such as an underlying theme or perspective. Unlike previous models which cluster words by topi ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
This paper presents the Topic-Aspect Model (TAM), a Bayesian mixture model which jointly discovers topics and aspects. We broadly define an aspect of a document as a characteristic that spans the document, such as an underlying theme or perspective. Unlike previous models which cluster words by topic or aspect, our model can generate token assignments in both of these dimensions, rather than assuming words come from only one of two orthogonal models. We present two applications of the model. First, we model a corpus of computational linguistics abstracts, and find that the scientific topics identified in the data tend to include both a computational aspect and a linguistic aspect. For example, the computational aspect of GRAMMAR emphasizes parsing,
Latent variable models of selectional preference
- In ACL 2010
, 2010
"... This paper describes the application of so-called topic models to selectional preference induction. Three models related to Latent Dirichlet Allocation, a proven method for modelling document-word cooccurrences, are presented and evaluated on datasets of human plausibility judgements. Compared to pr ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper describes the application of so-called topic models to selectional preference induction. Three models related to Latent Dirichlet Allocation, a proven method for modelling document-word cooccurrences, are presented and evaluated on datasets of human plausibility judgements. Compared to previously proposed techniques, these models perform very competitively, especially for infrequent predicate-argument combinations where they exceed the quality of Web-scale predictions while using relatively little data. 1
Building and Using a Semantivisual Image Hierarchy
"... A meaningful image hierarchy can ease the human effort in organizing thousands and millions of pictures (e.g., personal albums), and help to improve performance of end tasks such as image annotation and classification. Previous work has focused on using either low-level image features or textual tag ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
A meaningful image hierarchy can ease the human effort in organizing thousands and millions of pictures (e.g., personal albums), and help to improve performance of end tasks such as image annotation and classification. Previous work has focused on using either low-level image features or textual tags to build image hierarchies, resulting in limited success in their general usage. In this paper, we propose a method to automatically discover the “semantivisual” image hierarchy by incorporating both image and tag information. This hierarchy encodes a general-to-specific image relationship. We pay particular attention to quantifying the effectiveness of the learned hierarchy, as well as comparing our method with others in the end-task applications. Our experiments show that humans find our semantivisual image hierarchy more effective than those solely based on texts or low-level visual features. And using the constructed image hierarchy as a knowledge ontology, our algorithm can perform challenging image classification and annotation tasks more accurately. 1.
Unified analysis of streaming news
- In WWW
, 2011
"... News clustering, categorization and analysis are key components of any news portal. They require algorithms capable of dealing with dynamic data to cluster, interpret and to temporally aggregate news articles. These three tasks are often solved separately. In this paper we present a unified framewor ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
News clustering, categorization and analysis are key components of any news portal. They require algorithms capable of dealing with dynamic data to cluster, interpret and to temporally aggregate news articles. These three tasks are often solved separately. In this paper we present a unified framework to group incoming news articles into temporary but tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve. We achieve this by building a hybrid clustering and topic model. To deal with the available wealth of data we build an efficient parallel inference algorithm by sequential Monte Carlo estimation. Time and memory costs are nearly constant in the length of the history, and the approach scales to hundreds of thousands of documents. We demonstrate the efficiency and accuracy on the publicly available TDT dataset and data of a major internet news site.
Interactive Topic Modeling
"... Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential end-users ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential end-users of topic modeling software. In this work, we develop a framework for allowing users to iteratively refine the topics discovered by models such as latent Dirichlet allocation (LDA) by adding constraints that enforce that sets of words must appear together in the same topic. We incorporate these constraints interactively by selectively removing elements in the state of a Markov Chain used for inference; we investigate a variety of methods for incorporating this information and demonstrate that these interactively added constraints improve topic usefulness for simulated and actual user sessions. 1
Word Features for Latent Dirichlet Allocation
"... We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictiona ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages. We present experiments on multi-language topic synchronisation where dictionary information is used to bias corresponding words towards similar topics. Results indicate that our model substantially improves topic cohesion when compared to the standard LDA model. 1
Automatic Labelling of Topic Models
"... We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the lab ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method. 1
Optimizing Semantic Coherence in Topic Models
"... Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirich ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH). 1

