Results 1 - 10
of
80
Sparse stochastic inference for latent dirichlet allocation
- In International Conference on Machine Learning
, 2012
"... We present a hybrid algorithm for Bayesian topic models that combines the efficiency of sparse Gibbs sampling with the scalability of online stochastic inference. We used our algorithm to analyze a corpus of 1.2 million books (33 billion words) with thousands of topics. Our approach reduces the bias ..."
Abstract
-
Cited by 43 (4 self)
- Add to MetaCart
(Show Context)
We present a hybrid algorithm for Bayesian topic models that combines the efficiency of sparse Gibbs sampling with the scalability of online stochastic inference. We used our algorithm to analyze a corpus of 1.2 million books (33 billion words) with thousands of topics. Our approach reduces the bias of variational inference and generalizes to many Bayesian hidden-variable models. 1.
Termite: Visualization techniques for assessing textual topic models
- In Proceedings of the International Working Conference on Advanced Visual Interfaces
, 2012
"... Topic models aid analysis of text corpora by identifying la-tent topics based on co-occurring words. Real-world de-ployments of topic models, however, often require intensive expert verification and model refinement. In this paper we present Termite, a visual analysis tool for assessing topic model ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
(Show Context)
Topic models aid analysis of text corpora by identifying la-tent topics based on co-occurring words. Real-world de-ployments of topic models, however, often require intensive expert verification and model refinement. In this paper we present Termite, a visual analysis tool for assessing topic model quality. Termite uses a tabular layout to promote comparison of terms both within and across latent topics. We contribute a novel saliency measure for selecting relevant terms and a seriation algorithm that both reveals clustering structure and promotes the legibility of related terms. In a series of examples, we demonstrate how Termite allows analysts to identify coherent and significant themes.
Exploring Topic Coherence over many models and many topics
"... We apply two new automated semantic evaluations to three distinct latent topic models. Both metrics have been shown to align with human evaluations and provide a balance between internal measures of information gain and comparisons to human ratings of coherent topics. We improve upon the measures by ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
(Show Context)
We apply two new automated semantic evaluations to three distinct latent topic models. Both metrics have been shown to align with human evaluations and provide a balance between internal measures of information gain and comparisons to human ratings of coherent topics. We improve upon the measures by introducing new aggregate measures that allows for comparing complete topic models. We further compare the automated measures to other metrics for topic models, comparison to manually crafted semantic tests and document classification. Our experiments reveal that LDA and LSA each have different strengths; LDA best learns descriptive topics while LSA is best at creating a compact semantic representation of documents and words in a corpus. 1
Improving topic coherence with regularized topic models
- In Proc. of NIPS
, 2011
"... Abstract Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
(Show Context)
Abstract Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To overcome this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data.
Unsupervised graph-based topic labelling using DBpedia
- IN PROCEEDINGS OF THE 6TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM ’13
, 2013
"... ..."
(Show Context)
Evaluating Topic Coherence Using Distributional Semantics
"... This paper introduces distributional semantic similarity methods for automatically measuring the coherence of a set of words generated by a topic model. We construct a semantic space to represent each topic word by making use of Wikipedia as a reference corpus to identify context features and collec ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
This paper introduces distributional semantic similarity methods for automatically measuring the coherence of a set of words generated by a topic model. We construct a semantic space to represent each topic word by making use of Wikipedia as a reference corpus to identify context features and collect frequencies. Relatedness between topic words and context features is measured using variants of Pointwise Mutual Information (PMI). Topic coherence is determined by measuring the distance between these vectors computed using a variety of metrics. Evaluation on three data sets shows that the distributional-based measures outperform the state-of-the-art approach for this task.
Validating Estimates of Latent Traits From Textual Data Using Human Judgment as a Benchmark
, 2012
"... Automated and statistical methods for estimating latent political traits and classes from textual data hold great promise, since virtually every political act involves the production of text. Statistical models of natural language features, however, are heavily laden with unrealistic assumptions abo ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
Automated and statistical methods for estimating latent political traits and classes from textual data hold great promise, since virtually every political act involves the production of text. Statistical models of natural language features, however, are heavily laden with unrealistic assumptions about the process that generates this data, including the stochastic process of text generation, the functional link between political variables and observed text, and the nature of the variables (and dimensions) on which observed text should be conditioned. While acknowledging statistical models of latent traits to be “wrong”, political scientists nonetheless treat the treat their results as sufficiently valid to be useful. In this paper, we address the issue of substantive validity in the face of potential model failure, in the context of unsupervised scaling methods of latent traits. We critically examine one popular parametric measurement model of latent traits for text and then compare its results to systematic human judgments of the texts as a benchmark for validity.
Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment (Supplementary Materials)
"... the main paper with additional details in the caption. Supplementary Figure 3 shows additional data points for Figure 7 in the main paper. 2. Expert-Authored Concepts in Information Visualization We conducted a survey asking ten experienced information visualization (InfoVis) researchers to identify ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
(Show Context)
the main paper with additional details in the caption. Supplementary Figure 3 shows additional data points for Figure 7 in the main paper. 2. Expert-Authored Concepts in Information Visualization We conducted a survey asking ten experienced information visualization (InfoVis) researchers to identify what they consider to be significant and coherent areas of research in their field. Participants were asked to label each area, and describe it with lists of exemplary terms and documents. We focused on InfoVis research due to relevance, scope and familiarity. Analysis of academic publications is one of the common real-world uses of topic modeling
On-line Trend Analysis with Topic Models:
"... We present a novel topic modelling-based methodology to track emerging events in microblogs such as Twitter. Our topic model has an in-built update mechanism based on time slices and implements a dynamic vocabulary. We first show that the method is robust in detecting events using a range of dataset ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
(Show Context)
We present a novel topic modelling-based methodology to track emerging events in microblogs such as Twitter. Our topic model has an in-built update mechanism based on time slices and implements a dynamic vocabulary. We first show that the method is robust in detecting events using a range of datasets with injected novel events, and then demonstrate its application in identifying trending topics in Twitter.
Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality
- In Proc. of the Europ. Chap. of the Assoc. for
, 2014
"... Topic models based on latent Dirichlet al-location and related methods are used in a range of user-focused tasks including doc-ument navigation and trend analysis, but evaluation of the intrinsic quality of the topic model and topics remains an open research area. In this work, we explore the two ta ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
(Show Context)
Topic models based on latent Dirichlet al-location and related methods are used in a range of user-focused tasks including doc-ument navigation and trend analysis, but evaluation of the intrinsic quality of the topic model and topics remains an open research area. In this work, we explore the two tasks of automatic evaluation of single topics and automatic evaluation of whole topic models, and provide recom-mendations on the best strategy for per-forming the two tasks, in addition to pro-viding an open-source toolkit for topic and topic model evaluation. 1