Results 1  10
of
17
Interactive Topic Modeling
"... Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential endusers ..."
Abstract

Cited by 33 (8 self)
 Add to MetaCart
(Show Context)
Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential endusers of topic modeling software. In this work, we develop a framework for allowing users to iteratively refine the topics discovered by models such as latent Dirichlet allocation (LDA) by adding constraints that enforce that sets of words must appear together in the same topic. We incorporate these constraints interactively by selectively removing elements in the state of a Markov Chain used for inference; we investigate a variety of methods for incorporating this information and demonstrate that these interactively added constraints improve topic usefulness for simulated and actual user sessions. 1
A framework for incorporating general domain knowledge into latent Dirichlet allocation using firstorder logic
 In Proceedings of the 22nd International Joint Conferences on Artificial Intelligence
, 2011
"... Topic models have been used successfully for a variety of problems, often in the form of applicationspecific extensions of the basic Latent Dirichlet Allocation (LDA) model. Because deriving these new models in order to encode domain knowledge can be difficult and timeconsuming, we propose the Fold ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
Topic models have been used successfully for a variety of problems, often in the form of applicationspecific extensions of the basic Latent Dirichlet Allocation (LDA) model. Because deriving these new models in order to encode domain knowledge can be difficult and timeconsuming, we propose the Fold·all model, which allows the user to specify general domain knowledge in FirstOrder Logic (FOL). However, combining topic modeling with FOL can result in inference problems beyond the capabilities of existing techniques. We have therefore developed a scalable inference technique using stochastic gradient descent which may also be useful to the Markov Logic Network (MLN) research community. Experiments demonstrate the expressive power of Fold·all, as well as the scalability of our proposed inference method. 1
Reducing the Sampling Complexity of Topic Models
"... Inference in topic models typically involves a sampling step to associate latent variables with observations. Unfortunately the generative model loses sparsity as the amount of data increases, requiring O(k) operations per word for k topics. In this paper we propose an algorithm which scales linear ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
Inference in topic models typically involves a sampling step to associate latent variables with observations. Unfortunately the generative model loses sparsity as the amount of data increases, requiring O(k) operations per word for k topics. In this paper we propose an algorithm which scales linearly with the number of actually instantiated topics kd in the document. For large document collections and in structured hierarchical models kd k. This yields an order of magnitude speedup. Our method applies to a wide variety of statistical models such as PDP [16, 4] and HDP [19]. At its core is the idea that dense, slowly changing distributions can be approximated efficiently by the combination of a MetropolisHastings step, use of sparsity, and amortized constant time sampling via Walker’s alias method.
Tracking Trends: Incorporating Term Volume into Temporal Topic Models
"... Text corpora with documents from a range of time epochs are natural and ubiquitous in many fields, such as research papers, newspaper articles and a variety of types of recently emerged social media. People not only would like to know what kind of topics can be found from these data sources but also ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Text corpora with documents from a range of time epochs are natural and ubiquitous in many fields, such as research papers, newspaper articles and a variety of types of recently emerged social media. People not only would like to know what kind of topics can be found from these data sources but also wish to understand the temporal dynamics of these topics and predict certain properties of terms or documents in the future. Topic models are usually utilized to find latent topics from text collections, and recently have been applied to temporal text corpora. However, most proposed models are general purpose models to which no real tasks are explicitly associated. Therefore, current models may be difficult to apply in realworld applications, such as the problems of
Differential topic models
 In IEEE Pattern Analysis and Machine Intelligence
, 2014
"... Abstract—In applications we may want to compare different document collections: they could have shared content but also different and unique aspects in particular collections. This task has been called comparative text mining or crosscollection modeling. We present a differential topic model for th ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract—In applications we may want to compare different document collections: they could have shared content but also different and unique aspects in particular collections. This task has been called comparative text mining or crosscollection modeling. We present a differential topic model for this application that models both topic differences and similarities. For this we use hierarchical Bayesian nonparametric models. Moreover, we found it was important to properly model powerlaw phenomena in topicword distributions and thus we used the full PitmanYor process rather than just a Dirichlet process. Furthermore, we propose the transformed PitmanYor process (TPYP) to incorporate prior knowledge such as vocabulary variations in different collections into the model. To deal with the nonconjugate issue between model prior and likelihood in the TPYP, we thus propose an efficient sampling algorithm using a data augmentation technique based on the multinomial theorem. Experimental results show the model discovers interesting aspects of different collections. We also show the proposed MCMC based algorithm achieves a dramatically reduced test perplexity compared to some existing topic models. Finally, we show our model outperforms the stateoftheart for document classification/ideology prediction on a number of text collections. Index Terms—Differential topic model, transformed PitmanYor process, MCMC, data augmentation Ç
Extracting Mobile Behavioral Patterns with the Distant NGram Topic Model
"... Mining patterns of human behavior from largescale mobile phone data has potential to understand certain phenomena in society. The study of such humancentric massive datasets requires new mathematical models. In this paper, we propose a probabilistic topic model that we call the distant ngram topi ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Mining patterns of human behavior from largescale mobile phone data has potential to understand certain phenomena in society. The study of such humancentric massive datasets requires new mathematical models. In this paper, we propose a probabilistic topic model that we call the distant ngram topic model (DNTM) to address the problem of learning long duration human location sequences. The DNTM is based on Latent Dirichlet Allocation (LDA). We define the generative process for the model, derive the inference procedure and evaluate our model on real mobile data. We consider two different reallife human datasets, collected by mobile phone locations, the first considering GPS locations and the second considering cell tower connections. The DNTM successfully discovers topics on the two datasets. Finally, the DNTM is compared to LDA by considering loglikelihood performance on unseen data, showing the predictive power of the model on unseen data. We find that the DNTM consistantly outperforms LDA as the sequence length increases. 1.
Conditional topical coding: An efficient topic model conditioned on rich features
 In KDD
, 2011
"... Probabilistic topic models have shown remarkable success in many application domains. However, a probabilistic conditional topic model can be extremely inefficient when considering a rich set of features because it needs to define a normalized distribution, which usually involves a hardtocompute pa ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Probabilistic topic models have shown remarkable success in many application domains. However, a probabilistic conditional topic model can be extremely inefficient when considering a rich set of features because it needs to define a normalized distribution, which usually involves a hardtocompute partition function. This paper presents conditional topical coding (CTC), a novel formulation of conditional topic models which is nonprobabilistic. CTC relaxes the normalization constraints as in probabilistic models and learns nonnegative document codes and word codes. CTC does not need to define a normalized distribution and can efficiently incorporate a rich set of features for improved topic discovery and prediction tasks. Moreover, CTC can directly control the sparsity of inferred representations by using appropriate regularization. We develop an efficient and easytoimplement coordinate descent learning algorithm, of which each coding substep has a closedform solution. Finally, we demonstrate the advantages of CTC on online review analysis datasets. Our results show that conditional topical coding can achieve stateoftheart prediction performance and is much more efficient in training (one order of magnitude faster) and testing (two orders of magnitude faster) than probabilistic conditional topic models.
Mining topics in documents: standing on the shoulders of big data
 In Proceedings of the 20th ACM SIGKDD international conference
, 2014
"... Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of documents) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of documents) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results. In recently years, knowledgebased topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better topics. Our research takes a radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, we first mine some reliable (prior) knowledge from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web. The proposed algorithm mines two forms of knowledge: mustlink (meaning that two words should be in the same topic) and cannotlink (meaning that two words should not be in the same topic). It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitivity. Experimental results using review documents from 100 product domains show that the proposed approach makes dramatic improvements over stateoftheart baselines.
Proceedings of the TwentySecond International Joint Conference on Artificial Intelligence A Framework for Incorporating General Domain Knowledge into Latent Dirichlet Allocation Using FirstOrder Logic
"... Topic models have been used successfully for a variety of problems, often in the form of applicationspecific extensions of the basic Latent Dirichlet Allocation (LDA) model. Because deriving these new models in order to encode domain knowledge can be difficult and timeconsuming, we propose the Fold ..."
Abstract
 Add to MetaCart
Topic models have been used successfully for a variety of problems, often in the form of applicationspecific extensions of the basic Latent Dirichlet Allocation (LDA) model. Because deriving these new models in order to encode domain knowledge can be difficult and timeconsuming, we propose the Fold·all model, which allows the user to specify general domain knowledge in FirstOrder Logic (FOL). However, combining topic modeling with FOL can result in inference problems beyond the capabilities of existing techniques. We have therefore developed a scalable inference technique using stochastic gradient descent which may also be useful to the Markov Logic Network (MLN) research community. Experiments demonstrate the expressive power of Fold·all, as well as the scalability of our proposed inference method. 1
An Extension of Topic Models for Text Classification: a Term Weighting Approach
"... Abstract — Text classification has become a critical step in big data analytics. For supervised machine learning approaches to text classification, availability of sufficient training data with classification labels attached to individual text units is essential to the performance. Since labeled dat ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract — Text classification has become a critical step in big data analytics. For supervised machine learning approaches to text classification, availability of sufficient training data with classification labels attached to individual text units is essential to the performance. Since labeled data are usually scarce, however, it is always desirable to devise a semisupervised method where unlabeled data are used in addition to labeled ones. A solution is to apply a latent factor model to generate clustered text features and use them for text classification. The main thrust of the current research is to extend Latent Dirichlet Allocation (LDA) for this purpose by considering word weights in sampling and maintaining balances of topic distributions. A series of experiments were conducted to evaluate the proposed method for classification tasks. The result shows that the topic distributions generated by the balance weighted topic modeling method add some discriminative power to feature generations for classification.