Results 1 -
7 of
7
Improving Topic Models with Latent Feature Word Representations
- Transactions of the Association for Computational Linguistics
, 2015
"... Probabilistic topic models are widely used to discover latent topics in document collec-tions, while latent feature vector representa-tions of words have been used to obtain high performance in many NLP tasks. In this pa-per, we extend two different Dirichlet multino-mial topic models by incorporati ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Probabilistic topic models are widely used to discover latent topics in document collec-tions, while latent feature vector representa-tions of words have been used to obtain high performance in many NLP tasks. In this pa-per, we extend two different Dirichlet multino-mial topic models by incorporating latent fea-ture vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Exper-imental results show that by using informa-tion from the external corpora, our new mod-els produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.
Supervised topic models with word order . . .
, 2015
"... One limitation of most existing probabilistic latent topic models for document classification is that the topic model itself does not consider useful side-information, namely, class labels of documents. Topic models, which in turn consider the side-information, popularly known as supervised topic m ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
One limitation of most existing probabilistic latent topic models for document classification is that the topic model itself does not consider useful side-information, namely, class labels of documents. Topic models, which in turn consider the side-information, popularly known as supervised topic models, do not consider the word order structure in documents. One of the motivations behind considering the word order structure is to capture the semantic fabric of the document. We investigate a low-dimensional latent topic model for document classification. Class label information and word order structure are integrated into a supervised topic model enabling a more effective interaction among such information for solving document classification. We derive a
Empirical Software Engineering manuscript No. (will be inserted by the editor) A Survey on the Use of Topic Models when Mining Software Repositories
"... ware development by mining and analyzing software repositories. Since the ma-jority of the software engineering data is unstructured, researchers have applied Information Retrieval (IR) techniques to help software development. The recent advances of IR, especially statistical topic models, have help ..."
Abstract
- Add to MetaCart
ware development by mining and analyzing software repositories. Since the ma-jority of the software engineering data is unstructured, researchers have applied Information Retrieval (IR) techniques to help software development. The recent advances of IR, especially statistical topic models, have helped make sense of un-structured data in software repositories even more. However, even though there are hundreds of studies on applying topic models to software repositories, there is no study that shows how the models are used in the software engineering research community, and which software engineering tasks are being supported through topic models. Moreover, since the performance of these topic models is directly related to the model parameters and usage, knowing how researchers use the topic models may also help future studies make optimal use of such models. Thus, we surveyed 167 articles from the software engineering literature that make use of topic models. We find that i) most studies centre around a limited number of software engineering tasks; ii) most studies use only basic topic models; iii) and researchers usually treat topic models as black boxes without fully exploring their underlying assumptions and parameter values. Our paper provides a starting point for new researchers who are interested in using topic models, and may help new researchers and practitioners determine how to best apply topic models to a par-ticular software engineering task.
Short and Sparse Text Topic Modeling via Self-Aggregation
"... The overwhelming amount of short text data on social media and elsewhere has posed great chal-lenges to topic modeling due to the sparsity prob-lem. Most existing attempts to alleviate this prob-lem resort to heuristic strategies to aggregate short texts into pseudo-documents before the application ..."
Abstract
- Add to MetaCart
The overwhelming amount of short text data on social media and elsewhere has posed great chal-lenges to topic modeling due to the sparsity prob-lem. Most existing attempts to alleviate this prob-lem resort to heuristic strategies to aggregate short texts into pseudo-documents before the application of standard topic modeling. Although such strate-gies cannot be well generalized to more general genres of short texts, the success has shed light on how to develop a generalized solution. In this pa-per, we present a novel model towards this goal by integrating topic modeling with short text aggre-gation during topic inference. The aggregation is founded on general topical affinity of texts rather than particular heuristics, making the model read-ily applicable to various short texts. Experimental results on real-world datasets validate the effective-ness of this new model, suggesting that it can distill more meaningful topics from short texts. 1
Abstract Venue Concept Detection from Location-Based Social Networks⋆
"... Abstract. We investigate a new graphical model that can generate la-tent abstract concepts of venues, or Point of Interest (POI) by exploiting text data in venue profiles obtained from location-based social networks (LBSNs). Our model offers tailor-made modeling for two different types of text data ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. We investigate a new graphical model that can generate la-tent abstract concepts of venues, or Point of Interest (POI) by exploiting text data in venue profiles obtained from location-based social networks (LBSNs). Our model offers tailor-made modeling for two different types of text data that commonly appears in venue profiles, namely, tags and comments. Such modeling can effectively exploit their different charac-teristics. Meanwhile, the modeling of these two parts are tied with each other in a coordinated manner. Experimental results show that our model can generate better abstract venue concepts than comparative models.
Improving Topic Coherence with Latent Feature Word Representations in MAP Estimation for Topic Modeling
"... Probabilistic topic models are widely used to discover latent topics in document col-lections, while latent feature word vec-tors have been used to obtain high per-formance in many natural language pro-cessing (NLP) tasks. In this paper, we present a new approach by incorporating word vectors to dir ..."
Abstract
- Add to MetaCart
(Show Context)
Probabilistic topic models are widely used to discover latent topics in document col-lections, while latent feature word vec-tors have been used to obtain high per-formance in many natural language pro-cessing (NLP) tasks. In this paper, we present a new approach by incorporating word vectors to directly optimize the max-imum a posteriori (MAP) estimation in a topic model. Preliminary results show that the word vectors induced from the experi-mental corpus can be used to improve the assignments of topics to words.
Twitter-Network Topic Model: A Full Bayesian Treatment for Social Network and Text Modeling
"... Twitter data is extremely noisy – each tweet is short, unstructured and with in-formal language, a challenge for current topic modeling. On the other hand, tweets are accompanied by extra information such as authorship, hashtags and the user-follower network. Exploiting this additional information, ..."
Abstract
- Add to MetaCart
(Show Context)
Twitter data is extremely noisy – each tweet is short, unstructured and with in-formal language, a challenge for current topic modeling. On the other hand, tweets are accompanied by extra information such as authorship, hashtags and the user-follower network. Exploiting this additional information, we propose the Twitter-Network (TN) topic model to jointly model the text and the social net-work in a full Bayesian nonparametric way. The TN topic model employs the hierarchical Poisson-Dirichlet processes (PDP) for text modeling and a Gaussian process random function model for social network modeling. We show that the TN topic model significantly outperforms several existing nonparametric models due to its flexibility. Moreover, the TN topic model enables additional informative inference such as authors ’ interests, hashtag analysis, as well as leading to further applications such as author recommendation, automatic topic labeling and hashtag suggestion. Note our general inference framework can readily be applied to other topic models with embedded PDP nodes. 1