Results 1 - 10
of
78
Composition in distributional models of semantics
, 2010
"... Distributional models of semantics have proven themselves invaluable both in cog-nitive modelling of semantic phenomena and also in practical applications. For ex-ample, they have been used to model judgments of semantic similarity (McDonald, 2000) and association (Denhire and Lemaire, 2004; Griffit ..."
Abstract
-
Cited by 148 (3 self)
- Add to MetaCart
Distributional models of semantics have proven themselves invaluable both in cog-nitive modelling of semantic phenomena and also in practical applications. For ex-ample, they have been used to model judgments of semantic similarity (McDonald, 2000) and association (Denhire and Lemaire, 2004; Griffiths et al., 2007) and have been shown to achieve human level performance on synonymy tests (Landuaer and Dumais, 1997; Griffiths et al., 2007) such as those included in the Test of English as Foreign Language (TOEFL). This ability has been put to practical use in automatic the-saurus extraction (Grefenstette, 1994). However, while there has been a considerable amount of research directed at the most effective ways of constructing representations for individual words, the representation of larger constructions, e.g., phrases and sen-tences, has received relatively little attention. In this thesis we examine this issue of how to compose meanings within distributional models of semantics to form represen-tations of multi-word structures. Natural language data typically consists of such complex structures, rather than
Empirical Study of Topic Modeling in Twitter
- PROCEEDINGS OF THE SIGKDD WORKSHOP ON SOCIAL MEDIA ANALYTICS (SOMA)
, 2010
"... Social networks such as Facebook, LinkedIn, and Twitter have been a crucial source of information for a wide spectrum of users. In Twitter, popular information that is deemed important by the community propagates through the network. Studying the characteristics of content in the messages becomes im ..."
Abstract
-
Cited by 78 (1 self)
- Add to MetaCart
(Show Context)
Social networks such as Facebook, LinkedIn, and Twitter have been a crucial source of information for a wide spectrum of users. In Twitter, popular information that is deemed important by the community propagates through the network. Studying the characteristics of content in the messages becomes important for a number of tasks, such as breaking news detection, personalized message recommendation, friends recommendation, sentiment analysis and others. While many researchers wish to use standard text mining tools to understand messages on Twitter, the restricted length of those messages prevents them from being employed to their full potential. We address the problem of using standard topic models in microblogging environments by studying how the models can be trained on the dataset. We propose several schemes to train a standard topic model and compare their quality and effectiveness through a set of carefully designed experiments from both qualitative and quantitative perspectives. We show that by training a topic model on aggregated messages we can obtain a higher quality of learned model which results in significantly better performance in two realworld classification problems. We also discuss how the state-ofthe-art Author-Topic model fails to model hierarchical relationships between entities in Social Media.
Exploiting internal and external semantics for the clustering of short texts using world knowledge
- In Proceeding of the 18th ACM conference on Information and knowledge management
, 2009
"... Clustering of short texts, such as snippets, presents great challenges in existing aggregated search techniques due to the problem of data sparseness and the complex semantics of natural language. As short texts do not provide sufficient term co-occurrence information, traditional text representatio ..."
Abstract
-
Cited by 43 (11 self)
- Add to MetaCart
(Show Context)
Clustering of short texts, such as snippets, presents great challenges in existing aggregated search techniques due to the problem of data sparseness and the complex semantics of natural language. As short texts do not provide sufficient term co-occurrence information, traditional text representation methods, such as “bag of words ” model, have several limitations when directly applied to short text tasks. In this paper, we propose a novel framework to improve the performance of short text clustering by exploiting the internal semantics from the original text and external concepts from world knowledge. The proposed method employs a hierarchical three-level structure to tackle the data sparsity problem of original short texts and reconstruct the corresponding feature space with the integration of multiple semantic knowledge bases – Wikipedia and WordNet. Empirical evaluation with Reuters and real web dataset demonstrates that our approach is able to achieve significant improvement as compared to the state-of-the-art methods.
Transferring Topical Knowledge from Auxiliary Long Texts for Short Text Clustering
"... With the rapid growth of social Web applications such as Twitter and online advertisements, the task of understanding short texts is becoming more and more important. Most traditional text mining techniques are designed to handle long text documents. For short text messages, many of the existing tec ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
(Show Context)
With the rapid growth of social Web applications such as Twitter and online advertisements, the task of understanding short texts is becoming more and more important. Most traditional text mining techniques are designed to handle long text documents. For short text messages, many of the existing techniques are not effective due to the sparseness of text representations. To understand short messages, we observe that it is often possible to find topically related long texts, which can be utilized as the auxiliary data when mining the target short texts data. In this article, we present a novel approach to cluster short text messages via transfer learning from auxiliary long text data. We show that while some previous works for enhancing short text clustering with related long texts exist, most of them ignore the semantic and topical inconsistencies between the target and auxiliary data and may hurt the clustering performance on the short texts. To accommodate the possible inconsistencies between source and target data, we propose a novel topic model- Dual Latent Dirichlet Allocation (DLDA) model, which jointly learns two sets of topics on short and long texts and couples the topic parameters to cope with the potential inconsistencies between data sets. We demonstrate through large-scale clustering experiments on both advertisements and Twitter data that we can obtain superior performance over several state-of-art techniques for clustering short text documents.
Entity disambiguation with hierarchical topic models
- In the SIGKDD international conference on Knowledge discovery and data mining
, 2011
"... Disambiguating entity references by annotating them with unique ids from a catalog is a critical step in the enrichment of unstructured content. In this paper, we show that topic models, such as Latent Dirichlet Allocation (LDA) and its hierarchical variants, form a natural class of models for learn ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
(Show Context)
Disambiguating entity references by annotating them with unique ids from a catalog is a critical step in the enrichment of unstructured content. In this paper, we show that topic models, such as Latent Dirichlet Allocation (LDA) and its hierarchical variants, form a natural class of models for learning accurate entity disambiguation models from crowd-sourced knowledge bases such as Wikipedia. Our main contribution is a semi-supervised hierarchical model called Wikipedia-based Pachinko Allocation Model (WPAM) that exploits: (1) All words in the Wikipedia corpus to learn word-entity associations (unlike existing approaches that only use words in a small fixed window around annotated entity references in Wikipedia pages), (2) Wikipedia annotations to appropriately bias the assignment of entity labels to annotated (and co-occurring unannotated) words during model learning, and (3) Wikipedia’s category hierarchy to capture co-occurrence patterns among entities. We also propose a scheme for pruning spurious nodes from Wikipedia’s crowd-sourced category hierarchy. In our experiments with multiple real-life datasets, we show that WPAM outperforms state-of-the-art baselines by as much as 16 % in terms of disambiguation accuracy.
Simultaneously Modeling Semantics and Structure of Threaded Discussions: A Sparse Coding Approach and Its Applications ∗
"... The huge amount of knowledge in web communities has motivated the research interests in threaded discussions. The dynamic nature of threaded discussions poses lots of challenging problems for computer scientists. Although techniques such as semantic models and structural models have been shown to be ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
The huge amount of knowledge in web communities has motivated the research interests in threaded discussions. The dynamic nature of threaded discussions poses lots of challenging problems for computer scientists. Although techniques such as semantic models and structural models have been shown to be useful in a number of areas, they are inefficient in understanding threaded discussions due to three reasons: (I) as most of users read existing messages before posting, posts in a discussion thread are temporally dependent on the previous ones; It causes the semantics and structure to be coupled with each other in threaded discussions; (II) in online discussion threads, there are a lot of junk posts which are useless and may disturb content analysis; and (III) it is very hard to judge the quality of a post. In this paper,
Evaluation Datasets for Twitter Sentiment Analysis A survey and a new dataset, the STS-Gold
"... Abstract. Sentiment analysis over Twitter offers organisations and individuals a fast and effective way to monitor the publics ’ feelings towards them and their competitors. To assess the performance of sentiment analysis methods over Twitter a small set of evaluation datasets have been released in ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
(Show Context)
Abstract. Sentiment analysis over Twitter offers organisations and individuals a fast and effective way to monitor the publics ’ feelings towards them and their competitors. To assess the performance of sentiment analysis methods over Twitter a small set of evaluation datasets have been released in the last few years. In this paper we present an overview of eight publicly available and manually annotated evaluation datasets for Twitter sentiment analysis. Based on this review, we show that a common limitation of most of these datasets, when assessing sentiment analysis at target (entity) level, is the lack of distinctive sentiment annotations among the tweets and the entities contained in them. For example, the tweet “I love iPhone, but I hate iPad ” can be annotated with a mixed sentiment label, but the entity iPhone within this tweet should be annotated with a positive sentiment label. Aiming to overcome this limitation, and to complement current evaluation datasets, we present STS-Gold, a new evaluation dataset where tweets and targets (entities) are annotated individually and therefore may present different sentiment labels. This paper also provides a comparative study of the various datasets along several dimensions including: total number of tweets, vocabulary size and sparsity. We also investigate the pair-wise correlation among these dimensions as well as their correlations to the sentiment classification performance on different datasets.
SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation
"... We present SimLex-999, a gold standard re-source for evaluating distributional semantic models that improves on existing resources in several important ways. First, in contrast to gold standards such as WordSim-353 and MEN, it explicitly quantifies similarity rather than association or relatedness s ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
We present SimLex-999, a gold standard re-source for evaluating distributional semantic models that improves on existing resources in several important ways. First, in contrast to gold standards such as WordSim-353 and MEN, it explicitly quantifies similarity rather than association or relatedness so that pairs of entities that are associated but not actually similar (Freud, psychology) have a low rat-ing. We show that, via this focus on similar-ity, SimLex-999 incentivizes the development of models with a different, and arguably wider range of applications than those which reflect conceptual association. Second, SimLex-999 contains a range of concrete and abstract ad-jective, noun and verb pairs, together with an independent rating of concreteness and (free) association strength for each pair. This diver-sity enables fine-grained analyses of the per-formance of models on concepts of different types, and consequently greater insight into how architectures can be improved. Further, unlike existing gold standard evaluations, for which automatic approaches have reached or surpassed the inter-annotator agreement ceil-ing, state-of-the-art models perform well be-low this ceiling on SimLex-999. There is therefore plenty of scope for SimLex-999 to quantify future improvements to distributional semantic models, guiding the development of the next generation of representation-learning architectures. 1
NIFTY: A System for Large Scale Information Flow Tracking and Clustering
"... The real-time information on news sites, blogs and social networking sites changes dynamically and spreads rapidly through the Web. Developing methods for handling such information at a massive scale requires that we think about how information content varies over time, how it is transmitted, and ho ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
(Show Context)
The real-time information on news sites, blogs and social networking sites changes dynamically and spreads rapidly through the Web. Developing methods for handling such information at a massive scale requires that we think about how information content varies over time, how it is transmitted, and how it mutates as it spreads. We describe the News Information Flow Tracking, Yay! (NIFTY) system for large scale real-time tracking of “memes ” — short textual phrases that travel and mutate through the Web. NIFTY is based on a novel highly-scalable incremental meme-clustering algorithm that efficiently extracts and identifies mutational variants of a single meme. NIFTY runs orders of magnitude faster than our previous MEMETRACKER system, while also maintaining better consistency and quality of extracted memes. We demonstrate the effectiveness of our approach by processing a 20 terabyte dataset of 6.1 billion blog posts and news articles that we have been continuously collecting for the last four years. NIFTY extracted 2.9 billion unique textual phrases and identified more than 9 million memes. Our meme-tracking algorithm was able to process the entire dataset in less than five days using a single machine. Furthermore, we also provide a live deployment of the NIFTY system that allows users to explore the dynamics of online news in near real-time.
Document Hierarchies from Text and Links
"... Hierarchical taxonomies provide a multi-level view of large document collections, allowing users to rapidly drill down to finegrained distinctions in topics of interest. We show that automatically induced taxonomies can be made more robust by combining text with relational links. The underlying mech ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
(Show Context)
Hierarchical taxonomies provide a multi-level view of large document collections, allowing users to rapidly drill down to finegrained distinctions in topics of interest. We show that automatically induced taxonomies can be made more robust by combining text with relational links. The underlying mechanism is a Bayesian generative model in which a latent hierarchical structure explains the observed data — thus, finding hierarchical groups of documents with similar word distributions and dense network connections. As a nonparametric Bayesian model, our approach does not require pre-specification of the branching factor at each non-terminal, but finds the appropriate level of detail directly from the data. Unlike many prior latent space models of network structure, the complexity of our approach does not grow quadratically in the number of documents, enabling application to networks with more than ten thousand nodes. Experimental results on hypertext and citation network corpora demonstrate the advantages of our hierarchical, multimodal approach.