Results 1 - 10
of
25
You Are What You Tweet: Analyzing Twitter for Public Health
"... Analyzing user messages in social media can measure different population characteristics, including public health measures. For example, recent work has correlated Twitter messages with influenza rates in the United States; but this has largely been the extent of mining Twitter for public health. In ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Analyzing user messages in social media can measure different population characteristics, including public health measures. For example, recent work has correlated Twitter messages with influenza rates in the United States; but this has largely been the extent of mining Twitter for public health. In this work, we consider a broader range of public health applications for Twitter. We apply the recently introduced Ailment Topic Aspect Model to over one and a half million health related tweets and discover mentions of over a dozen ailments, including allergies, obesity and insomnia. We introduce extensions to incorporate prior knowledge into this model and apply it to several tasks: tracking illnesses over times (syndromic surveillance), measuring behavioral risk factors, localizing illnesses by geographic region, and analyzing symptoms and medication usage. We show quantitative correlations with public health data and qualitative evaluations of model output. Our results suggest that Twitter has broad applicability for public health research.
Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
"... We address the problem of part-of-speech tagging for English data from the popular microblogging service Twitter. We develop a tagset, annotate data, develop features, and report tagging results nearing 90 % accuracy. The data and tools have been made available to the research community with the goa ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
We address the problem of part-of-speech tagging for English data from the popular microblogging service Twitter. We develop a tagset, annotate data, develop features, and report tagging results nearing 90 % accuracy. The data and tools have been made available to the research community with the goal of enabling richer text analysis of Twitter and related social media data sets. 1
Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
"... This paper describes DUALIST, an active learning annotation paradigm which solicits and learns from labels on both features (e.g., words) and instances (e.g., documents). We present a novel semi-supervised training algorithm developed for this setting, which is (1) fast enough to support real-time i ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper describes DUALIST, an active learning annotation paradigm which solicits and learns from labels on both features (e.g., words) and instances (e.g., documents). We present a novel semi-supervised training algorithm developed for this setting, which is (1) fast enough to support real-time interactive speeds, and (2) at least as accurate as preexisting methods for learning with mixed feature and instance labels. Human annotators in user studies were able to produce near-stateof-the-art classifiers—on several corpora in a variety of application domains—with only a few minutes of effort. 1
Short Text Conceptualization Using a Probabilistic Knowledgebase
"... Most text mining tasks, including clustering and topic detection, are based on statistical methods that treat text as bags of words. Semantics in the text is largely ignored in the mining process, and mining results often have low interpretability. One particular challenge faced by such approaches l ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Most text mining tasks, including clustering and topic detection, are based on statistical methods that treat text as bags of words. Semantics in the text is largely ignored in the mining process, and mining results often have low interpretability. One particular challenge faced by such approaches lies in short text understanding, as short texts lack enough content from which statistical conclusions can be drawn easily. In this paper, we improve text understanding by using a probabilistic knowledgebase that is as rich as our mental world in terms of the concepts (of worldly facts) it contains. We then develop a Bayesian inference mechanism to conceptualize words and short text. We conducted comprehensive experiments on conceptualizing textual terms, and clustering short pieces of text such as Twitter messages. Compared to purely statistical methods such as latent semantic topic modeling or methods that use existing knowledgebases (e.g., WordNet, Freebase and Wikipedia), our approach brings significant improvements in short text understanding as reflected by the clustering accuracy.
Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
"... Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detec ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction candidate for the word. The proposed method doesn’t require any annotations, and achieves state-of-the-art performance over an SMS corpus and a novel dataset based on Twitter. 1
Towards Conversation Entailment: An Empirical Investigation
"... While a significant amount of research has been devoted to textual entailment, automated entailment from conversational scripts has received less attention. To address this limitation, this paper investigates the problem of conversation entailment: automated inference of hypotheses from conversation ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
While a significant amount of research has been devoted to textual entailment, automated entailment from conversational scripts has received less attention. To address this limitation, this paper investigates the problem of conversation entailment: automated inference of hypotheses from conversation scripts. We examine two levels of semantic representations: a basic representation based on syntactic parsing from conversation utterances and an augmented representation taking into consideration of conversation structures. For each of these levels, we further explore two ways of capturing long distance relations between language constituents: implicit modeling based on the length of distance and explicit modeling based on actual patterns of relations. Our empirical findings have shown that the augmented representation with conversation structures is important, which achieves the best performance when combined with explicit modeling of long distance relations. 1
Smoothing Techniques for Adaptive Online Language Models: Topic Tracking in Tweet Streams
"... We are interested in the problem of tracking broad topics such as “baseball ” and “fashion ” in continuous streams of short texts, exemplified by tweets from the microblogging service Twitter. The task is conceived as a language modeling problem where per-topic models are trained using hashtags in t ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We are interested in the problem of tracking broad topics such as “baseball ” and “fashion ” in continuous streams of short texts, exemplified by tweets from the microblogging service Twitter. The task is conceived as a language modeling problem where per-topic models are trained using hashtags in the tweet stream, which serve as proxies for topic labels. Simple perplexity-based classifiers are then applied to filter the tweet stream for topics of interest. Within this framework, we evaluate, both intrinsically and extrinsically, smoothing techniques for integrating “foreground ” models (to capture recency) and “background ” models (to combat sparsity), as well as different techniques for retaining history. Experiments show that unigram language models smoothed using a normalized extension of stupid backoff and a simple queue for history retention performs well on the task.
Linguistic Redundancy in Twitter
"... In the last few years, the interest of the research community in micro-blogs and social media services, such as Twitter, is growing exponentially. Yet, so far not much attention has been paid on a key characteristic of microblogs: the high level of information redundancy. The aim of this paper is to ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In the last few years, the interest of the research community in micro-blogs and social media services, such as Twitter, is growing exponentially. Yet, so far not much attention has been paid on a key characteristic of microblogs: the high level of information redundancy. The aim of this paper is to systematically approach this problem by providing an operational definition of redundancy. We cast redundancy in the framework of Textual Entailment Recognition. We also provide quantitative evidence on the pervasiveness of redundancy in Twitter, and describe a dataset of redundancy-annotated tweets. Finally, we present a general purpose system for identifying redundant tweets. An extensive quantitative evaluation shows that our system successfully solves the redundancy detection task, improving over baseline systems with statistical significance. 1
Filter, Rank, and Transfer the Knowledge: Learning to Chat
, 2010
"... We propose a discriminative approach for automatically training chatbots to provide relevant and interesting responses. In contrast to most prior work, our approach is not based on hard-wiring response rules, but rather relies on machine learning. We set ourselves the task of ranking a repository of ..."
Abstract
- Add to MetaCart
We propose a discriminative approach for automatically training chatbots to provide relevant and interesting responses. In contrast to most prior work, our approach is not based on hard-wiring response rules, but rather relies on machine learning. We set ourselves the task of ranking a repository of responses to find the most suitable response. This work is just a first step towards the more general goal of then modifying the result to form a more appropriate response. We use a large corpus of public Twitter and LiveJournal conversations as training data for the learning task. Selecting an appropriate response from this repository, given new input from a user, is done in three phases. First, a fast filtering approach removes most irrelevant sentences. Second, a boosted tree ranker (using features that are very efficient to compute) further shrinks the set of candidate responses. Finally a more precise content-oriented ranking framework is used to output the final response. In addition to our offline repository of dialogs, we also exploit a smaller repository of human-generated and labeled instances. These data are collected through a web-application in which human users interact with the system and provide suggestions and feedback regarding the responses. The response selection is mainly based on content-oriented features and uses the “winnow ” multiplicative weight online learning approach. Having a large corpus of noisy offline Twitter and LiveJournal data as a source knowledge domain, and a moderate repository of less noisy, labeled online conversations as a destination knowledge domain, we
Experimentation, Algorithms
"... Twitter, a micro-blogging platform with an estimated 20 million unique monthly visitors and over 100 million registered users, offers an abundance of rich, structured data at a rate exceeding 600 tweets per second. Recent efforts to leverage this social data to rank users by quality and topical rele ..."
Abstract
- Add to MetaCart
Twitter, a micro-blogging platform with an estimated 20 million unique monthly visitors and over 100 million registered users, offers an abundance of rich, structured data at a rate exceeding 600 tweets per second. Recent efforts to leverage this social data to rank users by quality and topical relevance have largely focused on the “follow ” relationship. Twitter’s data offers additional implicit relationships between users, however, such as “retweets”and“mentions”. In this paper we investigate the semantics of the follow and retweet relationships. Specifically, we show that the transitivity of topical relevance is better preserved over retweet links, and that retweeting a user is a significantly stronger indicator of topical interest than following him. We demonstrate these properties by ranking users with two variants of the PageRank algorithm; one based on the follows subgraph

