Results 1 - 10
of
11
Trend detection in folksonomies
- PROC. FIRST INTERNATIONAL CONFERENCE ON SEMANTICS AND DIGITAL MEDIA TECHNOLOGY (SAMT), VOLUME 4306 OF LNCS
, 2006
"... As the number of resources on the web exceeds by far the number of documents one can track, it becomes increasingly difficult to remain up to date on ones own areas of interest. The problem becomes more severe with the increasing fraction of multimedia data, from which it is difficult to extract so ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
As the number of resources on the web exceeds by far the number of documents one can track, it becomes increasingly difficult to remain up to date on ones own areas of interest. The problem becomes more severe with the increasing fraction of multimedia data, from which it is difficult to extract some conceptual description of their contents. One way to overcome this problem are social bookmark tools, which are rapidly emerging on the web. In such systems, users are setting up lightweight conceptual structures called folksonomies, and overcome thus the knowledge acquisition bottleneck. As more and more people participate in the effort, the use of a common vocabulary becomes more and more stable. We present an approach for discovering topic-specific trends within folksonomies. It is based on a differential adaptation of the PageRank algorithm to the triadic hypergraph structure of a folksonomy. The approach allows for any kind of data, as it does not rely on the internal structure of the documents. In particular, this allows to consider different data types in the same analysis step. We run experiments on a large-scale
Spatial Variation in Search Engine Queries
, 2008
"... Local aspects of Web search — associating Web content and queries with geography — is a topic of growing interest. However, the underlying question of how spatial variation is manifested in search queries is still not well understood. Here we develop a probabilistic framework for quantifying such sp ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Local aspects of Web search — associating Web content and queries with geography — is a topic of growing interest. However, the underlying question of how spatial variation is manifested in search queries is still not well understood. Here we develop a probabilistic framework for quantifying such spatial variation; on complete Yahoo! query logs, we find that our model is able to localize large classes of queries to within a few miles of their natural centers based only on the distribution of activity for the query. Our model provides not only an estimate of a query’s geographic center, but also a measure of its spatial dispersion, indicating whether it has highly local interest or broader regional or national appeal. We also show how variations on our model can track geographically shifting topics over time, annotate a map with each location’s “distinctive queries,” and delineate the “spheres of influence” for competing queries in the same general domain.
The Gist of Everything New: Personalized Top-k Processing over Web 2.0 Streams ∗
"... Web 2.0 portals have made content generation easier than ever with millions of users contributing news stories in form of posts in weblogs or short textual snippets as in Twitter. Efficient and effective filtering solutions are key to allow users stay tuned to this ever-growing ocean of information, ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Web 2.0 portals have made content generation easier than ever with millions of users contributing news stories in form of posts in weblogs or short textual snippets as in Twitter. Efficient and effective filtering solutions are key to allow users stay tuned to this ever-growing ocean of information, releasing only relevant trickles of personal interest. In classical information filtering systems, user interests are formulated using standard IR techniques and data from all available information sources is filtered based on a predefined absolute quality-based threshold. In contrast to this restrictive approach which may still overwhelm the user with the returned stream of data, we envision a system which continuously keeps the user updated with only the top-k relevant new information. Freshness of data is guaranteed by considering it valid for a particular time interval, controlled by a sliding window. Considering relevance as relative to the existing pool of new information creates a highly dynamic setting. We present POL-filter which together with our maintenance module constitute an efficient solution to this kind of problem. We show by comprehensive performance evaluations using real world data, obtained from a weblog crawl, that our approach brings performance gains compared to state-of-the-art.
Gazpacho and summer rash: lexical relationships from temporal patterns of web search queries
"... In this paper we investigate temporal patterns of web search queries. We carry out several evaluations to analyze the properties of temporal profiles of queries, revealing promising semantic and pragmatic relationships between words. We focus on two applications: query suggestion and query categoriz ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper we investigate temporal patterns of web search queries. We carry out several evaluations to analyze the properties of temporal profiles of queries, revealing promising semantic and pragmatic relationships between words. We focus on two applications: query suggestion and query categorization. The former shows a potential for time-series similarity measures to identify specific semantic relatedness between words, which results in state-of-the-art performance in query suggestion while providing complementary information to more traditional distributional similarity measures. The query categorization evaluation suggests that the temporal profile alone is not a strong indicator of broad topical categories. 1
Durable Top-k Search in Document Archives
"... We propose and study a new ranking problem in versioned databases. Consider a database of versioned objects which have different valid instances along a history (e.g., documents in a web archive). Durable top-k search finds the set of objects that are consistently in the top-k results of a query (e. ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We propose and study a new ranking problem in versioned databases. Consider a database of versioned objects which have different valid instances along a history (e.g., documents in a web archive). Durable top-k search finds the set of objects that are consistently in the top-k results of a query (e.g., a keyword query) throughout a given time interval (e.g., from June 2008 to May 2009). Existing work on temporal top-k queries mainly focuses on finding the most representative top-k elements within a time interval. Such methods are not readily applicable to durable top-k queries. To address this need, we propose two techniques that compute the durable top-k result. The first is adapted from the classic top-k rank aggregation algorithm NRA. The second technique is based on a shared execution paradigm and is more efficient than the first approach. In addition, we propose a special indexing technique for archived data. The index, coupled with a space partitioning technique, improves performance even further. We use data from Wikipedia and the Internet Archive to demonstrate the efficiency and effectiveness of our solutions.
Document Clustering with Bursty Information,” Computing and Informatics
"... Abstract. Nowadays, almost all text corpora, such as blogs, emails and RSS feeds, are a collection of text streams. The traditional vector space model (VSM), or bagof-words representation, cannot capture the temporal aspect of these text streams. So far, only a few bursty features have been proposed ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Nowadays, almost all text corpora, such as blogs, emails and RSS feeds, are a collection of text streams. The traditional vector space model (VSM), or bagof-words representation, cannot capture the temporal aspect of these text streams. So far, only a few bursty features have been proposed to create text representations with temporal modeling for the text streams. We propose bursty feature representations that perform better than VSM on various text mining tasks, such as document retrieval, topic modeling and text categorization. For text clustering, we propose a novel framework to generate bursty distance measure. We evaluated it on UP-GMA, Star and K-Medoids clustering algorithms. The bursty distance measure did not only performed equally well on various text collections, but it was also able to cluster the news articles related to speci c events much better than other models.
Does Bad News Go Away Faster
- In In Proceedings of the International Conference on Weblogs and Social (ICWSM
, 2011
"... We study the relationship between content and temporal dynamics of information on Twitter, focusing on the persistence of information. We compare two extreme temporal patterns in the decay rate of URLs embedded in tweets, defining a prediction task to distinguish between URLs that fade rapidly follo ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We study the relationship between content and temporal dynamics of information on Twitter, focusing on the persistence of information. We compare two extreme temporal patterns in the decay rate of URLs embedded in tweets, defining a prediction task to distinguish between URLs that fade rapidly following their peak of popularity and those that fade more slowly. Our experiments show a strong association between the content and the temporal dynamics of information: given unigram features extracted from corresponding HTML webpages, a linear SVM classifier can predict the temporal pattern of URLs with high accuracy. We further explore the content of URLs in the two temporal classes using various textual analysis techniques (via LIWC and trend detection). We find that the rapidly-fading information contains significantly more words related to negative emotion, actions, and more complicated cognitive processes, whereas the persistent information contains more words related to positive emotion, leisure, and lifestyle.
Ranking of Evolving Stories Through Meta-Aggregation
"... In this paper we focus on the problem of ranking news stories within their historical context by exploiting their content similarity. We observe that news stories evolve and thus have to be ranked in a time and query dependent manner. We do this in two steps. First, the mining step discovers metasto ..."
Abstract
- Add to MetaCart
In this paper we focus on the problem of ranking news stories within their historical context by exploiting their content similarity. We observe that news stories evolve and thus have to be ranked in a time and query dependent manner. We do this in two steps. First, the mining step discovers metastories, which constitute meaningful groups of similar stories that occur at arbitrary points in time. Second, the ranking step uses well known measures of content similarity to construct implicit links among all metastories, and uses them to rank those metastories that overlap the time interval provided in a user query. We use real data from conventional and social media sources (weblogs) to study the impact of different meta-aggregation techniques and similarity measures in the final ranking. We evaluate the framework using both objective and subjective criteria, and discuss the selection of clustering method and similarity measure that lead to the best ranking results.
CoCITe — Coordinating Changes In Text
"... Abstract—Text streams are ubiquitous and contain a wealth of information, but are typically orders of magnitude too large in scale for comprehensive human inspection. There is a need for tools that can detect and group changes occurring within text streams and sub-streams, in order to find, structur ..."
Abstract
- Add to MetaCart
Abstract—Text streams are ubiquitous and contain a wealth of information, but are typically orders of magnitude too large in scale for comprehensive human inspection. There is a need for tools that can detect and group changes occurring within text streams and sub-streams, in order to find, structure and summarize these changes for presentation to human analysts. This paper describes a procedure for efficiently finding step changes, trends, bursts and cyclic changes affecting frequencies of words, or more general lexical items, within streams of documents which may be optionally labeled with metadata. The common phenomenon of over-dispersion is accommodated using mixture distributions. A streaming implementation is described which can process data from a continuous feed. Anomalies can be detected, grouped, and rendered visually for human comprehension. Index Terms—Statistical software, modeling structured, textual and multimedia data, text mining.
TM-LDA: Efficient Online Modeling of the Latent Topic Transitions in Social Media
"... Latent topic analysis has emerged as one of the most effective methods for classifying, clustering and retrieving textual data. However, existing models such as Latent Dirichlet Allocation (LDA) were developed for static corpora of relatively large documents. In contrast, much of the textual content ..."
Abstract
- Add to MetaCart
Latent topic analysis has emerged as one of the most effective methods for classifying, clustering and retrieving textual data. However, existing models such as Latent Dirichlet Allocation (LDA) were developed for static corpora of relatively large documents. In contrast, much of the textual content on the web, and especially social media, is temporally sequenced, and comes in short fragments such as Tweets, Facebook status updates, or comments on YouTube. In this paper we propose a novel topic model, Temporal-LDA or TM-LDA, for efficiently mining streams of social text such as a Twitter stream for an author, by modeling the topics and topic transitions that naturally arise in such data. TM-LDA learns the transition parameters among topics by minimizing the prediction error on topic distribution in subsequent postings. After training, TM-LDA is thus able to accurately predict the expected topic distribution in future posts. To make these predictions more efficient for a realistic online prediction setting, we develop an efficient updating algorithm to adjust transition parameters, as new documents stream in. Our empirical results, over a corpus of over 30 million Twitter posts show that TM-LDA significantly outperforms state-of-the-art static LDA models for estimating the topic distribution of new documents over time. We also demonstrate how TM-LDA is able to highlight interesting variations of common patterns of behavior across different cities, such as differences in the work-life rhythm of cities, and factors responsible for area-specific problems and complaints.

