Results 1 - 10
of
15
Conceptdoppler: A weather tracker for internet censorship
- In 14th ACM Conference on Computer and Communications Security
, 2007
"... The text of this paper has passed across many Internet routers on its way to the reader, but some routers will not pass it along unfettered because of censored words it contains. We present two sets of results: 1) Internet measurements of keyword filtering by the Great “Firewall ” of China (GFC); an ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
The text of this paper has passed across many Internet routers on its way to the reader, but some routers will not pass it along unfettered because of censored words it contains. We present two sets of results: 1) Internet measurements of keyword filtering by the Great “Firewall ” of China (GFC); and 2) initial results of using latent semantic analysis as an efficient way to reproduce a blacklist of censored words via probing. Our Internet measurements suggest that the GFC’s keyword filtering is more a panopticon than a firewall, i.e., it need not block every illicit word, but only enough to promote self-censorship. China’s largest ISP, ChinaNET, performed 83.3 % of all filtering of our probes, and 99.1 % of all filtering that occurred at the first hop past the Chinese border. Filtering occurred beyond the third hop for 11.8 % of our probes, and there were sometimes as many as 13 hops past the border to a filtering router. Approximately 28.3% of the Chinese hosts we sent probes to were reachable along paths that were not filtered at all. While more tests are needed to provide a definitive picture of the GFC’s implementation, our results disprove the notion that GFC keyword filtering is a firewall strictly at the border of China’s Internet. While evading a firewall a single time defeats its purpose, it would be necessary to evade a panopticon almost every time. Thus, in lieu of evasion, we propose ConceptDoppler, an architecture
Towards Recency Ranking in Web Search
"... In web search, recency ranking refers to ranking documents by relevance which takes freshness into account. In this paper, we propose a retrieval system which automatically detects and responds to recency sensitive queries. The system detects recency sensitive queries using a high precision classifi ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
In web search, recency ranking refers to ranking documents by relevance which takes freshness into account. In this paper, we propose a retrieval system which automatically detects and responds to recency sensitive queries. The system detects recency sensitive queries using a high precision classifier. The system responds to recency sensitive queries by using a machine learned ranking model trained for such queries. We use multiple recency features to provide temporal evidence which effectively represents document recency. Furthermore, we propose several training methodologies important for training recency sensitive rankers. Finally, we develop new evaluation metrics for recency sensitive queries. Our experiments demonstrate the efficacy of the proposed approaches.
Click-Through Prediction for News Queries
"... A growing trend in commercial search engines is the display of specialized content such as news, products, etc. interleaved with web search results. Ideally, this content should be displayed only when it is highly relevant to the search query, as it competes for space with “regular ” results and adv ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
A growing trend in commercial search engines is the display of specialized content such as news, products, etc. interleaved with web search results. Ideally, this content should be displayed only when it is highly relevant to the search query, as it competes for space with “regular ” results and advertisements. One measure of the relevance to the search query is the click-through rate the specialized content achieves when displayed; hence, if we can predict this click-through rate accurately, we can use this as the basis for selecting when to show specialized content. In this paper, we consider the problem of estimating the clickthrough rate for dedicated news search results. For queries for which news results have been displayed repeatedly before, the click-through rate can be tracked online; however, the key challenge for which previously unseen queries to display news results remains. In this paper we propose a supervised model that offers accurate prediction of news click-through rates and satisfies the requirement of adapting quickly to emerging news events.
iScore: Measuring the Interestingness of Articles in a Limited User Environment
- In: IEEE Symposium on Computational Intelligence and Data Mining
, 2007
"... Abstract-Search engines, such as Google, assign scores to news articles based on their relevancy to a query. However, not all relevant articles for the query may be interesting to a user. For example, if the article is old or yields little new information, the article would be uninteresting. Relevan ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Abstract-Search engines, such as Google, assign scores to news articles based on their relevancy to a query. However, not all relevant articles for the query may be interesting to a user. For example, if the article is old or yields little new information, the article would be uninteresting. Relevancy scores do not take into account what makes an article interesting, which would vary from user to user. Although methods such as collaborative filtering have been shown to be effective in recommendation systems, in a limited user environment there are not enough users that would make collaborative filtering effective. We present a general framework for defining and measuring the “interestingness ” of articles, incorporating user-feedback. We show 21 % improvement over traditional IR methods. I.
A Language Modeling Approach for Temporal Information Needs
- In Proceedings of the 32nd European Conference on Information Retrieval (ECIR 2010
, 2010
"... Abstract. This work addresses information needs that have a temporal dimension conveyed by a temporal expression in the user’s query. Temporal expressions such as “in the 1990s ” are frequent, easily extractable, but not leveraged by existing retrieval models. One challenge when dealing with them is ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Abstract. This work addresses information needs that have a temporal dimension conveyed by a temporal expression in the user’s query. Temporal expressions such as “in the 1990s ” are frequent, easily extractable, but not leveraged by existing retrieval models. One challenge when dealing with them is their inherent uncertainty. It is often unclear which exact time interval a temporal expression refers to. We integrate temporal expressions into a language modeling approach, thus making them first-class citizens of the retrieval model and considering their inherent uncertainty. Experiments on the New York Times Annotated Corpus using Amazon Mechanical Turk to collect queries and obtain relevance assessments demonstrate that our approach yields substantial improvements in retrieval effectiveness. 1
Concordance-based entity-oriented search
- IEEE/ACM Web Intelligence (WI-07), Silicon Valley CA
"... Abstract — We consider the problem of finding the relevant named entities in response to a search query over a given text corpus. Entity search can readily be used to augment conventional web search engines for a variety of applications. To assess the significance of entity search, we analyzed the A ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract — We consider the problem of finding the relevant named entities in response to a search query over a given text corpus. Entity search can readily be used to augment conventional web search engines for a variety of applications. To assess the significance of entity search, we analyzed the AOL dataset of 36 million web search queries with respect to two different sets of entities: namely (a) 2.3 million distinct entities extracted from a news text corpus and (b) 2.9 million Wikipedia article titles. The results clearly indicate that search engines should be aware of entities, for under various criteria of matching between 18-39 % of all web search queries can be recognized as specifically searching for entities, while 73-87 % of all queries contain entities. Our entity search engine creates a concordance document for each entity, consisting of all the sentences in the corpus containing that entity. We then index and search these documents using open-source search software. This gives a ranked list of entities as the result of search. Visit
The Anatomy of a News Search Engine
, 2005
"... Today, news browsing and searching is one of the most important Internet activity. This paper introduces a general framework to build a News search engine by describing Velthune, an academic News search engine available on line. ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Today, news browsing and searching is one of the most important Internet activity. This paper introduces a general framework to build a News search engine by describing Velthune, an academic News search engine available on line.
Efficient Time-Travel on Versioned Text Collections
"... Abstract: The availability of versioned text collections such as the Internet Archive opens up opportunities for time-aware exploration of their contents. In this paper, we propose time-travel retrieval and ranking that extends traditional keyword queries with a temporal context in which the query s ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract: The availability of versioned text collections such as the Internet Archive opens up opportunities for time-aware exploration of their contents. In this paper, we propose time-travel retrieval and ranking that extends traditional keyword queries with a temporal context in which the query should be evaluated. More precisely, the query is evaluated over all states of the collection that existed during the temporal context. In order to support these queries, we make key contributions in (i) defining extensions to well-known relevance models that take into account the temporal context of the query and the version history of documents, (ii) designing an immortal index over the full versioned text collection that avoids a blowup in index size, and (iii) making the popular NRA algorithm for top-k query processing aware of the temporal context. We present preliminary experimental analysis over the English Wikipedia revision history showing that the proposed techniques are both effective and efficient. 1
Durable Top-k Search in Document Archives
"... We propose and study a new ranking problem in versioned databases. Consider a database of versioned objects which have different valid instances along a history (e.g., documents in a web archive). Durable top-k search finds the set of objects that are consistently in the top-k results of a query (e. ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We propose and study a new ranking problem in versioned databases. Consider a database of versioned objects which have different valid instances along a history (e.g., documents in a web archive). Durable top-k search finds the set of objects that are consistently in the top-k results of a query (e.g., a keyword query) throughout a given time interval (e.g., from June 2008 to May 2009). Existing work on temporal top-k queries mainly focuses on finding the most representative top-k elements within a time interval. Such methods are not readily applicable to durable top-k queries. To address this need, we propose two techniques that compute the durable top-k result. The first is adapted from the classic top-k rank aggregation algorithm NRA. The second technique is based on a shared execution paradigm and is more efficient than the first approach. In addition, we propose a special indexing technique for archived data. The index, coupled with a space partitioning technique, improves performance even further. We use data from Wikipedia and the Internet Archive to demonstrate the efficiency and effectiveness of our solutions.
Time will tell: Leveraging temporal expressions in ir. WSDM
, 2009
"... Temporal expressions, such as between 1992 and 2000, are frequent across many kinds of documents. Text retrieval, though, treats them as common terms, thus ignoring their inherent semantics. For queries with a strong temporal component, such as U.S. president 1997, this leads to a decrease in retrie ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Temporal expressions, such as between 1992 and 2000, are frequent across many kinds of documents. Text retrieval, though, treats them as common terms, thus ignoring their inherent semantics. For queries with a strong temporal component, such as U.S. president 1997, this leads to a decrease in retrieval effectiveness, since relevant documents (e.g., a biography of Bill Clinton containing the aforementioned temporal expression) can not be reliably matched to the query. We propose a novel approach, based on language models, to make temporal expressions first-class citizens of the retrieval model. In addition, we present experiments that show actual improvements in retrieval effectiveness.

