Results 11 - 20
of
60
The Happy Searcher: Challenges in Web Information Retrieval
- Proceedings of the 8th PRICAI Conference
, 2004
"... Search has arguably become the dominant paradigm for finding information on the World Wide Web. In order to build a successful search engine, there are a number of challenges that arise where techniques from artificial intelligence can be used to have a significant impact. In this paper, we expl ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Search has arguably become the dominant paradigm for finding information on the World Wide Web. In order to build a successful search engine, there are a number of challenges that arise where techniques from artificial intelligence can be used to have a significant impact. In this paper, we explore a number of problems related to finding information on the web and discuss approaches that have been employed in various research programs, including some of those at Google. Specifically, we examine issues of such as web graph analysis, statistical methods for inferring meaning in text, and the retrieval and analysis of newsgroup postings, images, and sounds. We show that leveraging the vast amounts of data on web, it is possible to successfully address problems in innovative ways that vastly improve on standard, but often data impoverished, methods. We also present a number of open research problems to help spur further research in these areas.
IQN routing: Integrating quality and novelty in p2p querying and ranking
- In EDBT
, 2006
"... Abstract. We consider a collaboration of peers autonomously crawling the Web. A pivotal issue when designing a peer-to-peer (P2P) Web search engine in this environment is query routing: selecting a small subset of (a potentially very large number of relevant) peers to contact to satisfy a keyword qu ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
Abstract. We consider a collaboration of peers autonomously crawling the Web. A pivotal issue when designing a peer-to-peer (P2P) Web search engine in this environment is query routing: selecting a small subset of (a potentially very large number of relevant) peers to contact to satisfy a keyword query. Existing approaches for query routing work well on disjoint data sets. However, naturally, the peers ’ data collections often highly overlap, as popular documents are highly crawled. Techniques for estimating the cardinality of the overlap between sets, designed for and incorporated into information retrieval engines are very much lacking. In this paper we present a comprehensive evaluation of appropriate overlap estimators, showing how they can be incorporated into an efficient, iterative approach to query routing, coined Integrated Quality Novelty (IQN). We propose to further enhance our approach using histograms, combining overlap estimation with the available score/ranking information. Finally, we conduct a performance evaluation in MINERVA, our prototype P2P Web search engine.
Domain independent approaches for finding diverse plans
- In Proceedings of IJCAI’07
, 2007
"... In many planning situations, a planner is required to return a diverse set of plans satisfying the same goals which will be used by the external systems collectively. We take a domain-independent approach to solving this problem. We propose different domain independent distance functions among plans ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
In many planning situations, a planner is required to return a diverse set of plans satisfying the same goals which will be used by the external systems collectively. We take a domain-independent approach to solving this problem. We propose different domain independent distance functions among plans that can provide meaningful insights about the diversity in the plan set. We then describe how two representative state-of-the-art domain independent planning approaches – one based on compilation to CSP, and the other based on heuristic local search – can be adapted to produce diverse plans. We present empirical evidence demonstrating the effectiveness of our approaches. 1
Utility-based information distillation over temporally sequenced documents
- Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
, 2007
"... This paper examines a new approach to information distillation over temporally ordered documents, and proposes a novel evaluation scheme for such a framework. It combines the strengths of and extends beyond conventional adaptive filtering, novelty detection and non-redundant passage ranking with res ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
This paper examines a new approach to information distillation over temporally ordered documents, and proposes a novel evaluation scheme for such a framework. It combines the strengths of and extends beyond conventional adaptive filtering, novelty detection and non-redundant passage ranking with respect to long-lasting information needs (‘tasks ’ with multiple queries). Our approach supports fine-grained user feedback via highlighting of arbitrary spans of text, and leverages such information for utility optimization in adaptive settings. For our experiments, we defined hypothetical tasks based on news events in the TDT4 corpus, with multiple queries per task. Answer keys (nuggets) were generated for each query and a semiautomatic procedure was used for acquiring rules that allow automatically matching nuggets against system responses. We also propose an extension of the NDCG metric for assessing the utility of ranked passages as a combination of relevance and novelty. Our results show encouraging utility enhancements using the new approach, compared to the baseline systems without incremental learning or the novelty detection components.
Exploring Independent Trends in a Topic-Based Search Engine
, 2004
"... Topic-based search engines are an alternative to simple keyword search engines that are common in today's intranets. The temporal behaviour of the topics in a topic model based search engine can be used for trend analysis, which is an important research goal on its own. We apply topic modelling to a ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Topic-based search engines are an alternative to simple keyword search engines that are common in today's intranets. The temporal behaviour of the topics in a topic model based search engine can be used for trend analysis, which is an important research goal on its own. We apply topic modelling to an online financial newspaper data and show that some of the trends in the topics are consistent with common understanding.
TREC2002 Web, Novelty and Filtering Track Experiments using PIRCS
- In Proceedings of TREC11
, 2002
"... Introduction In TREC2002, we participated in three tracks: web, novelty and adaptive filtering. The Web track has two tasks: distillation and named- page retrieval. Distillation is a new utility concept for ranking documents, and needs new design on the output document ranked list after an ad-hoc r ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Introduction In TREC2002, we participated in three tracks: web, novelty and adaptive filtering. The Web track has two tasks: distillation and named- page retrieval. Distillation is a new utility concept for ranking documents, and needs new design on the output document ranked list after an ad-hoc retrieval from the web (.gov) collection. Novelty track is a new task that involves identifying relevant sentences to a question, and to remove duplicate or non- novel entries in the answer list. The third track is adaptive filtering. We revived a filtering program that was functional at TREC-9 with some added capability. Sections 2, 3, 4 describe our participation in these tracks respectively. Section 5 has our conclusion. 2 The Web Track This year the web track involves two tasks: topic distillation and named-page finding. Named-page finding is similar to last year's home page finding [1] except that an answer page may be a sub-site address containing what the user wants that is named in
On the usage of global document occurrences in peer-to-peer information systems
- In COOPIS 2005
"... Abstract. There exist a number of approaches for query processing in Peer-to-Peer information systems that efficiently retrieve relevant information from distributed peers. However, very few of them take into consideration the overlap between peers: as the most popular resources (e.g., documents or ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Abstract. There exist a number of approaches for query processing in Peer-to-Peer information systems that efficiently retrieve relevant information from distributed peers. However, very few of them take into consideration the overlap between peers: as the most popular resources (e.g., documents or files) are often present at most of the peers, a large fraction of the documents eventually received by the query initiator are duplicates. We develop a technique based on the notion of global document occurrences (GDO) that, when processing a query, penalizes frequent documents increasingly as more and more peers contribute their local results. We argue that the additional effort to create and maintain the GDO information is reasonably low, as the necessary information can be piggybacked onto the existing communication. Early experiments indicate that our approach significantly decreases the number of peers that have to be involved in a query to reach a certain level of recall and, thus, decreases user-perceived latency and the wastage of network resources.
Combine multiple forms of evidence while filtering
- In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing
, 2005
"... This paper studies how to go beyond relevance and enable a filtering system to learn more interesting and detailed data driven user models from multiple forms of evidence. We carry out a user study using a real time web based personal news filtering system, and collect extensive multiple forms of ev ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
This paper studies how to go beyond relevance and enable a filtering system to learn more interesting and detailed data driven user models from multiple forms of evidence. We carry out a user study using a real time web based personal news filtering system, and collect extensive multiple forms of evidence, including explicit and implicit user feedback. We explore the graphical modeling approach to combine these forms of evidence. To test whether the approach can help us understand the domain better, we use graph structure learning algorithm to derive the causal relationships between different forms of evidence. To test whether the approach can help the system improve the performance, we use the graphical inference algorithms to predict whether a user likes a document based on multiple forms of evidence. The results show that combining multiple forms of evidence using graphical models can help us better understand the filtering problem, improve filtering system performance, and handle various data missing situations naturally. 1
Novelty detection for cross-lingual news stories with visual duplicates and speech transcripts
- in Proceedings of ACM Multimedia
, 2007
"... An overwhelming volume of news videos from different channels and languages is available today, which demands automatic management of this abundant information. To effectively search, retrieve, browse and track cross-lingual news stories, a news story similarity measure plays a critical role in asse ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
An overwhelming volume of news videos from different channels and languages is available today, which demands automatic management of this abundant information. To effectively search, retrieve, browse and track cross-lingual news stories, a news story similarity measure plays a critical role in assessing the novelty and redundancy among them. In this paper, we explore the novelty and redundancy detection with visual duplicates and speech transcripts for cross-lingual news stories. News stories are represented by a sequence of keyframes in the visual track and a set of words extracted from speech transcript in the audio track. A major difference to pure text documents is that the number of keyframes in one story is relatively small compared to the number of words and there exist a large number of non-near-duplicate keyframes. These features make the behavior of similarity measures different compared to traditional textual collections. Furthermore, the textual features and visual features complement each other for news stories. They can be further combined to boost the performance. Experiments on the TRECVID-2005 cross-lingual news video corpus show that approaches on textual features and visual features demonstrate different performance, and measures on visual features are quite effective. Overall, the cosine distance on keyframes is still a robust measure. Language models built on visual features demonstrate promising performance. The fusion of textual and visual features improves overall performance.
Identifying Opinion Leaders in the Blogosphere
"... Opinion leaders are those who bring in new information, ideas, and opinions, then disseminate them down to the masses, and thus influence the opinions and decisions of others by a fashion of word of mouth. Opinion leaders capture the most representative opinions in the social network, and consequent ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Opinion leaders are those who bring in new information, ideas, and opinions, then disseminate them down to the masses, and thus influence the opinions and decisions of others by a fashion of word of mouth. Opinion leaders capture the most representative opinions in the social network, and consequently are important for understanding the massive and complex blogosphere. In this paper, we propose a novel algorithm called InfluenceRank to identify opinion leaders in the blogosphere. The InfluenceRank algorithm ranks blogs according to not only how important they are as compared to other blogs, but also how novel the information they can contribute to the network. Experimental results indicate that our proposed algorithm is effective in identifying influential opinion leaders.

