Results 1 - 10
of
11
Modeling Search Engine Effectiveness for Federated Search
"... Federated search links multiple search engines into a single, virtual search system. Most prior research of federated search focused on selecting search engines that have the most relevant contents, but ignored the retrieval effectiveness of individual search engines. This omission can cause serious ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
Federated search links multiple search engines into a single, virtual search system. Most prior research of federated search focused on selecting search engines that have the most relevant contents, but ignored the retrieval effectiveness of individual search engines. This omission can cause serious problems when federating search engines of different qualities.
Central-rank-based collection selection in uncooperative distributed information retrieval
- Proc. ECIR Conf
, 2007
"... Abstract. Collection selection is one of the key problems in distributed information retrieval. Due to resource constraints it is not usually feasible to search all collections in response to a query. Therefore, the central component (broker) selects a limited number of collections to be searched fo ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
Abstract. Collection selection is one of the key problems in distributed information retrieval. Due to resource constraints it is not usually feasible to search all collections in response to a query. Therefore, the central component (broker) selects a limited number of collections to be searched for the submitted queries. During the past decade, several collection selection algorithms have been introduced. However, their performance varies on different testbeds. We propose a new collection-selection method based on the ranking of downloaded sample documents. We test our method on six testbeds and show that our technique can significantly outperform other state-of-the-art algorithms in most cases. We also introduce a new testbed based on the trec gov2 documents. 1
Updating Collection Representations For Federated Search ABSTRACT
"... To facilitate the search for relevant information across a set of online distributed collections, a federated information retrieval system typically represents each collection, centrally, by a set of vocabularies or sampled documents. Accurate retrieval is therefore related to how precise each repre ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
To facilitate the search for relevant information across a set of online distributed collections, a federated information retrieval system typically represents each collection, centrally, by a set of vocabularies or sampled documents. Accurate retrieval is therefore related to how precise each representation reflects the underlying content stored in that collection. As collections evolve over time, collection representations should also be updated to reflect any change, however, a current solution has not yet been proposed. In this study we examine both the implications of out-of-date representation sets on retrieval accuracy, as well as proposing three different policies for managing necessary updates. Each policy is evaluated on a testbed of forty-four dynamic collections over an eight-week period. Our findings show that out-ofdate representations significantly degrade performance over time, however, adopting a suitable update policy can minimise this problem.
Towards better measures: Evaluation of estimated resource description quality for distributed IR
- In First International Conference on Scalable Information Systems. IEEE CS Society
, 2006
"... An open problem for Distributed Information Retrieval systems (DIR) is how to represent large document repositories, also known as resources, both accurately and efficiently. Obtaining resource description estimates is an important phase in DIR, especially in non-cooperative environments. Measuring ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
An open problem for Distributed Information Retrieval systems (DIR) is how to represent large document repositories, also known as resources, both accurately and efficiently. Obtaining resource description estimates is an important phase in DIR, especially in non-cooperative environments. Measuring the quality of an estimated resource description is a contentious issue as current measures do not provide an adequate indication of quality. In this paper, we provide an overview of these currently applied measures of resource description quality, before proposing the Kullback-Leibler (KL) divergence as an alternative. Through experimentation we illustrate the shortcomings of these past measures, whilst providing evidence that KL is a more appropriate measure of quality. When applying KL to compare different QBS algorithms, our experiments provide strong evidence in favour of a previously unsupported hypothesis originally posited in the initial Query-Based Sampling work. 1
Classification-Based Resource Selection
"... In some retrieval situations, a system must search across multiple collections. This task, referred to as federated search, occurs for example when searching a distributed index or aggregating content for web search. Resource selection refers to the subtask of deciding, given a query, which collecti ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
In some retrieval situations, a system must search across multiple collections. This task, referred to as federated search, occurs for example when searching a distributed index or aggregating content for web search. Resource selection refers to the subtask of deciding, given a query, which collections to search. Most existing resource selection methods rely on evidence found in collection content. We present an approach to resource selection that combines multiple sources of evidence to inform the selection decision. We derive evidence from three different sources: collection documents, the topic of the query, and query click-through data. We combine this evidence by treating resource selection as a multiclass machine learning problem. Although machine learned approaches often require large amounts of manually generated training data, we present a method for using automatically generated training data. We make use of and compare against prior resource selection work and evaluate across three experimental testbeds.
Compact features for detection of near-duplicates in distributed retrieval
- in ‘Proceedings of String Processing and Information Retrieval Symposium (to appear
, 2006
"... Abstract. In distributed information retrieval, answers from separate collections are combined into a single result set. However, the collections may overlap. The fact that the collections are distributed means that it is not in general feasible to prune duplicate and near-duplicate documents at ind ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. In distributed information retrieval, answers from separate collections are combined into a single result set. However, the collections may overlap. The fact that the collections are distributed means that it is not in general feasible to prune duplicate and near-duplicate documents at index time. In this paper we introduce and analyze the grainy hash vector,acompactdocumentrepresentationthatcanbeused to efficiently prune duplicate and near-duplicate documents from result lists. We demonstrate that, for a modest bandwidth and computational cost, many near-duplicates can be accurately removed from result lists produced by a cooperative distributed information retrieval system. 1
WWW.JOURNALOFCOMPUTING.ORG Studies on Relevance, Ranking and Results Display
"... Abstract—This study considers the extent to which users with the same query agree as to what is relevant, and how what is considered relevant may translate into a retrieval algorithm and results display. To combine user perceptions of relevance with algorithm rank and to present results, we created ..."
Abstract
- Add to MetaCart
Abstract—This study considers the extent to which users with the same query agree as to what is relevant, and how what is considered relevant may translate into a retrieval algorithm and results display. To combine user perceptions of relevance with algorithm rank and to present results, we created a prototype digital library of scholarly literature. We confine studies to one population of scientists (paleontologists), one domain of scholarly scientific articles (paleo-related), and a prototype system (PaleoLit) that we built for the purpose. Based on the principle that users do not pre-suppose answers to a given query but that they will recognize what they want when they see it, our system uses a rules-based algorithm to cluster results into fuzzy categories with three relevance levels. Our system matches at least 1/3 of our participants ’ relevancy ratings 87 % of the time. Our subsequent usability study found that participants trusted our uncertainty labels but did not value our color-coded horizontal results layout above a standard retrieval list. We posit that users make such judgments in limited time, and that time optimization per task might help explain some of our findings. Index Terms—knowledge retrieval; uncertainty, “fuzzy, ” and probabilistic reasoning; knowledge representation formalisms and methods
In Language and Information Technologies
, 2011
"... This dissertation would have not been possible without the persistent guidance and encouragement of my mentors. I owe a big ‘thank you ’ to my advisor, Jamie Callan. It was during the Fall of 2004, when I took two half-semester courses taught by Jamie (Digital Libraries and Text Data Mining), that I ..."
Abstract
- Add to MetaCart
This dissertation would have not been possible without the persistent guidance and encouragement of my mentors. I owe a big ‘thank you ’ to my advisor, Jamie Callan. It was during the Fall of 2004, when I took two half-semester courses taught by Jamie (Digital Libraries and Text Data Mining), that I discovered the field of Information Retrieval. Seven years later, I am writing my dissertation acknowledgments. Jamie’s advice on both research and life in general has been a great source of insight and support. His feedback on research papers, presentations, lecture slides, proposals, and this dissertation has taught me a great deal about communicating effectively. From Jamie’s example, I have learned valuable lessons on research, teaching, and mentoring. I would like to thank Jaime Carbonell, Yiming Yang, and Fernando Diaz for agreeing to be in my thesis committee. Their feedback was critical in making this dissertation stronger. Fernando Diaz helped shape many of the ideas presented in this dissertation. It was during an internship with Fernando at Yahoo! where I began working on vertical selection. I enjoyed it so much I returned for a second internship a year later. I have been fortunate to have had Fernando as a mentor and collaborator ever since.
Using Past Queries for Resource Selection in Distributed Information Retrieval
, 2011
"... Federated text search provides a unified search interface for multiple search engines of distributed text information sources. Resource selection is an important component for federated text search, which selects a small number of information sources that contain the largest number of relevant docum ..."
Abstract
- Add to MetaCart
Federated text search provides a unified search interface for multiple search engines of distributed text information sources. Resource selection is an important component for federated text search, which selects a small number of information sources that contain the largest number of relevant documents for a user query. Most prior research of resource selection focused on selecting information sources by analyzing static information of available information sources that is sampled in the offline manner. On the other hand, most prior research ignored a large amount of valuable information like the results from past queries. This paper proposes a new resource selection technique (which is called qSim) that utilizes the search results of past queries for estimating the utilities of available information sources for a specific user query. The new algorithm calculates the query similarities between a specific query and all past queries, and then estimates the utilities of available information sources by the weighted combination of results of past queries with respect to the query similarities. The new resource selection algorithm is practical as it does not require relevance judgment of past queries and it only utilizes regression based results merging method to rank the results of past queries. This paper

