Results 1 - 10
of
65
PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities
, 2003
"... Abstract. We present PlanetP, a peer-to-peer (P2P) content search and retrieval infrastructure targeting communities wishing to share large sets of text documents. P2P computing is an attractive model for information sharing between ad hoc groups of users because of its low cost of entry and explici ..."
Abstract
-
Cited by 139 (11 self)
- Add to MetaCart
Abstract. We present PlanetP, a peer-to-peer (P2P) content search and retrieval infrastructure targeting communities wishing to share large sets of text documents. P2P computing is an attractive model for information sharing between ad hoc groups of users because of its low cost of entry and explicit model for resource scaling. As communities grow, however, a key challenge becomes finding relevant information. To address this challenge, our design centers around indexing, content search, and retrieval rather than scalable name-based object location, which has been the focus of recent P2P systems. PlanetP takes the novel approach of replicating the global directory and a compact summary index at every peer using gossiping. PlanetP then leverages this information to approximate a state-of-the-art document ranking algorithm to help users locate relevant information within the large communal data set. Using a prototype implementation together with simulation, we show: (i) it is possible to design a gossiping algorithm that reliably maintains a copy of communal state at each peer yet requires only a modest amount of bandwidth, (ii) our content search and retrieval algorithm tracks the performance of the original ranking algorithm very closely, giving P2P communities a search and retrieval algorithm as good as that possible assuming a centralized server, and (iii) PlanetP’s gossiping and search and retrieval algorithms both scale well to communities of at least several thousand peers. 1
Query-Based Sampling of Text Databases
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 1999
"... ... This paper presents query-based sampling, a new technique for acquiring accurate resource descriptions. Query-based sampling does not require the cooperationof resource providers nor does it require that resource providers use a particular search engine or representation technique. An extensive ..."
Abstract
-
Cited by 134 (13 self)
- Add to MetaCart
... This paper presents query-based sampling, a new technique for acquiring accurate resource descriptions. Query-based sampling does not require the cooperationof resource providers nor does it require that resource providers use a particular search engine or representation technique. An extensive set of experimental results demonstrates that accurate resource descriptions are created, that computation and communication costs are reasonable, and that the resource descriptions do in fact enable accurate automatic database selection.
A Local Search Mechanism for Peer-to-Peer Networks
, 2002
"... One important problem in peer-to-peer (P2P) networks is searching and retrieving the correct information. However, existing searching mechanisms in pure peer-to-peer networks are inefficient due to the decentralized nature of such networks. We propose two mechanisms for information retrieval in pure ..."
Abstract
-
Cited by 96 (6 self)
- Add to MetaCart
One important problem in peer-to-peer (P2P) networks is searching and retrieving the correct information. However, existing searching mechanisms in pure peer-to-peer networks are inefficient due to the decentralized nature of such networks. We propose two mechanisms for information retrieval in pure peer-to-peer networks. The first, the modified Breadth-First-Search (BFS) mechanism, is an extension of the current Gnuttela protocol, allows searching with keywords, and is designed to minimize the number of messages that are needed to search the network. The second, the Intelligent Search mechanism, uses the past behavior of the P2P network to further improve the scalability of the search procedure. In this algorithm, each peer autonomously decides which of its peers are most likely to answer a given query. The algorithm is entirely distributed, and therefore scales well with the size of the network. We implemented our mechanisms as middleware platforms. To show the advantages of our mechanisms we present experimental results using the middleware implementation.
Distributed search over the hidden web: Hierarchical database sampling and selection
- In VLDB
, 2002
"... Many valuable text databases on the web have non-crawlable contents that are “hidden ” behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and eff ..."
Abstract
-
Cited by 85 (12 self)
- Add to MetaCart
Many valuable text databases on the web have non-crawlable contents that are “hidden ” behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from “uncooperative ” databases by using “focused query probes,” which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. Our content summaries are the first to include absolute document frequency estimates for the database words. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to compensate for potentially incomplete content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases. Our experiments indicate that our new content-summary construction technique is efficient and produces more accurate summaries than those from previously proposed strategies. Also, our hierarchical database selection algorithm exhibits significantly higher precision than its flat counterparts. 1
Relevant Document Distribution Estimation Method for Resource Selection
, 2003
"... Prior research under a variety of conditions has shown the CORI algorithm to be one of the most effective resource selection algorithms, but the range of database sizes studied was not large. This paper shows that the CORI algorithm does not do well in environments with a mix of "small" and "very la ..."
Abstract
-
Cited by 64 (15 self)
- Add to MetaCart
Prior research under a variety of conditions has shown the CORI algorithm to be one of the most effective resource selection algorithms, but the range of database sizes studied was not large. This paper shows that the CORI algorithm does not do well in environments with a mix of "small" and "very large" databases. A new resource selection algorithm is proposed that uses information about database sizes as well as database contents. We also show how to acquire database size estimates in uncooperative environments as an extension of the query-based sampling used to acquire resource descriptions. Experiments demonstrate that the database size estimates are more accurate for large databases than estimates produced by a competing method; the new resource ranking algorithm is always at least as effective as the CORI algorithm; and the new algorithm results in better document rankings than the CORI algorithm.
SETS: Search Enhanced by Topic Segmentation
, 2003
"... We present SETS, an architecture for building topic-segmented networks for efficient search. The key idea is to arrange participants in a topic-segmented topology where most of the links are short-distance links joining pairs of sites with similar content. The resulting topically focused regions are ..."
Abstract
-
Cited by 61 (4 self)
- Add to MetaCart
We present SETS, an architecture for building topic-segmented networks for efficient search. The key idea is to arrange participants in a topic-segmented topology where most of the links are short-distance links joining pairs of sites with similar content. The resulting topically focused regions are joined together into a single network by long-distance links. Queries are then matched and routed to only the topically closest regions. We draw on ideas from machine learning and social network theory to build an efficient search network. We discuss a variety of design issues and tradeoffs that an implementor of SETS would face. We show that SETS is ecient in network traffic and query processing load.
A language modeling framework for resource selection and results merging
- IN CIKM 2002
, 2002
"... Statistical language models have been proposed recently for several information retrieval tasks, including the resource selection task in distributed information retrieval. This paper extends the language modeling approach to integrate resource selection, ad-hoc searching, and merging of results fro ..."
Abstract
-
Cited by 60 (5 self)
- Add to MetaCart
Statistical language models have been proposed recently for several information retrieval tasks, including the resource selection task in distributed information retrieval. This paper extends the language modeling approach to integrate resource selection, ad-hoc searching, and merging of results from different text databases into a single probabilistic retrieval model. This new approach is designed primarily for Intranet environments, where it is reasonable to assume that resource providers are relatively homogeneous and can adopt the same kind of search engine. Experiments demonstrate that this new, integrated approach is at least as effective as the prior state-of-the-art in distributed IR.
The Impact of Database Selection on Distributed Searching
- SIGIR
, 2000
"... The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts -- database selection, query processing, and results merging. In this paper we examine the effect of database selection on retriev ..."
Abstract
-
Cited by 53 (12 self)
- Add to MetaCart
The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts -- database selection, query processing, and results merging. In this paper we examine the effect of database selection on retrieval performance. We look at retrieval performance in three different distributed retrieval testbeds and distill some general results. First we find that good database selection can result in better retrieval effectiveness than can be achieved in a centralized database. Second we find that good performance can be achieved when only a few sites are selected and that the performance generally increases as more sites are selected. Finally we find that when database selection is employed, it is not necessary to maintain collection wide information (CWI), e.g. global idf. Local information can be used to achieve superior performance. This means that distributed systems can be engineered with more autonomy and less cooperation. This work suggests that improvements in database selection can lead to broader improvements in retrieval performance, even in centralized (i.e. single database) systems. Given a centralized database and a good selection mechanism, retrieval performance can be improved by decomposing that database conceptually and employing a selection step.
Text-Based Content Search and Retrieval in ad hoc P2P Communities
- In Proceedings of the International Workshop on Peer-to-Peer Computing (co-located with Networking
, 2002
"... We consider the problem of content search and retrieval in peer-to-peer (P2P) communities. P2P computing is a potentially powerful model for information sharing between ad hoc groups of users because of its low cost of entry and natural model for resource scaling with community size. As P2P communit ..."
Abstract
-
Cited by 48 (10 self)
- Add to MetaCart
We consider the problem of content search and retrieval in peer-to-peer (P2P) communities. P2P computing is a potentially powerful model for information sharing between ad hoc groups of users because of its low cost of entry and natural model for resource scaling with community size. As P2P communities grow in size, however, locating information distributed across the large number of peers becomes problematic. We present a distributed text-based content search and retrieval algorithm to address this problem. Our algorithm is based on a state-of-the-art text-based document ranking algorithm: the vector-space model, instantiated with the TFxIDF ranking rule. A naive application of TFxIDF would require each peer in a community to collect an inverted index of the entire community. This is costly both in terms of bandwidth and storage. Instead, we show how TFxIDF can be approximated given compact summaries of peers ’ local inverted indexes. We make three contributions: (a) we show how the TFxIDF rule can be adapted to use the index summaries, (b) we provide a heuristic for adaptively determining the set of peers that should be contacted for a query, and (c) we show that our algorithm tracks TFxIDF’s performance very closely, regardless of how documents are distributed throughout the community. Furthermore, our algorithm preserves the main flavor of TFxIDF by retrieving close to the same set of documents for any given query.
Collection selection and results merging with topically organized U.S. patents and TREC data
- In CIKM 2000
, 2000
"... We investigate three issues in distributed information retrieval, considering both TREC data and U.S. Patents: (1) topical organization of large text collections, (2) collection ranking and selection with topically organized collections (3) results merging, particularly document score normalization, ..."
Abstract
-
Cited by 38 (7 self)
- Add to MetaCart
We investigate three issues in distributed information retrieval, considering both TREC data and U.S. Patents: (1) topical organization of large text collections, (2) collection ranking and selection with topically organized collections (3) results merging, particularly document score normalization, with topically organized collections. We find that it is better to organize collections topically, and that topical collections can be well ranked using either INQUERY’s CORI algorithm, or the Kullback-Leibler divergence (KL), but KL is far worse than CORI for non-topically organized collections. For results merging, collections organized by topic require global idfs for the best performance. Contrary to results found elsewhere, normalized scores are not as good as global idfs for merging when the collections are topically organized.

