Results 1 - 10
of
121
Routing indices for peer-to-peer systems
, 2002
"... Finding information in a peer-to-peer system currently requires either a costly and vulnerable central index, or ooding the network with queries. In this paper we introduce the concept of Routing Indices (RIs), which allow nodes to forward queries to neighbors that are more likely to have answers. I ..."
Abstract
-
Cited by 313 (12 self)
- Add to MetaCart
Finding information in a peer-to-peer system currently requires either a costly and vulnerable central index, or ooding the network with queries. In this paper we introduce the concept of Routing Indices (RIs), which allow nodes to forward queries to neighbors that are more likely to have answers. If a node cannot answer a query, it forwards the query to a subset of its neighbors, based on its local RI, rather than by selecting neighbors at random or by ooding the network by forwarding the query to all neighbors. We present three RI schemes: the compound, the hop-count, and the exponential routing indices. We evaluate their performance via simulations, and nd that RIs can improve performance by one or two orders of magnitude vs. a ooding-based system, and by up to 100 % vs. a random forwarding system. We also discuss the tradeo s between the di erent RIschemes and highlight the e ects of key design variables on system performance.
Peer-to-Peer Information Retrieval Using Self-Organizing Semantic Overlay Networks
, 2003
"... Content-based full-text search is a challenging problem in Peer-toPeer (P2P) systems. Traditional approaches have either been centralized or use flooding to ensure accuracy of the results returned. In this paper, we present pSearch, a decentralized non-flooding P2P information retrieval system. pSea ..."
Abstract
-
Cited by 184 (7 self)
- Add to MetaCart
Content-based full-text search is a challenging problem in Peer-toPeer (P2P) systems. Traditional approaches have either been centralized or use flooding to ensure accuracy of the results returned. In this paper, we present pSearch, a decentralized non-flooding P2P information retrieval system. pSearch distributes document indices through the P2P network based on document semantics generated by Latent Semantic Indexing (LSI). The search cost (in terms of different nodes searched and data transmitted) for a given query is thereby reduced, since the indices of semantically related documents are likely to be co-located in the network. We also describe techniques that help distribute the indices more evenly across the nodes, and further reduce the number of nodes accessed using appropriate index distribution as well as using index samples and recently processed queries to guide the search. Experiments show that pSearch can achieve performance comparable to centralized information retrieval systems by searching only a small number of nodes. For a system with 128,000 nodes and 528,543 documents (from news, magazines, etc.), pSearch searches only 19 nodes and transmits only 95.5KB data during the search, whereas the top 15 documents returned by pSearch and LSI have a 91.7% intersection.
Context in Web Search
- IEEE Data Engineering Bulletin
, 2000
"... Web search engines generally treat search requests in isolation. The results for a given query are identical, independent of the user, or the context in which the user made the request. Nextgeneration search engines will make increasing use of context information, either by using explicit or implici ..."
Abstract
-
Cited by 100 (0 self)
- Add to MetaCart
Web search engines generally treat search requests in isolation. The results for a given query are identical, independent of the user, or the context in which the user made the request. Nextgeneration search engines will make increasing use of context information, either by using explicit or implicit context information from users, or by implementing additional functionality within restricted contexts. Greater use of context in web search may help increase competition and diversity on the web.
Comparing the Performance of Database Selection Algorithms
, 1999
"... We compare the performance of two database selection algorithms reported in the literature. Their performance is compared using a common testbed designed specifically for database selection techniques. The testbed is a decomposition of the TREC/- TIPSTER data into 236 subcollections. We present resu ..."
Abstract
-
Cited by 89 (23 self)
- Add to MetaCart
We compare the performance of two database selection algorithms reported in the literature. Their performance is compared using a common testbed designed specifically for database selection techniques. The testbed is a decomposition of the TREC/- TIPSTER data into 236 subcollections. We present results of a recent investigation of the performance of the CORI algorithm and compare the performance with earlier work that examined the performance of gGlOSS. The databases from our testbed were ranked using both the gGlOSS and CORI techniques and compared to the RBR baseline, a baseline derived from TREC relevance judgements. We examined the degree to which CORI and gGlOSS approximate this baseline. Our results confirm our earlier observation that the gGlOSS Ideal(l) ranks do not estimate relevance- This work supported in part by DARPA contract N6600197 -C-8542 and NASA GSRP NGT5-50062. y This work supported in part by NSF, the Library of Congress, and the Department of Commerce under agre...
Distributed search over the hidden web: Hierarchical database sampling and selection
- In VLDB
, 2002
"... Many valuable text databases on the web have non-crawlable contents that are “hidden ” behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and eff ..."
Abstract
-
Cited by 85 (12 self)
- Add to MetaCart
Many valuable text databases on the web have non-crawlable contents that are “hidden ” behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from “uncooperative ” databases by using “focused query probes,” which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. Our content summaries are the first to include absolute document frequency estimates for the database words. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to compensate for potentially incomplete content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases. Our experiments indicate that our new content-summary construction technique is efficient and produces more accurate summaries than those from previously proposed strategies. Also, our hierarchical database selection algorithm exhibits significantly higher precision than its flat counterparts. 1
pSearch: Information Retrieval in Structured Overlays
, 2002
"... We describe an efficient peer-to-peer information retrieval system, pSearch, that supports state-of-the-art content- and semantic-based full-text searches. pSearch avoids the scalability problem of existing systems that employ centralized indexing, or index/query flooding. It also avoids the nondete ..."
Abstract
-
Cited by 68 (6 self)
- Add to MetaCart
We describe an efficient peer-to-peer information retrieval system, pSearch, that supports state-of-the-art content- and semantic-based full-text searches. pSearch avoids the scalability problem of existing systems that employ centralized indexing, or index/query flooding. It also avoids the nondeterminism that is exhibited by heuristic-based approaches. In pSearch, documents in the network are organized around their vector representations (based on modern document ranking algorithms) such that the search space for a given query is organized around related documents, achieving both eciency and accuracy.
SETS: Search Enhanced by Topic Segmentation
, 2003
"... We present SETS, an architecture for building topic-segmented networks for efficient search. The key idea is to arrange participants in a topic-segmented topology where most of the links are short-distance links joining pairs of sites with similar content. The resulting topically focused regions are ..."
Abstract
-
Cited by 61 (4 self)
- Add to MetaCart
We present SETS, an architecture for building topic-segmented networks for efficient search. The key idea is to arrange participants in a topic-segmented topology where most of the links are short-distance links joining pairs of sites with similar content. The resulting topically focused regions are joined together into a single network by long-distance links. Queries are then matched and routed to only the topically closest regions. We draw on ideas from machine learning and social network theory to build an efficient search network. We discuss a variety of design issues and tradeoffs that an implementor of SETS would face. We show that SETS is ecient in network traffic and query processing load.
The Impact of Database Selection on Distributed Searching
- SIGIR
, 2000
"... The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts -- database selection, query processing, and results merging. In this paper we examine the effect of database selection on retriev ..."
Abstract
-
Cited by 53 (12 self)
- Add to MetaCart
The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts -- database selection, query processing, and results merging. In this paper we examine the effect of database selection on retrieval performance. We look at retrieval performance in three different distributed retrieval testbeds and distill some general results. First we find that good database selection can result in better retrieval effectiveness than can be achieved in a centralized database. Second we find that good performance can be achieved when only a few sites are selected and that the performance generally increases as more sites are selected. Finally we find that when database selection is employed, it is not necessary to maintain collection wide information (CWI), e.g. global idf. Local information can be used to achieve superior performance. This means that distributed systems can be engineered with more autonomy and less cooperation. This work suggests that improvements in database selection can lead to broader improvements in retrieval performance, even in centralized (i.e. single database) systems. Given a centralized database and a good selection mechanism, retrieval performance can be improved by decomposing that database conceptually and employing a selection step.
QProber: A system for automatic classification of hidden-web databases
- ACM TOIS
, 2003
"... The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web “crawlers. ” Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. ..."
Abstract
-
Cited by 53 (11 self)
- Add to MetaCart
The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web “crawlers. ” Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. Here, we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.

