Results 1 - 10
of
66
Minerva: Collaborative p2p search
- In VLDB
, 2005
"... This paper proposes the live demonstration of a prototype of MINERVA 1, a novel P2P Web search engine. The search engine is layered on top of a DHT-based overlay network that connects an a-priori unlimited number of peers, each of which maintains a personal local database and a local search facility ..."
Abstract
-
Cited by 53 (18 self)
- Add to MetaCart
(Show Context)
This paper proposes the live demonstration of a prototype of MINERVA 1, a novel P2P Web search engine. The search engine is layered on top of a DHT-based overlay network that connects an a-priori unlimited number of peers, each of which maintains a personal local database and a local search facility. Each peer posts a small amount of metadata to a physically distributed directory that is used to efficiently select promising peers from across the peer population that can best locally execute a query. The proposed demonstration serves as a proof of concept for P2P Web search by deploying the project on standard notebook PCs and also invites everybody to join the network by instantly installing a small piece of software from a USB memory stick. 1
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
- in Proceedings of IEEE ICDE
, 2007
"... The suitability of Peer-to-Peer (P2P) approaches for fulltext web retrieval has recently been questioned because of the claimed unacceptable bandwidth consumption induced by retrieval from very large document collections. In this contribution we formalize a novel indexing/retrieval model that achiev ..."
Abstract
-
Cited by 28 (7 self)
- Add to MetaCart
(Show Context)
The suitability of Peer-to-Peer (P2P) approaches for fulltext web retrieval has recently been questioned because of the claimed unacceptable bandwidth consumption induced by retrieval from very large document collections. In this contribution we formalize a novel indexing/retrieval model that achieves high performance, costefficient retrieval by indexing with highly discriminative keys (HDKs) stored in a distributed global index maintained in a structured P2P network. HDKs correspond to carefully selected terms and term sets appearing in a small number of collection documents. We provide a theoretical analysis of the scalability of our retrieval model and report experimental results obtained with our HDK-based P2P retrieval engine. These results show that, despite increased indexing costs, the total traffic generated with the HDK approach is significantly smaller than the one obtained with distributed single-term indexing strategies. Furthermore, our experiments show that the retrieval performance obtained with a random set of real queries is comparable to the one of centralized, single-term solution using the best state-of-the-art BM25 relevance computation scheme. Finally, our scalability analysis demonstrates that the HDK approach can scale to large networks of peers indexing web-size document collections, thus opening the way towards viable, truly-decentralized web retrieval. 1.
Discovering and exploiting keyword and attribute-value co-occurrences to improve p2p routing indices
- In CIKM
, 2006
"... Peer-to-Peer (P2P) search requires intelligent decisions for query routing: selecting the best peers to which a given query, initiated at some peer, should be forwarded for retrieving additional search results. These decisions are based on statistical summaries for each peer, which are usually organ ..."
Abstract
-
Cited by 19 (8 self)
- Add to MetaCart
(Show Context)
Peer-to-Peer (P2P) search requires intelligent decisions for query routing: selecting the best peers to which a given query, initiated at some peer, should be forwarded for retrieving additional search results. These decisions are based on statistical summaries for each peer, which are usually organized on a per-keyword basis and managed in a distributed directory of routing indices. Such architectures disregard the
SourceRank: Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source Agreement
"... We consider the problem of deep web source selection and argue that existing source selection methods are inadequate as they are based on local similarity assessment. Specifically, they fail to account for the fact that sources can vary in trustworthiness and individual results can vary in importanc ..."
Abstract
-
Cited by 19 (7 self)
- Add to MetaCart
(Show Context)
We consider the problem of deep web source selection and argue that existing source selection methods are inadequate as they are based on local similarity assessment. Specifically, they fail to account for the fact that sources can vary in trustworthiness and individual results can vary in importance. In response, we formulate a global measure to calculate relevance and trustworthiness of a source based on agreement between the answers provided by different sources. Agreement is modeled as a graph with sources at the vertices. On this agreement graph, source quality scores—namely SourceRank—are calculated as the stationary visit probability of a weighted random walk. Our experiments on online databases and 675 book sources from Google Base show that SourceRank improves relevance of the results by 25-40 % over existing methods and Google Base ranking. SourceRank also reduces linearly with the corruption levels of the sources.
P2P Content Search: Give the Web Back to the People
"... The proliferation of peer-to-peer (P2P) systems has come with various compelling applications including file sharing based on distributed hash tables (DHTs) or other kinds of overlay networks. Searching the content of files (especially Web Search) requires multi-keyword querying with scoring and ran ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
The proliferation of peer-to-peer (P2P) systems has come with various compelling applications including file sharing based on distributed hash tables (DHTs) or other kinds of overlay networks. Searching the content of files (especially Web Search) requires multi-keyword querying with scoring and ranking. Existing approaches have no way of taking into account the correlation between the keywords in the query. This paper presents our solution that incorporates the queries and behavior of the users in the P2P network such that interesting correlations can be inferred.
p2pDating: Real Life Inspired Semantic Overlay Networks for Web Search
- Inf. Process. Manage
, 2005
"... We consider a network of autonomous peers forming a logically global but physically distributed search engine, where every peer has its own local collection generated by independently crawling the web. A challenging task in such systems is to efficiently route user queries to peers that can deliver ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
(Show Context)
We consider a network of autonomous peers forming a logically global but physically distributed search engine, where every peer has its own local collection generated by independently crawling the web. A challenging task in such systems is to efficiently route user queries to peers that can deliver high quality results and be able to rank these returned results, thus satisfying the users ’ information need. However, the problem inherent with this scenario is selecting a few promising peers out of an a priori unlimited number of peers. In recent research a rather strict notion of semantic overlay networks has been established. In most approaches, peers are squeezed into a semantic profile by clustering them based on their contents. In the spirit of the natural notion of autonomous peers participating in a P2P system, our strategy creates semantic overlay networks based on the notion of “peer-to-peer dating”: Peers are free to decide which connections they create and which they want to avoid based on various usefulness estimators. The proposed techniques can be easily integrated into existing systems as they require only small additional bandwidth consumption as most messages can be piggybacked onto established communication. We show how we can greatly benefit from these additional semantic relations during query routing in search engines, such as MINERVA, and in the JXP algorithm, which computes the PageRank authority measure in a completely decentralized manner.
Daw: Duplicate-aware federated query processing over the web of data
- In ISWC
, 2013
"... Abstract. Over the last years the Web of Data has developed into a large compendium of interlinked data sets from multiple domains. Due to the decentralised architecture of this compendium, several of these datasets contain duplicated data. Yet, so far, only little attention has been paid to the eff ..."
Abstract
-
Cited by 17 (10 self)
- Add to MetaCart
(Show Context)
Abstract. Over the last years the Web of Data has developed into a large compendium of interlinked data sets from multiple domains. Due to the decentralised architecture of this compendium, several of these datasets contain duplicated data. Yet, so far, only little attention has been paid to the effect of duplicated data on federated querying. This work presents DAW, a novel duplicate-aware approach to feder-ated querying over the Web of Data. DAW is based on a combination of min-wise independent permutations and compact data summaries. It can be directly combined with existing federated query engines in or-der to achieve the same query recall values while querying fewer data sources. We extend three well-known federated query processing engines – DARQ, SPLENDID, and FedX – with DAW and compare our exten-sions with the original approaches. The comparison shows that DAW can greatly reduce the number of queries sent to the endpoints, while keeping high query recall values. Therefore, it can significantly improve the performance of federated query processing engines. Moreover, DAW provides a source selection mechanism that maximises the query recall, when the query processing is limited to a subset of the sources.
Improving distributed join efficiency with extended bloom filter operations
- IN 21ST INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS (AINA-07). IEEE COMPUTER SOCIETY
, 2007
"... Bloom filter based algorithms have proven successful as
very efficient technique to reduce communication costs of
database joins in a distributed setting. However, the full
potential of bloom filters has not yet been exploited. Especially in the case of multi-joins, where the data is distributed amo ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Bloom filter based algorithms have proven successful as
very efficient technique to reduce communication costs of
database joins in a distributed setting. However, the full
potential of bloom filters has not yet been exploited. Especially in the case of multi-joins, where the data is distributed among several sites, additional optimization opportunities arise, which require new bloom filter operations and computations. In this paper, we present these extensions and point out how they improve the performance of such distributed joins. While the paper focuses on efficient join computation, the described extensions are applicable to a wide range of usages, where bloom filters are facilitated for compressed set representation.
IQN routing: Integrating quality and novelty in p2p querying and ranking
- In EDBT
, 2006
"... Abstract. We consider a collaboration of peers autonomously crawling the Web. A pivotal issue when designing a peer-to-peer (P2P) Web search engine in this environment is query routing: selecting a small subset of (a potentially very large number of relevant) peers to contact to satisfy a keyword qu ..."
Abstract
-
Cited by 16 (8 self)
- Add to MetaCart
(Show Context)
Abstract. We consider a collaboration of peers autonomously crawling the Web. A pivotal issue when designing a peer-to-peer (P2P) Web search engine in this environment is query routing: selecting a small subset of (a potentially very large number of relevant) peers to contact to satisfy a keyword query. Existing approaches for query routing work well on disjoint data sets. However, naturally, the peers ’ data collections often highly overlap, as popular documents are highly crawled. Techniques for estimating the cardinality of the overlap between sets, designed for and incorporated into information retrieval engines are very much lacking. In this paper we present a comprehensive evaluation of appropriate overlap estimators, showing how they can be incorporated into an efficient, iterative approach to query routing, coined Integrated Quality Novelty (IQN). We propose to further enhance our approach using histograms, combining overlap estimation with the available score/ranking information. Finally, we conduct a performance evaluation in MINERVA, our prototype P2P Web search engine.
Global document frequency estimation in peer-to-peer web search
- In WebDB
, 2006
"... Information retrieval (IR) in peer-to-peer (P2P) networks, where the corpus is spread across many loosely coupled peers, has recently gained importance. In contrast to IR systems on a centralized server or server farm, P2P IR faces the additional challenge of either being oblivious to global corpus ..."
Abstract
-
Cited by 15 (7 self)
- Add to MetaCart
(Show Context)
Information retrieval (IR) in peer-to-peer (P2P) networks, where the corpus is spread across many loosely coupled peers, has recently gained importance. In contrast to IR systems on a centralized server or server farm, P2P IR faces the additional challenge of either being oblivious to global corpus statistics or having to compute the global measures from local statistics at the individual peers in an efficient, distributed manner. One specific measure of interest is the global document frequency for different terms, which would be very beneficial as term-specific weights in the scoring and ranking of merged search results that have been obtained from different peers. This paper presents an efficient solution for the problem of estimating global document frequencies in a large-scale P2P network with very high dynamics where peers can join and leave the network on short notice. In particular, the developed method takes into account the fact that the local document collections of autonomous peers may arbitrarily overlap, so that global counting needs to be duplicateinsensitive. The method is based on hash sketches as a technique for compact data synopses. Experimental studies demonstrate the estimator’s accuracy, scalability, and ability to cope with high dynamics. Moreover, the benefit for ranking P2P search results is shown by experiments with real-world Web data and queries. 1.