Results 1 - 10
of
40
The Architecture of PIER: an Internet-Scale Query Processor
- In CIDR
, 2005
"... This paper presents the architecture of PIER , an Internetscale query engine we have been building over the last three years. PIER is the first general-purpose relational query processor targeted at a peer-to-peer (p2p) architecture of thousands or millions of participating nodes on the Internet. ..."
Abstract
-
Cited by 59 (5 self)
- Add to MetaCart
This paper presents the architecture of PIER , an Internetscale query engine we have been building over the last three years. PIER is the first general-purpose relational query processor targeted at a peer-to-peer (p2p) architecture of thousands or millions of participating nodes on the Internet. It supports massively distributed, database-style dataflows for snapshot and continuous queries. It is intended to serve as a building block for a diverse set of Internet-scale informationcentric applications, particularly those that tap into the standardized data readily available on networked machines, including packet headers, system logs, and file names
Delay aware querying with Seaweed
- In VLDB
, 2006
"... Large highly distributed data sets are poorly supported by current query technologies. Applications such as endsystembased network management are characterized by data stored on large numbers of endsystems, with frequent local updates and relatively infrequent global one-shot queries. The challenges ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
Large highly distributed data sets are poorly supported by current query technologies. Applications such as endsystembased network management are characterized by data stored on large numbers of endsystems, with frequent local updates and relatively infrequent global one-shot queries. The challenges are scale (10 3 to 10 9 endsystems) and endsystem unavailability. In such large systems, a significant fraction of endsystems, and their data, will be unavailable at any given time. Existing methods to provide high data availability despite endsystem unavailability involve centralizing, redistributing or replicating the data. At large scale these methods are not scalable. We advocate a design that trades query delay for completeness, incrementally returning results as endsystems become available. We also introduce the idea of completeness prediction, which provides the user with explicit feedback about this delay/completeness trade-off. Completeness prediction is based on replication of compact data summaries and availability models. This metadata is orders of magnitude smaller than the data. Seaweed is a scalable query infrastructure supporting online aggregation and completeness prediction. Seaweed is built on a distributed hash table (DHT) but unlike previous DHT based approaches it does not redistribute data across the network. It exploits the DHT infrastructure for failure resilient metadata replication, query dissemination, and result aggregation. We analytically compare Seaweed’s scalability against other approaches and present an evaluation of the Seaweed prototype running on a large-scale network simulator driven by real-world traces. 1.
Gossip-based search selection in hybrid peer-to-peer networks
- in Proceedings of the 5th International Workshop on Peer-to-Peer Systems (IPTPS’06
, 2006
"... Abstract: We present GAB, a search algorithm for hybrid P2P networks, that is, networks that search using both flooding and a DHT. GAB uses a gossipstyle algorithm to collect global statistics about document popularity to allow each peer to make intelligent decisions about which search style to use ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Abstract: We present GAB, a search algorithm for hybrid P2P networks, that is, networks that search using both flooding and a DHT. GAB uses a gossipstyle algorithm to collect global statistics about document popularity to allow each peer to make intelligent decisions about which search style to use for a given query. Moreover, GAB automatically adapts to changes in the operating environment. Synthetic and trace-driven simulations show that compared to a simple hybrid approach, GAB reduces the response time by 25-50 % and the average query bandwidth cost by 45%, with no loss in recall. GAB scales well, with only a 7 % degradation in performance despite a tripling in system size. I.
Xpath lookup queries in p2p networks
- In WIDM’04: Proceedings of the 6th annual ACM international workshop on Web information and data management
, 2004
"... We address the problem of querying XML data over a P2P network. In P2P networks, the allowed kinds of queries are usually exact-match queries over file names. We discuss the extensions needed to deal with XML data and XPath queries. A single peer can hold a whole document or a partial/complete fragm ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
We address the problem of querying XML data over a P2P network. In P2P networks, the allowed kinds of queries are usually exact-match queries over file names. We discuss the extensions needed to deal with XML data and XPath queries. A single peer can hold a whole document or a partial/complete fragment of the latter. Each XML fragment/document is identified by a distinct path expression, which is encoded in a distributed hash table. Our framework differs from content-based routing mechanisms, biased towards finding the most relevant peers holding the data. We perform fragments placement and enable fragments lookup by solely exploiting few path expressions stored on each peer. By taking advantage of quasi-zero replication of global catalogs, our system supports fast full and partial XPath querying. To this purpose, we have extended the Chord simulator and performed an experimental evaluation of our approach.
Distributed Cache Table: Efficient Query-Driven
- Processing of Multi-Term Queries in P2P Networks. In P2PIR
, 2006
"... The state-of-the-art techniques for processing multi-term queries in P2P environments are query flooding and inverted list intersection. However, it has been shown that due to scalability reasons both methods fail to support fulltext search in large scale document collections distributed among the n ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
The state-of-the-art techniques for processing multi-term queries in P2P environments are query flooding and inverted list intersection. However, it has been shown that due to scalability reasons both methods fail to support fulltext search in large scale document collections distributed among the nodes in a P2P network. Although a number of optimizations have been suggested recently based on the aforementioned techniques, little evidence is given on their scalability. In this paper we suggest a novel query-driven indexing strategy which generates and maintains only those index entries that are actually used for query processing. In our approach called Distributed Cache Table 1 (DCT) we suggest to abandon the difference between data indexing and query caching, and to store result sets (caches) for the most profitable queries. DCT employs a distributed index to efficiently locate caches that can answer a given multiterm query and broadcasts the query to all the peers only if no such caches were found. Evaluations on real data and query loads show that DCT converges to a high cache-hit ratio and indeed offers a large-scale distributed solution for storing and efficient querying of vast amounts of documents in the P2P setting. DCT achieves two orders of magnitude improvement in traffic consumption compared to a standard distributed single-term indexing approach.
Searching dynamic communities with personal indexes
- In Proc. 4th Intl. Semantic Web Conf., ISWC 2005, volume 3729 of LNCS, pages 491 – 505
, 2005
"... Abstract. Often the challenge of finding relevant information is reduced to find the ’right ’ people who will answer our question. In this paper we present innovative algorithms called INGA (Interest-based Node Grouping Algorithms) which integrate personal routing indices into semantic query process ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Abstract. Often the challenge of finding relevant information is reduced to find the ’right ’ people who will answer our question. In this paper we present innovative algorithms called INGA (Interest-based Node Grouping Algorithms) which integrate personal routing indices into semantic query processing to boost performance. Similar to social networks peers in INGA cooperate to efficiently route queries for documents along adaptive shortcut-based overlays using only local, but semantically well chosen information. We propose active and passive shortcut creation strategies for index building and a novel algorithm to select the most promising content providers depending on each peer index with respect to the individual query. We quantify the benefit of our indexing strategy by extensive performance experiments in the SWAP simulation infrastructure. While obtaining high recall values compared to other state-of-the-art algorithms, we show that INGA improves recall and reduces the number of messages significantly. 1
Efficient processing of XPath queries with structured overlay networks
- In ODBASE’05, Agia
, 2005
"... Abstract. Non-trivial search predicates beyond mere equality are at the current focus of P2P research. Structured queries, as an important type of non-trivial search, have been studied extensively mainly for unstructured P2P systems so far. As unstructured P2P systems do not use indexing, structured ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
Abstract. Non-trivial search predicates beyond mere equality are at the current focus of P2P research. Structured queries, as an important type of non-trivial search, have been studied extensively mainly for unstructured P2P systems so far. As unstructured P2P systems do not use indexing, structured queries are very easy to implement since they can be treated equally to any other type of query. However, this comes at the expense of very high bandwidth consumption and limitations in terms of guarantees and expressiveness that can be provided. Structured P2P systems are an efficient alternative as they typically offer logarithmic search complexity in the number of peers. Though the use of a distributed index (typically a distributed hash table) makes the implementation of structured queries more efficient, it also introduces considerable complexity, and thus only a few approaches exist so far. In this paper we present a first solution for efficiently supporting structured queries, more specifically, XPath queries, in structured P2P systems. For the moment we focus on supporting queries with descendant axes (“//”) and wildcards (“*”) and do not address joins. The results presented in this paper provide foundational basic functionalities to be used by higher-level query engines for more efficient, complex query support. 1
XML processing in DHT networks
- In ICDE
, 2008
"... Abstract — We study the scalable management of XML data in P2P networks based on distributed hash tables (DHTs). We identify performance limitations in this context, and propose an array of techniques to lift them. First, we adapt the DHT platform’s index store and communication primitives to the ne ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
Abstract — We study the scalable management of XML data in P2P networks based on distributed hash tables (DHTs). We identify performance limitations in this context, and propose an array of techniques to lift them. First, we adapt the DHT platform’s index store and communication primitives to the needs of massive data processing. Second, we introduce a distributed hierarchical index and associated efficient algorithms to speed up query processing. Third, we present an innovative, XMLspecific flavor of Bloom filters, to reduce data transfers entailed by query processing. Our approach is fully implemented in the KadoP system, used in a real-life software manufacturing application. Our experiments demonstrate the benefits of the proposed techniques. I.
Reliable Storage and Querying for Collaborative Data Sharing Systems
"... Abstract — The sciences, business confederations, and medicine urgently need infrastructure for sharing data and updates among collaborators ’ constantly changing, heterogeneous databases. The ORCHESTRA system addresses these needs by providing data transformation and exchange capabilities across DB ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
Abstract — The sciences, business confederations, and medicine urgently need infrastructure for sharing data and updates among collaborators ’ constantly changing, heterogeneous databases. The ORCHESTRA system addresses these needs by providing data transformation and exchange capabilities across DBMSs, combined with archived storage of all database versions. ORCHESTRA adopts a peer-to-peer architecture in which individual collaborators contribute data and compute resources, but where there may be no dedicated server or compute cluster. We study how to take the combined resources of ORCHES-TRA’s autonomous nodes, as well as PCs from “cloud ” services such as Amazon EC2, and provide reliable, cooperative storage and query processing capabilities. We guarantee reliability and correctness as in distributed or cloud DBMSs, while also supporting cross-domain deployments, replication, and transparent failover, as provided by peer-to-peer systems. Our storage and query subsystem supports dozens to hundreds of nodes across different domains, possibly including nodes on cloud services. Our contributions include (1) a modified data partitioning substrate that combines cluster and peer-to-peer techniques, (2) an efficient implementation of replicated, reliable, versioned storage of relational data, (3) new query processing and indexing techniques over this storage layer, and (4) a mechanism for incrementally recomputing query results that ensures correct, complete, and duplicate-free results in the event of node failure during query execution. We experimentally validate query processing performance, failure detection methods, and the performance benefits of incremental recovery in a prototype implementation. I.
Finding Rare Data Objects in P2P File-Sharing Systems
, 2005
"... Peer-to-peer file-sharing systems have hundreds of thousands of users sharing petabytes of data, however, their search functionality is limited. In general, query results contain many references to the same data object. These references are grouped, and the size of the group--the number of reference ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Peer-to-peer file-sharing systems have hundreds of thousands of users sharing petabytes of data, however, their search functionality is limited. In general, query results contain many references to the same data object. These references are grouped, and the size of the group--the number of references it contains--is the typical ranking metric. Although group size is e#ective in finding popular data, it works poorly for rare, less popular data. Other ranking functions, such as precision and cosine similarity, are more appropriate in this case. We show the significant performance benefit in finding rare data using these ranking functions through extensive simulation.

