Results 1 - 10
of
27
The Piazza Peer Data Management Project
, 2003
"... A major problem in today's information-driven world is that sharing heterogeneous, semantically rich data is incredibly difficult. Piazza is a peer data management system that enables sharing heterogeneous data in a distributed and scalable way. Piazza assumes the participants to be interested in sh ..."
Abstract
-
Cited by 41 (3 self)
- Add to MetaCart
A major problem in today's information-driven world is that sharing heterogeneous, semantically rich data is incredibly difficult. Piazza is a peer data management system that enables sharing heterogeneous data in a distributed and scalable way. Piazza assumes the participants to be interested in sharing data, and willing to define pairwise mappings between their schemas. Then, users formulate queries over their preferred schema, and a query answering system expands recursively any mappings relevant to the query, retrieving data from other peers. In this paper, we provide a brief overview of the Piazza project including our work on developing mapping languages and query reformulation algorithms, assisting the users in defining mappings, indexing, and enforcing access control over shared data.
Service-Based Distributed Querying on the Grid
- IN PROC. OF THE 1ST INT. CONF. ON SERVICE ORIENTED COMPUTING
, 2003
"... Service-based approaches (such as Web Services and the Open Grid Services Architecture) have gained considerable attention recently for supporting distributed application development in e-business and e-science. The emergence of a service-oriented view of hardware and software resources raises t ..."
Abstract
-
Cited by 32 (21 self)
- Add to MetaCart
Service-based approaches (such as Web Services and the Open Grid Services Architecture) have gained considerable attention recently for supporting distributed application development in e-business and e-science. The emergence of a service-oriented view of hardware and software resources raises the question as to how database management systems and technologies can best be deployed or adapted for use in such an environment. This paper explores one aspect of service-based computing and data management, viz., how to integrate query processing technology with a service-based Grid. The paper describes in detail the design and implementation of a service-based distributed query processor for the Grid. The query processor is service-based in two orthogonal senses: firstly, it supports querying over data storage and analysis resources that are made available as services, and, secondly, its internal architecture factors out as services the functionalities related to the construction of distributed query plans on the one hand, and to their execution over the Grid on the other. The resulting system both provides a declarative approach to service orchestration in the Grid, and demonstrates how query processing can benefit from dynamic access to computational resources on the Grid.
Bypass caching: Making scientific databases good network citizens
- In: ICDE. (2005
, 2005
"... Scientific database federations are geographically distributed and network bound. Thus, they could benefit from proxy caching. However, existing caching techniques are not suitable for their workloads, which compare and join large data sets. Existing techniques reduce parallelism by conducting distr ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
Scientific database federations are geographically distributed and network bound. Thus, they could benefit from proxy caching. However, existing caching techniques are not suitable for their workloads, which compare and join large data sets. Existing techniques reduce parallelism by conducting distributed queries in a single cache and lose the data reduction benefits of performing selections at each database. We develop the bypass-yield formulation of caching, which reduces network traffic in wide-area database federations, while preserving parallelism and data reduction. Bypass-yield caching is altruistic; caches minimize the overall network traffic generated by the federation, rather than focusing on local performance. We present an adaptive, workload-driven algorithm for managing a bypass-yield cache. We also develop on-line algorithms that make no assumptions about workload: a k-competitive deterministic algorithm and a randomized algorithm with minimal space complexity. We verify the efficacy of bypass-yield caching by running workload traces collected from the Sloan Digital Sky Survey through a prototype implementation.
The design and implementation of ogsa-dqp: A service-based distributed query processo
- FUTURE GENERATION COMPUTER SYSTEMS 25 (3)
, 2009
"... Service-based approaches are rising to prominence because of their potential to meet the requirements for distributed application development in e-business and e-science. The emergence of a service-oriented view of hardware and software resources raises the question as to how database management sys ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Service-based approaches are rising to prominence because of their potential to meet the requirements for distributed application development in e-business and e-science. The emergence of a service-oriented view of hardware and software resources raises the question as to how database management systems and technologies can best be deployed or adapted for use in such an environment. This paper explores one aspect of service-based computing and data management, viz., how to integrate query processing technology with a service-based architecture suitable for a Grid environment. The paper addresses this by describing in detail the design and implementation of a service-based distributed query processor. The query processor is service-based in two orthogonal senses: firstly, it supports querying over data storage and analysis resources that are made available as services, and, secondly, its internal architecture factors out as services the functionalities related to the construction and execution of distributed query plans. The resulting system both provides a declarative approach to service orchestration, and demonstrates how query processing can benefit from a service-based architecture. As well as describing and motivating the architecture used, the paper also describes usage scenarios, and, using a bioinformatics application, presents performance results that benchmark the system and illustrate the benefits provided by the service-based architecture.
A workload-driven unit of cache replacement for mid-tier database caching
- In DASFAA ’07, 12th International Conference on Database Systems for Advanced Applications
, 2007
"... Abstract. Making multi-terabyte scientific databases publicly accessible over the Internet is increasingly important in disciplines such as Biology and Astronomy. However, contention at a centralized, backend database is a major performance bottleneck, limiting the scalability of Internet-based, dat ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Abstract. Making multi-terabyte scientific databases publicly accessible over the Internet is increasingly important in disciplines such as Biology and Astronomy. However, contention at a centralized, backend database is a major performance bottleneck, limiting the scalability of Internet-based, database applications. Midtier caching reduces contention at the backend database by distributing database operations to the cache. To improve the performance of mid-tier caches, we propose the caching of query prototypes, a workload-driven unit of cache replacement in which the cache object is chosen from various classes of queries in the workload. In existing mid-tier caching systems, the storage organization in the cache is statically defined. Our approach adapts cache storage to workload changes, requires no prior knowledge about the workload, and is transparent to the application. Experiments over a one-month, 1.4 million query Astronomy workload demonstrate up to 70% reduction in network traffic and reduce query response time by up to a factor of three when compared with alternative units of cache replacement. 1
AstroDAS: Sharing Assertions across Astronomy Catalogues through
- Distributed Annotation, International Provenance and Annotation Workshop (I-PAW), 2006, Springer-Verlag LNCS
, 2006
"... Abstract. As diverse scientific data collections migrate online, researchers want the ability to share their assertions regarding the entities that span these disparate databases. We focus on a case study provided by the astronomical community’s Virtual Observatory effort to investigate the use of a ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract. As diverse scientific data collections migrate online, researchers want the ability to share their assertions regarding the entities that span these disparate databases. We focus on a case study provided by the astronomical community’s Virtual Observatory effort to investigate the use of annotation to record and share the celestial object mappings asserted by different research groups. The prototype for our Astronomy Distributed Annotation System (AstroDAS) complements the existing OpenSkyQuery tools for federated database queries, and provides web service methods to allow clients to create and store mapping annotations as relational database tuples on annotation servers. We expect the mechanisms for creating and querying annotations in AstroDAS can be extended to assist with tasks other than entity mapping, in other domains with relational data sources. 1
LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases
"... Workloads that comb through vast amounts of data are gaining importance in the sciences. These workloads consist of “needle in a haystack ” queries that are long running and data intensive so that query throughput limits performance. To maximize throughput for data-intensive queries, we put forth Li ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Workloads that comb through vast amounts of data are gaining importance in the sciences. These workloads consist of “needle in a haystack ” queries that are long running and data intensive so that query throughput limits performance. To maximize throughput for data-intensive queries, we put forth LifeRaft: a query processing system that batches queries with overlapping data requirements. Rather than scheduling queries in arrival order, LifeRaft executes queries concurrently against an ordering of the data that maximizes data sharing among queries. This decreases I/O and increases cache utility. However, such batch processing can increase query response time by starving interactive workloads. LifeRaft addresses starvation using techniques inspired by head scheduling in disk drives. Depending upon the workload saturation and queuing times, the system adaptively and incrementally trades-off processing queries in arrival order and data-driven batch processing. Evaluating LifeRaft in the SkyQuery federation of astronomy databases reveals a two-fold improvement in query throughput. 1.
Data Exploration of Turbulence Simulations using a Database Cluster ABSTRACT
"... We describe a new environment for the exploration of turbulent flows that uses a cluster of databases to store complete histories of Direct Numerical Simulation (DNS) results. This allows for spatial and temporal exploration of highresolution data that were traditionally too large to store and too c ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
We describe a new environment for the exploration of turbulent flows that uses a cluster of databases to store complete histories of Direct Numerical Simulation (DNS) results. This allows for spatial and temporal exploration of highresolution data that were traditionally too large to store and too computationally expensive to produce on demand. We perform analysis of these data directly on the databases nodes, which minimizes the volume of network traffic. The low network demands enable us to provide public access to this experimental platform and its datasets through Web services. This paper details the system design and implementation. Specifically, we focus on hierarchical spatial indexing, cache-sensitive spatial scheduling of batch workloads, localizing computation through data partitioning, and load balancing techniques that minimize data movement. We provide real examples of how scientists use the system to perform high-resolution turbulence research from standard desktop computing environments. 1.
A Black-Box Approach to Query Cardinality Estimation
- IN PROC. CIDR
, 2007
"... We present a “black-box” approach to estimating query cardinality that has no knowledge of query execution plans and data distribution, yet provides accurate estimates. It does so by grouping queries into syntactic families and learning the cardinality distribution of that group directly from points ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We present a “black-box” approach to estimating query cardinality that has no knowledge of query execution plans and data distribution, yet provides accurate estimates. It does so by grouping queries into syntactic families and learning the cardinality distribution of that group directly from points in a high-dimensional input space constructed from the query’s attributes, operators, function arguments, aggregates, and constants. We envision an increasing need for such an approach in applications in which query cardinality is required for resource optimization and decision-making at locations that are remote from the data sources. Our primary case study is the Open SkyQuery federation of Astronomy archives, which uses a scheduling and caching mechanism at the mediator for execution of federated queries at remote sources. Experiments using real workloads show that the black-box approach produces accurate estimates and is frugal in its use of space and in computation resources. Also, the black-box approach provides dramatic improvements in the performance of caching in Open SkyQuery.
Minimizing Communication Cost in Distributed Multi-query Processing
, 2009
"... Increasing prevalence of large-scale distributed monitoring and computing environments such as sensor networks, scientific federations, Grids etc., has led to a renewed interest in the area of distributed query processing and optimization. In this paper we address a general, distributed multiquery ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Increasing prevalence of large-scale distributed monitoring and computing environments such as sensor networks, scientific federations, Grids etc., has led to a renewed interest in the area of distributed query processing and optimization. In this paper we address a general, distributed multiquery processing problem motivated by the need to minimize the communication cost in these environments. Specifically we address the problem of optimally sharing data movement across the communication edges in a distributed communication network given a set of overlapping queries and query plans for them (specifying the operations to be executed). Most of the problem variations of our general problem can be shown to be NP-Hard by a reduction from the Steiner tree problem. However, we show that the problem can be solved optimally if the communication network is a tree, and present a novel algorithm for finding an optimal data movement plan. For general communication networks, we present efficient approximation algorithms for several variations of the problem. Finally, we present an experimental study over synthetic datasets showing both the need for exploiting the sharing of data movement and the effectiveness of our algorithms at finding such plans.

