Results 1 - 10
of
22
Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications
- In Proc. of the 2006 Intl. Conf. on Very Large Data Bases
, 2006
"... We present the NIMO system that automatically learns cost models for predicting the execution time of computationalscience applications running on large-scale networked utilities such as computational grids. Accurate cost models are important for selecting efficient plans for executing these applica ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
We present the NIMO system that automatically learns cost models for predicting the execution time of computationalscience applications running on large-scale networked utilities such as computational grids. Accurate cost models are important for selecting efficient plans for executing these applications on the utility. Computational-science applications are often scripts (written, e.g., in languages like Perl or Matlab) connected using a workflow-description language, and therefore, pose different challenges compared to modeling the execution of plans for declarative queries with wellunderstood semantics. NIMO generates appropriate training samples for these applications to learn fairly-accurate cost models quickly using statistical learning techniques. NIMO’s approach is active and noninvasive: it actively deploys and monitors the application under varying conditions, and obtains its training data from passive instrumentation streams that require no changes to the operating system or applications. Our experiments with real scientific applications demonstrate that NIMO significantly reduces the number of training samples and the time to learn fairly-accurate cost models. 1.
Perm: Processing Provenance and Data on the same Data Model through Query Rewriting
- In ICDE ’09: Proceedings of the 25th International Conference on Data Engineering
, 2009
"... Abstract — Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data it ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Abstract — Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data items are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In this paper, we present an alternative approach that uses query rewriting to annotate result tuples with provenance information. The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques. In the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently provides provenance information inducing only a small overhead on normal operations. I.
The design and implementation of ogsa-dqp: A service-based distributed query processo
- FUTURE GENERATION COMPUTER SYSTEMS 25 (3)
, 2009
"... Service-based approaches are rising to prominence because of their potential to meet the requirements for distributed application development in e-business and e-science. The emergence of a service-oriented view of hardware and software resources raises the question as to how database management sys ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Service-based approaches are rising to prominence because of their potential to meet the requirements for distributed application development in e-business and e-science. The emergence of a service-oriented view of hardware and software resources raises the question as to how database management systems and technologies can best be deployed or adapted for use in such an environment. This paper explores one aspect of service-based computing and data management, viz., how to integrate query processing technology with a service-based architecture suitable for a Grid environment. The paper addresses this by describing in detail the design and implementation of a service-based distributed query processor. The query processor is service-based in two orthogonal senses: firstly, it supports querying over data storage and analysis resources that are made available as services, and, secondly, its internal architecture factors out as services the functionalities related to the construction and execution of distributed query plans. The resulting system both provides a declarative approach to service orchestration, and demonstrates how query processing can benefit from a service-based architecture. As well as describing and motivating the architecture used, the paper also describes usage scenarios, and, using a bioinformatics application, presents performance results that benchmark the system and illustrate the benefits provided by the service-based architecture.
Autonomic Query Parallelization using Non-dedicated Computers: An Evaluation of Adaptivity Options
- In Proc. 3rd Intl. Conference on Autonomic Computing
, 2006
"... Received: date / Revised version: date Abstract Writing parallel programs that can take advantage of non-dedicated processors is much more difficult than writing such programs for networks of dedicated processors. In a non-dedicated environment such programs must use autonomic techniques to respond ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Received: date / Revised version: date Abstract Writing parallel programs that can take advantage of non-dedicated processors is much more difficult than writing such programs for networks of dedicated processors. In a non-dedicated environment such programs must use autonomic techniques to respond to the unpredictable load fluctuations that prevail in the
Deriving and managing data products in an environmental observation and forecasting system
- Proc. of Conference on Innovative Data Systems Research (CIDR
, 2005
"... ..."
Harnessing Grid Resources with Data-Centric Task Farms
, 2007
"... As the size of scientific data sets and the resources required for analysis increase, data locality becomes crucial to the efficient use of large scale distributed systems for scientific and data-intensive applications. In order to support interactive analysis of large quantities of data in many sci ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
As the size of scientific data sets and the resources required for analysis increase, data locality becomes crucial to the efficient use of large scale distributed systems for scientific and data-intensive applications. In order to support interactive analysis of large quantities of data in many scientific disciplines, we propose a data diffusion approach, in which the resources required for data analysis are acquired dynamically, in response to demand. Acquired resources (compute and storage) can be “cached ” for some time, thus allowing more rapid responses to subsequent requests. We define an abstract model for data-centric task farms as a common parallel pattern that drives the independent computational tasks, taking into consideration the data locality in order to optimize the performance of the analysis of large datasets. This approach can provide the benefits of dedicated hardware without the associated high costs. We will validate our abstract model through discrete-event simulations; we expect simulations to show the model is both efficient and scalable given a wide range of simulation parameters. To explore the practical realization of our abstract model, we have developed a Fast and Light-weight tasK executiON framework (Falkon). Falkon provides for dynamic acquisition and release of resources, data management capabilities, and the dispatch of analysis tasks via a data-aware scheduler. We have
Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results
, 2005
"... Minimizing both the response time to produce the first few thousand results and the overall execution time is important for interactive querying. Current join algorithms either minimize the execution time at the expense of response time or minimize response time by producing results early witho ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Minimizing both the response time to produce the first few thousand results and the overall execution time is important for interactive querying. Current join algorithms either minimize the execution time at the expense of response time or minimize response time by producing results early without optimizing the total time. We present a hashbased join algorithm, called early hash join, which can be dynamically configured at any point during join processing to tradeoff faster production of results for overall execution time. We demonstrate that varying how inputs are read has a major effect on these two factors and provide formulas that allow an optimizer to calculate the expected rate of join output and the number of I/O operations performed using different input reading strategies. Experimental results show that early hash join performs significantly fewer I/O operations and executes faster than other early join algorithms, especially for one-to-many joins. Its overall execution time is comparable to standard hybrid hash join, but its response time is an order of magnitude faster. Thus, early hash join can replace hybrid hash join in any situation where a fast initial response time is beneficial without the penalty in overall execution time exhibited by other early join algorithms.
Data Stream Sharing
- In Proc. of the Intl. Workshop on Pervasive Information Management
, 2005
"... Abstract. Recent research efforts in the fields of data stream processing and data stream management systems (DSMSs) show the increasing importance of processing data streams, e. g., in the e-science domain. Together with the advent of peer-to-peer (P2P) networks and grid computing, this leads to th ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Abstract. Recent research efforts in the fields of data stream processing and data stream management systems (DSMSs) show the increasing importance of processing data streams, e. g., in the e-science domain. Together with the advent of peer-to-peer (P2P) networks and grid computing, this leads to the necessity of developing new techniques for distributing and processing continuous queries over data streams in such networks. In this paper, we present a novel approach for optimizing the integration, distribution, and execution of newly registered continuous queries over data streams in grid-based P2P networks. We introduce Windowed XQuery (WXQuery), our XQuery-based subscription language for continuous queries over XML data streams supporting window-based operators. Concentrating on filtering and window-based aggregation, we present our stream sharing algorithms as well as experimental evaluation results from the astrophysics application domain to assess our approach. 1
A Black-Box Approach to Query Cardinality Estimation
- IN PROC. CIDR
, 2007
"... We present a “black-box” approach to estimating query cardinality that has no knowledge of query execution plans and data distribution, yet provides accurate estimates. It does so by grouping queries into syntactic families and learning the cardinality distribution of that group directly from points ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We present a “black-box” approach to estimating query cardinality that has no knowledge of query execution plans and data distribution, yet provides accurate estimates. It does so by grouping queries into syntactic families and learning the cardinality distribution of that group directly from points in a high-dimensional input space constructed from the query’s attributes, operators, function arguments, aggregates, and constants. We envision an increasing need for such an approach in applications in which query cardinality is required for resource optimization and decision-making at locations that are remote from the data sources. Our primary case study is the Open SkyQuery federation of Astronomy archives, which uses a scheduling and caching mechanism at the mediator for execution of federated queries at remote sources. Experiments using real workloads show that the black-box approach produces accurate estimates and is frugal in its use of space and in computation resources. Also, the black-box approach provides dramatic improvements in the performance of caching in Open SkyQuery.
Minimizing Communication Cost in Distributed Multi-query Processing
, 2009
"... Increasing prevalence of large-scale distributed monitoring and computing environments such as sensor networks, scientific federations, Grids etc., has led to a renewed interest in the area of distributed query processing and optimization. In this paper we address a general, distributed multiquery ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Increasing prevalence of large-scale distributed monitoring and computing environments such as sensor networks, scientific federations, Grids etc., has led to a renewed interest in the area of distributed query processing and optimization. In this paper we address a general, distributed multiquery processing problem motivated by the need to minimize the communication cost in these environments. Specifically we address the problem of optimally sharing data movement across the communication edges in a distributed communication network given a set of overlapping queries and query plans for them (specifying the operations to be executed). Most of the problem variations of our general problem can be shown to be NP-Hard by a reduction from the Steiner tree problem. However, we show that the problem can be solved optimally if the communication network is a tree, and present a novel algorithm for finding an optimal data movement plan. For general communication networks, we present efficient approximation algorithms for several variations of the problem. Finally, we present an experimental study over synthetic datasets showing both the need for exploiting the sharing of data movement and the effectiveness of our algorithms at finding such plans.

