Results 1 - 10
of
11
Executing Multiple Pipelined Data Analysis Operations in the Grid
, 2002
"... Processing of data in many data analysis applications can be represented as an acyclic, coarse grain data flow, from data sources to the client. This paper is concerned with scheduling of multiple data analysis operations, each of which is represented as a pipelined chain of processing on data. We ..."
Abstract
-
Cited by 32 (7 self)
- Add to MetaCart
Processing of data in many data analysis applications can be represented as an acyclic, coarse grain data flow, from data sources to the client. This paper is concerned with scheduling of multiple data analysis operations, each of which is represented as a pipelined chain of processing on data. We define the scheduling problem for effectively placing components onto Grid resources, and propose two scheduling algorithms. Experimental results are presented using a visualization application.
Processing Large-Scale Multidimensional Data in Parallel and Distributed Environments
, 2002
"... Analysis of data is an important step in understanding and solving a scientific problem. Analysis involves extracting the data of interest from all the available raw data in a dataset and processing it into a data product. However, in many areas of science and engineering, a scientist's ability to a ..."
Abstract
-
Cited by 13 (9 self)
- Add to MetaCart
Analysis of data is an important step in understanding and solving a scientific problem. Analysis involves extracting the data of interest from all the available raw data in a dataset and processing it into a data product. However, in many areas of science and engineering, a scientist's ability to analyze information is increasingly becoming hindered by dataset sizes. The vast amount of data in scientific datasets makes it a difficult task to efficiently access the data of interest, and manage potentially heterogeneous system resources to process the data. Subsetting and aggregation are common operations executed in a wide range of data-intensive applications. We argue that common runtime and programming support can be developed for applications that query and manipulate large datasets. This paper presents a compendium of frameworks and methods we have developed to support efficient execution of subsetting and aggregation operations in applications that query and manipulate large, multi-dimensional datasets in parallel and distributed computing environments.
Adaptable mirroring in cluster servers
- In Proc. of High Performance Distributed Computing (HPDC) Conference
, 2001
"... ..."
An Efficient System for Multi-perspective Imaging and Volumetric Shape Analysis
, 2001
"... We present a high performance system for efficient multiperspective image analysis on very large image datasets, implemented as a customized extension of the Active Data Repository (ADR) object-oriented framework. The resulting system provides a flexible framework for handling multiperspective video ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
We present a high performance system for efficient multiperspective image analysis on very large image datasets, implemented as a customized extension of the Active Data Repository (ADR) object-oriented framework. The resulting system provides a flexible framework for handling multiperspective video sequences in any parallel or distributed computing environment that can support ADR. We have employed the framework to produce an efficient volumetric shape analysis application implementation. Initial performance results show that using an effective data distribution algorithm to produce good load balancing allows the ADR implementation to achieve scalable high performance.
A Distributed Computing Environment for Interdisciplinary Applications
- Currency and Computation: Practice and Experience
, 2002
"... Practical applications are generally interdisciplinary in nature. The technology is well matured for addressing individual discipline applications and not for interdisciplinary applications. Hence, there is a need to couple the capabilities of several different computational disciplines to address t ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Practical applications are generally interdisciplinary in nature. The technology is well matured for addressing individual discipline applications and not for interdisciplinary applications. Hence, there is a need to couple the capabilities of several different computational disciplines to address these interdisciplinary practical applications. One approach is to use coupled or multi-physics software, which typically involves developing and validating the entire software spectrum for a specific application, which will be time consuming and may require more time to get to the end user. The other approach is to integrate individual well-matured computational technology disciplines software by taking advantage of the existing scalable software and validation investments, and tremendous developments in computer science and computational sciences. This integrated approach requires consistent data model, data format, data management, seamless data movement, and robust modular scalable including coupling algorithms. To address these requirements, we developed a new flexible data exchange mechanism for HPC codes and tools, known as the eXtensible Data Model and Format (XDMF). XDMF provides computational engines with the tools necessary to exist in a modern computing environment with minimal modification. Instead of imposing a new programming paradigm on HPC codes, XDMF uses the existing concept of file I/O for distributed coordination. XDMF incorporates Network Distributed Global Memory (NDGM), Hierarchical Data Format version 5 (HDF5), and eXtensible Markup Language (XML) to provide a flexible yet efficient data exchange mechanism. . This paper discusses development and implementation of distributed computing environment for interdisciplinary applications utilizing the concept...
Iteration Aware Prefetching for Large Multidimensional Scientific Datasets
- PROC. 17TH INTERNATIONAL SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT CONFERENCE
, 2005
"... Most caching and prefetching research does not take advantage of prior knowledge of access patterns, or does not adequately address the storage issues associated with multidimensional scientific data. Armed with an access pattern specified at run time as an iteration over a multidimensional array st ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Most caching and prefetching research does not take advantage of prior knowledge of access patterns, or does not adequately address the storage issues associated with multidimensional scientific data. Armed with an access pattern specified at run time as an iteration over a multidimensional array stored as a disk file, we use prefetching to greatly reduce the number of disk accesses and mitigate the cost of read latency. We call this iteration aware prefetching.
We assume the pattern of access is not known until runtime, in contrast to chunking methods that preprocess a file for a particular access pattern. Our approach results in dramatic performance improvements over file system caching. We also significantly outperform chunking without having to reorganize the data, and can do even better by applying our approach on top of a chunked file.
Efficient manipulation of large datasets on heterogeneous storage systems
- In Proceedings of the 11th Heterogeneous Computing Workshop (HCW2002). IEEE Computer
, 2002
"... beynon,als¦ In this paper we are concerned with the efficient use of a collection of disk-based storage systems and computing platforms in a heterogeneous setting for retrieving and processing large scientific datasets. We demonstrate, in the context of a data-intensive visualization application, ho ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
beynon,als¦ In this paper we are concerned with the efficient use of a collection of disk-based storage systems and computing platforms in a heterogeneous setting for retrieving and processing large scientific datasets. We demonstrate, in the context of a data-intensive visualization application, how heterogeneity affects performance and show a set of optimization techniques that can be used to improve performance in a component-based framework. In particular, we examine the application of parallelism via transparent copies of application components in the pipelined processing of data. 1
Supporting scalable and distributed data subsetting and aggregation in large-scale seismic data analysis
- The Journal of High Performance Computing Applications
, 2006
"... The ability to query and process very large, terabytescale datasets has become a key step in many scientific and engineering applications. In this paper, we describe the application of two middleware frameworks in an integrated fashion to provide a scalable and efficient system for execution of seis ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The ability to query and process very large, terabytescale datasets has become a key step in many scientific and engineering applications. In this paper, we describe the application of two middleware frameworks in an integrated fashion to provide a scalable and efficient system for execution of seismic data analysis on large datasets in a distributed environment. We investigate different strategies for efficient querying of large datasets and parallel implementations of a seismic image reconstruction algorithm. Our results on a state-of-the-art mass storage system coupled with a high-end compute cluster show that our implementation is scalable and can achieve about 2.9 Gigabytes per second data processing rate – about 70% of the maximum 4.2GB/s application-level raw I/O bandwidth of the storage platform.
Optimizing Reduction Computations In a Distributed Environment
, 2003
"... We investigate runtime strategies for data-intensive applications that involve generalized reductions on large, distributed datasets. Our set of strategies includes replicated filter state, partitioned filter state, and hybrid options between these two extremes. We evaluate these strategies using em ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We investigate runtime strategies for data-intensive applications that involve generalized reductions on large, distributed datasets. Our set of strategies includes replicated filter state, partitioned filter state, and hybrid options between these two extremes. We evaluate these strategies using emulators of three real applications, different query and output sizes, and a number of configurations. We consider execution in a homogeneous cluster and in a distributed environment where only a subset of nodes host the data. Our results show replicating the filter state scales well and outperforms other schemes, if sufficient memory is available and sufficient computation is involved to offset the cost of global merge step. In other cases, hybrid is usually the best. Moreover, in almost all cases, the performance of the hybrid strategy is quite close to the best strategy. Thus, we believe that hybrid is an attractive approach when the relative performance of different schemes cannot be predicted.
Out of Core Visualization Using Iterator Aware Multidimensional Prefetching
, 2005
"... Visualization of multidimensional data presents special challenges for the design of efficient out-of-core data access. Elements that are nearby in the visualization may not be nearby in the underlying data file, which can severely tax the operating system's disk cache. The Granite Scientific Databa ..."
Abstract
- Add to MetaCart
Visualization of multidimensional data presents special challenges for the design of efficient out-of-core data access. Elements that are nearby in the visualization may not be nearby in the underlying data file, which can severely tax the operating system's disk cache. The Granite Scientific Database System can address these problems because it is aware of the organization of the data on disk, and it knows the visualization method's pattern of access. The access pattern is expressed using a toolkit of iterators that both describe the access pattern and perform the iteration itself. Because our system has knowledge of both the data organization and the access pattern, we are able to provide significant performance improvements while hiding the details of out-of-core access from the visualization programmer.

