Results 1 - 10
of
10
Bitmap indexes for large scientific data sets: A case study
- In IPDPS
, 2006
"... The data used by today’s scientific applications are often very high in dimensionality and staggering in size. These characteristics necessitate the use of a good multidimensional indexing strategy to provide efficient access to the data. Researchers have previously proposed the use of bitmap indexe ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
The data used by today’s scientific applications are often very high in dimensionality and staggering in size. These characteristics necessitate the use of a good multidimensional indexing strategy to provide efficient access to the data. Researchers have previously proposed the use of bitmap indexes for high-dimension scientific data as a way of overcoming the drawbacks of traditional multidimensional indexes such as R-trees and KD-trees, which are bulky and whose performance does not scale well as the number of dimensions increases. However, the techniques proposed in previous work on bitmap indexes are not sufficient to address all problems that arise in practice. In experiments with real datasets, we experienced problems with index size and query performance. To overcome these shortcomings, we propose the use of adaptive, multilevel, multiresolution bitmap indexes, and evaluate their performance in two scientific domains. Our preliminary experiments with a parallel query processor and index creator also show that it is very easy to parallelize a bitmap index. 1
GODIVA: Lightweight Data Management for Scientific Visualization
- 20th International Conference on Data Engineering (ICDE), 2004 [Mount2004] R. Mount, “A Leadership-Class Facility for Data-Intensive Science”, http://www-user.slac.stanford.edu/rmount/leadership/HighEndComputingProposal-- 4_9_04.doc [No03] Jaechun No, Raj
, 2004
"... Scientific visualization applications are very dataintensive, with high demands for I/O and data management. Developers of many visualization tools hesitate to use traditional DBMSs, due to the lack of support for these DBMSs on parallel platforms and the risk of reducing the portability of their to ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Scientific visualization applications are very dataintensive, with high demands for I/O and data management. Developers of many visualization tools hesitate to use traditional DBMSs, due to the lack of support for these DBMSs on parallel platforms and the risk of reducing the portability of their tools and the user data. In this paper, we propose the GODIVA framework, which provides simple databaselike interfaces to help visualization tool developers manage their in-memory data, and I/O optimizations such as prefetching and caching to improve input performance at run time. We implemented the GODIVA interfaces in a stand-alone, portable user library, which can be used by all types of visualization codes: interactive and batch-mode, sequential and parallel. Performance results from running a visualization tool using the GODIVA library on multiple platforms show that the GODIVA framework is easy to use, alleviates developers ’ data management burden, and can bring substantial I/O performance improvement. 1
Agent-Based Query Optimisation in a Grid Environment
- In Proceedings of the IASTED International Conference on Applied Informatics (AI 2001
, 2001
"... The next generation experiments in High Energy Physics are the driving force for setting up an International Data Grid at CERN, the European Organization for Nuclear Research. Hundreds of Petabytes of data will be distributed and replicated all over the globe starting from 2005. In order to analyse ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
The next generation experiments in High Energy Physics are the driving force for setting up an International Data Grid at CERN, the European Organization for Nuclear Research. Hundreds of Petabytes of data will be distributed and replicated all over the globe starting from 2005. In order to analyse this massive set of distributed data efficiently, we propose a hierarchical query optimisation architecture based on multi-agent technology. The architecture is optimised for the High Energy Physics community but is representative also for other data intensive scientific applications that use distributed data stores and mass storage systems. Keywords: query optimisation, distributed computing, agent technology 1 Introduction The idea of building an International Data Grid [1,2] at CERN, the European Organization for Nuclear Research, was driven by the needs of the next generation accelerator, the Large Hadron Collider (LHC), which is scheduled to be in operation in 2005. Several Petabyte...
A Scientific Data Management System for Irregular Applications
"... Many scientific applications are I/O intensive and generate large data sets, spanning hundreds or thousands of "files." Management, storage, efficient access, and analysis of this data present an extremely challenging task. We have developed a software system, called Scientific Data Manager (SDM), t ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Many scientific applications are I/O intensive and generate large data sets, spanning hundreds or thousands of "files." Management, storage, efficient access, and analysis of this data present an extremely challenging task. We have developed a software system, called Scientific Data Manager (SDM), that uses a combination of parallel file I/O and database support for high-performance scientific data management. SDM provides a high-level API to the user and, internally, uses a parallel file system to store real data and a database to store application-related metadata. In this paper, we describe how we designed and implemented SDM to support irregular applications. SDM can efficiently handle the reading and writing of data in an irregular mesh, as well as the distribution of index values. We describe the SDM user interface and how we have implemented it to achieve high performance. SDM makes extensive use of MPI-IO's noncontiguous collective I/O functions. SDM also uses the concept of a history file to optimize the cost of the index distribution using the metadata stored in database. We present performance results with two irregular applications, a CFD code called FUN3D and a Rayleigh-Taylor instability code, on the SGI Origin2000 at Argonne National Laboratory.
Indexing Scientific Data
, 2007
"... The ability to extract information from collected data has always driven science. Today’s large computers and automated sensing technologies collect terabytes of data in a few weeks. Extracting information from such large amounts of data is like trying to find a needle in a haystack. For efficient i ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The ability to extract information from collected data has always driven science. Today’s large computers and automated sensing technologies collect terabytes of data in a few weeks. Extracting information from such large amounts of data is like trying to find a needle in a haystack. For efficient information extraction, we need disk-based indexing schemes that can efficiently handle queries restricting ranges on dozens of attributes. Unfortunately, the unique characteristics of scientific data and queries cause traditional indexing techniques to have poor performance on scientific workloads, occupy excessive space, or both. Bitmap indexes were proposed as a solution to these problems. However, in experiments with scientific data and queries, we found that previously proposed variations of bitmap indexes either were quite slow or required excessive storage for processing the large-range query conditions our scientists used. Scientists also told us that bitmap indexes, though smaller than traditional indexes, were too large for scientific data warehouses. Our scientists also wanted an efficient method to consolidate the data points returned by the indexes into larger, more meaningful regions of interest. To address these three problems, we introduced multi-resolution bitmap indexes, which group
High-Performance Scientific Data Management System
- Journal of Parallel and Distributed Computing
"... Many scientific applications have large I/O requirements, in terms of both the size of data and the number of files or data sets. Management, storage, efficient access, and analysis of this data present an extremely challenging task. Traditionally, two different solutions have been used for this tas ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Many scientific applications have large I/O requirements, in terms of both the size of data and the number of files or data sets. Management, storage, efficient access, and analysis of this data present an extremely challenging task. Traditionally, two different solutions have been used for this task: file I/O or databases. File I/O can provide high performance but is tedious to use with large numbers of files and large and complex data sets. Databases can be convenient, flexible, and powerful but do not perform and scale well for parallel supercomputing applications. We have developed a software system, called Scientific Data Manager (SDM), that combines the good features of both file I/O and databases. SDM provides a high-level API to the user and, internally, uses a parallel file system to store real data (using various I/O optimizations available in MPI-IO) and a database to store application-related metadata. In order to support I/O in irregular applications, SDM makes extensive use of MPI-IO’s noncontiguous collective I/O functions. Moreover, SDM uses the concept of a history file to optimize the cost of the index distribution using the metadata stored in database. We describe the design and implementation of SDM and present performance results with two regular applications, ASTRO3D and an Euler solver, and with two irregular applications, a CFD code called FUN3D and a Rayleigh-Taylor instability code.
Dealing with Massive Data: From Parallel I/O to Grid I/O
, 2003
"... Acknowledgements Many people have helped us find our way during the development of this thesis. Erich Schikuta, our supervisor, provided a motivating, enthusiastic, and critical atmosphere dur-ing our discussions. It was a great pleasure for us to conduct this thesis under his su-pervision. We also ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Acknowledgements Many people have helped us find our way during the development of this thesis. Erich Schikuta, our supervisor, provided a motivating, enthusiastic, and critical atmosphere dur-ing our discussions. It was a great pleasure for us to conduct this thesis under his su-pervision. We also acknowledge Heinz and Kurt Stockinger who provided constructive comments. We would also like to thank everybody for providing us with feedback.
Experience with BXGrid: a data repository and computing grid for biometrics research
- CLUSTER COMPUT
, 2009
"... Research in the field of biometrics depends on the effective management and analysis of many terabytes of digital data. The quality of an experimental result is often highly dependent upon the sheer amount of data marshalled to support it. However, the current state of the art requires researchers t ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Research in the field of biometrics depends on the effective management and analysis of many terabytes of digital data. The quality of an experimental result is often highly dependent upon the sheer amount of data marshalled to support it. However, the current state of the art requires researchers to have a heroic level of expertise in systems software to perform large scale experiments. To address this, we have designed and implemented BXGrid, a data repository and workflow abstraction for biometrics research. The system is composed of a relational database, an active storage cluster, and a campus computing grid. End users interact with the system through a high level abstraction of four stages: Select, Transform, AllPairs, and Analyze. A high degree of availability and reliability is achieved through transparent fail over, three phase operations, and independent auditing. BXGrid is currently in daily production use by an active biometrics research group at the University of Notre Dame. We discuss our experience in constructing and using the system and offer lessons learned in conducting collaborative research in e-Science.
Maitri: A Format-Independent Framework for Managing Large Scale Scientific Data
"... Even traditional commercial database systems do not scale to the size of today’s large scientific data sets, whose growth is outpacing Moore’s Law. Instead, scientists are wedded to special-purpose data formats and their associated I/O libraries, even though these libraries provide only basic functi ..."
Abstract
- Add to MetaCart
Even traditional commercial database systems do not scale to the size of today’s large scientific data sets, whose growth is outpacing Moore’s Law. Instead, scientists are wedded to special-purpose data formats and their associated I/O libraries, even though these libraries provide only basic functionality. Thus there is a need for a scalable data management system that can support these formats and, when needed, provide more sophisticated functionality for indexing, buffering, caching, concurrency control, metadata management, and querying. This demonstration showcases Maitri, a framework that can be used to address these needs. The Maitri framework consists of a set of standard, very narrow interfaces for format-agnostic, loosely-coupled libraries offering aspects of
High-Performance Scientific Data Management System\Lambda
"... Abstract Many scientific applications have large I/O requirements, in terms of both the size of data and thenumber of files or data sets. Management, storage, efficient access, and analysis of this data present an ..."
Abstract
- Add to MetaCart
Abstract Many scientific applications have large I/O requirements, in terms of both the size of data and thenumber of files or data sets. Management, storage, efficient access, and analysis of this data present an

