Results 1 - 10
of
13
An Analytic Behavior Model for Disk Drives With Readahead Caches and Request Reordering
, 1998
"... Modern disk drives read-ahead data and reorder incoming requests in a workload-dependent fashion. This improves their performance, but makes simple analytical models of them inadequate for performance prediction, capacity planning, workload balancing, and so on. To address this problem we have devel ..."
Abstract
-
Cited by 56 (8 self)
- Add to MetaCart
Modern disk drives read-ahead data and reorder incoming requests in a workload-dependent fashion. This improves their performance, but makes simple analytical models of them inadequate for performance prediction, capacity planning, workload balancing, and so on. To address this problem we have developed a new analytic model for disk drives that do readahead and request reordering. We did so by developing performance models of the disk drive components (queues, caches, and the disk mechanism) and a workload transformation technique for composing them. Our model includes the effects of workload-specific parameters such as request size and spatial locality. The result is capable of predicting the behavior of a variety of real-world devices to within 17% across a variety of workloads and disk drives.
Variable Length Queries for Time Series Data
- IN ICDE
, 2000
"... Finding similar patterns in a time sequence is a well-known problem that has been addressed by many authors. Most of the current techniques work well for queries of a prespecified length, but fail for variable length queries. We propose a new indexing technique that works well for variable length ..."
Abstract
-
Cited by 45 (7 self)
- Add to MetaCart
Finding similar patterns in a time sequence is a well-known problem that has been addressed by many authors. Most of the current techniques work well for queries of a prespecified length, but fail for variable length queries. We propose a new indexing technique that works well for variable length queries. Our idea is to store index structures at different resolutions for a given dataset. The resolutions are based on wavelets. A number of subqueries at different resolutions are generated for each variable length query. The ranges of the subqueries are progressively refined based on results from previous subqueries. Our experiments show that the total cost for our method is 4 to 20 times less than the current techniques including Linear Scan. Because of the need to store information at multiple resolution levels, the storage requirement of our method could potentially be large. In the second part of the paper, we show how the index information can be compressed with minimal information loss. According to our experimental results, even after compressing the size of the index to one fifth, the total cost of our method is 3 to 15 times less than the current techniques.
Performance Modeling for Realistic Storage Devices
, 1997
"... Managing large amounts of storage is difficult and becoming more so as both the complexity and number of storage devices are increasing. One approach to this problem is a self-managing storage system. Since a self-managing storage system is a real-time system, it requires a model that quickly approx ..."
Abstract
-
Cited by 36 (8 self)
- Add to MetaCart
Managing large amounts of storage is difficult and becoming more so as both the complexity and number of storage devices are increasing. One approach to this problem is a self-managing storage system. Since a self-managing storage system is a real-time system, it requires a model that quickly approximates the behavior of the storage device in a workload-dependent fashion. We develop such a model.
Our approach to modeling storage devices is to model the individual physical components of the device, such as queues, caches, and disk mechanisms, and then compose the component models. Each component model determines its behavior from the specification of the entering workload and the lower-level device behavior. To support the lower level component model in determining its behavior, each component model creates a modified workload specification to support the manner that the physical component would modify the entering workload. Modifying the workload specification allows us, for example, to capture the altered spatial locality that occurs when queues reorder their requests.
Our model predicts the device behavior in terms of response time within a relative error ranging from 2% to 30% for interesting subsets of the domain of devices and workloads. To demonstrate this, the model has been validated with synthetic traces of parallel scientific file system workloads and video-on-demand applications and traces of transaction processing applications.
Our contributions to the area of performance modeling for storage devices include the following:
- An infrastructure for developing a composite model. The infrastructure
supports the development of more complicated devices and workloads
than we have validated.
- Methods to approximate the mean seek time and rotational latency of
a disk mechanism using measures of workload spatial locality.
- Methods to approximate the miss probability and the full- and partial- hit
probabilities in an I/O system's data caches using measures of workload
spatial locality.
- Methods to approximate the queue delay for non-FCFS scheduling algorithms
using a description of the workload arrival process.
These methods can be composed to provide analytic estimation procedures for the behavior of a subset of current storage devices.
Declustering two-dimensional datasets over MEMS-based Storage
- IN EDBT
, 2003
"... Due to the large difference between seek time and transfer time in current disk technology, it is advantageous to perform large I/O using a single sequential access rather than multiple small random I/O accesses. However, prior optimal cost and data placement approaches for processing range querie ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Due to the large difference between seek time and transfer time in current disk technology, it is advantageous to perform large I/O using a single sequential access rather than multiple small random I/O accesses. However, prior optimal cost and data placement approaches for processing range queries over two-dimensional datasets do not consider this property. In particular, these techniques do not consider the issue of sequential data placement when multiple I/O blocks need to be retrieved from a single device. In this paper, we reevaluate the optimal cost of range queries, and prove that, in general, it is impossible to achieve the new optimal cost. This is because disks cannot facilitate two-dimensional sequential access which is required by the new optimal cost. Fortunately, MEMS-based storage is being developed to reduce I/O cost. We first show that the two-dimensional sequential access requirement can not be satisfied by simply modeling MEMS-based storage as conventional disks. Then we propose a new placement scheme that exploits the physical properties of MEMS-based storage to solve this problem. Our
Speeding Up Whole-Genome Alignment by Indexing Frequency Vectors
, 2004
"... Motivation: Many biological applications require the comparison of large genome strings. Current techniques suffer from high computational and I/O costs. ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Motivation: Many biological applications require the comparison of large genome strings. Current techniques suffer from high computational and I/O costs.
Accessing scientific data: Simpler is better
- In Proc. of the 8th International Symposium on Spatial and Temporal Databases
, 2003
"... Abstract. A variety of index structures has been proposed for supporting fast access and summarization of large multidimensional data sets. Some of these indices are fairly involved, hence few are used in practice. In this paper we examine how to reduce the I/O cost by taking full advantage of recen ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract. A variety of index structures has been proposed for supporting fast access and summarization of large multidimensional data sets. Some of these indices are fairly involved, hence few are used in practice. In this paper we examine how to reduce the I/O cost by taking full advantage of recent trends in hard disk development which favor reading large chunks of consecutive disk blocks over seeking and searching. We present the Multiresolution File Scan (MFS) approach which is based on a surprisingly simple and flexible data structure which outperforms sophisticated multidimensional indices, even if they are bulk-loaded and hence optimized for query processing. Our approach also has the advantage that it can incorporate a priori knowledge about the query workload. It readily supports summarization using distributive (e.g., count, sum, max, min) and algebraic (e.g., avg) aggregate operators. 1
Shift and Scale Invariant Search of Multi-attribute Time Sequences
, 2001
"... We investigate the problem of searching similar multi-attribute time sequences in databases. Such sequences arise naturally in a number of medical, financial, video, weather forecast, and stock market databases where more than one attribute is of interest at a time instant. We formulate a new symmet ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
We investigate the problem of searching similar multi-attribute time sequences in databases. Such sequences arise naturally in a number of medical, financial, video, weather forecast, and stock market databases where more than one attribute is of interest at a time instant. We formulate a new symmetric scale and shift invariant notion of distance for such sequences. We also propose a new index structure that transforms the data sequences and clusters them according to their shiftings and scalings. This clustering improves the efficiency considerably. According to our experiments with real and synthetic datasets, the index structure's performance is 5 to 60 times better than competing techniques, the exact speedup based on other optimizations such as caching and replication. Finally, we also consider the subsequence search problem. 1
Joining Massive High-Dimensional Datasets
- In Proc. ICDE
, 2003
"... We consider the problem of joining massive datasets. We propose two techniques for minimizing disk I/O cost of join operations for both spatial and sequence data. Our techniques optimize the available buffer space using a global view of the datasets. We build a boolean matrix on the pages of the giv ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We consider the problem of joining massive datasets. We propose two techniques for minimizing disk I/O cost of join operations for both spatial and sequence data. Our techniques optimize the available buffer space using a global view of the datasets. We build a boolean matrix on the pages of the given datasets using a lower bounding distance predictor. The marked entries of this matrix represent candidate page pairs to be joined. Our first technique joins the marked pages iteratively. Our second technique clusters the marked entries using rectangular dense regions that have minimal perimeter and fit into buffer. These clusters are then ordered so that the total number of common pages between consecutive clusters is maximal. The clusters are then read from disk and joined. Our experimental results on various real datasets show that our techniques are 2 to 86 times faster than the competing techniques for spatial datasets, and 13 to 133 times faster than the competing techniques for sequence datasets.
Optimizing Similarity Search for Arbitrary Length Time Series Queries
, 2003
"... We consider the problem of finding similar patterns in a time sequence. Typical applications of this problem involve large databases consisting of long time sequences of different lengths. Current time sequence search techniques work well for queries of a prespecified length, but not for arbitrary l ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We consider the problem of finding similar patterns in a time sequence. Typical applications of this problem involve large databases consisting of long time sequences of different lengths. Current time sequence search techniques work well for queries of a prespecified length, but not for arbitrary length queries. We propose a novel indexing technique that works well for arbitrary length queries. The proposed technique stores index structures at different resolutions for a given dataset. We prove that, this index structure is suA preliminary version of this paper appeared in ICDE 2001 [9].

