Results 1 - 10
of
14
Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems
, 2008
"... As storage systems reach the petabyte scale, it has become increasingly difficult for users and storage administrators to understand and manage their data. File metadata, such as inode and extended attributes are a valuable source of information that can aid in locating and identifying files, and ca ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
As storage systems reach the petabyte scale, it has become increasingly difficult for users and storage administrators to understand and manage their data. File metadata, such as inode and extended attributes are a valuable source of information that can aid in locating and identifying files, and can also facilitate administrative tasks, such as storage provisioning and recovery from backups. Unfortunately, most storage systems have no way to quickly and easily search file metadata at large scale. To address these issues, we developed Spyglass, a indexing system that efficiently gathers, indexes and queries file metadata in large-scale storage systems. Our analysis of file metadata from real-world workloads showed that metadata has spatial locality in the storage namespace and that the distribution of metadata is highly skewed. Based on these findings, we designed Spyglass to use index partitioning and signature files to quickly prune the file search space. We also developed techniques to efficiently handle index versioning, facilitating both fast update and queries across historical indexes. Experiments on systems with up to 300 million files show that the Spyglass prototype is as much as several thousand times faster than current database solutions while requiring only a fraction of the space. 1
Architecture of a Database System
"... Database Management Systems (DBMSs) are a ubiquitous and critical component of modern computing, and the result of decades of research and development in both academia and industry. Historically, DBMSs were among the earliest multi-user server systems to be developed, and thus pioneered many systems ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Database Management Systems (DBMSs) are a ubiquitous and critical component of modern computing, and the result of decades of research and development in both academia and industry. Historically, DBMSs were among the earliest multi-user server systems to be developed, and thus pioneered many systems design techniques for scalability and reliability now in use in many other contexts. While many of the algorithms and abstractions used by a DBMS are textbook material, there has been relatively sparse coverage in the literature of the systems design issues that make a DBMS work. This paper presents an architectural discussion of DBMS design principles, including process models, parallel architecture, storage system design, transaction system implementation, query processor and optimizer architectures, and typical shared components and utilities. Successful commercial and open-source systems are used as points of reference, particularly when multiple alternative designs have been adopted by different groups. 1
Answering Aggregation Queries in a Secure System Model
"... As more sensitive data is captured in electronic form, security becomes more and more important. Data encryption is the main technique for achieving security. While in the past enterprises were hesitant to implement database encryption because of the very high cost, complexity, and performance degra ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
As more sensitive data is captured in electronic form, security becomes more and more important. Data encryption is the main technique for achieving security. While in the past enterprises were hesitant to implement database encryption because of the very high cost, complexity, and performance degradation, they now have to face the ever-growing risk of data theft as well as emerging legislative requirements. Data encryption can be done at multiple tiers within the enterprise. Different choices on where to encrypt the data offer different security features that protect against different attacks. One class of attack that needs to be taken seriously is the compromise of the database server, its software or administrator. A secure way to address this threat is for a DBMS to directly process queries on the ciphertext, without decryption. We conduct a comprehensive study on answering SUM and AVG aggregation queries in such a system model by using a secure homomorphic encryption scheme in a novel way. We demonstrate that the performance of such a solution is comparable to a traditional symmetric encryption scheme (e.g., DES) in which each value is decrypted and the computation is performed on the plaintext. Clearly this traditional encryption scheme is not a viable solution to the problem because the server must have access to the secret key and the plaintext, which violates our system model and security requirements. We study the problem in the setting of a read-optimized DBMS for data warehousing applications, in which SUM and AVG are frequent and crucial.
Dynamic tables: An architecture for managing evolving, heterogeneous biomedical data in relational database management systems
- Journal of the American Medical Informatics Association
, 2007
"... Data sparsity and schema evolution issues affecting bioinformatics and medical informatics communities have forced the adoption of vertical or object-attribute-value based database schemas to overcome limitations posed when using conventional relational database technology. Through our collaboration ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Data sparsity and schema evolution issues affecting bioinformatics and medical informatics communities have forced the adoption of vertical or object-attribute-value based database schemas to overcome limitations posed when using conventional relational database technology. Through our collaboration with the Yale Center for Medial Informatics (YCMI), we explore the reasons for this and show why their data is difficult to model using conventional relational techniques. We propose a solution to these obstacles based on a relational database engine using a sparse, column-store architecture. We provide benchmarks comparing the performance of queries and schema-modification operations using three different strategies: (1) the standard conventional relational design, (2) past approaches used by clinical and neuroinformatics researchers, and (3) our sparse, column-store architecture. Our performance results show that our architecture is a promising technique for storing and processing many types of data that were not handled well by conventional nor other derived semantic data models. 1
One size fits all? – Part 2: benchmarking results
- In CIDR
, 2007
"... Two years ago, some of us wrote a paper predicting the demise of “One Size Fits All (OSFA) ” [Sto05a]. In that paper, we examined the stream processing and data warehouse markets and gave reasons for a substantial performance advantage to specialized architectures in both markets. Herein, we make th ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Two years ago, some of us wrote a paper predicting the demise of “One Size Fits All (OSFA) ” [Sto05a]. In that paper, we examined the stream processing and data warehouse markets and gave reasons for a substantial performance advantage to specialized architectures in both markets. Herein, we make three additional contributions. First, we present reasons why the same performance advantage is enjoyed by specialized implementations in the text processing market. Second, the major contribution of the paper is to show “apples to apples ” performance numbers between commercial implementations of specialized architectures and relational DBMSs in both stream processing and data warehouses. Finally, we also show comparison numbers between an academic prototype of a specialized architecture for scientific and intelligence applications, a relational DBMS, and a widely used mathematical computation tool. In summary, there appear to be at least four markets where specialized architectures enjoy an overwhelming performance advantage.
ABSTRACT Executing Stream Joins on the Cell Processor
"... Low-latency and high-throughput processing are key requirements of data stream management systems (DSMSs). Hence, multi-core processors that provide high aggregate processing capacity are ideal matches for executing costly DSMS operators. The recently developed Cell processor is a good example of a ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Low-latency and high-throughput processing are key requirements of data stream management systems (DSMSs). Hence, multi-core processors that provide high aggregate processing capacity are ideal matches for executing costly DSMS operators. The recently developed Cell processor is a good example of a heterogeneous multi-core architecture and provides a powerful platform for executing data stream operators with high-performance. On the down side, exploiting the full potential of a multi-core processor like Cell is often challenging, mainly due to the heterogeneous nature of the processing elements, the software managed local memory at the co-processor side, and the unconventional programming model in general. In this paper, we study the problem of scalable execution of windowed stream join operators on multi-core processors, and specifically on the Cell processor. By examining various aspects of join execution flow, we determine the right set of techniques to apply in order to minimize the sequential segments and maximize parallelism. Concretely, we show that basic windows coupled with low-overhead pointer-shifting techniques can be used to achieve efficient join window partitioning, column-oriented join window organization can be used to minimize scattered data transfers, delay-optimized double buffering can be used for effective pipelining, rateaware batching can be used to balance join throughput and tuple delay, and finally SIMD (single-instruction multipledata) optimized operator code can be used to exploit data parallelism. Our experimental results show that, following the design guidelines and implementation techniques outlined in this paper, windowed stream joins can achieve high scalability (linear in the number of co-processors) by making efficient use of the extensive hardware parallelism provided by the Cell processor (reaching data processing rates of ≈ 13 GB/sec) and significantly surpass the performance obtained form conventional high-end processors (supporting a combined input stream rate of 2000 tuples/sec using 15 minutes windows and without dropping any tuples, resulting in ≈ 8.3 times higher output rate compared to an SSE implementation on dual 3.2Ghz Intel Xeon).
The SBC-Tree: An Index for Run-Length Compressed Sequences
"... Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this pap ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the String B-tree for Compressed sequences, termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBCtree is a two-level index structure based on the well-known String B-tree and a 3-sided range query structure [7]. The SBC-tree supports pattern matching queries such as substring matching, prefix matching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. Substring matching, prefix matching, and range search execute in an optimal O(logB N + |p|+T) I/O operations, where |p | is the
The Case for RodentStore, an Adaptive, Declarative Storage System
, 2009
"... Recent excitement in the database community surrounding new applications—analytic, scientific, graph, geospatial, etc.—has led to an explosion in research on database storage systems. New storage systems are vital to the database community, as they are at the heart of making database systems perform ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Recent excitement in the database community surrounding new applications—analytic, scientific, graph, geospatial, etc.—has led to an explosion in research on database storage systems. New storage systems are vital to the database community, as they are at the heart of making database systems perform well in new application domains. Unfortunately, each such system also represents a substantial engineering effort including a great deal of duplication of mechanisms for features such as transactions and caching. In this paper, we make the case for RodentStore, an adaptive and declarative storage system providing a high-level interface for describing the physical representation of data. Specifically, RodentStore uses a declarative storage algebra whereby administrators (or database design tools) specify how a logical schema should be grouped into collections of rows, columns, and/or arrays, and the order in which those groups should be laid out on disk. We describe the key operators and types of our algebra, outline the general architecture of RodentStore, which interprets algebraic expressions to generate a physical representation of the data, and describe the interface between RodentStore and other parts of a database system, such as the query optimizer and executor. We provide a case study of the potential use of RodentStore in representing dense geospatial data collected from a mobile sensor network, showing the ease with which different storage layouts can be expressed using some of our algebraic constructs and the potential performance gains that a RodentStore-built storage system can offer.
Light-weight, Runtime Verification of Query Sources
"... Abstract — Modern database systems increasingly make use of networked storage. This storage can be in the form of SAN’s or in the form of shared-nothing nodes in a cluster. One type of attack on databases is arbitrary modification of data in a database through the file system, bypassing database acc ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract — Modern database systems increasingly make use of networked storage. This storage can be in the form of SAN’s or in the form of shared-nothing nodes in a cluster. One type of attack on databases is arbitrary modification of data in a database through the file system, bypassing database access control. Additionally, for many applications, ensuring strict and definite authenticity of query source and results is required or highly desirable. In this paper, we propose a lightweight approach for verifying the minimum information that a database server needs from the storage system to execute a query. The verification is definite and produces high confidence results because of its online manner (i.e., the information is verified right before it is used). It is lightweight in three ways: (1) We use the Merkle hash tree data structure and fast cryptographic hash functions to ensure the verification itself is fast and secure; (2) We verify the minimum number of bytes needed to ensure the authenticity of the source related to the query result; and (3) We achieve high concurrency of multiple reader and writer transactions and avoid delays due to locking by using the compare-and-swap primitive. We then prove the correctness and progress guarantees of the algorithms using concepts from the theory of distributed computing. We also analyze the performance of the algorithm. Finally, we perform a comprehensive empirical study on various parameter choices and on the system performance and concurrency with our approaches. I.
Adaptive Physical Design for Curated Archives
"... Abstract. We introduce AdaptPD, an automated physical design tool that improves database performance by continuously monitoring changes in the workload and adapting the physical design to suit the incoming workload. Current physical design tools are offline and require specification of a representat ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. We introduce AdaptPD, an automated physical design tool that improves database performance by continuously monitoring changes in the workload and adapting the physical design to suit the incoming workload. Current physical design tools are offline and require specification of a representative workload. AdaptPD is “always on ” and incorporates online algorithms which profile the incoming workload to calculate the relative benefit of transitioning to an alternative design. Efficient query and transition cost estimation modules allow AdaptPD to quickly decide between various design configurations. We evaluate AdaptPD with the SkyServer Astronomy database using queries submitted by SkyServer’s users. Experiments show that AdaptPD adapts to changes in the workload, improves query performance substantially over offline tools, and introduces minor computational overhead. 1

