Results 1 - 10
of
30
XMill: an Efficient Compressor for XML Data
, 1999
"... We describe a tool for compressing XML data, with applications in data exchange and archiving, which usually achieves about twice the compression ratio of gzip at roughly the same speed. The compressor, called XMill, incorporates and combines existing compressors in order to apply them to heterogene ..."
Abstract
-
Cited by 165 (0 self)
- Add to MetaCart
We describe a tool for compressing XML data, with applications in data exchange and archiving, which usually achieves about twice the compression ratio of gzip at roughly the same speed. The compressor, called XMill, incorporates and combines existing compressors in order to apply them to heterogeneous XML data: it uses zlib, the library function for gzip, a collection of datatype specific compressors for simple data types, and, possibly, user defined compressors for application specific data types. 1 Introduction We have implemented a compressor/decompressor for XML data, to be used in data exchange and archiving, that achieves about twice the compression rate of general-purpose compressors (gzip), at about the same speed. The tool can be downloaded from www.research.att.com/sw/tools/xmill/. XML is now being adopted by many organizations and industry groups, like the healthcare, banking, chemical, and telecommunications industries. The attraction in XML is that it is a self-describi...
Weaving Relations for Cache Performance
, 2001
"... Relational database systems have traditionally optimzed for I/O performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages). Recent research, however, indicates that cache utilization and performance is becoming increasingly important on m ..."
Abstract
-
Cited by 83 (14 self)
- Add to MetaCart
Relational database systems have traditionally optimzed for I/O performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages). Recent research, however, indicates that cache utilization and performance is becoming increasingly important on modern platforms. In this paper, we first demonstrate that in-page data placement is the key to high cache performance and that NSM exhibits low cache utilization on modern platforms. Next, we propose a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page. Because PAX only affects layout inside the pages, it incurs no storage penalty and does not affect I/O behavior. According to our experimental results, when compared to NSM (a) PAX exhibits superior cache and memory bandwidth utilization, saving at least 75% of NSM's stall time due to data cache accesses, (b) range selection queries and updates on memoryresident relations execute 17-25% faster, and (c) TPC-H queries involving I/O execute 11-48% faster.
Super-Scalar RAM-CPU Cache Compression
- In Proceedings of the International Conference of Data Engineering (IEEE ICDE
, 2006
"... CWI is a founding member of ERCIM, the European Research Consortium for Informatics and Mathematics. CWI's research has a theme-oriented structure and is grouped into four clusters. Listed below are the names of the clusters and in parentheses their acronyms. ..."
Abstract
-
Cited by 49 (12 self)
- Add to MetaCart
CWI is a founding member of ERCIM, the European Research Consortium for Informatics and Mathematics. CWI's research has a theme-oriented structure and is grouped into four clusters. Listed below are the names of the clusters and in parentheses their acronyms.
The Implementation and Performance of Compressed Databases
, 1998
"... In this paper, we show how compression can be integrated into a relational database system. Specifically, we describe how the storage manager, the query execution engine, and the query optimizer of a database system can be extended to deal with compressed data. Our main result is that compression ca ..."
Abstract
-
Cited by 44 (5 self)
- Add to MetaCart
In this paper, we show how compression can be integrated into a relational database system. Specifically, we describe how the storage manager, the query execution engine, and the query optimizer of a database system can be extended to deal with compressed data. Our main result is that compression can significantly improve the response time of queries if very light-weight compression techniques are used. We will present such light-weight compression techniques and give the results of running the TPC-D benchmark on a so compressed database and a non-compressed database using the AODB database system, an experimental database system that was developed at the Universities of Mannheim and Passau. Our benchmark results demonstrate that compression indeed offers high performance gains (up to 55%) for IO-intensive queries and moderate gains for CPU-intensive queries. Compression can, however, also increase the running time of certain update operations. In all, we recommend to extend today's da...
Improving Index Performance through Prefetching
, 2001
"... This paper proposes and evaluates Prefetching B -Trees#, which use prefetching to accelerate two important operations on B -Tree indices: searches and range scans. To accelerate searches, pB -Trees use prefetching to e#ectively create wider nodes than the natural data transfer size: e.g., ..."
Abstract
-
Cited by 33 (7 self)
- Add to MetaCart
This paper proposes and evaluates Prefetching B -Trees#, which use prefetching to accelerate two important operations on B -Tree indices: searches and range scans. To accelerate searches, pB -Trees use prefetching to e#ectively create wider nodes than the natural data transfer size: e.g., eight vs. one cache lines or disk pages. These wider nodes reduce the height of the B -Tree, thereby decreasing the number of expensive misses when going from parenttochild without signi#cantly increasing the cost of fetching a given node. Our results show that this technique speeds up search and update times by a factor of 1.2#1.5 for main-memory B -Trees. In addition, it outperforms and is complementary to #Cache-SensitiveB -Trees." To accelerate range scans, pB -Trees provide arrays of pointers to their leaf nodes. These allow the pB -Tree to prefetch arbitrarily far ahead, even for nonclustered indices, thereby hiding the normally expensive cache misses associated with traversing the leaves within the range. Our results show that this technique yields over a sixfold speedup on range scans of 1000+ keys. Although our experimental evaluation focuses on main memory databases, the techniques that we propose are also applicable to hiding disk latency.
The Link Database: Fast Access to Graphs of the Web
"... ... graph where URLs are nodes and hyperlinks are directed edges. The Link Database provides fast access to the hyperlinks. To support a wide range of graph algorithms, we find it important to fit the Link Database into memory. In the first version of the Link Database, we achieved this fit by using ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
... graph where URLs are nodes and hyperlinks are directed edges. The Link Database provides fast access to the hyperlinks. To support a wide range of graph algorithms, we find it important to fit the Link Database into memory. In the first version of the Link Database, we achieved this fit by using machines with lots of memory (8GB), and storing each hyperlink in 32 bits. However, this approach was limited to roughly 100 million Web pages. This paper presents techniques to compress the links to accommodate larger graphs. Our techniques combine well-known compression methods with methods that depend on the properties of the web graph. The first compression technique takes advantage of the fact that most hyperlinks on most Web pages point to other pages on the same host as the page itself. The second technique takes advantage of the fact that many pages on the same host share hyperlinks, that is, they tend to point to a common set of pages. Together, these techniques reduce space requirements to under 6 bits per link. While (de)compression adds latency to the hyperlink access time, we can still compute the strongly connected components of a 6 billion-edge graph in under 20 minutes and run applications such as Kleinberg's HITS in real time. This paper describes our techniques for compressing the Link Database, and provides performance numbers for compression ratios and decompression speed.
Query Optimization In Compressed Database Systems
- In ACM SIGMOD
, 2001
"... Over the lastd ecad es, improvements in CPU speed have outpaced improvements in main memory and d isk access rates by ord ers of magnitud , enabling the use ofd ata compression techniques to improve the performance ofd atabase systems. Previous work d scribes the benefits of compression for numerica ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
Over the lastd ecad es, improvements in CPU speed have outpaced improvements in main memory and d isk access rates by ord ers of magnitud , enabling the use ofd ata compression techniques to improve the performance ofd atabase systems. Previous work d scribes the benefits of compression for numerical attributes, whered8 a is stored in compressed format ond isk. Despite the abund3& e of stringvalued attributes in relational schemas there is little work on compression for string attributes in ad atabase context. Moreover, none of the previous work suitablyad2 esses the role of the query optimizer: During query execution, dD a is either eagerly d compressed when it is read into main memory, or dD a lazily stays compressed in main memory and is d compressed ond emand only. In this paper, we present an e#ective approach for dD abase compression based on lightweight, attribute-level compression techniques. We propose a Hierarchical ictionary Encod ing strategy that intelligently selects the most e#ective compression method for string-valued attributes. We show that eager and lazy d compression strategies prod1 e suboptimal plans for queries involving compressed string attributes. We then formalize the problem of compressionaware query optimizationand propose one provably optimal and two fast heuristic algorithms for selecting a query plan for relational schemas with compressed attributes; our algorithms can easily be integrated into existing cost-based query optimizers. Experiments using TPC-Hd atad emonstrate the impact of our string compression method s and show the importance of compression-aware query optimization. Our approach results in up to an or d r speed up over existing approaches. 1.
Engineering the Compression of Massive Tables: An Experimental Approach
, 2000
"... We study the problem of compressing massive tables. We devise a novel compression paradigm---training for lossless compression--- which assumes that the data exhibit dependencies that can be learned by examining a small amount of training material. We develop an experimental methodology to test the ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
We study the problem of compressing massive tables. We devise a novel compression paradigm---training for lossless compression--- which assumes that the data exhibit dependencies that can be learned by examining a small amount of training material. We develop an experimental methodology to test the approach. Our result is a system, pzip, which outperforms gzip by factors of two in compression size and both compression and uncompression time for various tabular data. Pzip is now in production use in an AT&T network traffic data warehouse.
Data Page Layouts for Relational Databases on Deep Memory Hierarchies
, 2002
"... Relational database systems have traditionally optimized for I/0 performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages). ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
Relational database systems have traditionally optimized for I/0 performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages).
Approximate Lineage for Probabilistic Databases
"... In probabilistic databases, lineage is fundamental to both query processing and understanding the data. Current systems s.a. Trio or Mystiq use a complete approach in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In large databases lineage formulas can b ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
In probabilistic databases, lineage is fundamental to both query processing and understanding the data. Current systems s.a. Trio or Mystiq use a complete approach in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In large databases lineage formulas can become huge: in one public database (the Gene Ontology) we often observed 10MB of lineage (provenance) data for a single tuple. In this paper we propose to use approximate lineage, which is a much smaller formula keeping track of only the most important derivations, which the system can use to process queries and provide explanations. We discuss in detail two specific kinds of approximate lineage: (1) a conservative approximation called sufficient lineage that records the most important derivations for each tuple, and (2) polynomial lineage, which is more aggressive and can provide higher compression ratios, and which is based on Fourier approximations of Boolean expressions. In this paper we define approximate lineage formally, describe algorithms to compute approximate lineage and prove formally their error bounds, and validate our approach experimentally on a real data set. 1.

