## Engineering the Compression of Massive Tables: An Experimental Approach (2000)

### Cached

### Download Links

- [www.research.att.com]
- [www2.research.att.com]
- [www.cs.princeton.edu:80]
- [www.cs.princeton.edu]
- [www.research.att.com]
- DBLP

### Other Repositories/Bibliography

Citations: | 26 - 2 self |

### BibTeX

@MISC{Buchsbaum00engineeringthe,

author = {Adam L. Buchsbaum and Donald F. Caldwell and S. Muthukrishnan},

title = {Engineering the Compression of Massive Tables: An Experimental Approach},

year = {2000}

}

### Years of Citing Articles

### OpenURL

### Abstract

We study the problem of compressing massive tables. We devise a novel compression paradigm---training for lossless compression--- which assumes that the data exhibit dependencies that can be learned by examining a small amount of training material. We develop an experimental methodology to test the approach. Our result is a system, pzip, which outperforms gzip by factors of two in compression size and both compression and uncompression time for various tabular data. Pzip is now in production use in an AT&T network traffic data warehouse.

### Citations

1139 | A universal algorithm for sequential data compression
- Ziv, Lempel
- 1977
(Show Context)
Citation Context ...ning sets, and we compress test sets with respect to the plans. We compare the resulting compression to the straightforward approach of treating the tables as text and applying Lempel-Ziv compression =-=[20, 21]-=-. It will be clear that comparable performance would falsify our assumptions about the data dependencies. In all cases, however, our algorithms provide substantial compression improvements. While trai... |

942 |
A method for the construction of minimum redundancy codes
- Huffman
- 1952
(Show Context)
Citation Context ...in Section 1 are satisfied. From an information-theoretic point of view, can be treated as a string, e.g., of bytes in row-major order. It would thus suffice to perform Lempel-Ziv [20, 21] or Huffman =-=[9]-=- compression, yielding provably optimal asymptotic performance in terms of certain ergodic properties of the source that generates the table. This does not, however, adequately solve the table compres... |

730 | Compression of individual sequences via variable-rate coding
- Ziv, Lempel
- 1978
(Show Context)
Citation Context ...ning sets, and we compress test sets with respect to the plans. We compare the resulting compression to the straightforward approach of treating the tables as text and applying Lempel-Ziv compression =-=[20, 21]-=-. It will be clear that comparable performance would falsify our assumptions about the data dependencies. In all cases, however, our algorithms provide substantial compression improvements. While trai... |

664 |
Arithmetic coding for data compression
- Witten, Neal, et al.
- 1987
(Show Context)
Citation Context ...hich have already been well optimized: e.g., compress [18, 21], gzip [20], and vdelta [10]. Each is fast, on-line, and well-suited to our application. Of other available compressors, we note that PPM =-=[6, 19]-=-, which exploits context sensitivity and thus seems applicable to table data, and bzip [1] are too slow for our environment, although attempts have been made to tune PPM for speed at the expense of co... |

565 | A Block-sorting Lossless Data Compression Algorithm
- Burrows, Wheeler
- 1994
(Show Context)
Citation Context ...ch is fast, on-line, and well-suited to our application. Of other available compressors, we note that PPM [6, 19], which exploits context sensitivity and thus seems applicable to table data, and bzip =-=[1]-=- are too slow for our environment, although attempts have been made to tune PPM for speed at the expense of compression size [16]. We therefore do not use bzip and PPM in our compression scheme, but w... |

414 |
A technique for high performance data compression
- Welch
- 1984
(Show Context)
Citation Context ...ted the base problem of computing ! and ! . Rather than develop our own base compression method, we decided to use one of the standard programs, which have already been well optimized: e.g., compress =-=[18, 21]-=-, gzip [20], and vdelta [10]. Each is fast, on-line, and well-suited to our application. Of other available compressors, we note that PPM [6, 19], which exploits context sensitivity and thus seems app... |

330 | I.: Data compression using adaptive coding and partial string matching
- Cleary, Witten
- 1984
(Show Context)
Citation Context ...hich have already been well optimized: e.g., compress [18, 21], gzip [20], and vdelta [10]. Each is fast, on-line, and well-suited to our application. Of other available compressors, we note that PPM =-=[6, 19]-=-, which exploits context sensitivity and thus seems applicable to table data, and bzip [1] are too slow for our environment, although attempts have been made to tune PPM for speed at the expense of co... |

319 | Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation
- Jain, Vazirani
(Show Context)
Citation Context ...KJ H set : J : H and and iterate. Otherwise we are done. We call this greedy differential compression. The final solution is L roughly -optimal under the metric assumption [14]. Better approximations =-=[3, 4, 5, 11]-=- are known, but the greedy algorithm suffices for our purpose of testing the presence of differential dependencies. Greedy differential compression produces the following compression plan: compress ea... |

261 | Approximation algorithms for facility location problems
- Shmoys, Tardos, et al.
- 1997
(Show Context)
Citation Context ...s and those 7 in as derived columns. Given a :<;=7?>@ mapping , we define the A:" cost, to be BDC E E The goal is to find a F G:" pair of minimum cost. This is precisely the facility location problem =-=[17]-=-. We will assume that the differential cost is a metric. In general, this depends on the base compressor. We apply the simple, greedy algorithm for this problem [14]. At any time, we have a candidate ... |

223 |
Linear Prediction of Speech
- Markel, Gray
- 1976
(Show Context)
Citation Context ...sumptions about the data dependencies. In all cases, however, our algorithms provide substantial compression improvements. While training has been applied to lossy compression, e.g., in speech coding =-=[12, 15]-=-, ours is the first known instance of applying training to lossless compression. For our primary application, compression exploiting implicit dependencies outperformed that using explicit dependencies... |

205 | Improved combinatorial algorithms for facility location problems
- Charikar, Guha
(Show Context)
Citation Context ...KJ H set : J : H and and iterate. Otherwise we are done. We call this greedy differential compression. The final solution is L roughly -optimal under the metric assumption [14]. Better approximations =-=[3, 4, 5, 11]-=- are known, but the greedy algorithm suffices for our purpose of testing the presence of differential dependencies. Greedy differential compression produces the following compression plan: compress ea... |

122 | Improved approximation algorithms for uncapaciateted facility location
- Chudak
- 1998
(Show Context)
Citation Context ...KJ H set : J : H and and iterate. Otherwise we are done. We call this greedy differential compression. The final solution is L roughly -optimal under the metric assumption [14]. Better approximations =-=[3, 4, 5, 11]-=- are known, but the greedy algorithm suffices for our purpose of testing the presence of differential dependencies. Greedy differential compression produces the following compression plan: compress ea... |

117 | Implementing the PPM data compression scheme
- Moffat
- 1990
(Show Context)
Citation Context ...s context sensitivity and thus seems applicable to table data, and bzip [1] are too slow for our environment, although attempts have been made to tune PPM for speed at the expense of compression size =-=[16]-=-. We therefore do not use bzip and PPM in our compression scheme, but we do compare our scheme against bzip and PPM by themselves. We note but do not consider in this paper hybrid approaches, in which... |

43 | Compressing relations and indexes
- Goldstein, Ramakrishnan, et al.
- 1998
(Show Context)
Citation Context ...ich anticipates processing 1 TB of satellite images every two weeks.) Finally, the approaches to database compression include lightweight techniques such as compressing each tuple by simple encodings =-=[7, 8]-=- and tiling the entire table [8]. These approaches are not appropriate for table compression: the former is too wasteful, and the latter too expensive and cumbersome. Our contribution is a novel appro... |

30 | An Empirical Study of Delta Algorithms
- Hunt, Vo, et al.
- 1996
(Show Context)
Citation Context ...g ! and ! . Rather than develop our own base compression method, we decided to use one of the standard programs, which have already been well optimized: e.g., compress [18, 21], gzip [20], and vdelta =-=[10]-=-. Each is fast, on-line, and well-suited to our application. Of other available compressors, we note that PPM [6, 19], which exploits context sensitivity and thus seems applicable to table data, and b... |

19 |
Architecture and Design of Storage and Data Management for the NASA
- Kobler, Berbert, et al.
- 1995
(Show Context)
Citation Context ...several string fields of variable length; table data are more homogeneous, with fixed field lengths. Also, non-tabular databases are not routinely TBs in size. (An exception is NASA’s EOSDIS database =-=[13]-=-, which anticipates processing 1 TB of satellite images every two weeks.) Finally, the approaches to database compression include lightweight techniques such as compressing each tuple by simple encodi... |

7 |
Distortion Performance of Vector Quantization for LPC Voice Coding
- Juang, Wang, et al.
(Show Context)
Citation Context ...sumptions about the data dependencies. In all cases, however, our algorithms provide substantial compression improvements. While training has been applied to lossy compression, e.g., in speech coding =-=[12, 15]-=-, ours is the first known instance of applying training to lossless compression. For our primary application, compression exploiting implicit dependencies outperformed that using explicit dependencies... |

4 |
Data compression in a data base system
- Cormack
- 1985
(Show Context)
Citation Context ...m our table compression problem in many ways. First, the goals are different. Database compression stresses the preservation of indexing—the ability to retrieve an arbitrary record— under compression =-=[7]-=-. Table compression does not require indexing to be preserved. Next, the data are different. Database records are often dynamic, unlike table data, which have a write-once discipline. Databases consis... |

1 |
Analysis of a local heuristic for facility location problems
- Korupolu, Plaxton, et al.
- 1998
(Show Context)
Citation Context ...sely the facility location problem [17]. We will assume that the differential cost is a metric. In general, this depends on the base compressor. We apply the simple, greedy algorithm for this problem =-=[14]-=-. At any time, we have a candidate :" pair . We determine the smallest cost IH : H% solution, , obtained by BDC % 1. removing a column from , 2. adding a column to , or : 7 7 % * 3. substituting one o... |