## Bayesian Networks for Lossless Dataset Compression (1999)

Venue: | IN CONFERENCE ON KNOWLEDGE DISCOVERY IN DATABASES (KDD |

Citations: | 7 - 2 self |

### BibTeX

@INPROCEEDINGS{Davies99bayesiannetworks,

author = {Scott Davies and Andrew Moore},

title = {Bayesian Networks for Lossless Dataset Compression},

booktitle = {IN CONFERENCE ON KNOWLEDGE DISCOVERY IN DATABASES (KDD},

year = {1999},

publisher = {}

}

### OpenURL

### Abstract

The recent explosion in research on probabilistic data mining algorithms such as Bayesian networks has been focussed primarily on their use in diagnostics, prediction and efficient inference. In this paper, we examine the use of Bayesian networks for a different purpose: lossless compression of large datasets. We present algorithms for automatically learning Bayesian networks and new structures called "Huffman networks" that model statistical relationships in the datasets, and algorithms for using these models to then compress the datasets. These algorithms often achieve significantly better compression ratios than achieved with common dictionary-based algorithms such those used by programs like ZIP.

### Citations

7440 |
Probabilistie Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...uch a model would typically be useless for compression, since it would usually require as much space as D itself. What kinds of probabilistic models might be useful for compression? Bayesian networks =-=[12]-=-, also commonly known as belief networks, are a popular class of probabilistic models that work well in conjunction with compression, although they are primarily used for data analysis and decision-ma... |

2689 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...search procedure to find a network B that maximizes (or at least hopefully comes close to maximizing) a scoring function C(B;D). A popular scoring function is the Bayesian Information Criterion (BIC) =-=[15]-=-, C(B;D) = log P (D j B) \Gamma jBjs0:5 log k where jBj is the number of parameters (probabilities) stored in the net and k is the number of records in the dataset. 4 Using Bayesian networks with arit... |

1036 |
A method for the construction of minimum-redundancy codes
- Huffman
- 1952
(Show Context)
Citation Context ...unmanageably large dictionaries in order to do so. Huffman Coding Given a small discrete set of source symbols and their associated probabilities, a simple greedy algorithm developed by David Huffman =-=[7]-=- can be used to find an optimal code with which to encode these source symbols on an individual basis. However, if the probability for one particular source symbol is very high (theoretically only nee... |

688 |
Arithmetic coding for data compression
- Witten, Neal, et al.
- 1987
(Show Context)
Citation Context ... can be inefficient, as the code requires at least one bit for each source symbol encoded. Arithmetic Coding Arithmetic coding (developed by Rissanen [13] and Pasco [11]; see Witten, Neal, and Cleary =-=[16]-=- for a tutorial) allows sequences of symbols to be encoded nearly optimally (in the limit as k increases) without requiring the enumeration of all possible source code sequences of length k. Arithmeti... |

682 | Approximating discrete probability distributions with dependence trees
- Chow, Liu
- 1968
(Show Context)
Citation Context ... similar to an algorithm previously used by Sahami for classification [14]. In the special case where c is 1, this algorithm reduces to a penalized version of Chow and Liu's dependency-tree algorithm =-=[3]-=-, and is provably optimal. While the network chosen by this greedy algorithm won't generally be as accurate as one found via a more thorough search, this algorithm has the advantage of being more comp... |

275 |
Graphical Models for Machine Learning and Digital Communicaiton
- Frey
- 1998
(Show Context)
Citation Context ...will be examined in [4]. 6 Concluding remarks Related Work Automatically-learned Bayesian networks have been used previously in conjunction with arithmetic encoding in recent research by Brendan Frey =-=[6]-=-. Rather than learning the structure of the Bayesian network, Frey uses a fixed network structure in which each node has many parents; the probability of each node given its parents is paramaterized u... |

199 | Learning Bayesian belief networks. An approach based on the MDL principle
- Lam, Bacchus
- 1994
(Show Context)
Citation Context ...hat are good for compression. This "minimum description length" (or MDL) approach has also been used for learning Bayesian networks in cases where compression is not necessarily the primary =-=objective [8]. Bay-=-esian networks are straightforward to use with arithmetic coding. To encode a record I of the dataset with a Bayesian network B, one treats each of the values in I as an individual "source symbol... |

161 | Learning bayesian networks is np-complete
- Chickering
- 1995
(Show Context)
Citation Context ...l in B's probability tables optimally: namely, we simply use the empirical distributions appearing in D. However, even with complete data, the problem of finding the best network structure is NP-hard =-=[2]-=-. Learning a Bayesian network is thus typically done by using a search procedure to find a network B that maximizes (or at least hopefully comes close to maximizing) a scoring function C(B;D). A popul... |

145 | Arithmetic coding revisited
- Moffat, Neal, et al.
- 1998
(Show Context)
Citation Context ... In conjunction with the Bayesian network learning algorithms discussed above, we used a limited-precision arithmetic coding library written by Carpinelli et al. [1] based on a paper by Moffat et al. =-=[9]-=-. We compare the compression performance of arithmetic coding with Bayesian networks (using the best of the two algorithms described above) and Dynamic Bayesian networks with the performance of gzip a... |

127 | Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets
- Moore, Lee
- 1998
(Show Context)
Citation Context ...s for learning Bayesian networks. The first algorithm uses a form of stochastic hillclimbing over possible network structures using the Bayesian Information Criterion as its scoring function. ADTrees =-=[10]-=- are used to speed up this search by decreasing the amount of time necessary to calculate the dataset statistics required for the search. The second algorithm takes two sweeps through the dataset. In ... |

112 |
Generalized Kraft inequality and arithmetic coding
- Rissanen
- 1976
(Show Context)
Citation Context ... only needing a fraction of a bit), Huffman coding can be inefficient, as the code requires at least one bit for each source symbol encoded. Arithmetic Coding Arithmetic coding (developed by Rissanen =-=[13]-=- and Pasco [11]; see Witten, Neal, and Cleary [16] for a tutorial) allows sequences of symbols to be encoded nearly optimally (in the limit as k increases) without requiring the enumeration of all pos... |

112 | Learning limited dependence bayesian classi ers
- Sahami
- 1996
(Show Context)
Citation Context ...er the dataset to fill in the probability tables of the resulting network. See [4] for further details. This algorithm is somewhat similar to an algorithm previously used by Sahami for classification =-=[14]-=-. In the special case where c is 1, this algorithm reduces to a penalized version of Chow and Liu's dependency-tree algorithm [3], and is provably optimal. While the network chosen by this greedy algo... |

111 |
Probabilistic temporal reasoning
- Dean, Kanazawa
- 1988
(Show Context)
Citation Context ...ish to scan through the dataset sequentially, we can take advantage of potential correlations between the ith record and i + 1th record to obtain further compression. We use Dynamic Bayesian Networks =-=[5]-=- to seek out and learn such correlations and exploit them in a tighter encoding. These networks are learned with a greedy algorithm similar to the greedy Bayesian networklearning algorithm described i... |

56 |
Source Coding Algorithms for Fast Data Compression
- Pasco
- 1976
(Show Context)
Citation Context ... fraction of a bit), Huffman coding can be inefficient, as the code requires at least one bit for each source symbol encoded. Arithmetic Coding Arithmetic coding (developed by Rissanen [13] and Pasco =-=[11]-=-; see Witten, Neal, and Cleary [16] for a tutorial) allows sequences of symbols to be encoded nearly optimally (in the limit as k increases) without requiring the enumeration of all possible source co... |

41 |
Coding theorems for individual sequences
- Ziv
- 1978
(Show Context)
Citation Context ...e LZ77 algorithm [18] (employed by gzip), which uses a sliding window of previously encoded symbols as its dictionary. These algorithms can be shown to achieve asympotically optimal compression rates =-=[17]-=-; however, they may require the use of unmanageably large dictionaries in order to do so. Huffman Coding Given a small discrete set of source symbols and their associated probabilities, a simple greed... |

39 | A method for the construction of minimum-redundancy codes - man, A - 1979 |

36 |
A universal algorithm for data compression
- Ziv, Lempel
- 1977
(Show Context)
Citation Context ...e that appears in the dictionary, that sequence's position in the dictionary is encoded rather than the individual symbols themselves. An example of a dictionary-based algorithm is the LZ77 algorithm =-=[18]-=- (employed by gzip), which uses a sliding window of previously encoded symbols as its dictionary. These algorithms can be shown to achieve asympotically optimal compression rates [17]; however, they m... |

7 | Cached sucient statistics for e- cient machine learning with large datasets - Moore, Lee - 1998 |

2 |
Bit Based Compression Using Arithmetic Coding. Available for download at ftp://munnari.oz.au/pub/arith coder
- Word
- 1995
(Show Context)
Citation Context ...th arithmetic coding on real datasets. In conjunction with the Bayesian network learning algorithms discussed above, we used a limited-precision arithmetic coding library written by Carpinelli et al. =-=[1]-=- based on a paper by Moffat et al. [9]. We compare the compression performance of arithmetic coding with Bayesian networks (using the best of the two algorithms described above) and Dynamic Bayesian n... |