## Common pitfalls using normalized compression distance: what to watch out for in a compressor (2005)

Venue: | Communications in Information and Systems |

Citations: | 10 - 3 self |

### BibTeX

@ARTICLE{Án05commonpitfalls,

author = {Manuel Cebri Án and Manuel Alfonseca and Alfonso Ortega},

title = {Common pitfalls using normalized compression distance: what to watch out for in a compressor},

journal = {Communications in Information and Systems},

year = {2005},

volume = {5},

pages = {367--384}

}

### OpenURL

### Abstract

Abstract. Using the mathematical background for algorithmic complexity developed by Kolmogorov in the sixties, Cilibrasi and Vitanyi have designed a similarity distance named normalized compression distance applicable to the clustering of objects of any kind, such as music, texts or gene sequences. The normalized compression distance is a quasi-universal normalized admissible distance under certain conditions. This paper shows that the compressors used to compute the normalized compression distance are not idempotent in some cases, being strongly skewed with the size of the objects and window size, and therefore causing a deviation in the identity property of the distance if we don’t take care that the objects to be compressed fit the windows. The relationship underlying the precision of the distance and the size of the objects has been analyzed for several well-known compressors, and specially in depth for three cases, bzip2, gzip and PPMZ which are examples of the three main types of compressors: block-sorting, Lempel-Ziv, and statistic. 1. Introduction. A

### Citations

1777 | An introduction to Kolmogorov complexity and its applications
- Li, Vitányi
- 1997
(Show Context)
Citation Context ...al Turing machine, outputs x and halts, when it is fed with input y. The Kolmogorov Complexity K(x) is the Conditional Kolmogorov Complexity for y = λ. K(x|y) andK(x) are both incomputable. Reference =-=[4]-=- provides a more detailed treatment of algorithmic information theory. Definition 4. A normalized admissible distance f is quasi-universal if for every computable normalized admissible distance d and ... |

1210 | A universal algorithm for sequential data compression
- Ziv, Lempel
- 1977
(Show Context)
Citation Context ...l the distance saturates in 1. We’ll call again the two zones weak dependency and strong dependency, for analogy with the previous section. The kernel of gzip [8] uses a variant of the LZ77 algorithm =-=[9]-=- for preprocessing and a statistical compressor (usually Huffman) as post-processing. The skew with the object size is fully explained by the compression scheme of the LZ77 algorithm5 . As in the prev... |

642 |
Text compression
- Bell, Cleary, et al.
- 1990
(Show Context)
Citation Context ..., while file pic is highly compressible, because of large amounts of white space in the picture, represented by long runs of zeros. More reasons for choosing this benchmark are explained in reference =-=[5]-=-. 1 All the experiments published in [2] were performed using this toolkit. 2 Available in the Internet at http://www.complearn.orgsCOMMON PITFALLS USING THE NORMALIZED COMPRESSION DISTANCE 371 Table ... |

591 | A Block-sorting Lossless Data Compression Algorithm
- Burrows, Wheeler
- 1994
(Show Context)
Citation Context ...objects add to more than 900 Kbytes, the catenated object is divided into parts smaller than 900 Kbytes before being compressed. A more detailed explanation of the algorithms in bzip2 can be found in =-=[6]-=-. Let’s start with the weak dependency, which can be observed in the [1 Kbyte, 450 Kbytes] interval, exactly the half size of the block. In this zone, the size of the catenated objects is smaller than... |

352 | Data compression using adaptive coding and partial string matching
- Cleary, Witten
- 1984
(Show Context)
Citation Context ...: C(xy) ≥ C(x). 3. Symmetry: C(xy) =C(yx). 4. Distributivity: C(xy)+C(z) ≤ C(xz)+C(yz). Definition 2. Adistanced(x, y) is a normalized admissible distance or similarity distance if it takes values in =-=[0, 1]-=- and satisfies the following conditions for all objects x, y, z: 1. Identity: d(x, y) =0if x = y. 2. Symmetry: d(x, y) =d(y, x).sCOMMON PITFALLS USING THE NORMALIZED COMPRESSION DISTANCE 369 3. Triang... |

208 | The similarity metrics
- Li, Chen, et al.
- 2003
(Show Context)
Citation Context ... compressor must hold to be useful in the computation of the NCD, and by giving formal expression to the quality of the distance in comparison with an ideal distance proposed by Vitányi and others in =-=[3]-=-. In this paper we show that the ∗Departamento de Ingeniería Informática, Escuela Politécnica Superior, Universidad Autónoma de Madrid, E-mail: {manuel.cebrian, manuel.alfonseca, alfonso.ortega}@uam.e... |

193 | Clustering by compression
- Cilibrasi
- 2003
(Show Context)
Citation Context ...ll, it means that a lot of information contained in x canbeusedtocodey, following the similarity conditions described in the previous paragraph. This was formalized by Rudi Cilibrasi and Paul Vitányi =-=[2]-=-, giving rise to the concept of normalized compression distance (NCD), which is based on the use of compressors to provide a measure of the similarity between the objects. This distance may then be us... |

25 |
A Method for The Construction of Minimum Redundancy
- Huffman
(Show Context)
Citation Context ...s applied and the output is “20103040400030” (see [6]). Huffman coding: The frequencies of the characters are measured as 0:8, 1:1, 2:1, 3:2, 4:2 and the compressed string is built using 26 bits (see =-=[7]-=-). Using the same scheme, the string “drdobbs” is compressed using 17 bits, so the distance is NCD = 26−17 17 =0.529. Now another symbol “w” is added to the string, so that the new string whose distan... |