## Compression of Correlated Bit-Vectors (1990)

Venue: | Information Systems |

Citations: | 27 - 2 self |

### BibTeX

@ARTICLE{Bookstein90compressionof,

author = {A. Bookstein and S.T. Klein},

title = {Compression of Correlated Bit-Vectors},

journal = {Information Systems},

year = {1990},

volume = {16},

pages = {387--400}

}

### Years of Citing Articles

### OpenURL

### Abstract

: Bitmaps are data structures occurring often in information retrieval. They are useful; they are also large and expensive to store. For this reason, considerable effort has been devoted to finding techniques for compressing them. These techniques are most effective for sparse bitmaps. We propose a preprocessing stage, in which bitmaps are first clustered and the clusters used to transform their member bitmaps into sparser ones, that can be more effectively compressed. The clustering method efficiently generates a graph structure on the bitmaps. In some situations, it is desired to impose restrictions on the graph; finding the optimal graph satisfying these restrictions is shown to be NPcomplete. The results of applying our algorithm to the Bible is presented: for some sets of bitmaps, our method almost doubled the compression savings. 1. Introduction Textual Information Retrieval Systems (IRS) are voracious consumers of computer storage resources. Most conspicuous, of course, is the...

### Citations

11193 |
Computers and Intractability: A Guide to the Theory of NP-Completeness
- Garey, Johnson
- 1979
(Show Context)
Citation Context ...n (1) can be checked in time O(n), conditions (2) and (4) in time O(m`) and condition (3) in time O(n 2 `). Thus HDC 2 NP. For the reduction, we use the following problem known to be NP-complete (see =-=[8]-=-): 3DM (3 Dimensional Matching): Instance: A set M ` W \Theta X \Theta Y , where W , X and Y are disjoint sets having the same number q of elements. Question: Does M contain a matching , that is, a su... |

461 |
On the shortest spanning subtree of a graph and the traveling salesman problem
- Kruskal
- 1956
(Show Context)
Citation Context ...f the oriented trees in G. G is the optimal forest we were seeking. Many algorithms for finding a MST for a non-directed graph appear in the literature, ranging from Kruskal's simple greedy algorithm =-=[12]-=-, which has in our case complexity O(m 2 log m), to Yao's more involved technique [20], which would need O(m 2 log log m) operations for our application. 3.2 Algorithm statement Summarizing, we sugges... |

158 |
Data Compression: Methods and Theory
- Storer
- 1988
(Show Context)
Citation Context ...ide range of data structures must be sought for the efficient operation of such systems [10]. To date, most attention has been given to, and progress made in, the area of text compression ([2], [13], =-=[16]-=-). In this paper, we shall describe and examine the possibilities of compressing bitmaps, a data structure often proposed for improving the performance of retrieval systems ([6], [18]). Bitmaps occur ... |

107 |
Signature files: An access method for documents and its analytical performance evaluation
- Faloutsos, Christodoulakis
- 1984
(Show Context)
Citation Context ...mpression ([2], [13], [16]). In this paper, we shall describe and examine the possibilities of compressing bitmaps, a data structure often proposed for improving the performance of retrieval systems (=-=[6]-=-, [18]). Bitmaps occur often in information retrieval. They can represent the occurrences of a word in the sentences or paragraphs making up a text; they can indicate the documents associated with an ... |

84 | The Art of Computer Programming, Vol. I: Fundamental Algorithms - Knuth - 1997 |

41 |
A Compression Method for Clustered Bit-Vectors
- Teuhola
- 1978
(Show Context)
Citation Context ...he space allocated for each run must be adequate for the largest possible run, such codes can be inefficient if many of the runs are of small or moderate length. The following variant, due to Teuhola =-=[17]-=-, improves on simple run length coding by having a variable length representation of -- 3 -- a run length. A run of r zeros is first broken up into successive blocks of zeros of exponentially increasi... |

16 |
Mathematical analysis of various superimposed coding methods
- STIASSNY
- 1960
(Show Context)
Citation Context ...rticularly effective for bitmaps which are not extremely sparse. This may have several applications. For example, bit-slices of signature methods are often chosen so that the density of 1-bits is 1 2 =-=[15]-=-. Such vectors are almost impossible to compress individually. There may however be a possible gain by using clustering. Also, the possibility of compression would permit us to increase the size of th... |

15 |
Improved hierarchical bit-vector compression in document retrieval systems
- Choueka, Fraenkel, et al.
- 1986
(Show Context)
Citation Context ...s also contribute to our ability to compress bitmaps effectively, as evidenced by the fact that actual IR bitmaps are more compressible than randomly generated bitmaps with the same density of 1-bits =-=[4]-=-. The reason for the better results is a cluster-effect: since the segment positions in the bitmaps are usually ordered by topic or chronologically, adjacent bits often correspond to segments treating... |

10 |
Novel compression of sparse bit-strings
- Fraenkel, Klein
- 1985
(Show Context)
Citation Context ...g of zeros only, and blocks with only a single 1-bit, have much higher probabilities than the other blocks, so the average codeword length of the Huffman code will be smaller than k. Fraenkel & Klein =-=[7]-=- combine Huffman coding with run-length coding. Once again, a parameter k is chosen as a block size. However, since for very sparse vectors the probability of a block of k zeros is high, runs of block... |

10 |
Deerwester S., Storing Text Retrieval Systems on CD-ROM
- Klein, Bookstein
- 1989
(Show Context)
Citation Context ...ures must be created that themselves require a substantial ammount of space. Thus, mecanisms for compressing a wide range of data structures must be sought for the efficient operation of such systems =-=[10]-=-. To date, most attention has been given to, and progress made in, the area of text compression ([2], [13], [16]). In this paper, we shall describe and examine the possibilities of compressing bitmaps... |

6 | Using bitmaps for medium sized information retrieval systems
- Bookstein, Klein
- 1990
(Show Context)
Citation Context ... categories of bitmaps. We concentrate on sets of bitmaps such as are generally found in information retrieval (IR) systems. How such bitmaps can be used to enhance the system is discussed in [5] and =-=[3]-=-. Each bit-position corresponds to a specified sub-unit of the database, henceforth referred to as a segment; below, a segment will refer to a paragraph of text, though, in other contexts, a full docu... |

6 |
An O(jEj log log jV j) algorithm for finding minimum spanning trees
- Yao
- 1975
(Show Context)
Citation Context ...r finding a MST for a non-directed graph appear in the literature, ranging from Kruskal's simple greedy algorithm [12], which has in our case complexity O(m 2 log m), to Yao's more involved technique =-=[20]-=-, which would need O(m 2 log log m) operations for our application. 3.2 Algorithm statement Summarizing, we suggest the following procedure as the first stage for compressing a set of m bitmaps X 1 ; ... |

5 |
Compression of Large Inverted Files with Hyperbolic Term Distribution
- Schuegraf
- 1976
(Show Context)
Citation Context ...en successive 1-bits, that is, give the position of a 1-bit relative to the preceding 1-bit position rather than relative to the beginning of the vector. This is known as run-length coding (Schuegraf =-=[14]-=-). In its simplest form, the length of every run is encoded by a fixed length codeword; since this codeword must be large enough to accommodate the theoretical maximum run length, this is equivalent t... |

3 |
On the use of bit-maps for multiple key retrieval
- Vallarino
- 1976
(Show Context)
Citation Context ...sion ([2], [13], [16]). In this paper, we shall describe and examine the possibilities of compressing bitmaps, a data structure often proposed for improving the performance of retrieval systems ([6], =-=[18]-=-). Bitmaps occur often in information retrieval. They can represent the occurrences of a word in the sentences or paragraphs making up a text; they can indicate the documents associated with an index ... |

1 |
Cleary J.G., Modeling for Text
- Bell, Witten
- 1989
(Show Context)
Citation Context ...ressing a wide range of data structures must be sought for the efficient operation of such systems [10]. To date, most attention has been given to, and progress made in, the area of text compression (=-=[2]-=-, [13], [16]). In this paper, we shall describe and examine the possibilities of compressing bitmaps, a data structure often proposed for improving the performance of retrieval systems ([6], [18]). Bi... |

1 |
Huffman coding
- Jakobsson
- 1978
(Show Context)
Citation Context ... (c) we can explicitly represent the number of zeros in the last block as a binary integer with k + t bits. Thus a run of length r is encoded by O(log r) bits instead of O(log(max length)). Jakobsson =-=[9]-=- suggests the use of Huffman coding for bitmaps. The bit-vector is partitioned into blocks of fixed size k, and statistics are collected on the frequency of occurrence of the 2 k bit patterns. Based o... |