## Lempel-Ziv parsing and sublinear-size index structures for string matching (Extended Abstract) (1996)

Venue: | Proc. 3rd South American Workshop on String Processing (WSP'96 |

Citations: | 50 - 1 self |

### BibTeX

@INPROCEEDINGS{Kärkkäinen96lempel-zivparsing,

author = {Juha Kärkkäinen and Esko Ukkonen},

title = {Lempel-Ziv parsing and sublinear-size index structures for string matching (Extended Abstract)},

booktitle = {Proc. 3rd South American Workshop on String Processing (WSP'96},

year = {1996},

pages = {141--155},

publisher = {Carleton University Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

String matching over a long text can be significantly speeded up with an index structure formed by preprocessing the text. For very long texts, the size of such an index can be a problem. This paper presents the first sublinear-size index structure. The new structure is based on Lempel-Ziv parsing of the text and has size linear in N, the size of the Lempel-Ziv parse. For a text of length n, N = O(n = log n) and can be still smaller if the text is compressible. With the new index structure, all occurrences of a pattern string of length m can be found in time O(m 2

### Citations

1221 | A universal algorithm for sequential data compression
- Ziv, Lempel
- 1977
(Show Context)
Citation Context ...string. The scheme occurs in several variations. Our variation is the original from the seminal paper by Lempel and Ziv [19] which preceded the papers describing their famous data compression methods =-=[26, 27]-=-. The notation is borrowed from [9]. We define that the LZ parse of string T is a sequence of the form Z = (P 1 ; L 1 ; C 1 ) \Delta \Delta \Delta (P i ; L i ; C i ) \Delta \Delta \Delta (PN ; LN ; CN... |

1211 |
Multidimensional binary search trees used for associative searching
- Bentley
- 1975
(Show Context)
Citation Context ... for 2DPM. 4.2 2-D Tree for Strings The 2-d tree is a binary search tree for points in two dimensions. It is the two-dimensional case of the multidimensional binary search tree or k-d tree of Bentley =-=[6]-=-. The 2-d tree is like ordinary binary search tree except the odd levels use the x-coordinate as the discriminator and the even levels use the ycoordinate as the discriminator. The 2-d tree for N poin... |

779 | Compression of individual sequences via variable-rate coding
- Ziv, Lempel
- 1978
(Show Context)
Citation Context ...string. The scheme occurs in several variations. Our variation is the original from the seminal paper by Lempel and Ziv [19] which preceded the papers describing their famous data compression methods =-=[26, 27]-=-. The notation is borrowed from [9]. We define that the LZ parse of string T is a sequence of the form Z = (P 1 ; L 1 ; C 1 ) \Delta \Delta \Delta (P i ; L i ; C i ) \Delta \Delta \Delta (PN ; LN ; CN... |

673 | Suffix arrays: A new method for on-line string searches
- Manber, Myers
- 1993
(Show Context)
Citation Context ... uneconomically large to be really attractive in practical applications. The size depends on implementation details and the structure of the text, but will never be as low as 10n bytes. Suffix arrays =-=[20, 11]-=- (size 5n bytes), level-compressed tries [3, 4] (size about 11n bytes), suffix cactuses [15] (size 9n bytes), and suffix binary search trees [13] (size about 10n bytes) are alternative smaller data st... |

568 |
A space economical suffix tree construction algorithm
- McCreight
- 1976
(Show Context)
Citation Context ... is a trie-like data structure representing all suffixes of T . It can be constructed in time O(n log c) and space O(n), where n = jT j is the length of T and c = j\Sigmaj is the size of the alphabet =-=[25, 21, 24]-=-. The existence of an occurrence of P in T can be detected using the suffix tree in time O(m log c) and all occurrences can be listed in time O(m log c + L), where m = jPj and L is the number of occur... |

445 | Linear pattern matching algorithms
- Weiner
- 1973
(Show Context)
Citation Context ... is a trie-like data structure representing all suffixes of T . It can be constructed in time O(n log c) and space O(n), where n = jT j is the length of T and c = j\Sigmaj is the size of the alphabet =-=[25, 21, 24]-=-. The existence of an occurrence of P in T can be detected using the suffix tree in time O(m log c) and all occurrences can be listed in time O(m log c + L), where m = jPj and L is the number of occur... |

247 |
On the complexity of finite sequences
- Lempel, Ziv
- 1976
(Show Context)
Citation Context ...onsecutive disjoint blocks that follow the internal repetitive structure of the string. The scheme occurs in several variations. Our variation is the original from the seminal paper by Lempel and Ziv =-=[19]-=- which preceded the papers describing their famous data compression methods [26, 27]. The notation is borrowed from [9]. We define that the LZ parse of string T is a sequence of the form Z = (P 1 ; L ... |

181 |
Priority search trees
- McCreight
- 1985
(Show Context)
Citation Context ...omputational geometry problem is the interval intersection problem, for which there exists a number of data structures such as the segment tree [7], the interval tree [8] and the priority search tree =-=[22]-=-. The last of the three is the most suitable for our purposes due to its small space requirement. The priority search tree, invented by McCreight [22], was originally designed for solving semi-infinit... |

98 | Let sleeping files lie: Pattern matching in z-compressed files
- Amir, Benson, et al.
- 1996
(Show Context)
Citation Context ...g has also been applied to searching q-grams [16]. The actual data structures, however, are very different from the ones described here. Technically related is also the problem of compressed matching =-=[1, 9]-=- which asks one to find the occurrences of P in T directly from the LZ compressed representation of T . The problem is quite different but the solutions utilize the same properties of LZ parsing as we... |

90 |
String matching in lempel-ziv compressed strings,” in 27th annual ACM symposium on the theory of computing
- Farach, Thorup
- 1995
(Show Context)
Citation Context ... the 2-d tree for strings and the twodimensional heap. Both may have applications beyond the present context. Our search algorithm is based on a special property of LZ parsing of T , already noted in =-=[9]-=-: The first occurrence (from left to right) of P in T must overlap the last symbol of some block produced by the LZ parsing. Therefore we divide the occurrences into two types: The primary occurrences... |

79 |
Patricia--practical algorithm to retrieve information coded in alphanumeric
- Morrison
- 1968
(Show Context)
Citation Context ...rimary search alone is enough. A totally different possibility to construct a small index structure is to build a sparse suffix tree, i.e., a suffix tree that represents only a subset of all suffixes =-=[11, 23, 17, 2]-=-. In this way, all occurrences of P that overlap the starting point of a suffix that is present in the tree can easily be found. The other occurrences are difficult and lead to a brute-force search. T... |

72 |
On-line construction of suffix-trees
- Ukkonen
- 1995
(Show Context)
Citation Context ... is a trie-like data structure representing all suffixes of T . It can be constructed in time O(n log c) and space O(n), where n = jT j is the length of T and c = j\Sigmaj is the size of the alphabet =-=[25, 21, 24]-=-. The existence of an occurrence of P in T can be detected using the suffix tree in time O(m log c) and all occurrences can be listed in time O(m log c + L), where m = jPj and L is the number of occur... |

66 |
Data Compression with Finite Windows
- Fiala, Greene
- 1989
(Show Context)
Citation Context ...r this variation. The advantage of this variation is that the parse can be constructed in O(N ) space (and O(n log c) time) using a sparse suffix tree. The data compression method of Fiala and Greene =-=[10]-=- is based on this variation. The size of our index structure will be proportional to the length N of Z. We next derive a bound for N (cf. [19]). Lemma 1. Let T be a text over alphabet of size c and le... |

63 | Asymptotic behavior of the Lempel–Ziv parsing scheme and digital search trees,” Theor
- Jacquet, Szpankowski
- 1995
(Show Context)
Citation Context ...ough for our results. The expected value of N has been shown to be about nh= log n, where h is the entropy of the text, for various versions of LZ parsing and various models of randomness (see, e.g., =-=[14]-=- and references therein). Some empirical measurements of N are given in Table 1. Random text represents the worst case, uncompressible text. Table 1. Size of the LZ parse for various texts of length 3... |

54 |
A new approach to rectangle intersections
- Edelsbrunner
- 1983
(Show Context)
Citation Context ... (xsy). Another closely related computational geometry problem is the interval intersection problem, for which there exists a number of data structures such as the segment tree [7], the interval tree =-=[8]-=- and the priority search tree [22]. The last of the three is the most suitable for our purposes due to its small space requirement. The priority search tree, invented by McCreight [22], was originally... |

38 | Suffix cactus: A cross between suffix tree and suffix array
- Kärkkäinen
- 1995
(Show Context)
Citation Context ...lementation details and the structure of the text, but will never be as low as 10n bytes. Suffix arrays [20, 11] (size 5n bytes), level-compressed tries [3, 4] (size about 11n bytes), suffix cactuses =-=[15]-=- (size 9n bytes), and suffix binary search trees [13] (size about 10n bytes) are alternative smaller data structures with almost the same properties as the suffix tree. Their space requirement is stil... |

37 |
Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees
- Lee, Wong
- 1977
(Show Context)
Citation Context ...log N ) time [6]. The range query in a 2-d tree works in the obvious way by continuing the search in one or both subtrees of each node depending on the result of comparisons in the node. Lee and Wong =-=[18]-=- have shown that the range query in a balanced 2-d tree of N points takes O( p N + l) time in the worst case, where l is the size of the answer. As noted earlier, the 2DPM problem can be interpreted a... |

35 | S.N.: Efficient implementation of suffix trees
- Andersson
- 1995
(Show Context)
Citation Context ...practical applications. The size depends on implementation details and the structure of the text, but will never be as low as 10n bytes. Suffix arrays [20, 11] (size 5n bytes), level-compressed tries =-=[3, 4]-=- (size about 11n bytes), suffix cactuses [15] (size 9n bytes), and suffix binary search trees [13] (size about 10n bytes) are alternative smaller data structures with almost the same properties as the... |

33 | Improved behaviour of tries by adaptive branching
- Andersson, Nilsson
- 1993
(Show Context)
Citation Context ...practical applications. The size depends on implementation details and the structure of the text, but will never be as low as 10n bytes. Suffix arrays [20, 11] (size 5n bytes), level-compressed tries =-=[3, 4]-=- (size about 11n bytes), suffix cactuses [15] (size 9n bytes), and suffix binary search trees [13] (size about 10n bytes) are alternative smaller data structures with almost the same properties as the... |

30 | Suffix trees on words
- Andersson, Larsson, et al.
- 1999
(Show Context)
Citation Context ...rimary search alone is enough. A totally different possibility to construct a small index structure is to build a sparse suffix tree, i.e., a suffix tree that represents only a subset of all suffixes =-=[11, 23, 17, 2]-=-. In this way, all occurrences of P that overlap the starting point of a suffix that is present in the tree can easily be found. The other occurrences are difficult and lead to a brute-force search. T... |

23 | Lexicographical indices for text: inverted files vs
- Gonnet, Baeza-Yates, et al.
- 1991
(Show Context)
Citation Context ... uneconomically large to be really attractive in practical applications. The size depends on implementation details and the structure of the text, but will never be as low as 10n bytes. Suffix arrays =-=[20, 11]-=- (size 5n bytes), level-compressed tries [3, 4] (size about 11n bytes), suffix cactuses [15] (size 9n bytes), and suffix binary search trees [13] (size about 10n bytes) are alternative smaller data st... |

20 | An efficient algorithm for dynamic text indexing
- Gu, Farach, et al.
- 1994
(Show Context)
Citation Context ...ent but the solutions utilize the same properties of LZ parsing as we do. Finally, we note that the subproblem of finding block matches in the dynamic text indexing algorithm of Gu, Farach and Beigel =-=[12]-=- resembles an on-line (no preprocessing) version of our primary searching. 2 LZ Parsing The Lempel-Ziv scheme of data compression is based on parsing the string into consecutive disjoint blocks that f... |

18 | Lempel-Ziv index for q-grams
- Karkkainen, Sutinen
- 1996
(Show Context)
Citation Context ... present paper that also finds such secondary occurrences fast using the properties of LZ parsing. The basic two-phase technique based on Lempel-Ziv parsing has also been applied to searching q-grams =-=[16]-=-. The actual data structures, however, are very different from the ones described here. Technically related is also the problem of compressed matching [1, 9] which asks one to find the occurrences of ... |

15 |
Suffix Binary Search Trees
- Irving
- 1995
(Show Context)
Citation Context ...t will never be as low as 10n bytes. Suffix arrays [20, 11] (size 5n bytes), level-compressed tries [3, 4] (size about 11n bytes), suffix cactuses [15] (size 9n bytes), and suffix binary search trees =-=[13]-=- (size about 10n bytes) are alternative smaller data structures with almost the same properties as the suffix tree. Their space requirement is still high for large n and it can be impossible to store ... |

11 |
An optimal worst-case algorithm for reporting intersections of rectangles
- Bentley, Wood
- 1980
(Show Context)
Citation Context ...the pairs are intervals (xsy). Another closely related computational geometry problem is the interval intersection problem, for which there exists a number of data structures such as the segment tree =-=[7]-=-, the interval tree [8] and the priority search tree [22]. The last of the three is the most suitable for our purposes due to its small space requirement. The priority search tree, invented by McCreig... |

8 |
Kärkkäinen J: Sparse suffix trees
- Ukkonen
- 1996
(Show Context)
Citation Context ...rimary search alone is enough. A totally different possibility to construct a small index structure is to build a sparse suffix tree, i.e., a suffix tree that represents only a subset of all suffixes =-=[11, 23, 17, 2]-=-. In this way, all occurrences of P that overlap the starting point of a suffix that is present in the tree can easily be found. The other occurrences are difficult and lead to a brute-force search. T... |

5 | Optimized binary search and text retrieval
- Barbosa, Navarro, et al.
- 1995
(Show Context)
Citation Context ...quirement is still high for large n and it can be impossible to store the entire data structure in the fast memory. Using suffix trees and arrays in secondary memory environment is considered e.g. in =-=[11, 5]-=-. As the slow secondary memory operations can, in practice, destroy the good theoretical performance, there is a need to find small alternatives for suffix trees and arrays, even at the cost of increa... |