## December 2005The SBC-Tree: An Index for Run-Length Compressed Sequences (2005)

### BibTeX

@MISC{Eltabakh05december2005the,

author = {Mohamed Y. Eltabakh and Wing-kai Hon and Rahul Shah and Walid G. Aref and Jeffrey S. Vitter and Fohamed Y. Eltabakh and Wing-kai Hon and Rahul Sa. H and Walid G. Aref and Jeffrey S. Vitter},

title = {December 2005The SBC-Tree: An Index for Run-Length Compressed Sequences},

year = {2005}

}

### OpenURL

### Abstract

Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., biological sequence databases. multimedia: and facsimile transmission. One of the main challenges is how to operate, e.g., indexing: searching, and retriexral: on the compressed data without decompressing it. In t.his paper, we present the String &tree for _Compressed sequences; termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-knoxvn String B-tree and a 3-sided range query structure. The SBC-tree supports substring as \\re11 as prefix m,atching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. The insertion and deletion of all suffixes of a compressed sequence of length m taltes O(m logB(N + m)) I/O operations. Substring match,ing, pre,fix matching, and range search execute in an optimal O(log, N + F) I/O operations, where Ip is the length of the compressed query pattern and T is the query output size. Re present also two variants of the SBC-tree: the SBC-tree that is based on an R-tree instead of the 3-sided structure: and the one-level SBC-tree that does not use a two-dimensional index. These variants do not have provable worstcase theoret.ica1 bounds for search operations, but perform well in practice. The SBC-tree index is realized inside PostgreSQL in t,he context of a biological protein database application. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, up to 30 % reduction in 110s for the insertion operations, and retains the optimal search performance achieved by the St,ring B-tree over the uncompressed sequences.!I c 0, h

### Citations

2219 | R-trees: A dynamic index structure for spatial searching
- Guttman
- 1984
(Show Context)
Citation Context ...4'e discuss the assigninent of t,he tags in detail in Section 5.2. The secoild level of the SBC-t'ree can be any tm~o-dimensional index st,ruct,ure. 1Ve consider: in this paper. the use of the R-tree =-=[28]-=- and the 3-sided range query structure [8]. The R-tree has a good perforillance in practice and is available in several DBAiSs, ho~\;e\~er. no theoretical bouilds are guara.nteed. Tlle 3-sided range q... |

1138 | A Universal Algorithm for Sequential Data Compression - Ziv, Lempel - 1977 |

565 | A Block-sorting Lossless Data Compression Algorithm - Burrows, Wheeler - 1994 |

555 | The ubiquitous B-tree
- Comer
- 1979
(Show Context)
Citation Context ...he string. The logical keys are sorted inside the String B-tree according to the lexicographic order of the corresponding suffixes (See Figure 2). The String B-tree is a conlbiilat~ioil of the B-tree =-=[16]-=- and the Patricia t rie [37]. where the eiltries inside each B-tree node are organized in a Patricia trie struct.ure instea.d of a sequential array (See Figure 2(c)). The goal of'tlle Patricia trie is... |

426 | Fast subsequence matching in time-series databases - Faloutsos, Ranganathan, et al. - 1994 |

426 | Linear pattern matching algorithms - Weiner - 1973 |

396 | Algorithms on strings, trees, and sequences: Computer science and computational biology - Gusfield - 1997 |

320 | External Memory Algorithms and Data Structures: Dealing with Massive Data
- Vitter
- 1981
(Show Context)
Citation Context ...essed data, e.g.. 11: 2, 6, 13, 14, 20. 23: 26: 32: 4:I.l. However, none of t.he proposed algorithms address the problein of iildexiilg and searching compressed data using external memory techiliques =-=[46]-=-. In t.llis paper, n7e propose the SBC-tree ($tring B-tree for Compressed sequences) for indexing and searclliilg RLE-conipressed sequences of arbitrary lengt,h. The SBC-tree is a. two-level index str... |

259 | Trie memory - Fredkin - 1960 |

193 | High-order entropy-compressed text indexes - Grossi, Gupta, et al. - 2003 |

180 | Opportunistic Data Structures with Application - Ferrragina, Manzini - 2000 |

155 | Two algorithms for maintaining order in a list - Dietz, Sleator |

95 | Let sleeping files lie: pattern matching in Z-compressed files - Amir, Benson, et al. - 1996 |

93 |
Prefix b-trees
- Bayer, Unterauer
- 1977
(Show Context)
Citation Context ...oposed. These sti-uctul.es include suffix trees [27, 34: 471, suffix binary search trees (311, suffix arrays 121: 2'7, 331, inverted files 1391: tries 122, 371, B-trees 19. 161. and the prefix B-tree =-=[lo]-=-. Several variants of these structures ha~e been proposed to index efficieatly strings of nlll3ounded length. The persistent suffix trees have been proposed in 112. 301. A buffer nlai~agement strategy... |

80 | Efficient two-dimensional compressed matching - Amir, Benson - 1992 |

62 | R.: Seq: A model for sequence databases
- Seshadri, Livny, et al.
- 1995
(Show Context)
Citation Context ...uences are usually treated as st,ring sequences. Therefore, indexing colnpressed sequences is closely t.ied to text a.nd sequence indexing. A ]nodel for sequence databases: called SEQ: is proposed in =-=[40]-=-. SEQ inodels different types of sequence data and defines a set of operators to query the sequences. A data structure for indexing numeric sequences is proposed in (181: where sequeilces are mapped i... |

55 | A Database Index to Large Biological Sequences - Hunt, Atkinson, et al. - 2001 |

48 | Efficient pattern matching with scaling - Amir, Landau, et al. - 1992 |

27 | M.: “Practical Suffix Tree Construction - Sandeep, Hankins, et al. - 2004 |

18 | Approximate matching of run-length compressed strings
- Mäkinen, Navarro, et al.
- 2001
(Show Context)
Citation Context ...algorithms have been proposed to search various formats of colllpiessed data. Algorithins for searching RLE-coillpressed sequences include substring nlatching 13. 4. 441. approximate pattern matchIng =-=[32]-=-. edit distance [7. 141. and longest conllno~l subsequellce [G. 231. Algorithms ovel other colnpressioll schenles include seaiching Lempel-Ziv compressed data [2. 381. searching antidictionaries compr... |

15 | Engineering a fast online persistent suffix tree construction - Bedathur, Haritsa - 2004 |

13 | Regular expression searching on compressed text - Navarro - 2003 |

12 | Longest common subsequence between run-length-encoded strings: a new algorithm with improved parallelism - Freschi, Bogliolo |

11 | Searching BWT compressed text with the BoyerMoore algorithm and binary search - Bell, Mukherjee, et al. - 2002 |

10 | Inplace run-length 2D compressed search - Amir, Landau, et al. |

9 | Garcia “Efficient RunLength Encodings - Tanaka, leon- - 1982 |

8 | Searching on the secondary structure of protein sequences - HAMMEL, PATEL - 2002 |

6 | The organization of a Multilist-type associative memory - PRYWES, GRAY - 1963 |

3 | Organization and Maintenance of Large Ordered Indices. Acta Informatica 1: SP 2 Bench: A SPARQL Performance Benchmark - Bayer, McCreight - 2009 |

2 | Matching patterns in strings subject to multi-linear transformations - Tzoreff - 1988 |

1 | Matching for run-length encoded strings. Journal of Complexity - Landau, Skiena - 1999 |

1 | When indexing equals compression: experiments with compressing suffix arrays and applications - Vitter - 2004 |

1 | Suffix arrays: A new method for on-line string searches - hilanber, hlyers - 1993 |

1 | A space-economical suffix tree construction algorithm - hlccreight - 1976 |

1 |
Priority search trees
- h4cCreight
- 1985
(Show Context)
Citation Context ...all points (z: y), u~here a1 < n: 5 a2 a.nd y 2 bl (See Figure 3). The 3-sided range query structure [8] is an external lnen~ory structure that. is based on the external melnory priority search t.ree =-=[35]-=- and the persist.ent B-tree [ll: 451. The 3-sided range query st,ructure lias an optimal worst.-case t11eoret.ical bound for t.he update and 3-sided range query operations. The following lemma. states... |

1 | Implementing the ppm data compression scheme - h4offat - 1990 |

1 |
Patricia: Practical algorithm to retrieve information coded in alphanumeric
- hdorrison
- 1968
(Show Context)
Citation Context ...are sorted inside the String B-tree according to the lexicographic order of the corresponding suffixes (See Figure 2). The String B-tree is a conlbiilat~ioil of the B-tree [16] and the Patricia t rie =-=[37]-=-. where the eiltries inside each B-tree node are organized in a Patricia trie struct.ure instea.d of a sequential array (See Figure 2(c)). The goal of'tlle Patricia trie is to avoid the logarithmic se... |

1 |
Pattern matching in text compressed by using antidictionaries
- Shinohara, Arikawa
- 1999
(Show Context)
Citation Context ...ance [7. 141. and longest conllno~l subsequellce [G. 231. Algorithms ovel other colnpressioll schenles include seaiching Lempel-Ziv compressed data [2. 381. searching antidictionaries compressed text =-=[41]-=-. and searching Burl 0ws-\4~1leeler trailsform (BWT) compressed data 1131. For applicationssuch a.s entropy coinpressed t,ext,, the encoding scheme is complex. and lleilce the search mechanisms and c... |