## CPS-tree: A compact partitioned suffix tree for disk-based indexing on large genome sequences (2007)

Venue: | In Proceedings of the International Conference on Data Engineering |

Citations: | 3 - 0 self |

### BibTeX

@INPROCEEDINGS{Wong07cps-tree:a,

author = {Swee-seong Wong and Wing-kin Sung},

title = {CPS-tree: A compact partitioned suffix tree for disk-based indexing on large genome sequences},

booktitle = {In Proceedings of the International Conference on Data Engineering},

year = {2007},

pages = {1350--1354}

}

### OpenURL

### Abstract

Suffix tree is an important data structure for indexing a long sequence (like a genome sequence) or a concatenation of sequences. It finds many applications in practice, especially in the domain of bioinformatics. Suffix tree allows for efficient pattern search with time independent of the sequence length. However, the performance of disk-based suffix tree is a concern as it is slowed down significantly by poor localized access resulting in high IO disk access. The focus of this paper is to design an IO-efficient and Compact Partitioned Suffix tree representation (CPS-tree) on disk. We show that representing suffix tree using CPStree has several advantages. First, our representation allows us to visit any node in the suffix tree by accessing at most log n pages of the tree where n is the length of the sequence. Second, our storage scheme improves the access pattern and reduces the number of page fault resulting in efficient search retrieval and efficient tree traversal operations. Third, by bit packing, our index is compact. Experimental results show that CPS-tree outperforms other indexes on disk. When fully loaded into the main memory, CPS-tree is still efficient. Hence, we expect CPS-tree to be a good disk-based representation of suffix tree, with potential use in practical applications. 1.

### Citations

699 | Suffix arrays: A new method for on-line string searches
- Manber, Myers
- 1993
(Show Context)
Citation Context ... WOTD-tree[5, 9] min{m, H} + m+occ B CPS-tree min{m, log n} + m+occ B logB n + m B H√ +log B B n + m+occ B min{m, H} + m+occ B min{m, log n} + m B logB n + ℓ B H√ +log B B n + ℓ B ℓ B ℓ B suffix array=-=[8]-=- m log n + occ B m log n log n + ℓ B log B n + ℓ B log |A| 1 1 log n + ℓ B The WOTD-tree is generated using the TDD construction algorithm[9]. The SB-tree does not maintain the original suffix tree st... |

125 | The string B-tree: a new data structure for string search in external memory and its applications
- Ferragina, Grossi
(Show Context)
Citation Context ...x tree in memory is no longer feasible. We need to have a disk-based representation of suffix tree that allows for efficient access. We have seen a number of disk-based representations of suffix tree =-=[1, 4, 9]-=- in the literature. However, these disk-based suffix trees either fail to support all the general suffix tree operations well or have high IO disk access for certain operations. This paper focuses on ... |

123 | Reducing the space requirements of suffix trees
- Kurtz
- 1998
(Show Context)
Citation Context ...d suffix tree on disk, assuming that every position on the text is to be indexed and addressable using a 4 bytes word. Our scheme is comparable to the most space efficient suffix tree representations =-=[1, 4, 6]-=-. Retrieval of the matching occurrences on the text, given a search string, can be performed by traversing the suffix tree to access all the leaf nodes within a subtree. Alternatively, this can be han... |

88 |
Efficient suffix trees on secondary storage
- Clark, Munro
- 1996
(Show Context)
Citation Context ...x tree in memory is no longer feasible. We need to have a disk-based representation of suffix tree that allows for efficient access. We have seen a number of disk-based representations of suffix tree =-=[1, 4, 9]-=- in the literature. However, these disk-based suffix trees either fail to support all the general suffix tree operations well or have high IO disk access for certain operations. This paper focuses on ... |

42 | Efficient implementation of lazy suffix trees
- Giegerich, Kurtz, et al.
- 2003
(Show Context)
Citation Context ...d throughout this paper for easy referSuffix Exact match query Exact match Edge label Child node structure count query access access SB-tree[4] logB n + m+occ B CPT[1] √B H +logBn + m+occ B WOTD-tree=-=[5, 9]-=- min{m, H} + m+occ B CPS-tree min{m, log n} + m+occ B logB n + m B H√ +log B B n + m+occ B min{m, H} + m+occ B min{m, log n} + m B logB n + ℓ B H√ +log B B n + ℓ B ℓ B ℓ B suffix array[8] m log n + oc... |

30 | Practical Suffix Tree Construction
- Tata, Hankins, et al.
(Show Context)
Citation Context ... problems. The first problem is on constructing suffix tree efficiently. Fortunately, a suffix tree (or a suffix array) for human genome of 3 billion characters can now be constructed within 30 hours =-=[7, 9]-=-. Hence, the problem on suffix tree construction has largely been solved in practice. The second problem is on accessing the suffix tree. As the genome database gets bigger, maintaining suffix tree in... |

27 | A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays
- Lam, Sadakane, et al.
- 2002
(Show Context)
Citation Context ... problems. The first problem is on constructing suffix tree efficiently. Fortunately, a suffix tree (or a suffix array) for human genome of 3 billion characters can now be constructed within 30 hours =-=[7, 9]-=-. Hence, the problem on suffix tree construction has largely been solved in practice. The second problem is on accessing the suffix tree. As the genome database gets bigger, maintaining suffix tree in... |

24 | Fast string searching in secondary storage: Theoretical developments and experimental results - Ferragina, Grossi - 1996 |

18 | Clustering techniques for minimizing external path length
- Diwan, Rane, et al.
- 1996
(Show Context)
Citation Context ... block). This partitioning method guarantees a good IO disk access bound for both worst and average cases. There are several tree partitioning methods in the literature. In the paper by Diwan et. al. =-=[2]-=-, bottom-up, tree partitioning methods have been proposed, that find the optimal layout minimizing either the worst (maximum) or average block access when traversing from the root to any leaf in the t... |