Results 11 - 20
of
44
Space-time tradeoffs for longest-common-prefix array computation
- In Proc. 19th ISAAC
, 2008
"... Abstract. The suffix array, a space efficient alternative to the suffix tree, is an important data structure for string processing, enabling efficient and often optimal algorithms for pattern matching, data compression, repeat finding and many problems arising in computational biology. An essential ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
Abstract. The suffix array, a space efficient alternative to the suffix tree, is an important data structure for string processing, enabling efficient and often optimal algorithms for pattern matching, data compression, repeat finding and many problems arising in computational biology. An essential augmentation to the suffix array for many of these tasks is the Longest Common Prefix (LCP) array. In particular the LCP array allows one to simulate bottom-up and top-down traversals of the suffix tree with significantly less memory overhead (but in the same time bounds). Since 2001 the LCP array has been computable in Θ(n) time, but the algorithm (even after subsequent refinements) requires relatively large working memory. In this paper we describe a new algorithm that provides a continuous space-time tradeoff for LCP array construction, running in O(nv) time and requiring n+O(n / √ v+v) bytesofworking space, where v can be chosen to suit the available memory. Furthermore, the algorithm processes the suffix array, and outputs the LCP, strictly left-to-right, making it suitable for use with external memory. We show experimentally that for many naturally occurring strings our algorithm is faster than the linear time algorithms, while using significantly less working memory. 1
Inducing the LCP-array
- In International Conference on Algorithms and Data Structures (WADS
, 2011
"... ar ..."
(Show Context)
Fast frequent string mining using suffix arrays
- IN: PROC. ICDM, IEEE COMPUTER SOCIETY
, 2005
"... ..."
On the Number of Elements to Reorder When Updating a Suffix Array
, 2011
"... Recently new algorithms appeared for updating the Burrows-Wheeler transform or the suffix array, when the text they index is modified. These algorithms proceed by reordering entries and the number of such reordered entries may be as high as the length of the text. However, in practice, these algorit ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Recently new algorithms appeared for updating the Burrows-Wheeler transform or the suffix array, when the text they index is modified. These algorithms proceed by reordering entries and the number of such reordered entries may be as high as the length of the text. However, in practice, these algorithms are faster for updating the Burrows-Wheeler transform or the suffix array than the fastest reconstruction algorithms. In this article we focus on the number of elements to be reordered for real-life texts. We show that this number is related to LCP values and that, on average, Lave entries are reordered, where Lave denotes the average LCP value, defined as the average length of the longest common prefix between two consecutive sorted suffixes. Since we know little about the LCP distribution for real-life texts, we conduct experiments on a corpus that consists of DNA sequences and natural language texts. The results show that apart from texts containing large repetitions, the average LCP value is close to the one expected on a random text.
Visualising the repeat structure of genomic sequences
- Complex Systems
, 2008
"... Repeats are a common feature of genomic sequences and much remains to be understood of their origin and structure. The identification of repeated strings in genomic sequences is therefore of importance for a variety of applications in biology. In this paper a new method for finding all repeats and v ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Repeats are a common feature of genomic sequences and much remains to be understood of their origin and structure. The identification of repeated strings in genomic sequences is therefore of importance for a variety of applications in biology. In this paper a new method for finding all repeats and visualising them in a two dimensional plot is presented. The method is first ap-plied to a set of constructed sequences in order to develop a compara-tive framework. Several complete genomes are then analysed, including the whole human genome. The technique reveals the complex repeat structure of genomic se-quences. In particular, interesting differences in the repeat character of the coding and non-coding regions of bacterial genomes are noted. The method allows fast identification of all repeats and easy inter-genome comparison. In doing this the plot effectively creates a sig-nature of a sequence which allows some classes of repeat present in a sequence to be identified by simple visual inspection. To our knowledge this is the first time all exact repeats have been visualised in a single plot that highlights the degree to which repeats occur within a genomic sequence, giving an indication of the important
STRING DATA STRUCTURES FOR COMPUTATIONAL MOLECULAR BIOLOGY
"... The topic of the chapter is string data structures with applications in the field of computational molecular biology. Let � be a finite alphabet consisting of a set of characters (or symbols). The cardinality of the alphabet denoted by |� | expresses the number of distinct characters in the alphabet ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
The topic of the chapter is string data structures with applications in the field of computational molecular biology. Let � be a finite alphabet consisting of a set of characters (or symbols). The cardinality of the alphabet denoted by |� | expresses the number of distinct characters in the alphabet. A string or word is an ordered list
Fast Optimal Algorithms for Computing All the Repeats in a String
, 2008
"... Given a string x = x[1..n] on an alphabet of size α, and a threshold pmin ≥ 1, we first describe a new algorithm PSY1 that, based on suffix array construction, computes all the complete nonextendible repeats in x of length p ≥ pmin. PSY1 executes in Θ(n) time independent of alphabet size and is an ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Given a string x = x[1..n] on an alphabet of size α, and a threshold pmin ≥ 1, we first describe a new algorithm PSY1 that, based on suffix array construction, computes all the complete nonextendible repeats in x of length p ≥ pmin. PSY1 executes in Θ(n) time independent of alphabet size and is an order of magnitude faster than the two other algorithms previously proposed for this problem. Second, we describe a new fast algorithm PSY2 for computing all complete supernonextendible repeats in x that also executes in Θ(n) time independent of alphabet size, thus asymptotically faster than methods previously proposed. Both algorithms require 9n bytes of storage, including preprocessing (with a minor caveat for PSY1). We conclude with a brief discussion of applications to bioinformatics and data compression.
Sequence analysis mkESA: enhanced suffix array construction tool
"... Summary: We introduce the tool mkESA, an open source program for constructing enhanced suffix arrays (ESAs), striving for low memory consumption, yet high practical speed. mkESA is a user-friendly program written in portable C99, based on a parallelized version of the Deep-Shallow suffix array const ..."
Abstract
- Add to MetaCart
Summary: We introduce the tool mkESA, an open source program for constructing enhanced suffix arrays (ESAs), striving for low memory consumption, yet high practical speed. mkESA is a user-friendly program written in portable C99, based on a parallelized version of the Deep-Shallow suffix array construction algorithm, which is known for its high speed and small memory usage. The tool handles large FASTA files with multiple sequences, and computes suffix arrays and various additional tables, such as the LCP table (longest common prefix) or the inverse suffix array, from given sequence data. Availability: The source code of mkESA is freely available under the terms of the GNU General Public License (GPL) version 2 at
obtenir le diplôme de doctorat
, 2013
"... Développement de méthodes et d’algorithmes pour la caractérisation et l’annotation des transcriptomes avec les séquenceurs haut débit ..."
Abstract
- Add to MetaCart
(Show Context)
Développement de méthodes et d’algorithmes pour la caractérisation et l’annotation des transcriptomes avec les séquenceurs haut débit