Results 11  20
of
44
Spacetime tradeoffs for longestcommonprefix array computation
 In Proc. 19th ISAAC
, 2008
"... Abstract. The suffix array, a space efficient alternative to the suffix tree, is an important data structure for string processing, enabling efficient and often optimal algorithms for pattern matching, data compression, repeat finding and many problems arising in computational biology. An essential ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
Abstract. The suffix array, a space efficient alternative to the suffix tree, is an important data structure for string processing, enabling efficient and often optimal algorithms for pattern matching, data compression, repeat finding and many problems arising in computational biology. An essential augmentation to the suffix array for many of these tasks is the Longest Common Prefix (LCP) array. In particular the LCP array allows one to simulate bottomup and topdown traversals of the suffix tree with significantly less memory overhead (but in the same time bounds). Since 2001 the LCP array has been computable in Θ(n) time, but the algorithm (even after subsequent refinements) requires relatively large working memory. In this paper we describe a new algorithm that provides a continuous spacetime tradeoff for LCP array construction, running in O(nv) time and requiring n+O(n / √ v+v) bytesofworking space, where v can be chosen to suit the available memory. Furthermore, the algorithm processes the suffix array, and outputs the LCP, strictly lefttoright, making it suitable for use with external memory. We show experimentally that for many naturally occurring strings our algorithm is faster than the linear time algorithms, while using significantly less working memory. 1
Inducing the LCParray
 In International Conference on Algorithms and Data Structures (WADS
, 2011
"... ar ..."
(Show Context)
Fast frequent string mining using suffix arrays
 IN: PROC. ICDM, IEEE COMPUTER SOCIETY
, 2005
"... ..."
On the Number of Elements to Reorder When Updating a Suffix Array
, 2011
"... Recently new algorithms appeared for updating the BurrowsWheeler transform or the suffix array, when the text they index is modified. These algorithms proceed by reordering entries and the number of such reordered entries may be as high as the length of the text. However, in practice, these algorit ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Recently new algorithms appeared for updating the BurrowsWheeler transform or the suffix array, when the text they index is modified. These algorithms proceed by reordering entries and the number of such reordered entries may be as high as the length of the text. However, in practice, these algorithms are faster for updating the BurrowsWheeler transform or the suffix array than the fastest reconstruction algorithms. In this article we focus on the number of elements to be reordered for reallife texts. We show that this number is related to LCP values and that, on average, Lave entries are reordered, where Lave denotes the average LCP value, defined as the average length of the longest common prefix between two consecutive sorted suffixes. Since we know little about the LCP distribution for reallife texts, we conduct experiments on a corpus that consists of DNA sequences and natural language texts. The results show that apart from texts containing large repetitions, the average LCP value is close to the one expected on a random text.
Visualising the repeat structure of genomic sequences
 Complex Systems
, 2008
"... Repeats are a common feature of genomic sequences and much remains to be understood of their origin and structure. The identification of repeated strings in genomic sequences is therefore of importance for a variety of applications in biology. In this paper a new method for finding all repeats and v ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Repeats are a common feature of genomic sequences and much remains to be understood of their origin and structure. The identification of repeated strings in genomic sequences is therefore of importance for a variety of applications in biology. In this paper a new method for finding all repeats and visualising them in a two dimensional plot is presented. The method is first applied to a set of constructed sequences in order to develop a comparative framework. Several complete genomes are then analysed, including the whole human genome. The technique reveals the complex repeat structure of genomic sequences. In particular, interesting differences in the repeat character of the coding and noncoding regions of bacterial genomes are noted. The method allows fast identification of all repeats and easy intergenome comparison. In doing this the plot effectively creates a signature of a sequence which allows some classes of repeat present in a sequence to be identified by simple visual inspection. To our knowledge this is the first time all exact repeats have been visualised in a single plot that highlights the degree to which repeats occur within a genomic sequence, giving an indication of the important
STRING DATA STRUCTURES FOR COMPUTATIONAL MOLECULAR BIOLOGY
"... The topic of the chapter is string data structures with applications in the field of computational molecular biology. Let � be a finite alphabet consisting of a set of characters (or symbols). The cardinality of the alphabet denoted by �  expresses the number of distinct characters in the alphabet ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
The topic of the chapter is string data structures with applications in the field of computational molecular biology. Let � be a finite alphabet consisting of a set of characters (or symbols). The cardinality of the alphabet denoted by �  expresses the number of distinct characters in the alphabet. A string or word is an ordered list
Fast Optimal Algorithms for Computing All the Repeats in a String
, 2008
"... Given a string x = x[1..n] on an alphabet of size α, and a threshold pmin ≥ 1, we first describe a new algorithm PSY1 that, based on suffix array construction, computes all the complete nonextendible repeats in x of length p ≥ pmin. PSY1 executes in Θ(n) time independent of alphabet size and is an ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Given a string x = x[1..n] on an alphabet of size α, and a threshold pmin ≥ 1, we first describe a new algorithm PSY1 that, based on suffix array construction, computes all the complete nonextendible repeats in x of length p ≥ pmin. PSY1 executes in Θ(n) time independent of alphabet size and is an order of magnitude faster than the two other algorithms previously proposed for this problem. Second, we describe a new fast algorithm PSY2 for computing all complete supernonextendible repeats in x that also executes in Θ(n) time independent of alphabet size, thus asymptotically faster than methods previously proposed. Both algorithms require 9n bytes of storage, including preprocessing (with a minor caveat for PSY1). We conclude with a brief discussion of applications to bioinformatics and data compression.
Sequence analysis mkESA: enhanced suffix array construction tool
"... Summary: We introduce the tool mkESA, an open source program for constructing enhanced suffix arrays (ESAs), striving for low memory consumption, yet high practical speed. mkESA is a userfriendly program written in portable C99, based on a parallelized version of the DeepShallow suffix array const ..."
Abstract
 Add to MetaCart
Summary: We introduce the tool mkESA, an open source program for constructing enhanced suffix arrays (ESAs), striving for low memory consumption, yet high practical speed. mkESA is a userfriendly program written in portable C99, based on a parallelized version of the DeepShallow suffix array construction algorithm, which is known for its high speed and small memory usage. The tool handles large FASTA files with multiple sequences, and computes suffix arrays and various additional tables, such as the LCP table (longest common prefix) or the inverse suffix array, from given sequence data. Availability: The source code of mkESA is freely available under the terms of the GNU General Public License (GPL) version 2 at
obtenir le diplôme de doctorat
, 2013
"... Développement de méthodes et d’algorithmes pour la caractérisation et l’annotation des transcriptomes avec les séquenceurs haut débit ..."
Abstract
 Add to MetaCart
(Show Context)
Développement de méthodes et d’algorithmes pour la caractérisation et l’annotation des transcriptomes avec les séquenceurs haut débit