Results 1  10
of
70
The enhanced suffix array and its applications to genome analysis
 In Proc. Workshop on Algorithms in Bioinformatics, in Lecture Notes in Computer Science
, 2002
"... Abstract. In large scale applications as computational genome analysis, the space requirement of the suffix tree is a severe drawback. In this paper, we present a uniform framework that enables us to systematically replace every string processing algorithm that is based on a bottomup traversal of a ..."
Abstract

Cited by 44 (6 self)
 Add to MetaCart
(Show Context)
Abstract. In large scale applications as computational genome analysis, the space requirement of the suffix tree is a severe drawback. In this paper, we present a uniform framework that enables us to systematically replace every string processing algorithm that is based on a bottomup traversal of a suffix tree by a corresponding algorithm based on an enhanced suffix array (a suffix array enhanced with the lcptable). In this framework, we will show how maximal, supermaximal, and tandem repeats, as well as maximal unique matches can be efficiently computed. Because enhanced suffix arrays require much less space than suffix trees, very large genomes can now be indexed and analyzed, a task which was not feasible before. Experimental results demonstrate that our programs require not only less space but also much less time than other programs developed for the same tasks. 1
Finding approximate repetitions under Hamming distance
 THEORETICAL COMPUTER SCIENCE
, 2001
"... The problem of computing tandem repetitions with K possible mismatches is studied. Two main definitions are considered, and for both of them an O(nK log K + S) algorithm is proposed (S the size of the output). This improves, in particular, the bound obtained in [LS93]. Finally, other possible defini ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
The problem of computing tandem repetitions with K possible mismatches is studied. Two main definitions are considered, and for both of them an O(nK log K + S) algorithm is proposed (S the size of the output). This improves, in particular, the bound obtained in [LS93]. Finally, other possible definions are briefly analyzed.
Finding maximal pairs with bounded gap
 Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 1645 of Lecture Notes in Computer Science
, 1999
"... A pair in a string is the occurrence of the same substring twice. A pair is maximal if the two occurrences of the substring cannot be extended to the left and right without making them different. The gap of a pair is the number of characters between the two occurrences of the substring. In this pape ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
A pair in a string is the occurrence of the same substring twice. A pair is maximal if the two occurrences of the substring cannot be extended to the left and right without making them different. The gap of a pair is the number of characters between the two occurrences of the substring. In this paper we present methods for finding all maximal pairs under various constraints on the gap. In a string of length n we can find all maximal pairs with gap in an upper and lower bounded interval in time O(n log n + z) where z is the number of reported pairs. If the upper bound is removed the time reduces to O(n+z). Since a tandem repeat is a pair where the gap is zero, our methods can be seen as a generalization of finding tandem repeats. The running time of our methods equals the running time of well known methods for finding tandem repeats.
Maximal repetitions in strings
, 2008
"... The cornerstone of any algorithm computing all repetitions in strings of length n in O(n) time is the fact that the number of maximal repetitions (runs) is linear. Therefore, the most important part of the analysis of the running time of such algorithms is counting the number of runs. Kolpakov and K ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
The cornerstone of any algorithm computing all repetitions in strings of length n in O(n) time is the fact that the number of maximal repetitions (runs) is linear. Therefore, the most important part of the analysis of the running time of such algorithms is counting the number of runs. Kolpakov and Kucherov [FOCS’99] proved it to be cn but could not provide any value for c. Recently, Rytter [STACS’06] proved that c ≤ 5. His analysis has been improved by Puglisi et al. to obtain 3.48 and by Rytter to 3.44 (both submitted). The conjecture of Kolpakov and Kucherov, supported by computations, is that c = 1. Here we improve dramatically the previous results by proving that c ≤ 1.6 and show how it could be improved by computer verification down to 1.18 or less. While the conjecture may be very difficult to prove, we believe that our work provides a good approximation for all practical purposes. For the stronger result concerning the linearity of the sum of exponents, we give the first explicit bound: 5.6n. Kolpakov and Kucherov did not have any and Rytter considered “unsatisfactory” the bound that could be deduced from his proof. Our bound could be as well improved by computer verification down to 2.9n or less.
New lower bounds for the maximum number of runs in a string
 in Proc. Prague Stringology Conference (PSC’08), 2008
"... Abstract. We show a new lower bound for the maximum number of runs in a string. We prove that for any ε> 0, (α − ε)n is an asymptotic lower bound, where α = 174719/184973 ≈ 0.944565. It is superior to the previous bound 3/(1 + √ 5) ≈ 0.927 given by Franěk et al. [6,7]. Moreover, our construction ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
(Show Context)
Abstract. We show a new lower bound for the maximum number of runs in a string. We prove that for any ε> 0, (α − ε)n is an asymptotic lower bound, where α = 174719/184973 ≈ 0.944565. It is superior to the previous bound 3/(1 + √ 5) ≈ 0.927 given by Franěk et al. [6,7]. Moreover, our construction of the strings and the proof is much simpler than theirs. 1
Finding approximate tandem repeats in genomic sequences
 J. Comp. Biol
, 2005
"... An efficient algorithm is presented for detecting approximate tandem repeats in genomic sequences. The algorithm is based on a flexible statistical model which allows a wide range of definitions of approximate tandem repeats. The ideas and methods underlying the algorithm are described and examined ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
An efficient algorithm is presented for detecting approximate tandem repeats in genomic sequences. The algorithm is based on a flexible statistical model which allows a wide range of definitions of approximate tandem repeats. The ideas and methods underlying the algorithm are described and examined and its effectiveness on genomic data is demonstrated.
Spectral Repeat Finder (SRF): Identification of Repetitive Sequences using Fourier Transformation
 Bioinformatics
"... Motivation: Repetitive DNA sequences, besides having a variety of regulatory functions, are one of the principal causes of genomic instability. Understanding their origin and evolution is of fundamental importance for genome studies. The identification of repeats and their units helps in deducing t ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
Motivation: Repetitive DNA sequences, besides having a variety of regulatory functions, are one of the principal causes of genomic instability. Understanding their origin and evolution is of fundamental importance for genome studies. The identification of repeats and their units helps in deducing the intragenomic dynamics as an important feature of comparative genomics. A major difficulty in identification of repeats arises from the fact that the repeat units can be either exact or imperfect, in tandem or dispersed, and of unspecified length. Results: The Spectral Repeat Finder program circumvents these problems by using a discrete Fourier transformation to identify significant periodicities present in a sequence. The specific regions of the sequence that contribute to a given periodicity are located through a sliding window analysis, and an exact search method is then used to find the repetitive units. Efficient and complete detection of repeats is provided together with interactive and detailed visualization of the spectral analysis of input sequence. We demonstrate the utility of our method with various examples that contain previously unannotated repeats. A Web server has been developed for convenient access to the automated program.
Finding Repeats With Fixed Gap
 IN: PROC. OF THE 7TH INT’L SYMP. ON STRING PROCESSING AND INFORMATION RETRIEVAL (SPIRE). WASHINGTON: IEEE COMPUTER SOCIETY
, 2000
"... We propose an algorithm for finding in a word all pairs of occurrences of the same subword with a given distance r between them. The obtained complexity is O(n log r + S), where S is the size of the output. We also show how the algorithm can be modified in order to find all such pairs of occurrences ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
We propose an algorithm for finding in a word all pairs of occurrences of the same subword with a given distance r between them. The obtained complexity is O(n log r + S), where S is the size of the output. We also show how the algorithm can be modified in order to find all such pairs of occurrences separated by a given word. The solution uses an algorithm for finding all quasisquares in two strings, a problem that generalizes the known problem of searching for squares.