Results 1  10
of
11
The String BTree: A New Data Structure for String Search in External Memory and its Applications.
 Journal of the ACM
, 1998
"... We introduce a new textindexing data structure, the String BTree, that can be seen as a link between some traditional externalmemory and stringmatching data structures. In a short phrase, it is a combination of Btrees and Patricia tries for internalnode indices that is made more effective by a ..."
Abstract

Cited by 122 (12 self)
 Add to MetaCart
We introduce a new textindexing data structure, the String BTree, that can be seen as a link between some traditional externalmemory and stringmatching data structures. In a short phrase, it is a combination of Btrees and Patricia tries for internalnode indices that is made more effective by adding extra pointers to speed up search and update operations. Consequently, the String BTree overcomes the theoretical limitations of inverted files, Btrees, prefix Btrees, suffix arrays, compacted tries and suffix trees. String Btrees have the same worstcase performance as Btrees but they manage unboundedlength strings and perform much more powerful search operations such as the ones supported by suffix trees. String Btrees are also effective in main memory (RAM model) because they improve the online suffix tree search on a dynamic set of strings. They also can be successfully applied to database indexing and software duplication.
Combinatorial algorithms for DNA sequence assembly
 Algorithmica
, 1993
"... The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The seq ..."
Abstract

Cited by 42 (3 self)
 Add to MetaCart
The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NPhard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form ...
Incremental String Comparison
 SIAM JOURNAL ON COMPUTING
, 1995
"... The problem of comparing two sequences A and B to determine their LCS or the edit distance between them has been much studied. In this paper we consider the following incremental version of these problems: given an appropriate encoding of a comparison between A and B, can one incrementally compute t ..."
Abstract

Cited by 38 (3 self)
 Add to MetaCart
The problem of comparing two sequences A and B to determine their LCS or the edit distance between them has been much studied. In this paper we consider the following incremental version of these problems: given an appropriate encoding of a comparison between A and B, can one incrementally compute the answer for A and bB, and the answer for A and Bb with equal efficiency, where b is an additional symbol? Our main result is a theorem exposing a surprising relationship between the dynamic programming solutions for two such "adjacent" problems. Given a threshold k on the number of differences to be permitted in an alignment, the theorem leads directly to an O(k) algorithm for incrementally computing a new solution from an old one, as contrasts the O(k²) time required to compute a solution from scratch. We further show with a series of applications that this algorithm is indeed more powerful than its nonincremental counterpart by solving the applications with greater asymptotic ef...
Suffix Trees and their Applications in String Algorithms
, 1993
"... : The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
: The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter design, information retrieval, abstract data types and many others. In this paper, we survey some applications of suffix trees and some algorithmic techniques for their construction. Special emphasis is given to the most recent developments in this area, such as parallel algorithms for suffix tree construction and generalizations of suffix trees to higher dimensions, which are important in multidimensional pattern matching. Work partially supported by the ESPRIT BRA ALCOM II under contract no. 7141 and by the Italian MURST Project "Algoritmi, Modelli di Calcolo e Strutture Informative". y Part of this work was done while the author was visiting AT&T Bell Laboratories. Email: grossi@di.uni...
A 2 2/3Approximation Algorithms for the Shortest Superstring Problem
 DIMACS WORKSHOP ON SEQUENCING AND MAPPING
, 1995
"... Given a collection of strings S = fs1; : : : ; sng over an alphabet, a superstring of S is a string containing each si as a substring; that is, for each i, 1 i n, contains a block of jsij consecutive characters that match si exactly. The shortest superstring problem is the problem of nding a superst ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Given a collection of strings S = fs1; : : : ; sng over an alphabet, a superstring of S is a string containing each si as a substring; that is, for each i, 1 i n, contains a block of jsij consecutive characters that match si exactly. The shortest superstring problem is the problem of nding a superstring of minimum length. The shortest superstring problem has applications in both data compression and computational biology. In data compression, the problem is a part of a general model of string compression proposed by Gallant, Maier and Storer (JCSS '80). Much of the recent interest in the problem is due to its application to DNA sequence assembly. The problem has been shown to be NPhard; in fact, it was shown by Blum et al.(JACM '94) to be MAX SNPhard. The rst O(1)approximation was also due to Blum et al., who gave an algorithm that always returns a superstring no more than 3 times the length of an optimal solution. Several researchers have published results that improve on the approximation ratio; of these, the best previous result is our algorithm ShortString, which achieves a 2 3
Sequential and Parallel Approximation of Shortest Superstrings
, 1997
"... Superstrings have many applications in data compression and genetics. However, the decision version of the shortest superstring problem is N Pcomplete. In this paper we examine the complexity of approximating shortest superstrings. There are two basic measures of the approximations: the length fact ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Superstrings have many applications in data compression and genetics. However, the decision version of the shortest superstring problem is N Pcomplete. In this paper we examine the complexity of approximating shortest superstrings. There are two basic measures of the approximations: the length factor and the compression factor. The well known and practical approximation algorithm is the sequential algorithm GREEDY. It approximates the shortest superstring with the compression 1 factor of and with the length factor of 4. Our main results are: Ž. 2 1 A sequential length approximation algorithm which achieves a length factor of 2.83. This result improves the best previously known bound of 2.89 due to Teng and Yao. Very recently, this bound was improved by Kosaraju, Park, and Stein to 2.79, and by Armen and Stein to 2.75. Ž. 2 A proof that the algorithm GREEDY is not paralleliz
An experimental study of SBtrees
 In ACMSIAM symposium on Discrete Algorithms
, 1996
"... In a previous work of ours [13], we proposed a text indexing data structure for external memory, which we called SBtree, that combines the best Btree and suffix array qualities to overcome the limitations of inverted files, suffix arrays, suffix trees, and prefix Btrees. In this paper, we study t ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
In a previous work of ours [13], we proposed a text indexing data structure for external memory, which we called SBtree, that combines the best Btree and suffix array qualities to overcome the limitations of inverted files, suffix arrays, suffix trees, and prefix Btrees. In this paper, we study the performance of SBtrees in a practical setting by running a large number of searching and updating experiments. We obtain fast practical performance by means of a new spaceefficient and alphabetindependent organization of SBtree nodes and a new batch insertion procedure that avoids thrashing. 1 Introduction Textual data in electronic form are more available than before and range from published documents (e.g., electronic dictionaries, libraries and archives, etc.) to private databases (e.g., marketing information, legal records, medical histories, etc.). Online providers of legal and newswire texts (such as Westlaw and LexisNexis) already have hundreds of text gigabytes and will have...
Incremental clone detection. Diploma thesis
, 2008
"... I would like to thank the members of the Software Engineering Group for discussing my ideas and giving me feedback. Special thanks go to Rainer Koschke and Renate KlempienHinrichs for supervising this thesis. Furthermore, I would like to apologize to my family and friends, who did not get the share ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
I would like to thank the members of the Software Engineering Group for discussing my ideas and giving me feedback. Special thanks go to Rainer Koschke and Renate KlempienHinrichs for supervising this thesis. Furthermore, I would like to apologize to my family and friends, who did not get the share of my time that they deserved. Declaration of Authorship I declare that I wrote this thesis without external help. I did not use any sources except those explicitly stated or referenced in the bibliography. All parts which have been literally or according to their meaning taken from publications are indicated as such.
A 2 2/3 Superstring Approximation Algorithm
, 1998
"... Given a collection of strings S = {s_1, ..., s_n} over an alphabet \Sigma, a superstring \alpha of S is a string containing each s_i as a substring; that is, for each i, 1<=i<=n, \alpha contains a block of s_i consecutive characters that match s_i exactly. The shortest superstring problem is the p ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Given a collection of strings S = {s_1, ..., s_n} over an alphabet \Sigma, a superstring \alpha of S is a string containing each s_i as a substring; that is, for each i, 1<=i<=n, \alpha contains a block of s_i consecutive characters that match s_i exactly. The shortest superstring problem is the problem of finding a superstring \alpha of minimum length.
The shortest superstring problem has applications in both data compression and computational biology. It was shown by Blum et al. [3] to be MAX SNPhard. The first O(1)approximation algorithm also appeared in [3], which returns a superstring no more than 3 times the length of an optimal solution. Prior to the algorithm described in this paper, there were several published results that improved on the approximation ratio; of these, the best was our algorithm ShortString, a 2 3/4 approximation [1].
We present our new algorithm, GShortString, which achieves an approximation ratio of 2 2/3. Our approach builds on the work in [1], in which we identified classes of strings that have a nested periodic structure, and which must be present in the worst case for our algorithms. We introduced machinery to describe these strings and proved strong structural properties about them. In this paper we extend this study to strings that exhibit a more relaxed form of the same structure, and we use this understanding to obtain our improved result.
Algorithms for Three Versions of the Shortest Common Superstring Problem
"... Abstract. The input to the Shortest Common Superstring (SCS) problem is a set S of k words of total length n. In the classical version the output is an explicit word SCS(S) in which each s ∈ S occurs at least once. In our paper we consider two versions with multiple occurrences, in which the input i ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. The input to the Shortest Common Superstring (SCS) problem is a set S of k words of total length n. In the classical version the output is an explicit word SCS(S) in which each s ∈ S occurs at least once. In our paper we consider two versions with multiple occurrences, in which the input includes additional numbers (multiplicities), given in binary. Our output is the word SCS(S) given implicitly in a compact form, since its real size could be exponential. We also consider a case when all input words are of length two, where our main algorithmic tool is a compact representation of Eulerian cycles in multigraphs. Due to exponential multiplicities of edges such cycles can be exponential and the compact representation is needed. Other tools used in our paper are a polynomial case of integer linear programming and a minplus product of matrices. 1