Results 1 - 10
of
12
Suffix Trees and their Applications in String Algorithms
, 1993
"... : The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
: The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter design, information retrieval, abstract data types and many others. In this paper, we survey some applications of suffix trees and some algorithmic techniques for their construction. Special emphasis is given to the most recent developments in this area, such as parallel algorithms for suffix tree construction and generalizations of suffix trees to higher dimensions, which are important in multidimensional pattern matching. Work partially supported by the ESPRIT BRA ALCOM II under contract no. 7141 and by the Italian MURST Project "Algoritmi, Modelli di Calcolo e Strutture Informative". y Part of this work was done while the author was visiting AT&T Bell Laboratories. Email: grossi@di.uni...
Optimal Logarithmic Time Randomized Suffix Tree Construction
- In Proc 23rd ICALP
, 1996
"... The su#x tree of a string, the fundamental data structure in the area of combinatorial pattern matching, has many elegant applications. In this paper, we present a novel, simple sequential algorithm for the construction of su#x trees. We are also able to parallelize our algorithm so that we settl ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
The su#x tree of a string, the fundamental data structure in the area of combinatorial pattern matching, has many elegant applications. In this paper, we present a novel, simple sequential algorithm for the construction of su#x trees. We are also able to parallelize our algorithm so that we settle the main open problem in the construction of su#x trees: we give a Las Vegas CRCW PRAM algorithm that constructs the su#x tree of a binary string of length n in O(log n) time and O(n) work with high probability. In contrast, the previously known work-optimal algorithms, while deterministic, take# (log n) time.
Efficient Approximate and Dynamic Matching of Patterns Using a Labeling Paradigm (Extended Abstract)
"... A key approach in string processing algorithmics has been the labeling paradigm [KMR72], which is based on assigning labels to some of the substrings of a given string. If these labels are chosen consistently, they can enable fast comparisons of substrings. Until the first optimal parallel algorithm ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
A key approach in string processing algorithmics has been the labeling paradigm [KMR72], which is based on assigning labels to some of the substrings of a given string. If these labels are chosen consistently, they can enable fast comparisons of substrings. Until the first optimal parallel algorithm for suffix tree construction was given in [SV94], the labeling paradigm was considered not to be competitive with other approaches. In this paper we show that, this general method is also useful for several central problems in the area of string processing: ffl Approximate String Matching, ffl Dynamic Dictionary Matching, ffl Dynamic Text Indexing. The approximate string matching problem deals with finding all substrings of a text which match a pattern "approximately", i.e., with at most m differences. The differences can be in the form of inserted, deleted, or replaced characters. The text indexing problem deals with finding all occurrences of a pattern in a text, after the text is prep...
Optimal Parallel Construction of Minimal Suffix and Factor Automata
, 1995
"... This paper gives optimal parallel algorithms for the construction of the smallest deterministic finite automata recognizing all the suffixes and the factors of a string. The algorithms use recently discovered optimal parallel suffix tree construction algorithms together with data structures for t ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This paper gives optimal parallel algorithms for the construction of the smallest deterministic finite automata recognizing all the suffixes and the factors of a string. The algorithms use recently discovered optimal parallel suffix tree construction algorithms together with data structures for the efficient manipulation of trees, exploiting the well known relation between suffix and factor automata and suffix trees.
Space and time efficient parallel algorithms and software for EST clustering
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 2003
"... Expressed sequence tags, abbreviated as ESTs, are DNA molecules experimentally derived from expressed portions of genes. Clustering of ESTs is essential for gene recognition and for understanding important genetic variations such as those resulting in diseases. In this paper, we present the algorith ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Expressed sequence tags, abbreviated as ESTs, are DNA molecules experimentally derived from expressed portions of genes. Clustering of ESTs is essential for gene recognition and for understanding important genetic variations such as those resulting in diseases. In this paper, we present the algorithmic foundations and implementation of PaCE, a parallel software system we developed for large-scale EST clustering. The novel features of our approach include 1) design of space-efficient algorithms to limit the space required to linear in the size of the input data set, 2) a combination of algorithmic techniques to reduce the total work without sacrificing the quality of EST clustering, and 3) use of parallel processing to reduce runtime and facilitate clustering of large data sets. Using a combination of these techniques, we report the clustering of 327,632 rat ESTs in 47 minutes, and 420,694 Triticum aestivum ESTs in 3 hours and 15 minutes, using a 60-processor IBM xSeries cluster. These problems are well beyond the capabilities of stateof-the-art sequential software. We also present thorough experimental evaluation of our software including quality assessment using benchmark Arabidopsis EST data.
Linear-Time Construction of Two-Dimensional Suffix Trees (Extended Abstract)
- In Proceedings of the 26th International Colloquium on Automata, Languages and Programming (ICALP), volume 1644 of LNCS
, 1999
"... Dong Kyue Kim Kunsoo Park Department of Computer Engineering Seoul National University, Seoul 151-742, Korea fdkkim,kparkg@theory.snu.ac.kr Abstract. The suffix tree of a string S is a compacted trie that represents all suffixes of S. Linear-time algorithms for constructing the suffix tree hav ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Dong Kyue Kim Kunsoo Park Department of Computer Engineering Seoul National University, Seoul 151-742, Korea fdkkim,kparkg@theory.snu.ac.kr Abstract. The suffix tree of a string S is a compacted trie that represents all suffixes of S. Linear-time algorithms for constructing the suffix tree have been known for quite a while. In two dimensions, however, linear-time construction of two-dimensional suffix trees has been an open problem. We present the first linear-time algorithm for constructing twodimensional suffix trees.
A Parallel Algorithm for the Extraction of Structured Motifs
, 2004
"... In this work we propose a parallel algorithm for the efficient extraction of binding-site consensus from genomic sequences. This algorithm, based on an existing approach, extracts structured motifs, that consist of an ordered collection of p ≥ 1 boxes with sizes and spacings between them specifie ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
In this work we propose a parallel algorithm for the efficient extraction of binding-site consensus from genomic sequences. This algorithm, based on an existing approach, extracts structured motifs, that consist of an ordered collection of p ≥ 1 boxes with sizes and spacings between them specified by given parameters. The contents of the boxes, which represent the extracted motifs, are unknown at the start of the process and are found by the algorithm using a suffix tree as the fundamental data structure. By partitioning the structured motif searching space we divide the most demanding part of the algorithm by a number of processors that can be loosely coupled. In this way we obtain, under conditions that are easily met, a speedup that is linear on the number of available processing units. This speedup is verified by both theoretical and experimental analysis, also presented in this paper.
Approximate Pattern Matching Using Locally Consistent Parsing
, 1997
"... A key approach in string processing algorithmics has been the labeling paradigm [KMR72], which is based on assigning labels to some of the substrings of a given string. If these labels are chosen consistently, they can enable fast comparisons of substrings. Until the first optimal parallel algorithm ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
A key approach in string processing algorithmics has been the labeling paradigm [KMR72], which is based on assigning labels to some of the substrings of a given string. If these labels are chosen consistently, they can enable fast comparisons of substrings. Until the first optimal parallel algorithm for suffix tree construction was given in [SV94], the labeling paradigm was considered not to be competitive with the most efficient approaches. In this paper we show that this general method can be used to obtain a linear time, deterministic algorithm for the Approximate String Matching problem. The approximate string matching problem deals with finding all substrings of a text which match a pattern "approximately", i.e., with at most m differences. The differences can be in the form of inserted, deleted, or replaced characters. Department of Computer Science, University of Maryland, College Park, MD 20742; and Information Sciences Center, Bell Laboratories, Murray Hill, NJ 07974. y In...
Parallel EST Clustering
, 2002
"... Expressed sequence tags, abbreviated ESTs, are DNA fragments experimentally derived from expressed portions of genes. Clustering of ESTs is essential for gene recognition and understanding important genetic variations such as those resulting in diseases. In this paper, we present the design and deve ..."
Abstract
- Add to MetaCart
Expressed sequence tags, abbreviated ESTs, are DNA fragments experimentally derived from expressed portions of genes. Clustering of ESTs is essential for gene recognition and understanding important genetic variations such as those resulting in diseases. In this paper, we present the design and development of a parallel software system for EST clustering. The novel features of our approach include 1) space efficient algorithms to keep the space requirement linear in the size of the input data set, 2) a combination of algorithmic techniques to reduce the total work without sacrificing the quality of EST clustering, and 3) use of parallel processing to reduce the run-time and facilitate the clustering of large data sets. Using a combination of these techniques, we report the clustering of 50,000 maize ESTs in 16 minutes on a 32-processor IBM SP. To our knowledge, this is the first effort in building a parallel software system for EST clustering.
Suffix Trees for Integer Alphabets Revisited
, 1999
"... Farach recently gave a linear-time algorithm for constructing suffix trees for integer alphabets, which solves a major open problem on index data structures. We present a new and somewhat cleaner algorithm for constructing suffix trees for integer alphabets in linear time. ..."
Abstract
- Add to MetaCart
Farach recently gave a linear-time algorithm for constructing suffix trees for integer alphabets, which solves a major open problem on index data structures. We present a new and somewhat cleaner algorithm for constructing suffix trees for integer alphabets in linear time.

