Results 1 - 10
of
206
A Guided Tour to Approximate String Matching
- ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract
-
Cited by 306 (38 self)
- Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
XRANK: Ranked Keyword Search over XML Documents
, 2003
"... We consider the problem of efficiently producing ranked results for keyword search queries over hyperlinked XML documents. Evaluating ..."
Abstract
-
Cited by 161 (1 self)
- Add to MetaCart
We consider the problem of efficiently producing ranked results for keyword search queries over hyperlinked XML documents. Evaluating
The LCA problem revisited
- In Latin American Theoretical INformatics
, 2000
"... Abstract. We present a very simple algorithm for the Least Common Ancestors problem. We thus dispel the frequently held notion that optimal LCA computation is unwieldy and unimplementable. Interestingly, this algorithm is a sequentialization of a previously known PRAM algorithm. 1 ..."
Abstract
-
Cited by 144 (6 self)
- Add to MetaCart
Abstract. We present a very simple algorithm for the Least Common Ancestors problem. We thus dispel the frequently held notion that optimal LCA computation is unwieldy and unimplementable. Interestingly, this algorithm is a sequentialization of a previously known PRAM algorithm. 1
An O(ND) Difference Algorithm and Its Variations
- Algorithmica
, 1986
"... The problems of finding a longest common subsequence of two sequences A and B and a shortest edit script for transforming A into B have long been known to be dual problems. In this paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this perspective, a s ..."
Abstract
-
Cited by 133 (4 self)
- Add to MetaCart
The problems of finding a longest common subsequence of two sequences A and B and a shortest edit script for transforming A into B have long been known to be dual problems. In this paper, they are shown to be equivalent to finding a shortest/longest path in an edit graph. Using this perspective, a simple O(ND) time and space algorithm is developed where N is the sum of the lengths of A and B and D is the size of the minimum edit script for A and B. The algorithm performs well when differences are small (sequences are similar) and is consequently fast in typical applications. The algorithm is shown to have O(N +D expected-time performance under a basic stochastic model. A refinement of the algorithm requires only O(N) space, and the use of suffix trees leads to an O(NlgN +D ) time variation.
Simple linear work suffix array construction
, 2003
"... Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more exp ..."
Abstract
-
Cited by 119 (6 self)
- Add to MetaCart
Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space–time tradeoff. For any v ∈ [1, √ n], it runs in O(vn) time using O(n / √ v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.
Optimal Suffix Tree Construction with Large Alphabets
, 1997
"... The suffix tree of a string is the fundamental data structure of combinatorial pattern matching. Weiner [Wei73], who introduced the data structure, gave an O(n) time algorithm algorithm for building the suffix tree of an n character string drawn from a constant size alphabet. In the comparison model ..."
Abstract
-
Cited by 118 (0 self)
- Add to MetaCart
The suffix tree of a string is the fundamental data structure of combinatorial pattern matching. Weiner [Wei73], who introduced the data structure, gave an O(n) time algorithm algorithm for building the suffix tree of an n character string drawn from a constant size alphabet. In the comparison model, there is a trivial\Omega\Gamma n log n) time lower bound based on sorting, and Weiner's algorithm matches this bound trivially. Since Weiner's paper, the main open question has been how to deal with integer alphabets. There is no super-linear lower bound, and the fastest known algorithm was the O(n log n) time comparison based algorithm. We settle this open problem by closing the gap: we build suffix trees in linear time for integer alphabet.
TreeJuxtaposer: scalable tree comparison using Focus+Context with guaranteed visibility
- ACM Transactions on Graphics
, 2003
"... Structural comparison of large trees is a difficult task that is only partially supported by current visualization techniques, which are mainly designed for browsing. We present TreeJuxtaposer, a system designed to support the comparison task for large trees of several hundred thousand nodes. We int ..."
Abstract
-
Cited by 89 (5 self)
- Add to MetaCart
Structural comparison of large trees is a difficult task that is only partially supported by current visualization techniques, which are mainly designed for browsing. We present TreeJuxtaposer, a system designed to support the comparison task for large trees of several hundred thousand nodes. We introduce the idea of “guaranteed visibility”, where highlighted areas are treated as landmarks that must remain visually apparent at all times. We propose a new methodology for detailed structural comparison between two trees and provide a new nearly-linear algorithm for computing the best corresponding node from one tree to another. In addition, we present a new rectilinear Focus+Context technique for navigation that is well suited to the dynamic linking of side-by-side views while guaranteeing landmark visibility and constant frame rates. These three contributions result in a system delivering a fluid exploration experience that scales both in the size of the dataset and the number of pixels in the display. We have based the design decisions for our system on the needs of a target audience of biologists who must understand the structural details of many phylogenetic, or evolutionary, trees. Our tool is also useful in many other application domains where tree comparison is needed, ranging from network management to call graph optimization to genealogy.
A few logs suffice to build (almost) all trees (I)
- II. THEORETICAL COMPUTER SCIENCE
, 1999
"... A phylogenetic tree (also called an "evolutionary tree") is a leaf-labelled tree which represents the evolutionary history for a set of species, and the construction of such trees is a fundamental problem in biology. Here we address the issue of how many sequence sites are required in order to recov ..."
Abstract
-
Cited by 81 (24 self)
- Add to MetaCart
A phylogenetic tree (also called an "evolutionary tree") is a leaf-labelled tree which represents the evolutionary history for a set of species, and the construction of such trees is a fundamental problem in biology. Here we address the issue of how many sequence sites are required in order to recover the tree with high probability when the sites evolve under standard Markov-style i.i.d. mutation models. We provide analytic upper and lower bounds for the required sequence length, by developing a new (and polynomial time) algorithm. In particular we show that when the mutation probabilities are bounded the required sequence length can grow surprisingly slowly (a power of log n) in the number n of sequences, for almost all trees.
Ambivalent Data Structures For Dynamic 2-Edge-Connectivity And k Smallest Spanning Trees
- SIAM J. Comput
, 1991
"... . Ambivalent data structures are presented for several problems on undirected graphs. These data structures are used in finding the k smallest spanning trees of a weighted undirected graph in O(m log #(m, n) + min{k 3/2 ,km 1/2 }) time, where m is the number of edges and n the number of vertice ..."
Abstract
-
Cited by 73 (1 self)
- Add to MetaCart
. Ambivalent data structures are presented for several problems on undirected graphs. These data structures are used in finding the k smallest spanning trees of a weighted undirected graph in O(m log #(m, n) + min{k 3/2 ,km 1/2 }) time, where m is the number of edges and n the number of vertices in the graph. The techniques are extended to find the k smallest spanning trees in an embedded planar graph in O(n + k(log n) 3 ) time. Ambivalent data structures are also used to dynamically maintain 2-edge-connectivity information. Edges and vertices can be inserted or deleted in O(m 1/2 ) time, and a query as to whether two vertices are in the same 2-edge-connected component can be answered in O(log n) time, where m and n are understood to be the current number of edges and vertices, respectively. Key words. analysis of algorithms, data structures, embedded planar graph, fully persistent data structures, k smallest spanning trees, minimum spanning tree, on-line updating, topology tr...
Nearest Common Ancestors: A survey and a new distributed algorithm
, 2002
"... Several papers describe linear time algorithms to preprocess a tree, such that one can answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. A common idea used by all the algorithms for the problem is that a solution for complete ba ..."
Abstract
-
Cited by 65 (8 self)
- Add to MetaCart
Several papers describe linear time algorithms to preprocess a tree, such that one can answer subsequent nearest common ancestor queries in constant time. Here, we survey these algorithms and related results. A common idea used by all the algorithms for the problem is that a solution for complete balanced binary trees is straightforward. Furthermore, for complete balanced binary trees we can easily solve the problem in a distributed way by labeling the nodes of the tree such that from the labels of two nodes alone one can compute the label of their nearest common ancestor. Whether it is possible to distribute the data structure into short labels associated with the nodes is important for several applications such as routing. Therefore, related labeling problems have received a lot of attention recently.

