Results 1  10
of
30
Compressed suffix arrays and suffix trees with applications to text indexing and string matching
, 2005
"... The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. ..."
Abstract

Cited by 189 (17 self)
 Add to MetaCart
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for spaceefficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg Σ  bits by encoding each symbol with lg Σ  bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg Σ  n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg Σ) timeorinO(m +lgn) time, plus an outputsensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m / lg Σ  n +lgɛ Σ  n) search time in the worst case, for any constant
Speeding Up Two StringMatching Algorithms
 ALGORITHMICA
, 1994
"... We show how to speed up two stringmatching algorithms: the BoyerMoore algorithm (BM algorithm), and its version called here the reverse factor algorithm (RF algorithm). The RF algorithm is based on factor graphs for the reverse of the pattern.The main feature of both algorithms is that they scan ..."
Abstract

Cited by 94 (17 self)
 Add to MetaCart
We show how to speed up two stringmatching algorithms: the BoyerMoore algorithm (BM algorithm), and its version called here the reverse factor algorithm (RF algorithm). The RF algorithm is based on factor graphs for the reverse of the pattern.The main feature of both algorithms is that they scan the text righttoleft from the supposed right position of the pattern. The BM algorithm goes as far as the scanned segment (factor) is a suffix of the pattern. The RF algorithm scans while the segment is a factor of the pattern. Both algorithms make a shift of the pattern, forget the history, and start again. The RF algorithm usually makes bigger shifts than BM, but is quadratic in the worst case. We show that it is enough to remember the last matched segment (represented by two pointers to the text) to speed up the RF algorithm considerably (to make a linear number of inspections of text symbols, with small coefficient), and to speed up the BM algorithm (to make at most 2.n comparisons). Only a constant additional memory is needed for the search phase. We give alternative versions of an accelerated RF algorithm: the first one is based on combinatorial properties of primitive words, and the other two use the power of suffix trees extensively. The paper demonstrates the techniques to transform algorithms, and also shows interesting new applications of data structures representing all subwords of the pattern in compact form.
A Generalized Suffix Tree and Its (Un)Expected Asymptotic Behaviors
 SIAM J. Computing
, 1996
"... Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees  further calle ..."
Abstract

Cited by 53 (29 self)
 Add to MetaCart
Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees  further called bsuffix trees  built from the first n suffixes of a random word. In this family a noncompact suffix tree (i.e., such that every edge is labeled by a single symbol) is represented by b = 1, and a compact suffix tree (i.e., without unary nodes) is asymptotically equivalent to b ! 1 as n ! 1. We study several parameters of bsuffix trees, namely: the depth of a given suffix, the depth of insertion, the height and the shortest feasible path. Some new results concerning typical (i.e., almost sure) behaviors of these parameters are established. These findings are used to obtain several insights into certain algorithms on words, molecular biology and universal data compression schemes. Key Wo...
Tries for Approximate String Matching
 IEEE Transactions on Knowledge and Data Engineering
, 1996
"... Tries offer text searches with costs which are independent of the size of the document being searched, and so are important for large documents requiring spelling checkers), case insensitivity, and limited approximate regular secondary storage. Approximate searches, in which the search pattern d ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
Tries offer text searches with costs which are independent of the size of the document being searched, and so are important for large documents requiring spelling checkers), case insensitivity, and limited approximate regular secondary storage. Approximate searches, in which the search pattern differs from the document by k substitutions, transpositions, insertions or deletions, have hitherto been carried out only at costs linear in the size of the document. We present a triebased method whose cost is independent of document size. H. Shang and T.H. Merrett are at the School of Computer Science, McGill University, Montr'eal, Qu'ebec, Canada H3A 2A7, Email: fshang, timg@cs.mcgill.ca 100 Our experiments show that this new method significantly outperforms the nearest competitor for k=0 and k=1, which are arguably the most important cases. The linear cost (in k) of the other methods begins to catch up, for our small files, only at k=2. For larger files, complexity arguments i...
Analysis of the average depth in a suffix tree under a Markov model
 In International Conference on the Analysis of Algorithms
, 2005
"... In this report, we prove that under a Markovian model of order one, the average depth of suffix trees of index n is asymptotically similar to the average depth of tries (a.k.a. digital trees) built on n independent strings. This leads to an asymptotic behavior of (log n)/h + C for the average of the ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
In this report, we prove that under a Markovian model of order one, the average depth of suffix trees of index n is asymptotically similar to the average depth of tries (a.k.a. digital trees) built on n independent strings. This leads to an asymptotic behavior of (log n)/h + C for the average of the depth of the suffix tree, where h is the entropy of the Markov model and C is constant. Our proof compares the generating functions for the average depth in tries and in suffix trees; the difference between these generating functions is shown to be asymptotically small. We conclude by using the asymptotic behavior of the average depth in a trie under the Markov model found by Jacquet and Szpankowski ([4]).
String Transformation Learning
 In Proceedings of ACL/EACL'97
, 1997
"... String transformation systems have been introduced in (Brill, 1995) and have several applications in natural language pro cessing. In this work we consider the com putational problem of automatically learning from a given corpus the set of transformations presenting the best evidence. We introduce ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
String transformation systems have been introduced in (Brill, 1995) and have several applications in natural language pro cessing. In this work we consider the com putational problem of automatically learning from a given corpus the set of transformations presenting the best evidence. We introduce an original data structure and efficient algorithms that learn some faro flies of transformations that are relevant for partofspeech tagging and phonologi cal rule systems. We also show that the same learning problem becomes NPhard in cases of an unbounded use of don't care symbols in a transformatiou.
Evolution of Musical Motifs in Polyphonic Passages
 Symposium on AI and Creativity in Arts and Science, Proceedings of AISB’02
, 2002
"... In this paper we consider the problem of motif evolution in polyphonic musical sequences. A related problem, where a set of sequences of notes (one sequence for a voice) and a pattern is given, is to find whether approximate occurrences of the pattern occur distributed across the sequences (Holub ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
In this paper we consider the problem of motif evolution in polyphonic musical sequences. A related problem, where a set of sequences of notes (one sequence for a voice) and a pattern is given, is to find whether approximate occurrences of the pattern occur distributed across the sequences (Holub et al., 1999; Lemstrom and Tarhio, 2000). Formally, this related problem is as follows: given a set t of h strings (each representing a voice) t 1 ; : : : ; t n ; i 2 f1::hg, for some constant h and a pattern p = p1 ; : : : ; pm , we say that p occurs at position j of t if p1 = t j ; p2 = t j+1 ; : : : ; pm = t j+m 1 for some fi1 ; : : : ; i mg 2 f1::hg. Our problem of finding evolutionary chains is defined as follows: given a set t of h strings t (the target), for some constant h and a motif p, find whether there exists a sequence u1 = p; u2 ; : : : ; u ` occurring in the target t such that u j+1 occurs to the right of u j in t and for any given j 2 f1::` 1g, u j and u j+1 are similar enough, i.e., they do not differ more than by a certain number of basic operations  insertions, deletions and substitutions. In this paper, we consider several variants of the evolutionary chain problem and present efficient algorithms solving them.
Random Suffix Search Trees
, 2003
"... A random suffix search tree is a binary search tree constructed for the suffixes X i = 0:B i B i+1 B i+2 : : : of a sequence B 1 ; B 2 ; B 3 :; : : : of independent identically distributed random bary digits B j . Let D n denote the depth of the node for X n in this tree when B 1 is uniform on Z b ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
A random suffix search tree is a binary search tree constructed for the suffixes X i = 0:B i B i+1 B i+2 : : : of a sequence B 1 ; B 2 ; B 3 :; : : : of independent identically distributed random bary digits B j . Let D n denote the depth of the node for X n in this tree when B 1 is uniform on Z b . We show that for any value of b > 1, E D n = 2 log n + O(log log n), just as for the random binary search tree. We also show that D n = E D n ! 1 in probability.
Sequential and indexed twodimensional combinatorial template matching allowing rotations
 THEORETICAL COMPUTER SCIENCE A
, 2005
"... We present new and faster algorithms to search for a 2dimensional pattern in a 2dimensional text allowing any rotation of the pattern. This has applications such as image databases and computational biology. We consider the cases of exact and approximate matching under several matching models, usi ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
We present new and faster algorithms to search for a 2dimensional pattern in a 2dimensional text allowing any rotation of the pattern. This has applications such as image databases and computational biology. We consider the cases of exact and approximate matching under several matching models, using a combinatorial approach that generalizes string matching techniques. We focus on sequential algorithms, where only the pattern can be preprocessed, as well as on indexed algorithms, where the text is preprocessed and an index built on it. On sequential searching we derive averagecase lower bounds and then obtain optimal averagecase algorithms for all the matching models. At the same time, these algorithms are worstcase optimal. On indexed searching we obtain search time polylogarithmic on the text size, as well as sublinear time in general for approximate searching.