Results 1  10
of
10
Reducing the Space Requirement of Suffix Trees
 Software – Practice and Experience
, 1999
"... We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average ..."
Abstract

Cited by 118 (10 self)
 Add to MetaCart
We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction
Faster Approximate String Matching
 Algorithmica
, 1999
"... We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, ..."
Abstract

Cited by 72 (24 self)
 Add to MetaCart
We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e. whenever mk = O(log n)), where m is the pattern length and k ! m the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m = O(log n). Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk=w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps and others, at essentially the same search cost. We then explore other novel techniques t...
A Comparison of Imperative and Purely Functional Suffix Tree Constructions
 Science of Computer Programming
, 1995
"... We explore the design space of implementing suffix tree algorithms in the functional paradigm. We review the linear time and space algorithms of McCreight and Ukkonen. Based on a new terminology of nested suffixes and nested prefixes, we give a simpler and more declarative explanation of these algor ..."
Abstract

Cited by 19 (7 self)
 Add to MetaCart
We explore the design space of implementing suffix tree algorithms in the functional paradigm. We review the linear time and space algorithms of McCreight and Ukkonen. Based on a new terminology of nested suffixes and nested prefixes, we give a simpler and more declarative explanation of these algorithms than was previously known. We design two "naive" versions of these algorithms which are not linear time, but use simpler data structures, and can be implemented in a purely functional style. Furthermore, we present a new, "lazy" suffix tree construction which is even simpler. We evaluate both imperative and functional implementations of these algorithms. Our results show that the naive algorithms perform very favourably, and in particular, the lazy construction compares very well to all the others. 1 Introduction Suffix trees are the method of choice when a large sequence of symbols, the "text", is to be searched frequently for occurrences of short sequences, the "patterns". Given tha...
Improving an Algorithm for Approximate Pattern Matching
 Algorithmica
, 1998
"... We study a recent algorithm for fast online approximate string matching. This is the problem of searching a pattern in a text allowing errors in the pattern or in the text. The algorithm is based on a very fast kernel which is able to search short patterns using a nondeterministic finite automat ..."
Abstract

Cited by 16 (8 self)
 Add to MetaCart
We study a recent algorithm for fast online approximate string matching. This is the problem of searching a pattern in a text allowing errors in the pattern or in the text. The algorithm is based on a very fast kernel which is able to search short patterns using a nondeterministic finite automaton, which is simulated using bitparallelism. A number of techniques to extend this kernel for longer patterns are presented in that work. However, the techniques can be integrated in many ways and the optimal interplay among them is by no means obvious. The solution to this problem starts at a very low level, by obtaining basic probabilistic information about the problem which was not previously known, and ends integrating analytical results with empirical data to obtain the optimal heuristic. The conclusions obtained via analysis are experimentally confirmed. We also improve many of the techniques and obtain a combined heuristic which is faster than the original work. This work sho...
An Approach to Identify Duplicated Web Pages
, 2002
"... A relevant consequence of the unceasing expansion of the Web and ecommerce is the growth of the demand of new Web sites and Web applications. As a result, Web sites and applications are usually developed without a formalized process, but Web pages are directly coded in an incremental way, where new ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
A relevant consequence of the unceasing expansion of the Web and ecommerce is the growth of the demand of new Web sites and Web applications. As a result, Web sites and applications are usually developed without a formalized process, but Web pages are directly coded in an incremental way, where new pages are obtained by duplicating existing ones. Duplicated Web pages, having the same structure and just differing for the data they include, can be considered as clones. The identification of clones may reduce the effort devoted to test, maintain and evolve Web sites and applications. Moreover, clone detection among different Web sites aims to detect cases of possible plagiarism.
Approximate string searching under weighted edit distance
 In Proceedings of the 3rd South American Workshop on String Processing (WSP ’96). Carleton Univ
, 1996
"... Abstract. Let p ∈ Σ ∗ be a string of length m and t ∈ Σ ∗ be a string of length n. The approximate string searching problem is to find all approximate matches of p in t having weighted edit distance at most k from p. We present a new method that preprocesses the pattern into a DFA which scans t onli ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Abstract. Let p ∈ Σ ∗ be a string of length m and t ∈ Σ ∗ be a string of length n. The approximate string searching problem is to find all approximate matches of p in t having weighted edit distance at most k from p. We present a new method that preprocesses the pattern into a DFA which scans t online in linear time, thereby recognizing all positions in t where an approximate match ends. We show how to reduce the exponential preprocessing effort and propose two practical algorithms. The first algorithm constructs the states of the DFA up to a certain depth r ≥ 1. It runs in O(Σ  r+1 · m + q · m + n) time and O(Σ  r+1 + Σ  r ·m) space where q ≤ n decreases as r increases. The second algorithm constructs the transitions of the DFA when they are demanded. It runs in O(qs·Σ+qt·m+n) time and O(qs·(Σ+m)) space where qs ≤ qt ≤ n depend on the problem instance. Practical measurements show that our algorithms work well in practice and beat previous methods for problems of interest in molecular biology. 1
Algebraic dynamic programming
 Algebraic Methodology And Software Technology, 9th International Conference, AMAST 2002
, 2002
"... Abstract. Dynamic programming is a classic programming technique, applicable in a wide variety of domains, like stochastic systems analysis, operations research, combinatorics of discrete structures, flow problems, parsing with ambiguous grammars, or biosequence analysis. Yet, no methodology is avai ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
Abstract. Dynamic programming is a classic programming technique, applicable in a wide variety of domains, like stochastic systems analysis, operations research, combinatorics of discrete structures, flow problems, parsing with ambiguous grammars, or biosequence analysis. Yet, no methodology is available for designing such algorithms. The matrix recurrences that typically describe a dynamic programming algorithm are difficult to construct, errorprone to implement, and almost impossible to debug. This article introduces an algebraic style of dynamic programming over sequence data. We define the formal framework including a formalization of Bellman’s principle, specify an executable specification language, and show how algorithm design decisions and tuning for efficiency can be described on a convenient level of abstraction.
A Partial Deterministic Automaton for Approximate String Matching
 In Proc. of Fourth South American Workshop on String Processing (WSP'97
, 1997
"... . One of the simplest approaches to approximate string matching is to consider the associated nondeterministic finite automaton and make it deterministic. Besides automaton generation, the search time is O(n) in the worst case, where n is the text size. This solution is mentioned in the classical ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
. One of the simplest approaches to approximate string matching is to consider the associated nondeterministic finite automaton and make it deterministic. Besides automaton generation, the search time is O(n) in the worst case, where n is the text size. This solution is mentioned in the classical literature but has not been further pursued, due to the large number of automaton states that may be generated. We study the idea of generating the deterministic automaton on the fly. That is, we only generate the states that are actually reached when the text is traversed. We show that this limits drastically the number of states actually generated. Moreover, the algorithm is competitive, being the fastest one for intermediate error ratios and pattern lengths. 1 Introduction Approximate string matching is one of the main problems in classical string algorithms, with applications to text searching, computational biology, pattern recognition, etc. The problem is defined as follows: given a t...
An Approach to Identify Duplicated Web Pages
"... A relevant consequence of the unceasing expansion of the Web and ecommerce is the growth of the demand ofnew Web sites and Web applications. As a result, Web sites and applications are usually developed without a fomlalized process, but Web pages are directly coded in an incremental way, where ne}1 ..."
Abstract
 Add to MetaCart
A relevant consequence of the unceasing expansion of the Web and ecommerce is the growth of the demand ofnew Web sites and Web applications. As a result, Web sites and applications are usually developed without a fomlalized process, but Web pages are directly coded in an incremental way, where ne}10 ' pages are obtained by duplicating existing ones. Duplicated Web pages, having the same structure and just differing for the data they include, can be considered as clones. The identification of clones may reduce the effort devoted to test, maintain and evolve Web sites and applications. Moreover, clone detection among different Web sites aims to detect cases ofpossible plagiarism. In this paper we propose an approach, based on similarity metrics, to detect duplicated pages in Web sites and applications, implemented with HTML language and ASP technology. The proposed approach has been assessed by analyzing several Web sites and Web applications. The obtained results are reported in the paper with respect to some case studies.
A General Technique to Improve Filter Algorithms for Approximate String Matching
 Universitat
, 1997
"... . Approximate string matching searches for occurrences of a pattern in a text, where a certain number of character differences (errors) is allowed. Fast methods use filters: A fast preprocessing phase determines regions of the text where a match cannot occur; only the remaining text regions must ..."
Abstract
 Add to MetaCart
. Approximate string matching searches for occurrences of a pattern in a text, where a certain number of character differences (errors) is allowed. Fast methods use filters: A fast preprocessing phase determines regions of the text where a match cannot occur; only the remaining text regions must be scrutinized by the slower approximate matching algorithm. Such filters can be very effective, but they (naturally) degrade at a critical error threshold. We introduce a general technique to improve the efficiency of filters and hence to push out further this critical threshold value. Our technique intermittently reevaluates the possibility of a match in a given region. It combines precise information about the region already scanned with filtering information about the region yet to be searched. We apply this technique to four approximate string matching algorithms published by Chang & Lawler and Sutinen & Tarhio. 1 Introduction The problem of approximate string matching is st...