Results 1  10
of
38
A Guided Tour to Approximate String Matching
 ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 404 (38 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
Reducing the Space Requirement of Suffix Trees
 Software – Practice and Experience
, 1999
"... We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average ..."
Abstract

Cited by 118 (10 self)
 Add to MetaCart
We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction
Faster Approximate String Matching
 Algorithmica
, 1999
"... We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, ..."
Abstract

Cited by 72 (24 self)
 Add to MetaCart
We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e. whenever mk = O(log n)), where m is the pattern length and k ! m the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m = O(log n). Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk=w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps and others, at essentially the same search cost. We then explore other novel techniques t...
qgram based database searching using a suffix array
 QUASAR). Proceedings of the third annual international conference on Computational molecular biology (Recomb 99
, 1999
"... With the increasing amount of DNA sequence information deposited in public databases, searching for similarity to a query sequence has become a basic operation in molecular biology. But even today’s fast algorithms reach their limits when applied to allversusall comparisons of large databases. Her ..."
Abstract

Cited by 63 (6 self)
 Add to MetaCart
With the increasing amount of DNA sequence information deposited in public databases, searching for similarity to a query sequence has become a basic operation in molecular biology. But even today’s fast algorithms reach their limits when applied to allversusall comparisons of large databases. Here we present a new database searching algorithm called QUASAR (Qgram Alignment based on Suffix ARrays) which was designed to quickly detect sequences with strong similarity to the query in a context where many searches are conducted on one database. Our algorithm applies a modification of qtuple filtering implemented on top of a suffix array. Two versions were developed, one for a RAM resident suffix array and one for access to the suffix array on disk. We compared our implementation with BLAST and found that our approach is an order of magnitude faster. It is, however, restricted to the search for strongly similar DNA sequences as is typically required, e.g., in the context of clustering expressed sequence tags (ESTs). 1
Algorithms and Complexity for Annotated Sequence Analysis
, 1999
"... Molecular biologists use algorithms that compare and otherwise analyze sequences that represent genetic and protein molecules. Most of these algorithms, however, operate on the basic sequence and do not incorporate the additional information that is often known about the molecule and its pieces. Thi ..."
Abstract

Cited by 42 (1 self)
 Add to MetaCart
Molecular biologists use algorithms that compare and otherwise analyze sequences that represent genetic and protein molecules. Most of these algorithms, however, operate on the basic sequence and do not incorporate the additional information that is often known about the molecule and its pieces. This research describes schemes to combinatorially annotate this information onto sequences so that it can be analyzed in tandem with the sequence; the overall result would thus reflect both types of information about the sequence. These annotation schemes include adding colours and arcs to the sequence. Colouring a sequence would produce a samelength sequence of colours or other symbols that highlight or label parts of the sequence. Arcs can be used to link sequence symbols (or coloured substrings) to indicate molecular bonds or other relationships. Adding these annotations to sequence analysis problems such as sequence alignment or finding the longest common subsequence can make the problem more complex, often depending on the complexity of the annotation scheme. This research examines the different annotation schemes and the corresponding problems of verifying annotations, creating annotations, and finding the longest common subsequence of pairs of sequences with annotations. This work involves both the conventional complexity framework and parameterized complexity, and includes algorithms and hardness results for both frameworks. Automata and transducers are created for some annotation verification and creation problems. Different restrictions on layered substring and arc annotation are considered to de iii termine what properties an annotation scheme must have to make its incorporation feasible. Extensions to the algorithms that use weighting schemes are explored. Examin...
A Comparison of Approximate String Matching Algorithms
, 1991
"... Experimental comparison of the running time of approximate string matching algorithms for the�differences problem is presented. Given a pattern string, a text string, and integer�, the task is to find all approximate occurrences of the pattern in the text with at most�differences (insertions, deleti ..."
Abstract

Cited by 37 (0 self)
 Add to MetaCart
Experimental comparison of the running time of approximate string matching algorithms for the�differences problem is presented. Given a pattern string, a text string, and integer�, the task is to find all approximate occurrences of the pattern in the text with at most�differences (insertions, deletions, changes). We consider seven algorithms based on different approaches including dynamic programming, BoyerMoore string matching, suffix automata, and the distribution of characters. It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable. 2��� KEY WORDS String matching Edit distance k differences problem
CubyHum: A Fully Operational Query by Humming System
 ISMIR 2002 Conference Proceedings
, 2002
"... 'Query by humming ' is an interaction concept in which the identity of a song has to be revealed fast and orderly from a given sung input using a large database of known melodies. In short, it tries to detect the pitches in a sung melody and compares these pitches with symbolic representations of th ..."
Abstract

Cited by 33 (1 self)
 Add to MetaCart
'Query by humming ' is an interaction concept in which the identity of a song has to be revealed fast and orderly from a given sung input using a large database of known melodies. In short, it tries to detect the pitches in a sung melody and compares these pitches with symbolic representations of the known melodies. Melodies that are similar to the sung pitches are retrieved. Approximate pattern matching in the melody comparison process compensates for the errors in the sung melody by using classical dynamic programming. A filtering method is used to save computation in the dynamic programming framework. This paper presents the algorithms for pitch detection, note onset detection, quantization, melody encoding and approximate pattern matching as they have been implemented in the CubyHum software system. Since human reproduction of melodies is imperfect, findings from an experimental singing study were a crucial input to the development of the algorithms. Future research should pay special attention to the reliable detection of note onsets in any preferred singing style. In addition, research on index methods and fast bitparallelism algorithms for approximate pattern matching need to be further pursued to decrease computational requirements when dealing with large melody databases. 1.
A Survey of Music Information Retrieval Systems
 In ISMIR
, 2005
"... This survey paper provides an overview of contentbased music information retrieval systems, both for audio and for symbolic music notation. Matching algorithms and indexing methods are briefly presented. The need for a TREClike comparison of matching algorithms such as MIREX at ISMIR becomes clear ..."
Abstract

Cited by 33 (4 self)
 Add to MetaCart
This survey paper provides an overview of contentbased music information retrieval systems, both for audio and for symbolic music notation. Matching algorithms and indexing methods are briefly presented. The need for a TREClike comparison of matching algorithms such as MIREX at ISMIR becomes clear from the high number of quite different methods which so far only have been used on different data collections. We placed the systems on a map showing the tasks and users for which they are suitable, and we find that existing contentbased retrieval systems fail to cover a gap between the very general and the very specific retrieval tasks.
Multiple Approximate String Matching
 In Proc. of WADS'97, LNCS 1272
, 1997
"... We present two new algorithms for online multiple approximate string matching. These are extensions of previous algorithms that search for a single pattern. The singlepattern version of the first one is based on the simulation with bits of a nondeterministic finite automaton built from the patter ..."
Abstract

Cited by 19 (9 self)
 Add to MetaCart
We present two new algorithms for online multiple approximate string matching. These are extensions of previous algorithms that search for a single pattern. The singlepattern version of the first one is based on the simulation with bits of a nondeterministic finite automaton built from the pattern and using the text as input. To search for multiple patterns, we superimpose their automata, using the result as a filter. The second algorithm partitions the pattern in subpatterns that are searched with no errors, with a fast exact multipattern search algorithm. To handle multiple patterns, we search the subpatterns of all of them together. The average running time achieved is in both cases O(n) for moderate error level, pattern length and number of patterns. They adapt (with higher costs) to the other cases. However, the algorithms differ in speed and thresholds of usefulness. We analyze theoretically when each algorithm should be used, and show experimentally that they are faster ...
Averageoptimal single and multiple approximate string matching
 ACM Journal of Experimental Algorithmics (JEA
"... Abstract. We present a new algorithm for multiple approximate string matching. It is based on reading backwards enough ℓgrams from text windows so as to prove that no occurrence can contain the part of the window read, and then shifting the window. Three variants of the algorithm are presented, whi ..."
Abstract

Cited by 19 (11 self)
 Add to MetaCart
Abstract. We present a new algorithm for multiple approximate string matching. It is based on reading backwards enough ℓgrams from text windows so as to prove that no occurrence can contain the part of the window read, and then shifting the window. Three variants of the algorithm are presented, which give different tradeoffs between how much they work in the window and how much they shift it. We show analytically that two of our algorithms are optimal on average. Compared to the first averageoptimal multipattern approximate string matching algorithm [Fredriksson and Navarro, CPM 2003], the new algorithms are much faster and are optimal up to difference ratios of 1/2, contrary to the maximum of 1/3 that could be reached in previous work. This is also a contribution to the area of singlepattern approximate string matching, as the only averageoptimal algorithm [Chang and Marr, CPM 1994] also reached a difference ratio of 1/3. We show experimentally that our algorithms are very competitive, displacing the longstanding best algorithms