A Guided Tour to Approximate String Matching
 ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Cited by 523
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
Learning String Edit Distance
, 1997
"... In many applications, it is necessary to determine the similarity of two strings. A widelyused notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic mo ..."
Cited by 236
In many applications, it is necessary to determine the similarity of two strings. A widelyused notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string edit distance. Our stochastic model allows us to learn a string edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the difficult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string edit distance with nearly one fifth the error rate of the untrained Levenshtein distance. Our approach is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.
A fast bitvector algorithm for approximate string matching based on dynamic programming
 J. ACM
, 1999
"... Abstract. The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with korfewer differences. Simple and practical bitvector algorithms have been designed for this problem, most notably the one used in agrep. These alg ..."
Cited by 158
Abstract. The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with korfewer differences. Simple and practical bitvector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current stateset of the kdifference automaton for the query, and asymptotically run in either O(nmk/w) orO(nm log �/w) time where w is the word size of the machine (e.g., 32 or 64 in practice), and � is the size of the pattern alphabet. Here we present an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus, the algorithm’s performance is independent of k, and it is found to be more efficient than the previous results for many choices of k and small m. Moreover, because the algorithm is not dependent on k, it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4Russians algorithm of Wu et al. [1996]. This gives rise to an O(kn/w) expectedtime algorithm for the case where m may be arbitrarily large. In practice this new algorithm, that computes a region of the dynamic programming (d.p.) matrix w entries at a time using the basic algorithm as a subroutine, is significantly faster than our previous 4Russians algorithm, that computes the same region 4 or 5 entries at a time using table lookup. This performance improvement yields a code that is either superior or competitive with all existing algorithms except for some filtration algorithms that are superior when k/m is sufficiently small.
Local similarity in RNA secondary structures
, 2003
"... We present a systematic treatment of alignment distance and local similarity algorithms on trees and forests. We build upon the tree alignment algorithm for ordered trees given by Jiang et. al (1995) and extend it to calculate local forest alignments, which is essential for finding local similar reg ..."
Cited by 81
We present a systematic treatment of alignment distance and local similarity algorithms on trees and forests. We build upon the tree alignment algorithm for ordered trees given by Jiang et. al (1995) and extend it to calculate local forest alignments, which is essential for finding local similar regions in RNA secondary structures. The time complexity of our algorithm is O(F1  ·F2  ·deg(F1) · deg(F2) · (deg(F1) +deg(F2)) where Fi  is the number of nodes in forest Fi and deg(Fi) is the degree of Fi. We provide carefully engineered dynamic programming implementations using dense, twodimensional tables which considerably reduces the space requirement. We suggest a new representation of RNA secondary structures as forests that allow reasonable scoring of edit operations on RNA secondary structures. The comparison of RNA secondary structures is facilitated by a new visualization technique for RNA secondary structure alignments. Finally, we show how potential regulatory motifs can be discovered solely by their structural preservation, and independent of their sequence conservation and position.
Secure and Private Sequence Comparisons
 In WPES’03: Proceedings of the 2003 ACM workshop on Privacy in the electronic society
, 2003
"... We give an e#cient protocol for sequence comparisons of the editdistance kind, such that neither party reveals anything about their private sequence to the other party (other than what can be inferred from the edit distance between their two sequences  which is unavoidable because computing that ..."
Cited by 44
We give an e#cient protocol for sequence comparisons of the editdistance kind, such that neither party reveals anything about their private sequence to the other party (other than what can be inferred from the edit distance between their two sequences  which is unavoidable because computing that distance is the purpose of the protocol). The amount of communication done by our protocol is proportional to the time complexity of the bestknown algorithm for performing the sequence comparison.
SECURE OUTSOURCING OF SEQUENCE COMPARISONS
"... Largescale problems in the physical and life sciences are being revolutionized by Internet computing technologies, like grid computing, that make possible the massive cooperative sharing of computational power, bandwidth, storage, and data. A weak computational device, once connected to such a grid ..."
Cited by 36
Largescale problems in the physical and life sciences are being revolutionized by Internet computing technologies, like grid computing, that make possible the massive cooperative sharing of computational power, bandwidth, storage, and data. A weak computational device, once connected to such a grid, is no longer limited by its slow speed, small amounts of local storage, and limited bandwidth: It can avail itself of the abundance of these resources that is available elsewhere on the network. An impediment to the use of “computational outsourcing” is that the data in question is often sensitive, e.g., of national security importance, or proprietary and containing commercial secrets, or to be kept private for legal requirements such as the HIPAA legislation, GrammLeachBliley, or similar laws. This motivates the design of techniques for computational outsourcing in a privacypreserving manner, i.e., without revealing to the remote agents whose computational power is being used, either one’s data or the outcome of the computation on the data. This paper investigates such secure outsourcing for widely applicable sequence comparison problems, and gives an efficient protocol for a
Differencing and Merging Architectural Views
 Automated Software Engineering Journal
"... As architecturebased techniques become more widely adopted, software architects face the problem of reconciling different versions of architectural models. However, existing approaches to differencing and merging architectural views are based on restrictive assumptions, such as requiring view eleme ..."
Cited by 36
As architecturebased techniques become more widely adopted, software architects face the problem of reconciling different versions of architectural models. However, existing approaches to differencing and merging architectural views are based on restrictive assumptions, such as requiring view elements to have unique identifiers or explicitly log changes between versions. To overcome some of the above limitations, we propose differencing and merging architectural views based on structural information. To that effect, we generalize a published polynomialtime treetotree correction algorithm (that detects inserts, renames and deletes) into a novel algorithm to additionally detect restricted moves and support forcing and preventing matches between view elements. We implement a set of tools to compare and merge componentandconnector (C&C) architectural views, incorporating the algorithm. Finally, we provide an empirical evaluation of the algorithm and the tools on case studies with real software, illustrating the practicality of the approach to find and reconcile interesting divergences between architectural views.
On Using Guidance in Relational Reinforcement Learning
 Machine Learning
, 2004
"... Reinforcement learning, and Qlearning in particular, encounter two major problems when dealing with large state spaces. First, learning the Qfunction in tabular form may be infeasible because of the excessive amount of memory needed to store the table and because the Qfunction only converges afte ..."
Cited by 34
Reinforcement learning, and Qlearning in particular, encounter two major problems when dealing with large state spaces. First, learning the Qfunction in tabular form may be infeasible because of the excessive amount of memory needed to store the table and because the Qfunction only converges after each state has been visited multiple times. Second, rewards in the state space may be so sparse that with random exploration they will only be discovered extremely slowly. The first problem is often solved by learning a generalization of the encountered examples (e.g., using a neural net or decision tree). Relational reinforcement learning (RRL) is such an approach; it makes Qlearning feasible in structural domains by incorporating a relational learner into Qlearning. To solve the second problem a use of "reasonable policies" to provide guidance has been suggested. In this paper we investigate the best ways to provide guidance in two different domains.
A Theory of Multiple Classifier Systems And Its Application to Visual Word Recognition
, 1992
"... Despite the success of many pattern recognition systems in constrained domains, problems that involve noisy input and many classes remain difficult. A promising direction is to use several classifiers simultaneously, such that they can complement each other in correctness. This thesis is concerned w ..."
Cited by 33
Object count/Area Graphs for the Evaluation of Object Detection and Segmentation Algorithms
 International Journal on Document Analysis and Recognition
"... Evaluation of object detection algorithms is a nontrivial task: a detection result is usually evaluated by comparing the bounding box of the detected object with the bounding box of the ground truth object. The commonly used precision and recall measures are computed from the overlap area of these ..."
Cited by 30
Evaluation of object detection algorithms is a nontrivial task: a detection result is usually evaluated by comparing the bounding box of the detected object with the bounding box of the ground truth object. The commonly used precision and recall measures are computed from the overlap area of these two rectangles. However, these measures have several drawbacks: they don’t give intuitive information about the proportion of the correctly detected objects and the number of false alarms, and they cannot be accumulated across multiple images without creating ambiguity in their interpretation. Furthermore, quantitative and qualitative evaluation is often mixed resulting in ambiguous measures. In this paper we propose a new approach which tackles these problems. The performance of a detection algorithm is illustrated intuitively by performance graphs which present object level precision and recall depending on constraints on detection quality. In order to compare different detection algorithms, a representative single performance value is computed from the graphs. The influence of the test database on the detection performance is illustrated by performance/generality graphs. The evaluation method can be applied to different types of object detection algorithms. It has been tested on different text detection algorithms, among which are the participants of the ICDAR 2003 text detection competition.