Results 11 - 20
of
114
Approximate Nearest Neighbors and Sequence Comparison With Block Operations
- IN STOC
, 2000
"... We study sequence nearest neighbors (SNN). Let D be a database of n sequences; we would like to preprocess D so that given any on-line query sequence Q we can quickly find a sequence S in D for which d(S; Q) d(S; T ) for any other sequence T in D. Here d(S; Q) denotes the distance between sequences ..."
Abstract
-
Cited by 36 (6 self)
- Add to MetaCart
We study sequence nearest neighbors (SNN). Let D be a database of n sequences; we would like to preprocess D so that given any on-line query sequence Q we can quickly find a sequence S in D for which d(S; Q) d(S; T ) for any other sequence T in D. Here d(S; Q) denotes the distance between sequences S and Q, defined to be the minimum number of edit operations needed to transform one to another (all edit operations will be reversible so that d(S; T ) = d(T; S) for any two sequences T and S). These operations correspond to the notion of similarity between sequences that we wish to capture in a given application. Natural edit operations include character edits (inserts, replacements, deletes etc), block edits (moves, copies, deletes, reversals) and block numerical transformations (scaling by an additive or a multiplicative constant). The SNN problem arises in many applications. We present the first known efficient algorithm for "approximate" nearest neighbor search for sequences with p...
Incremental String Comparison
- SIAM JOURNAL ON COMPUTING
, 1995
"... The problem of comparing two sequences A and B to determine their LCS or the edit distance between them has been much studied. In this paper we consider the following incremental version of these problems: given an appropriate encoding of a comparison between A and B, can one incrementally compute t ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
The problem of comparing two sequences A and B to determine their LCS or the edit distance between them has been much studied. In this paper we consider the following incremental version of these problems: given an appropriate encoding of a comparison between A and B, can one incrementally compute the answer for A and bB, and the answer for A and Bb with equal efficiency, where b is an additional symbol? Our main result is a theorem exposing a surprising relationship between the dynamic programming solutions for two such "adjacent" problems. Given a threshold k on the number of differences to be permitted in an alignment, the theorem leads directly to an O(k) algorithm for incrementally computing a new solution from an old one, as contrasts the O(k²) time required to compute a solution from scratch. We further show with a series of applications that this algorithm is indeed more powerful than its non-incremental counterpart by solving the applications with greater asymptotic ef...
Efficient similarity joins for near duplicate detection
- In WWW
, 2008
"... With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given ..."
Abstract
-
Cited by 32 (5 self)
- Add to MetaCart
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the token ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. We have also studied the implementation of our proposed algorithm in stand-alone and RDBMSbased settings. Experimental results show our proposed algorithms can outperforms previous algorithms on several real datasets.
Pink Panther: A Complete Environment for Ground-Truthing and Benchmarking . . .
, 1998
"... We describe a new approach for the automatic evaluation of document page segmentation algorithms. Unlike techniques that rely on OCR output, our method is region-based: segmentation quality is assessed by comparing the segmentation output, described as a set of regions, to the corresponding groun ..."
Abstract
-
Cited by 31 (0 self)
- Add to MetaCart
We describe a new approach for the automatic evaluation of document page segmentation algorithms. Unlike techniques that rely on OCR output, our method is region-based: segmentation quality is assessed by comparing the segmentation output, described as a set of regions, to the corresponding ground-truth. Error maps are used to keep track of all the errors associated with each pixel, regardless of the document complexity. Misclassifications, splitting, and merging of regions are among the errors detected by the system. Each error can be weighted individually and the system can be customized to benchmark virtually any type of segmentation task. # 1998 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved Document Page Segmentation Benchmarking Ground-truth OCR Recognition 1.
Faster Bit-parallel Approximate String Matching
- In Proc. 13th Combinatorial Pattern Matching (CPM'2002), LNCS 2373
, 2002
"... We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in O(mn=w) time, where w is the width of the computer word. The ..."
Abstract
-
Cited by 26 (18 self)
- Add to MetaCart
We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in O(mn=w) time, where w is the width of the computer word. The second one [Navarro and Raffinot, ACM JEA, 2000], extends a sublinear-time exact algorithm to approximate searching. The latter technique makes use of an O(kmn=w) time algorithm [Wu and Manber, Comm. ACM, 1992] for its internal workings.
Longest Common Subsequences
- In Proc. of 19th MFCS, number 841 in LNCS
, 1994
"... . The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we explore ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
. The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we explore some of the combinatorial properties of the suband super-sequence relations, survey various algorithms for computing the LLCS, and introduce some results on the expected LLCS for pairs of random strings. 1 Introduction The set \Sigma of finite strings over an unordered finite alphabet \Sigma admits of several natural partial orders. Some, such as the substring, prefix, and suffix relations, depend on contiguity and lead to many interesting combinatorial questions with practical applications to string-matching. An excellent survey is given by Aho in [1]. In this talk however we will focus on the `subsequence' partial order. We say that u = u 1 \Delta \Delta \Delta um is a subsequence of ...
Approximate Pattern Matching with Samples
- In Proc. of ISAAC'94
, 1994
"... . We simplify in this paper the algorithm by Chang and Lawler for the approximate string matching problem, by adopting the concept of sampling. We have a more general analysis of expected time with the simplified algorithm for the one-dimensional case under a non-uniform probability distribution ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
. We simplify in this paper the algorithm by Chang and Lawler for the approximate string matching problem, by adopting the concept of sampling. We have a more general analysis of expected time with the simplified algorithm for the one-dimensional case under a non-uniform probability distribution, and we show that our method can easily be generalized to the two-dimensional approximate pattern matching problem with sublinear expected time. 1 Introduction Since the inaugural papers on string matching algorithms were published by Knuth, Morris and Pratt[11] and Boyer and Moore [5], the problem diversified into various directions. Let us call string matching one-dimensional pattern matching. One is two-dimensional pattern matching and the other is approximate pattern matching where up to k differences are allowed for a match. Yet another theme is two-dimensional approximate pattern matching. There are numerous papers in these new research areas. We cite just a few of them to compare...
Approximate String Matching using Within-word Parallelism
- Software Practice and Experience
, 1994
"... This paper shows how the basic dynamic programming problem for the approximate string matching problem can be parallelized by using `chunks' of computer words. This technique is most useful when the alphabet is small and when the word size of the processor is large. For example, when the word size i ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
This paper shows how the basic dynamic programming problem for the approximate string matching problem can be parallelized by using `chunks' of computer words. This technique is most useful when the alphabet is small and when the word size of the processor is large. For example, when the word size is 64 bits and the alphabet size is 8 (using three bits per character), the degree of parallelism is 21. Empirical results indicate an approximate speedup of 113 over the basic dynamic programming algorithm for alphabet size 8 on a 64 bit processor, and an approximate speedup of 47 for alphabet size 64 on a 64 bit processor. The speedups on a 32 bit processor are approximately half of these
SEMEX - An Efficient Music Retrieval Prototype
- In First International Symposium on Music Information Retrieval (ISMIR’2000
, 2000
"... We present an efficient prototype for music information retrieval. The prototype uses bitparallel algorithms for locating transposition invariant matches of monophonic query melodies within monophonic or polyphonic music stored in a database. When dealing with monophonic music, we employ a fast a ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
We present an efficient prototype for music information retrieval. The prototype uses bitparallel algorithms for locating transposition invariant matches of monophonic query melodies within monophonic or polyphonic music stored in a database. When dealing with monophonic music, we employ a fast approximate bit-parallel algorithm with special edit distance metrics.

