Results 1  10
of
54
A Metric Index for Approximate String Matching
 In LATIN
, 2002
"... We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approxima ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us finding the R occurrences of a pattern of length m in a text of length n in average time O(m log n+m +R), using O(n log n) space and O(n log n) index construction time. This complexity improves by far over all other previous methods. We also show a simpler scheme needing O(n) space.
Disorder inequality: A combinatorial approach to nearest neighbor search
 In WSDM’08
"... We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use the triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. In this paper we introduce a special property of the similarity function on a set S that leads to efficient combinatorial algorithms for S. The disorder constant D(S) of a set S is defined to ensure the following inequality: if x is the a’th most similar object to z and y is the b’th most similar object to z, then x is among the D(S) · (a + b) most similar objects to y. Assuming that disorder is small we present the first two known combinatorial algorithms for nearest neighbors whose query time has logarithmic dependence on the size of S. The first one, called Ranwalk, is a randomized zeroerror algorithm that always returns the exact nearest neighbor. It uses space quadratic in the input size in preprocessing, but is very efficient in query processing. The second algorithm, called Arwalk, uses nearlinear space. It uses random choices in preprocessing, but the query processing is essentially deterministic. For an arbitrary query q, there is only a small probability that the chosen data structure does not support q. Finally, we show that for the Reuters corpus average disorder is indeed quite small and that Ranwalk efficiently computes the nearest neighbor in most cases.
Atomic wedgie: Efficient query filtering for streaming time series
 In Proceedings of the 5 th IEEE International Conference on Data Mining (ICDM 2005
, 2005
"... In many applications it is desirable to monitor a streaming time series for predefined patterns. In domains as diverse as the monitoring of space telemetry, patient intensive care data, and insect populations, where data streams at a high rate and the number of predefined patterns is large, it may b ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
In many applications it is desirable to monitor a streaming time series for predefined patterns. In domains as diverse as the monitoring of space telemetry, patient intensive care data, and insect populations, where data streams at a high rate and the number of predefined patterns is large, it may be impossible for the comparison algorithm to keep up. We propose a novel technique that exploits the commonality among the predefined patterns to allow monitoring at higher bandwidths, while maintaining a guarantee of no false dismissals. Our approach is based on the widely used envelopebased lower bounding technique. Extensive experiments demonstrate that our approach achieves tremendous improvements in performance in the offline case, and significant improvements in the fastest possible arrival rate of the data stream that can be processed with guaranteed no false dismissal. 1.
Higher lower bounds for nearneighbor and further rich problems
 in Proc. 47th IEEE Symposium on Foundations of Computer Science (FOCS
"... We convert cellprobe lower bounds for polynomial space into stronger lower bounds for nearlinear space. Our technique applies to any lower bound proved through the richness method. For example, it applies to partial match, and to nearneighbor problems, either for randomized exact search, or for d ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
We convert cellprobe lower bounds for polynomial space into stronger lower bounds for nearlinear space. Our technique applies to any lower bound proved through the richness method. For example, it applies to partial match, and to nearneighbor problems, either for randomized exact search, or for deterministic approximate search (which are thought to exhibit the curse of dimensionality). These problems are motivated by search in large databases, so nearlinear space is the most relevant regime. Typically, richness has been used to imply Ω(d / lg n) lower bounds for polynomialspace data structures, where d is the number of bits of a query. This is the highest lower bound provable through the classic reduction to communication complexity. However, for space n lg O(1) n, we now obtain bounds of Ω(d / lg d). This is a significant improvement for natural values of d, such as lg O(1) n. In the most important case of d = Θ(lg n), we have the first superconstant lower bound. From a complexity theoretic perspective, our lower bounds are the highest known for any static data structure problem, significantly improving on previous records. 1
A linear size index for approximate pattern matching
 In Proc. 17th Annual Symposium on Combinatorial Pattern Matching
, 2006
"... Abstract. This paper revisits the problem of indexing a text S[1..n]to support searching substrings in S that match a given pattern P[1..m] with at most k errors. A naive solution either has a worstcase matching time complexity of Ω(m k)orrequiresΩ(n k) space. Devising a solution with better perfor ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
Abstract. This paper revisits the problem of indexing a text S[1..n]to support searching substrings in S that match a given pattern P[1..m] with at most k errors. A naive solution either has a worstcase matching time complexity of Ω(m k)orrequiresΩ(n k) space. Devising a solution with better performance has been a challenge until Cole et al. [5] showed an O(nlog k n)space index that can support kerror matching in O(m+occ+log k nlog log n) time, where occ is the number of occurrences. Motivated by the indexing of DNA, we investigate in this paper the feasibility of devising a linearsize index that still has a time complexity linear in m. In particular, we give an O(n)space index that supports kerror matching in O(m + occ +(logn) k(k+1) log log n) worstcase time. Furthermore, the index can be compressed from O(n) wordsintoO(n) bits with a slight increase in the time complexity. 1
Efficient algorithms for substring near neighbor problem
 in ACMSIAM Symposium on Discrete Algorithms (SODA), 2006
, 2006
"... In this paper we consider the problem of finding the approximate nearest neighbor when the data set points are the substrings of a given text T. Specifically, for a string T of length n, we present a data structure which does the following: given a pattern P, if there is a substring of T within the ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
In this paper we consider the problem of finding the approximate nearest neighbor when the data set points are the substrings of a given text T. Specifically, for a string T of length n, we present a data structure which does the following: given a pattern P, if there is a substring of T within the distance R from P, it reports a (possibly different) substring of T within distance cR from P. The length of the pattern P, denoted by m, is not known in advance. For the case where the distances are measured using the Hamming distance, we present a data structure which uses Õ(n1+1/c) space 1 and with Õ � n 1/c + mn o(1) � query time. This essentially matches the earlier bounds of [Ind98], which assumed that the pattern length m is fixed in advance. In addition, our data structure can be constructed in time Õ � n 1+1/c + n 1+o(1) M 1/3 � , where M is an upper bound for m. This essentially matches the preprocessing bound of [Ind98] as long as the term Õ � n 1+1/c � dominates the running time, which is the case when, e.g., c < 3. We also extend our results to the case where the distances are measured according to the l1 distance. The query time and the space bound are essentially the same, while the preprocessing time becomes Õ � n 1+1/c + n 1+o(1) M 2/3 �. 1
UNIFYING THE LANDSCAPE OF CELLPROBE LOWER BOUNDS
, 2008
"... We show that a large fraction of the datastructure lower bounds known today in fact follow by reduction from the communication complexity of lopsided (asymmetric) set disjointness. This includes lower bounds for: • highdimensional problems, where the goal is to show large space lower bounds. • co ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
We show that a large fraction of the datastructure lower bounds known today in fact follow by reduction from the communication complexity of lopsided (asymmetric) set disjointness. This includes lower bounds for: • highdimensional problems, where the goal is to show large space lower bounds. • constantdimensional geometric problems, where the goal is to bound the query time for space O(n·polylogn). • dynamic problems, where we are looking for a tradeoff between query and update time. (In this case, our bounds are slightly weaker than the originals, losing a lglgn factor.) Our reductions also imply the following new results: • an Ω(lgn/lglgn) bound for 4dimensional range reporting, given space O(n · polylogn). This is quite timely, since a recent result [39] solved 3D reporting in O(lg 2 lgn) time, raising the prospect that higher dimensions could also be easy. • a tight space lower bound for the partial match problem, for constant query time. • the first lower bound for reachability oracles. In the process, we prove optimal randomized lower bounds for lopsided set disjointness.
Pattern matching with address errors: rearrangement distances
 In SODA
, 2006
"... Historically, approximate pattern matching has mainly focused at coping with errors in the data, while the order of the text/pattern was assumed to be more or less correct. In this paper we consider a class of pattern matching problems where the content is assumed to be correct, while the locations ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
Historically, approximate pattern matching has mainly focused at coping with errors in the data, while the order of the text/pattern was assumed to be more or less correct. In this paper we consider a class of pattern matching problems where the content is assumed to be correct, while the locations may have shifted/changed. We formally define a broad class of problems of this type, capturing situations in which the pattern is obtained from the text by a sequence of rearrangements. We consider several natural rearrangement schemes, including the analogues of the ℓ1 and ℓ2 distances, as well as two distances based on interchanges. For these, we present efficient algorithms to solve the resulting string matching problems. 1