Results 1 - 10
of
33
A Metric Index for Approximate String Matching
- In LATIN
, 2002
"... We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approxima ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us finding the R occurrences of a pattern of length m in a text of length n in average time O(m log n+m +R), using O(n log n) space and O(n log n) index construction time. This complexity improves by far over all other previous methods. We also show a simpler scheme needing O(n) space.
Atomic wedgie: Efficient query filtering for streaming time series
- In Proceedings of the 5 th IEEE International Conference on Data Mining (ICDM 2005
, 2005
"... In many applications it is desirable to monitor a streaming time series for predefined patterns. In domains as diverse as the monitoring of space telemetry, patient intensive care data, and insect populations, where data streams at a high rate and the number of predefined patterns is large, it may b ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
In many applications it is desirable to monitor a streaming time series for predefined patterns. In domains as diverse as the monitoring of space telemetry, patient intensive care data, and insect populations, where data streams at a high rate and the number of predefined patterns is large, it may be impossible for the comparison algorithm to keep up. We propose a novel technique that exploits the commonality among the predefined patterns to allow monitoring at higher bandwidths, while maintaining a guarantee of no false dismissals. Our approach is based on the widely used envelope-based lower bounding technique. Extensive experiments demonstrate that our approach achieves tremendous improvements in performance in the offline case, and significant improvements in the fastest possible arrival rate of the data stream that can be processed with guaranteed no false dismissal. 1.
A linear size index for approximate pattern matching
- In Proc. 17th Annual Symposium on Combinatorial Pattern Matching
, 2006
"... Abstract. This paper revisits the problem of indexing a text S[1..n]to support searching substrings in S that match a given pattern P[1..m] with at most k errors. A naive solution either has a worst-case matching time complexity of Ω(m k)orrequiresΩ(n k) space. Devising a solution with better perfor ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Abstract. This paper revisits the problem of indexing a text S[1..n]to support searching substrings in S that match a given pattern P[1..m] with at most k errors. A naive solution either has a worst-case matching time complexity of Ω(m k)orrequiresΩ(n k) space. Devising a solution with better performance has been a challenge until Cole et al. [5] showed an O(nlog k n)-space index that can support k-error matching in O(m+occ+log k nlog log n) time, where occ is the number of occurrences. Motivated by the indexing of DNA, we investigate in this paper the feasibility of devising a linear-size index that still has a time complexity linear in m. In particular, we give an O(n)-space index that supports k-error matching in O(m + occ +(logn) k(k+1) log log n) worst-case time. Furthermore, the index can be compressed from O(n) wordsintoO(n) bits with a slight increase in the time complexity. 1
Disorder inequality: A combinatorial approach to nearest neighbor search
- In WSDM’08
"... We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use the triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. In this paper we introduce a special property of the similarity function on a set S that leads to efficient combinatorial algorithms for S. The disorder constant D(S) of a set S is defined to ensure the following inequality: if x is the a’th most similar object to z and y is the b’th most similar object to z, then x is among the D(S) · (a + b) most similar objects to y. Assuming that disorder is small we present the first two known combinatorial algorithms for nearest neighbors whose query time has logarithmic dependence on the size of S. The first one, called Ranwalk, is a randomized zero-error algorithm that always returns the exact nearest neighbor. It uses space quadratic in the input size in preprocessing, but is very efficient in query processing. The second algorithm, called Arwalk, uses near-linear space. It uses random choices in preprocessing, but the query processing is essentially deterministic. For an arbitrary query q, there is only a small probability that the chosen data structure does not support q. Finally, we show that for the Reuters corpus average disorder is indeed quite small and that Ranwalk efficiently computes the nearest neighbor in most cases.
Higher lower bounds for near-neighbor and further rich problems
- in Proc. 47th IEEE Symposium on Foundations of Computer Science (FOCS
"... We convert cell-probe lower bounds for polynomial space into stronger lower bounds for near-linear space. Our technique applies to any lower bound proved through the richness method. For example, it applies to partial match, and to near-neighbor problems, either for randomized exact search, or for d ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
We convert cell-probe lower bounds for polynomial space into stronger lower bounds for near-linear space. Our technique applies to any lower bound proved through the richness method. For example, it applies to partial match, and to near-neighbor problems, either for randomized exact search, or for deterministic approximate search (which are thought to exhibit the curse of dimensionality). These problems are motivated by search in large databases, so near-linear space is the most relevant regime. Typically, richness has been used to imply Ω(d / lg n) lower bounds for polynomial-space data structures, where d is the number of bits of a query. This is the highest lower bound provable through the classic reduction to communication complexity. However, for space n lg O(1) n, we now obtain bounds of Ω(d / lg d). This is a significant improvement for natural values of d, such as lg O(1) n. In the most important case of d = Θ(lg n), we have the first superconstant lower bound. From a complexity theoretic perspective, our lower bounds are the highest known for any static data structure problem, significantly improving on previous records. 1
Efficient algorithms for substring near neighbor problem
- in ACM-SIAM Symposium on Discrete Algorithms (SODA), 2006
, 2006
"... In this paper we consider the problem of finding the approximate nearest neighbor when the data set points are the substrings of a given text T. Specifically, for a string T of length n, we present a data structure which does the following: given a pattern P, if there is a substring of T within the ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
In this paper we consider the problem of finding the approximate nearest neighbor when the data set points are the substrings of a given text T. Specifically, for a string T of length n, we present a data structure which does the following: given a pattern P, if there is a substring of T within the distance R from P, it reports a (possibly different) substring of T within distance cR from P. The length of the pattern P, denoted by m, is not known in advance. For the case where the distances are measured using the Hamming distance, we present a data structure which uses Õ(n1+1/c) space 1 and with Õ � n 1/c + mn o(1) � query time. This essentially matches the earlier bounds of [Ind98], which assumed that the pattern length m is fixed in advance. In addition, our data structure can be constructed in time Õ � n 1+1/c + n 1+o(1) M 1/3 � , where M is an upper bound for m. This essentially matches the preprocessing bound of [Ind98] as long as the term Õ � n 1+1/c � dominates the running time, which is the case when, e.g., c < 3. We also extend our results to the case where the distances are measured according to the l1 distance. The query time and the space bound are essentially the same, while the preprocessing time becomes Õ � n 1+1/c + n 1+o(1) M 2/3 �. 1
Pattern matching with address errors: rearrangement distances
- In SODA
, 2006
"... Historically, approximate pattern matching has mainly focused at coping with errors in the data, while the order of the text/pattern was assumed to be more or less correct. In this paper we consider a class of pattern matching problems where the content is assumed to be correct, while the locations ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Historically, approximate pattern matching has mainly focused at coping with errors in the data, while the order of the text/pattern was assumed to be more or less correct. In this paper we consider a class of pattern matching problems where the content is assumed to be correct, while the locations may have shifted/changed. We formally define a broad class of problems of this type, capturing situations in which the pattern is obtained from the text by a sequence of rearrangements. We consider several natural rearrangement schemes, including the analogues of the ℓ1 and ℓ2 distances, as well as two distances based on interchanges. For these, we present efficient algorithms to solve the resulting string matching problems. 1
Approximate String Matching with Lempel-Ziv Compressed Indexes
"... Abstract. A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on self-indexes in recent years, ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Abstract. A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on self-indexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in computational biology applications. We present an ASM algorithm that works on top of a Lempel-Ziv self-index. We consider the so-called hybrid indexes, which are the best in practice for this problem. We show that a Lempel-Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to the Lempel-Ziv index. We show experimentally that our algorithm has a competitive performance and provides a useful space-time tradeoff compared to classical indexes.

