Results 1  10
of
73
A Guided Tour to Approximate String Matching
 ACM COMPUTING SURVEYS
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 529 (38 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems.
The grid file: an adaptable, symmetric multikey file structure
 In Trends in Information Processing Systems, Proc. 3rd ECZ Conference, A. Duijvestijn and P. Lockemann, Eds., Lecture Notes in Computer Science 123
, 1981
"... Traditional file structures that provide multikey access to records, for example, inverted files, are extensions of file structures originally designed for singlekey access. They manifest various deficiencies in particular for multikey access to highly dynamic files. We study the dynamic aspects of ..."
Abstract

Cited by 422 (4 self)
 Add to MetaCart
Traditional file structures that provide multikey access to records, for example, inverted files, are extensions of file structures originally designed for singlekey access. They manifest various deficiencies in particular for multikey access to highly dynamic files. We study the dynamic aspects of tile structures that treat all keys symmetrically, that is, file structures which avoid the distinction between primary and secondary keys. We start from a bitmap approach and treat the problem of file design as one of data compression of a large sparse matrix. This leads to the notions of a grid partition of the search space and of a grid directory, which are the keys to a dynamic file structure called the grid file. This tile system adapts gracefully to its contents under insertions and deletions, and thus achieves an upper hound of two disk accesses for single record retrieval; it also handles range queries and partially specified queries efficiently. We discuss in detail the design decisions that led to the grid file, present simulation results of its behavior, and compare it to other multikey access file structures.
Analysis of the clustering properties of the Hilbert spacefilling curve
 IEEE Transactions on Knowledge and Data Engineering
, 2001
"... AbstractÐSeveral schemes for the linear mapping of a multidimensional space have been proposed for various applications, such as access methods for spatiotemporal databases and image compression. In these applications, one of the most desired properties from such linear mappings is clustering, whic ..."
Abstract

Cited by 177 (11 self)
 Add to MetaCart
(Show Context)
AbstractÐSeveral schemes for the linear mapping of a multidimensional space have been proposed for various applications, such as access methods for spatiotemporal databases and image compression. In these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space being preserved in the linear space. It is widely believed that the Hilbert spacefilling curve achieves the best clustering [1], [14]. In this paper, we analyze the clustering property of the Hilbert spacefilling curve by deriving closedform formulas for the number of clusters in a given query region of an arbitrary shape (e.g., polygons and polyhedra). Both the asymptotic solution for the general case and the exact solution for a special case generalize previous work [14]. They agree with the empirical results that the number of clusters depends on the hypersurface area of the query region and not on its hypervolume. We also show that the Hilbert curve achieves better clustering than the z curve. From a practical point of view, the formulas given in this paper provide a simple measure that can be used to predict the required disk access behaviors and, hence, the total access time.
Fast Algorithms for Sorting and Searching Strings
, 1997
"... We present theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are character strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching a ..."
Abstract

Cited by 162 (0 self)
 Add to MetaCart
We present theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are character strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching algorithm blends tries and binary search trees; it is faster than hashing and other commonly used search methods. The basic ideas behind the algorithms date back at least to the 1960s, but their practical utility has been overlooked. We also present extensions to more complex string problems, such as partialmatch searching. 1. Introduction Section 2 briefly reviews Hoare's [9] Quicksort and binary search trees. We emphasize a wellknown isomorphism relating the two, and summarize other basic facts. The multikey algorithms and data structures are presented in Section 3. Multikey Quicksort orders a set of n vectors with k components each. Like regular Quicksort, it partitions its input into...
Fractals for Secondary Key Retrieval
"... In this paper we propose the use of fractals and especially the Hilbert curve, in order to design good distancepreserving mappings. Such mappings improve the performance of secondarykey and spatial access methods, where multidimensional points have to be stored on an 1dimensional medium (e.g., ..."
Abstract

Cited by 159 (19 self)
 Add to MetaCart
(Show Context)
In this paper we propose the use of fractals and especially the Hilbert curve, in order to design good distancepreserving mappings. Such mappings improve the performance of secondarykey and spatial access methods, where multidimensional points have to be stored on an 1dimensional medium (e.g., disk). Good clustering reduces the number of disk accesses on retrieval, improving the response time. Our experiments on range queries and nearest neighbor queries showed that the proposed Hilbert curve achieves better clustering than older methods (&quot;bitshuffling&quot;, or Peano curve), for every situation we tried.
Disk allocation for cartesian product files on multipledisk systems
 ACM Transactions of Database Systems
, 1982
"... Cartesian product files have recently been shown to exhibit attractive properties for partial match queries. This paper considers the file allocation problem for Cartesian product files, which can be stated as follows: Given a kattribute Cartesian product file and an mdisk system, allocate buckets ..."
Abstract

Cited by 64 (0 self)
 Add to MetaCart
(Show Context)
Cartesian product files have recently been shown to exhibit attractive properties for partial match queries. This paper considers the file allocation problem for Cartesian product files, which can be stated as follows: Given a kattribute Cartesian product file and an mdisk system, allocate buckets among the m disks in such a way that, for all possible partial match queries, the concurrency of disk accesses is maximis ed. The Risk Modulo (DM) allocation method is described first, and it is shown to be strict optimal under many conditions commonly occurring in practice, including all possible partial match queries when the number of disks is 2 or 3. It is also shown that although it has good performance, the DM allocation method is not strict optimal for all possible partial match queries when the number of disks is greater than 3. The General Disk Modulo (GDM) allocation method is then described, and a sufficient but not necessary condition for strict optimal&y of the GDM method for all partial match queries and any number of disks is then derived. Simulation studies comparing the DM and random allocation methods in terms of the average number of disk accesses, in response to various classes of partial match queries, show the former to be significantly more effective even when the number of disks is greater than 3, that is, even in cases where the DM method is not strict optimal. The results that have been derived formally and shown by simulation can be used for more effective design of optimal file systems for partial match queries. When considering multipledisk systems with independent access paths, it is important to ensure that similar records are clustered into the same or similar buckets, while similar buckets should be dispersed uniformly among the disks.
Dynamical Sources in Information Theory: A General Analysis of Trie Structures
 ALGORITHMICA
, 1999
"... Digital trees, also known as tries, are a general purpose flexible data structure that implements dictionaries built on sets of words. An analysis is given of three major representations of tries in the form of arraytries, list tries, and bsttries ("ternary search tries"). The size an ..."
Abstract

Cited by 59 (7 self)
 Add to MetaCart
Digital trees, also known as tries, are a general purpose flexible data structure that implements dictionaries built on sets of words. An analysis is given of three major representations of tries in the form of arraytries, list tries, and bsttries ("ternary search tries"). The size and the search costs of the corresponding representations are analysed precisely in the average case, while a complete distributional analysis of height of tries is given. The unifying data model used is that of dynamical sources and it encompasses classical models like those of memoryless sources with independent symbols, of finite Markovchains, and of nonuniform densities. The probabilistic behaviour of the main parameters, namely size, path length, or height, appears to be determined by two intrinsic characteristics of the source: the entropy and the probability of letter coincidence. These characteristics are themselves related in a natural way to spectral properties of specific transfer operators of the Ruelle type.
Lower bounds for high dimensional nearest neighbor search and related problems
, 1999
"... In spite of extensive and continuing research, for various geometric search problems (such as nearest neighbor search), the best algorithms known have performance that degrades exponentially in the dimension. This phenomenon is sometimes called the curse of dimensionality. Recent results [38, 37, 40 ..."
Abstract

Cited by 52 (2 self)
 Add to MetaCart
(Show Context)
In spite of extensive and continuing research, for various geometric search problems (such as nearest neighbor search), the best algorithms known have performance that degrades exponentially in the dimension. This phenomenon is sometimes called the curse of dimensionality. Recent results [38, 37, 40] show that in some sense it is possible to avoid the curse of dimensionality for the approximate nearest neighbor search problem. But must the exact nearest neighbor search problem suffer this curse? We provide some evidence in support of the curse. Specifically we investigate the exact nearest neighbor search problem and the related problem of exact partial match within the asymmetric communication model first used by Miltersen [43] to study data structure problems. We derive nontrivial asymptotic lower bounds for the exact problem that stand in contrast to known algorithms for approximate nearest neighbor search. 1
Burst Tries: A Fast, Efficient Data Structure for String Keys
 ACM Transactions on Information Systems
, 2002
"... Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, t ..."
Abstract

Cited by 39 (10 self)
 Add to MetaCart
(Show Context)
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or nearsorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
A new algorithm for optimal constraint satisfaction and its implications
 Alexander D. Scott) Mathematical Institute, University of Oxford
, 2004
"... We present a novel method for exactly solving (in fact, counting solutions to) general constraint satisfaction optimization with at most two variables per constraint (e.g. MAX2CSP and MIN2CSP), which gives the first exponential improvement over the trivial algorithm; more precisely, it is a cons ..."
Abstract

Cited by 38 (1 self)
 Add to MetaCart
(Show Context)
We present a novel method for exactly solving (in fact, counting solutions to) general constraint satisfaction optimization with at most two variables per constraint (e.g. MAX2CSP and MIN2CSP), which gives the first exponential improvement over the trivial algorithm; more precisely, it is a constant factor improvement in the base of the runtime exponent. In the case where constraints have arbitrary weights, there is a (1 + ǫ)approximation with roughly the same runtime, modulo polynomial factors. Our algorithm may be used to count the number of optima in MAX2SAT and MAXCUT instances in O(m 3 2 ωn/3) time, where ω < 2.376 is the matrix product exponent over a ring. This is the first known algorithm solving MAX2SAT and MAXCUT in provably less than c n steps in the worst case, for some c < 2; similar new results are obtained for related problems. Our main construction may also be used to show that any improvement in the runtime exponent of either kclique solution (even when k = 3) or matrix multiplication over GF(2) would improve the runtime exponent for solving 2CSP optimization. As a corollary, we prove that an n o(k)time kclique algorithm implies SNP ⊆ DTIME[2 o(n)], for any k(n) ∈ o ( √ n / log n). Further extensions of our technique yield connections between the complexity of some (polynomial time) high dimensional geometry problems and that of some general NPhard problems. For example, if there are sufficiently faster algorithms for computing the diameter of n points in ℓ1, then there is an (2 −ǫ) n algorithm for MAXLIN. Such results may be construed as either lower bounds on these highdimensional problems, or hope that better algorithms exist for more general NPhard problems. 1