Results 1  10
of
48
A Guided Tour to Approximate String Matching
 ACM Computing Surveys
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 409 (38 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1
The grid file: an adaptable, symmetric multikey file structure
 In Trends in Information Processing Systems, Proc. 3rd ECZ Conference, A. Duijvestijn and P. Lockemann, Eds., Lecture Notes in Computer Science 123
, 1981
"... Traditional file structures that provide multikey access to records, for example, inverted files, are extensions of file structures originally designed for singlekey access. They manifest various deficiencies in particular for multikey access to highly dynamic files. We study the dynamic aspects of ..."
Abstract

Cited by 383 (4 self)
 Add to MetaCart
Traditional file structures that provide multikey access to records, for example, inverted files, are extensions of file structures originally designed for singlekey access. They manifest various deficiencies in particular for multikey access to highly dynamic files. We study the dynamic aspects of tile structures that treat all keys symmetrically, that is, file structures which avoid the distinction between primary and secondary keys. We start from a bitmap approach and treat the problem of file design as one of data compression of a large sparse matrix. This leads to the notions of a grid partition of the search space and of a grid directory, which are the keys to a dynamic file structure called the grid file. This tile system adapts gracefully to its contents under insertions and deletions, and thus achieves an upper hound of two disk accesses for single record retrieval; it also handles range queries and partially specified queries efficiently. We discuss in detail the design decisions that led to the grid file, present simulation results of its behavior, and compare it to other multikey access file structures.
Fast Algorithms for Sorting and Searching Strings
, 1997
"... We present theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are character strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching a ..."
Abstract

Cited by 147 (0 self)
 Add to MetaCart
We present theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are character strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching algorithm blends tries and binary search trees; it is faster than hashing and other commonly used search methods. The basic ideas behind the algorithms date back at least to the 1960s, but their practical utility has been overlooked. We also present extensions to more complex string problems, such as partialmatch searching. 1. Introduction Section 2 briefly reviews Hoare's [9] Quicksort and binary search trees. We emphasize a wellknown isomorphism relating the two, and summarize other basic facts. The multikey algorithms and data structures are presented in Section 3. Multikey Quicksort orders a set of n vectors with k components each. Like regular Quicksort, it partitions its input into...
Analysis of the clustering properties of the Hilbert spacefilling curve
 IEEE Transactions on Knowledge and Data Engineering
, 2001
"... AbstractÐSeveral schemes for the linear mapping of a multidimensional space have been proposed for various applications, such as access methods for spatiotemporal databases and image compression. In these applications, one of the most desired properties from such linear mappings is clustering, whic ..."
Abstract

Cited by 141 (10 self)
 Add to MetaCart
AbstractÐSeveral schemes for the linear mapping of a multidimensional space have been proposed for various applications, such as access methods for spatiotemporal databases and image compression. In these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space being preserved in the linear space. It is widely believed that the Hilbert spacefilling curve achieves the best clustering [1], [14]. In this paper, we analyze the clustering property of the Hilbert spacefilling curve by deriving closedform formulas for the number of clusters in a given query region of an arbitrary shape (e.g., polygons and polyhedra). Both the asymptotic solution for the general case and the exact solution for a special case generalize previous work [14]. They agree with the empirical results that the number of clusters depends on the hypersurface area of the query region and not on its hypervolume. We also show that the Hilbert curve achieves better clustering than the z curve. From a practical point of view, the formulas given in this paper provide a simple measure that can be used to predict the required disk access behaviors and, hence, the total access time.
Fractals for Secondary Key Retrieval
"... In this paper we propose the use of fractals and especially the Hilbert curve, in order to design good distancepreserving mappings. Such mappings improve the performance of secondarykey and spatial access methods, where multidimensional points have to be stored on an 1dimensional medium (e.g., ..."
Abstract

Cited by 140 (17 self)
 Add to MetaCart
In this paper we propose the use of fractals and especially the Hilbert curve, in order to design good distancepreserving mappings. Such mappings improve the performance of secondarykey and spatial access methods, where multidimensional points have to be stored on an 1dimensional medium (e.g., disk). Good clustering reduces the number of disk accesses on retrieval, improving the response time. Our experiments on range queries and nearest neighbor queries showed that the proposed Hilbert curve achieves better clustering than older methods ("bitshuffling", or Peano curve), for every situation we tried.
Disk allocation for cartesian product files on multipledisk systems
 ACM Transactions of Database Systems
, 1982
"... Cartesian product files have recently been shown to exhibit attractive properties for partial match queries. This paper considers the file allocation problem for Cartesian product files, which can be stated as follows: Given a kattribute Cartesian product file and an mdisk system, allocate buckets ..."
Abstract

Cited by 57 (0 self)
 Add to MetaCart
Cartesian product files have recently been shown to exhibit attractive properties for partial match queries. This paper considers the file allocation problem for Cartesian product files, which can be stated as follows: Given a kattribute Cartesian product file and an mdisk system, allocate buckets among the m disks in such a way that, for all possible partial match queries, the concurrency of disk accesses is maximis ed. The Risk Modulo (DM) allocation method is described first, and it is shown to be strict optimal under many conditions commonly occurring in practice, including all possible partial match queries when the number of disks is 2 or 3. It is also shown that although it has good performance, the DM allocation method is not strict optimal for all possible partial match queries when the number of disks is greater than 3. The General Disk Modulo (GDM) allocation method is then described, and a sufficient but not necessary condition for strict optimal&y of the GDM method for all partial match queries and any number of disks is then derived. Simulation studies comparing the DM and random allocation methods in terms of the average number of disk accesses, in response to various classes of partial match queries, show the former to be significantly more effective even when the number of disks is greater than 3, that is, even in cases where the DM method is not strict optimal. The results that have been derived formally and shown by simulation can be used for more effective design of optimal file systems for partial match queries. When considering multipledisk systems with independent access paths, it is important to ensure that similar records are clustered into the same or similar buckets, while similar buckets should be dispersed uniformly among the disks.
Dynamical Sources in Information Theory: A General Analysis of Trie Structures
 ALGORITHMICA
, 1999
"... Digital trees, also known as tries, are a general purpose flexible data structure that implements dictionaries built on sets of words. An analysis is given of three major representations of tries in the form of arraytries, list tries, and bsttries ("ternary search tries"). The size and the sear ..."
Abstract

Cited by 50 (7 self)
 Add to MetaCart
Digital trees, also known as tries, are a general purpose flexible data structure that implements dictionaries built on sets of words. An analysis is given of three major representations of tries in the form of arraytries, list tries, and bsttries ("ternary search tries"). The size and the search costs of the corresponding representations are analysed precisely in the average case, while a complete distributional analysis of height of tries is given. The unifying data model used is that of dynamical sources and it encompasses classical models like those of memoryless sources with independent symbols, of finite Markovchains, and of nonuniform densities. The probabilistic behaviour of the main parameters, namely size, path length, or height, appears to be determined by two intrinsic characteristics of the source: the entropy and the probability of letter coincidence. These characteristics are themselves related in a natural way to spectral properties of specific transfer operators of the Ruelle type.
Lower bounds for high dimensional nearest neighbor search and related problems
, 1999
"... In spite of extensive and continuing research, for various geometric search problems (such as nearest neighbor search), the best algorithms known have performance that degrades exponentially in the dimension. This phenomenon is sometimes called the curse of dimensionality. Recent results [38, 37, 40 ..."
Abstract

Cited by 47 (2 self)
 Add to MetaCart
In spite of extensive and continuing research, for various geometric search problems (such as nearest neighbor search), the best algorithms known have performance that degrades exponentially in the dimension. This phenomenon is sometimes called the curse of dimensionality. Recent results [38, 37, 40] show that in some sense it is possible to avoid the curse of dimensionality for the approximate nearest neighbor search problem. But must the exact nearest neighbor search problem suffer this curse? We provide some evidence in support of the curse. Specifically we investigate the exact nearest neighbor search problem and the related problem of exact partial match within the asymmetric communication model first used by Miltersen [43] to study data structure problems. We derive nontrivial asymptotic lower bounds for the exact problem that stand in contrast to known algorithms for approximate nearest neighbor search. 1
A new algorithm for optimal constraint satisfaction and its implications
 Alexander D. Scott) Mathematical Institute, University of Oxford
, 2004
"... We present a novel method for exactly solving (in fact, counting solutions to) general constraint satisfaction optimization with at most two variables per constraint (e.g. MAX2CSP and MIN2CSP), which gives the first exponential improvement over the trivial algorithm; more precisely, it is a cons ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
We present a novel method for exactly solving (in fact, counting solutions to) general constraint satisfaction optimization with at most two variables per constraint (e.g. MAX2CSP and MIN2CSP), which gives the first exponential improvement over the trivial algorithm; more precisely, it is a constant factor improvement in the base of the runtime exponent. In the case where constraints have arbitrary weights, there is a (1 + ǫ)approximation with roughly the same runtime, modulo polynomial factors. Our algorithm may be used to count the number of optima in MAX2SAT and MAXCUT instances in O(m 3 2 ωn/3) time, where ω < 2.376 is the matrix product exponent over a ring. This is the first known algorithm solving MAX2SAT and MAXCUT in provably less than c n steps in the worst case, for some c < 2; similar new results are obtained for related problems. Our main construction may also be used to show that any improvement in the runtime exponent of either kclique solution (even when k = 3) or matrix multiplication over GF(2) would improve the runtime exponent for solving 2CSP optimization. As a corollary, we prove that an n o(k)time kclique algorithm implies SNP ⊆ DTIME[2 o(n)], for any k(n) ∈ o ( √ n / log n). Further extensions of our technique yield connections between the complexity of some (polynomial time) high dimensional geometry problems and that of some general NPhard problems. For example, if there are sufficiently faster algorithms for computing the diameter of n points in ℓ1, then there is an (2 −ǫ) n algorithm for MAXLIN. Such results may be construed as either lower bounds on these highdimensional problems, or hope that better algorithms exist for more general NPhard problems. 1
Scalability Analysis of Declustering Methods for Multidimensional Range Queries
 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 1998
"... Efficient storage and retrieval of multiattribute datasets have become one of the essential requirements for many dataintensive applications. The Cartesian product file has been known as an effective multiattribute file structure for partialmatch and bestmatch queries. Several heuristic meth ..."
Abstract

Cited by 29 (17 self)
 Add to MetaCart
Efficient storage and retrieval of multiattribute datasets have become one of the essential requirements for many dataintensive applications. The Cartesian product file has been known as an effective multiattribute file structure for partialmatch and bestmatch queries. Several heuristic methods have been developed to decluster Cartesian product files across multiple disks to obtain high performance for disk accesses. Though the scalability of the declustering methods becomes increasingly important for systems equipped with a large number of disks, no analytic studies have been done so far. In this paper we derive formulas describing the scalability of two popular declustering methods Disk Modulo and Fieldwise Xor for range queries, which are the most common type of queries. These formulas disclose the limited scalability of the declustering methods and arecorroborated by extensive simulation experiments. From the practical point of view, the formulas given in this paper provide ...