Results 1 - 10
of
13
Lsh forest: self-tuning indexes for similarity search
- In WWW
, 2005
"... We consider the problem of indexing high-dimensional data for answering (approximate) similarity-search queries. Similarity indexes prove to be important in a wide variety of settings: Web search engines desire fast, parallel, main-memory-based indexes for similarity search on text data; database sy ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
We consider the problem of indexing high-dimensional data for answering (approximate) similarity-search queries. Similarity indexes prove to be important in a wide variety of settings: Web search engines desire fast, parallel, main-memory-based indexes for similarity search on text data; database systems desire disk-based similarity indexes for high-dimensional data, including text and images; peer-to-peer systems desire distributed similarity indexes with low communication cost. We propose an indexing scheme called LSH Forest which is applicable in all the above contexts. Our index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by (a) eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and (b) improving on LSH’s performance guarantees for skewed data distributions while retaining the same storage and query overhead. We show how to construct this index in main memory, on disk, in parallel systems, and in peer-to-peer systems. We evaluate the design with experiments on multiple text corpora and demonstrate both the self-tuning nature and the superior performance of LSH Forest.
OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences
- In VLDB
, 2003
"... A common query against large protein and gene sequence data sets is to locate targets that are similar to an input query sequence. The current set of popular search tools, such as BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss target ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
A common query against large protein and gene sequence data sets is to locate targets that are similar to an input query sequence. The current set of popular search tools, such as BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss targets, which in many cases is undesirable.
Tries for Approximate String Matching
- IEEE Transactions on Knowledge and Data Engineering
, 1996
"... Tries offer text searches with costs which are independent of the size of the document being searched, and so are important for large documents requiring spelling checkers), case insensitivity, and limited approximate regular secondary storage. Approximate searches, in which the search pattern d ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
Tries offer text searches with costs which are independent of the size of the document being searched, and so are important for large documents requiring spelling checkers), case insensitivity, and limited approximate regular secondary storage. Approximate searches, in which the search pattern differs from the document by k substitutions, transpositions, insertions or deletions, have hitherto been carried out only at costs linear in the size of the document. We present a trie-based method whose cost is independent of document size. H. Shang and T.H. Merrett are at the School of Computer Science, McGill University, Montr'eal, Qu'ebec, Canada H3A 2A7, Email: fshang, timg@cs.mcgill.ca 100 Our experiments show that this new method significantly outperforms the nearest competitor for k=0 and k=1, which are arguably the most important cases. The linear cost (in k) of the other methods begins to catch up, for our small files, only at k=2. For larger files, complexity arguments i...
A Metric Index for Approximate String Matching
- In LATIN
, 2002
"... We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approxima ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us finding the R occurrences of a pattern of length m in a text of length n in average time O(m log n+m +R), using O(n log n) space and O(n log n) index construction time. This complexity improves by far over all other previous methods. We also show a simpler scheme needing O(n) space.
Longest-match String Searching for Ziv-Lempel Compression
, 1993
"... This paper presents eight data structures that can be used to accelerate the searching, including adaptations of four methods normally used for exact matching searching. The algorithms are evaluated analytically and empirically, indicating the trade-offs available between compression speed and memor ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
This paper presents eight data structures that can be used to accelerate the searching, including adaptations of four methods normally used for exact matching searching. The algorithms are evaluated analytically and empirically, indicating the trade-offs available between compression speed and memory consumption. Two of the algorithms are well-known methods of finding the longest match---the timeconsuming linear search, and the storage-intensive trie (digital search tree). The trie is adapted along the lines of a PATRICIA tree to operate economically. Hashing, binary search trees, splay trees and the Boyer--Moore searching algorithm are traditionally used to search for exact matches, but we show how these can be adapted to find longest matches. In addition, two data structures specifically designed for the application are presented
Distributed generation of suffix arrays
- In 8th Annual Symposium on Combinatorial Pattern Matching
, 1997
"... ..."
Self-indexed text compression using straight-line programs
- In Proc. 34th MFCS
, 2009
"... Abstract. Straight-line programs (SLPs) offer powerful text compression by representing a text T [1, u] in terms of a restricted context-free grammar of n rules, so that T can be recovered in O(u) time. However, the problem of operating the grammar in compressed form has not been studied much. We pr ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Abstract. Straight-line programs (SLPs) offer powerful text compression by representing a text T [1, u] in terms of a restricted context-free grammar of n rules, so that T can be recovered in O(u) time. However, the problem of operating the grammar in compressed form has not been studied much. We present a grammar representation whose size is of the same order of that of a plain SLP representation, and can answer other queries apart from expanding nonterminals. This can be of independent interest. We then extend it to achieve the first grammar representation able of extracting text substrings, and of searching the text for patterns, in time o(n). We also give byproducts on representing binary relations. 1 Introduction and Related Work Grammar-based compression is a well-known technique since at least the seventies, and still a very active area of research. From the different variants of the idea, we focus on the case where a given text T [1, u] is replaced by a context-free grammar (CFG) G that generates just the string T. Then one can store G instead
B-trees: Bearing Fruits of All Kinds
- in Proc. 13th Australasian Database Conference (ADC ’02), IEEE CS
, 2001
"... Index structures are often used to support search operations in large databases. Many advanced database application domains such as spatial databases, multimedia databases, temporal databases, and object-oriented databases, call for index structures that are specially designed and tailored for the d ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Index structures are often used to support search operations in large databases. Many advanced database application domains such as spatial databases, multimedia databases, temporal databases, and object-oriented databases, call for index structures that are specially designed and tailored for the domains. Interestingly, in each of these domains, we find methods that are based on one distinct structure - the B-tree. Invented some thirty years ago, the B-tree has been challenged repeatedly, but has retained its competitiveness. In this paper, we first give a quick review of B-trees. We then present its adaptations to various domains. For each domain, we present representative B-tree-based structures and their search operations. We conclude that the B-tree is truly an ubiquitous structure that has stood the test of times with wide acceptance in many domains.
Average-Case Analysis of Approximate Trie Search
"... Abstract For the exact search of a pattern of length m in a database of n strings the trie data structureallows an optimal lookup time of O (m). If errors are allowed between the pattern and the databasestrings, no such structure with reasonable size is known. Using a trie some work can be saved and ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract For the exact search of a pattern of length m in a database of n strings the trie data structureallows an optimal lookup time of O (m). If errors are allowed between the pattern and the databasestrings, no such structure with reasonable size is known. Using a trie some work can be saved and running times superior to the comparison with every string in the database can be achieved. Weinvestigate a comparison-based model where "errors " and "matches " are defined between pairs of characters. When comparing two characters, let p be the probability of an error. Between any twostrings we bound the number of errors by D, which we consider a function of n. We study theaverage-case complexity of the number of comparisons for searching in a trie in dependence of the parameters p and D. Our analysis yields the asymptotic behavior for memoryless sources withuniform probabilities. It turns out that there is a jump in the average-case complexity at certain thresholds for p and D. Our results can be applied for any comparison-based error model, forinstance, mismatches (Hamming distance), don't cares, or geometric character distances. 1 Introduction We study the average-case behavior of the simple problem of finding a given pattern in a set of patternssubject to two conditions. The set of patterns is given in advance and may have been preprocessed with linear space, and we are also interested in occurrences where the pattern is found with a givennumber of errors.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture
, 2002
"... One of the distinct features of modern Internet routers is that most performance-critical tasks, such as the switching of packets, is currently done using Application Specific Integrated Circuits (ASICs) or custom-designed hardware. The only few cases when off-the-shelf general-purpose processors or ..."
Abstract
- Add to MetaCart
One of the distinct features of modern Internet routers is that most performance-critical tasks, such as the switching of packets, is currently done using Application Specific Integrated Circuits (ASICs) or custom-designed hardware. The only few cases when off-the-shelf general-purpose processors or specialized network processors are used are route lookup, Quality of Service (QoS), fabric scheduling, and alike, while existing general-purpose architectures have failed to give a useful interface to sufficient bandwidth to support high-bandwidth routing. By using an architecture that is more general-purpose, routers can gain from economies of scale and increased flexibility compared to special-purpose hardware. The work presented in this thesis proposes the use of the Raw general-purpose processor as both a network processor and switch fabric for multigigabit routing. The Raw processor, through its tiled architecture and software-exposed on-chip networking, has enough internal and external bandwidth to deal with multigigabit routing. This thesis has three main goals. First, it proposes a single-chip router design on the Raw general-purpose processor. We demonstrate that a 4-port edge router running on a 250 MHz Raw processor is able to switch 3.3 million packets per second at peak rate, which results in the throughput of 26.9 gigabits per second for 1,024-byte packets. Second, it shows that it is possible to obtain an efficient mapping of a dynamic communications pattern, such as the connections of the switch fabric of a router, to a compile-time static interconnect of the Raw processor tiles, and proposes a Rotating Crossbar design that achieves efficient routing on the Raw static network. Third, it proposes the incorporation of computation into the communication interconnect of the switch fabric of a router.

