Results 1  10
of
50
SelfIndexing Inverted Files for Fast Text Retrieval
 ACM Transactions on Information Systems
, 1996
"... Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Here we show that query response time for conjunctive Boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, b ..."
Abstract

Cited by 171 (27 self)
 Add to MetaCart
Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Here we show that query response time for conjunctive Boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for Boolean queries of 510 terms, can reduce processing time to under one fifth of the previous cost. Similarly, ranked queries of 4050 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
On the limits of cacheobliviousness
 IN PROC. 35TH ANNUAL ACM SYMPOSIUM ON THEORY OF COMPUTING
, 2003
"... In this paper, we present lower bounds for permuting and sorting in the cacheoblivious model. We prove that (1) I/O optimal cacheoblivious comparison based sorting is not possible without a tall cache assumption, and (2) there does not exist an I/O optimalcacheoblivious algorithm for permuting, ..."
Abstract

Cited by 42 (6 self)
 Add to MetaCart
(Show Context)
In this paper, we present lower bounds for permuting and sorting in the cacheoblivious model. We prove that (1) I/O optimal cacheoblivious comparison based sorting is not possible without a tall cache assumption, and (2) there does not exist an I/O optimalcacheoblivious algorithm for permuting, not even in the presence of a tall cache assumption.Our results for sorting show the existence of an inherent tradeoff in the cacheoblivious model between the strength of the tall cache assumption and the overhead for the case M >> B, and show that Funnelsort and recursive binary mergesort are optimal algorithms in the sense that they attain this tradeoff.
Adaptive searching in succinctly encoded binary relations and treestructured documents (Extended Abstract)
 THEORETICAL COMPUTER SCIENCE
, 2005
"... This paper deals with succinct representations of data types motivated by applications in posting lists for search engines, in querying XML documents, and in the more general setting (which extends XML) of multilabeled trees, where several labels can be assigned to each node of a tree. To find th ..."
Abstract

Cited by 42 (14 self)
 Add to MetaCart
(Show Context)
This paper deals with succinct representations of data types motivated by applications in posting lists for search engines, in querying XML documents, and in the more general setting (which extends XML) of multilabeled trees, where several labels can be assigned to each node of a tree. To find the set of references corresponding to a set of keywords, one typically intersects the list of references associated with each keyword. We view this instead as having a single list of objects [n] = {1,..., n} (the references), each of which has a subset of the labels [σ] = {1,..., σ} (the keywords) associated with it. We are able to find the objects associated with an arbitrary set of keywords in time O(δk lg lg σ) using a data structure requiring only t(lg σ +o(lg σ)) bits, where δ is the number of steps required by a nondeterministic algorithm to check the answer, k is the number of keywords in the query, σ is the size of the set from which the keywords are chosen, and t is the number of associations between references and keywords. The data structure is succinct in that it differs from the space needed to write down all t occurrences of keywords by only a lower order term. An XML document is, for our purpose, a labeled rooted tree. We deal primarily with “nonrecursive labeled trees”, where no label occurs more than once on any root to leaf path. We find the set of nodes which path from the root include a set of keywords in the same time, O(δk lg lg σ), on a representation of the tree using essentially minimum space, 2n + n(lg σ + o(lg σ)) bits, where n is the number of nodes in the tree. If we permit nodes to have multiple
Faster adaptive set intersections for text searching
 EXPERIMENTAL ALGORITHMS: 5TH INTERNATIONAL WORKSHOP, WEA 2006, CALA GALDANA, MENORCA
, 2006
"... The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we engineer a better algorithm for this task, which improves over those proposed by Demaine, Munro and LópezOrtiz [SODA 2000/ALENEX 2001], by using a va ..."
Abstract

Cited by 34 (4 self)
 Add to MetaCart
(Show Context)
The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we engineer a better algorithm for this task, which improves over those proposed by Demaine, Munro and LópezOrtiz [SODA 2000/ALENEX 2001], by using a variant of interpolation search. More specifically, our contributions are threefold. First, we corroborate and complete the practical study from Demaine et al. on comparison based intersection algorithms. Second, we show that in practice replacing binary search and galloping (onesided binary) search [4] by interpolation search improves the performance of each main intersection algorithms. Third, we introduce and test variants of interpolation search: this results in an even better intersection algorithm.
Engineering a cacheoblivious sorting algorithm
 IN PROC. 6TH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS
, 2004
"... The cacheoblivious model of computation is a twolevel memory model with the assumption that the parameters of the model are unknown to the algorithms. A consequence of this assumption is that an algorithm efficient in the cache oblivious model is automatically efficient in a multilevel memory mod ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
The cacheoblivious model of computation is a twolevel memory model with the assumption that the parameters of the model are unknown to the algorithms. A consequence of this assumption is that an algorithm efficient in the cache oblivious model is automatically efficient in a multilevel memory model. Since the introduction of the cacheoblivious model by Frigo et al. in 1999, a number of algorithms and data structures in the model has been proposed and analyzed. However, less attention has been given to whether the nice theoretical proporities of cacheoblivious algorithms carry over into practice. This paper is an algorithmic engineering study of cacheoblivious sorting. We investigate a number of implementation issues and parameters choices for the cacheoblivious sorting algorithm Lazy Funnelsort by empirical methods, and compare the final algorithm with Quicksort, the established standard for comparison based sorting, as well as with recent cacheaware proposals. The main result is a carefully implemented cacheoblivious sorting algorithm, which we compare to the best implementation of Quicksort we can find, and find that it competes very well for input residing in RAM, and outperforms Quicksort for input on disk.
Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences
 In Proceedings of 12th International Conference on String Processing and Information Retrieval (SPIRE
, 2005
"... Abstract. This work presents an experimental comparison of intersection algorithms for sorted sequences, including the recent algorithm of BaezaYates. This algorithm performs on average less comparisons than the total number of elements of both inputs (n and m respectively) when n = αm (α> 1). W ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
(Show Context)
Abstract. This work presents an experimental comparison of intersection algorithms for sorted sequences, including the recent algorithm of BaezaYates. This algorithm performs on average less comparisons than the total number of elements of both inputs (n and m respectively) when n = αm (α> 1). We can find applications of this algorithm on query processing in Web search engines, where large intersections, or differences, must be performed fast. In this work we concentrate in studying the behavior of the algorithm in practice, using for the experiments test data that is close to the actual conditions of its applications. We compare the efficiency of the algorithm with other intersection algorithm and we study different optimizations, showing that the algorithm is more efficient than the alternatives in most cases, especially when one of the sequences is much larger than the other. 1
Finding maximal pairs with bounded gap
 Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 1645 of Lecture Notes in Computer Science
, 1999
"... A pair in a string is the occurrence of the same substring twice. A pair is maximal if the two occurrences of the substring cannot be extended to the left and right without making them different. The gap of a pair is the number of characters between the two occurrences of the substring. In this pape ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
A pair in a string is the occurrence of the same substring twice. A pair is maximal if the two occurrences of the substring cannot be extended to the left and right without making them different. The gap of a pair is the number of characters between the two occurrences of the substring. In this paper we present methods for finding all maximal pairs under various constraints on the gap. In a string of length n we can find all maximal pairs with gap in an upper and lower bounded interval in time O(n log n + z) where z is the number of reported pairs. If the upper bound is removed the time reduces to O(n+z). Since a tandem repeat is a pair where the gap is zero, our methods can be seen as a generalization of finding tandem repeats. The running time of our methods equals the running time of well known methods for finding tandem repeats.
An Experimental Investigation of Set Intersection Algorithms for Text Searching
, 2009
"... The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the intersection ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the intersection context. We perform an experimental comparison with the algorithms from the previous studies from Demaine, LópezOrtiz and Munro [ALENEX 2001], and from BaezaYates and Salinger [SPIRE 2005]; in addition, we implement and test the intersection algorithm from Barbay and Kenyon [SODA 2002] and its randomized variant [SAGA 2003]. We consider both the random data set from BaezaYates and Salinger, the Google queries used by Demaine et al., a corpus provided by Google and a larger corpus from the TREC Terabyte 2006 efficiency query stream, along with its own query log. We measure the performance both in terms of the number of comparisons and searches performed, and in terms of the CPU time on two different architectures. Our results confirm or improve the results from both previous studies in their respective context (comparison model on real data and CPU measures on random data), and extend them to new contexts. In particular we show that valuebased search algorithms perform well in posting lists in terms of the number of comparisons performed.
A.: Compact set representation for information retrieval
 In: SPIRE. Lecture
, 2007
"... Abstract. Conjunctive Boolean queries are a fundamental operation in web search engines. These queries can be reduced to the problem of intersecting ordered sets of integers, where each set represents the documents containing one of the query terms. But there is tension between the desire to store t ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
Abstract. Conjunctive Boolean queries are a fundamental operation in web search engines. These queries can be reduced to the problem of intersecting ordered sets of integers, where each set represents the documents containing one of the query terms. But there is tension between the desire to store the lists effectively, in a compressed form, and the desire to carry out intersection operations efficiently, using nonsequential processing modes. In this paper we evaluate intersection algorithms on compressed sets, comparing them to the best nonsequential arraybased intersection algorithms. By adding a simple, lowcost, auxiliary index, we show that compressed storage need not hinder efficient and highspeed intersection operations. 1