Results 1 - 10
of
13
Faster adaptive set intersections for text searching
- Experimental Algorithms: 5th International Workshop, WEA 2006, Cala Galdana, Menorca
, 2006
"... Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we engineer a better algorithm for this task, which improves over those proposed by Demaine, Munro and López-Ortiz [SODA 2000/ALENEX 2001], by us ..."
Abstract
-
Cited by 22 (5 self)
- Add to MetaCart
Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we engineer a better algorithm for this task, which improves over those proposed by Demaine, Munro and López-Ortiz [SODA 2000/ALENEX 2001], by using a variant of interpolation search. More specifically, our contributions are threefold. First, we corroborate and complete the practical study from Demaine et al. on comparison based intersection algorithms. Second, we show that in practice replacing binary search and galloping (one-sided binary) search [4] by interpolation search improves the performance of each main intersection algorithms. Third, we introduce and test variants of interpolation search: this results in an even better intersection algorithm.
Adaptive searching in succinctly encoded binary relations and tree-structured documents (Extended Abstract)
- THEORETICAL COMPUTER SCIENCE
, 2005
"... This paper deals with succinct representations of data types motivated by applications in posting lists for search engines, in querying XML documents, and in the more general setting (which extends XML) of multi-labeled trees, where several labels can be assigned to each node of a tree. To find th ..."
Abstract
-
Cited by 21 (9 self)
- Add to MetaCart
This paper deals with succinct representations of data types motivated by applications in posting lists for search engines, in querying XML documents, and in the more general setting (which extends XML) of multi-labeled trees, where several labels can be assigned to each node of a tree. To find the set of references corresponding to a set of keywords, one typically intersects the list of references associated with each keyword. We view this instead as having a single list of objects [n] = {1,..., n} (the references), each of which has a subset of the labels [σ] = {1,..., σ} (the keywords) associated with it. We are able to find the objects associated with an arbitrary set of keywords in time O(δk lg lg σ) using a data structure requiring only t(lg σ +o(lg σ)) bits, where δ is the number of steps required by a non-deterministic algorithm to check the answer, k is the number of keywords in the query, σ is the size of the set from which the keywords are chosen, and t is the number of associations between references and keywords. The data structure is succinct in that it differs from the space needed to write down all t occurrences of keywords by only a lower order term. An XML document is, for our purpose, a labeled rooted tree. We deal primarily with “non-recursive labeled trees”, where no label occurs more than once on any root to leaf path. We find the set of nodes which path from the root include a set of keywords in the same time, O(δk lg lg σ), on a representation of the tree using essentially minimum space, 2n + n(lg σ + o(lg σ)) bits, where n is the number of nodes in the tree. If we permit nodes to have multiple
An Experimental Investigation of Set Intersection Algorithms for Text Searching ⋆
"... Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the inte ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the intersection context. We perform an experimental comparison with the algorithms from the previous studies from Demaine, López-Ortiz and Munro [ALENEX 2001], and from Baeza-Yates and Salinger [SPIRE 2005]; in addition, we implement and test the intersection algorithm from Barbay and Kenyon [SODA 2002] and its randomized variant [SAGA 2003]. We consider both the random data set from Baeza-Yates and Salinger, the Google queries used by Demaine et al., a corpus provided by Google and a larger corpus from the TREC Terabyte 2006 efficiency query stream, along with its own query log. We measure the performance both in terms of the number of comparisons and searches performed, and in terms of the CPU time on two different architectures. Our results confirm or improve the results from both previous studies in their respective context (comparison model on real data and CPU measures on random data), and extend them to new contexts. In particular we show that value-based search algorithms perform well in posting lists in terms of the number of comparisons performed. 1
Alternation and Redundancy Analysis of the Intersection Problem
"... The intersection of sorted arrays problem has applications in search engines such as Google. Previous work propose and compare deterministic algorithms for this problem, in an adaptive analysis based on the encoding size of a certificate of the result (cost analysis). We define the alternation analy ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
The intersection of sorted arrays problem has applications in search engines such as Google. Previous work propose and compare deterministic algorithms for this problem, in an adaptive analysis based on the encoding size of a certificate of the result (cost analysis). We define the alternation analysis, based on the non-deterministic complexity of an instance. In this analysis we prove that there is a deterministic algorithm asymptotically performing as well as any randomized algorithm in the comparison model. We define the redundancy analysis, based on a measure of the internal redundancy of the instance. In this analysis we prove that any algorithm optimal in the redundancy analysis is optimal in the alternation analysis, but that there is a randomized algorithm which performs strictly better than any deterministic algorithm in the comparison model. Finally, we describe how those results can be extended beyond the comparison model.
Improving the Performance of List Intersection
"... List intersection is a central operation, utilized excessively for query processing on text and databases. We present list intersection algorithms for an arbitrary number of sorted and unsorted lists tailored to the characteristics of modern hardware architectures. Two new list intersection algorith ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
List intersection is a central operation, utilized excessively for query processing on text and databases. We present list intersection algorithms for an arbitrary number of sorted and unsorted lists tailored to the characteristics of modern hardware architectures. Two new list intersection algorithms are presented for sorted lists. The first algorithm, termed Dynamic Probes, dynamically decides the probing order on the lists exploiting information from previous probes at runtime. This information is utilized as a cache-resident microindex. The second algorithm, termed Quantile-based, deduces in advance a good probing order, thus avoiding the overhead of adaptivity and is based on detecting lists with non-uniform distribution of document identifiers. For unsorted lists, we present a novel hashbased algorithm that avoids the overhead of sorting. A detailed experimental evaluation is presented based on real and synthetic data using existing chip multiprocessor architectures with eight cores, validating the efficiency and efficacy of the proposed algorithms. 1.
Median Selection Requires (2+ε)n Comparisons
- In Proceedings of the 37th Annual Symposium on Foundations of Computer Science
, 1999
"... Improving a long standing result of Bent and John, and extending a recent result of Dor, Hastad, Ulfbeg and Zwick, we obtain a (2+ffl)n lower bound (for some fixed ffl ? 0) on the number of comparisons required, in the worst case, for selecting the median of n elements. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Improving a long standing result of Bent and John, and extending a recent result of Dor, Hastad, Ulfbeg and Zwick, we obtain a (2+ffl)n lower bound (for some fixed ffl ? 0) on the number of comparisons required, in the worst case, for selecting the median of n elements.
Average case analysis of the merging algorithm of Hwang and Lin
"... We derive an asymptotic equivalent to the average running time of the merging algorithm of Hwang and Lin applied on two linearly ordered lists of numbers a1 ! a2 ::: ! am and b1 ! b2 ::: ! bn when m and n tend to infinity in such a way that the ratio ae = m n is constant. We show that the distrib ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We derive an asymptotic equivalent to the average running time of the merging algorithm of Hwang and Lin applied on two linearly ordered lists of numbers a1 ! a2 ::: ! am and b1 ! b2 ::: ! bn when m and n tend to infinity in such a way that the ratio ae = m n is constant. We show that the distribution of the running time is concentrated around its expectation except when ae is a power of 2. When ae is a power of 2, we obtain an asymptotic equivalent for the expectation of the running time. Key words: merging, average case analysis. 1 Introduction In the merging problem we are given two linearly ordered lists of numbers a 1 ! a 2 ! ::: ! am and b 1 ! b 2 ! ::: ! b n and the task consists of pooling these two lists into a third ordered list c 1 ! c 2 ::: ! c n+m . We assume that the n + m elements are distinct and n m. The merging is performed by pairwise comparisons between the elements in the two lists. The measure of complexity of a merging algorithm is the number of comparison...
On the Comparison Cost of Partial Orders
, 1992
"... A great deal of effort has been directed towards determining the minimum number of binary comparisons sufficient to produce various partial orders given some partial order. For example, the sorting problem considers the minimum number of comparisons sufficient to construct a total order starting fro ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A great deal of effort has been directed towards determining the minimum number of binary comparisons sufficient to produce various partial orders given some partial order. For example, the sorting problem considers the minimum number of comparisons sufficient to construct a total order starting from n elements. The merging problem considers the minimum number of comparisons sufficient to construct a total order from two total orders. The searching problem can be seen as a special case of the merging problem in which one of the total orders is a singleton. The selection problem considers the minimum number of comparisons sufficient to select the i th largest of n elements. Little, however, is known about the minimum number of comparisons sufficient to produce an arbitrary partial order. In this paper we briefly survey the known results on this problem and we present some first results on partial orders which can be produced using either restricted types of comparisons or a limited n...
Faster Set Intersection Algorithms for Text Searching�
"... Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the inte ..."
Abstract
- Add to MetaCart
Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the intersection context. We perform an experimental comparison with the algorithms from the previous studies from Demaine, López-Ortiz and Munro [ALENEX 2001], and from Baeza-Yates and Salinger [SPIRE 2005]; in addition, we implement and test the intersection algorithm from Barbay and Kenyon [SODA 2002] and its randomized variant [SAGA 2003]. We consider both the random data-set from Baeza-Yates and Salinger, the Google queries used by Demaine et al., a corpus provided by Google and a larger corpus from the TREC Terabyte 2006 efficiency query stream, along with its own query log. We measure the performance both in terms of the number of comparisons and searches performed, and in terms of the CPU time on two different architectures. Our results confirm or improve the results from both previous studies in their respective context (comparison model on real data and CPU measures on random data), and extend them to new contexts. In particular we show that value-based search algorithms perform well in posting lists in terms of the number of comparisons performed. 1
Intersection in Integer Inverted Indices
"... Inverted index data structures are the key to fast search engines. The predominant operation on inverted indices asks for intersecting two sorted lists of document IDs which might have vastly varying lengths. We compare previous theoretical approaches, methods used in practice, and one new algorithm ..."
Abstract
- Add to MetaCart
Inverted index data structures are the key to fast search engines. The predominant operation on inverted indices asks for intersecting two sorted lists of document IDs which might have vastly varying lengths. We compare previous theoretical approaches, methods used in practice, and one new algorithm which exploits that the intersection uses small integer keys. We also take different data compression techniques into account. The new algorithm is very fast, simple, has good space efficiency, and is the only algorithm that performs well over the entire spectrum of relative list length ratios. 1

