Results 1  10
of
31
Type less, find more: fast autocompletion search with a succinct index
 IN 29TH CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR’06
, 2006
"... We consider the following fulltext search autocompletion feature. Imagine a user of a search engine typing a query. Then with every letter being typed, we would like an instant display of completions of the last query word which would lead to good hits. At the same time, the best hits for any of th ..."
Abstract

Cited by 41 (4 self)
 Add to MetaCart
We consider the following fulltext search autocompletion feature. Imagine a user of a search engine typing a query. Then with every letter being typed, we would like an instant display of completions of the last query word which would lead to good hits. At the same time, the best hits for any of these completions should be displayed. Known indexing data structures that apply to this problem either incur large processing times for a substantial class of queries, or they use a lot of space. We present a new indexing data structure that uses no more space than a stateoftheart compressed inverted index, but that yields an order of magnitude faster query processing times. Even on the large TREC Terabyte collection, which comprises over 25 million documents, we achieve, on a single machine and with the index on disk, average response times of one tenth of a second. We have built a fullfledged, interactive search engine that realizes the proposed autocompletion feature combined with support for proximity search, semistructured (XML) text, subword and phrase completion, and semantic tags.
Faster adaptive set intersections for text searching
 Experimental Algorithms: 5th International Workshop, WEA 2006, Cala Galdana, Menorca
, 2006
"... Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we engineer a better algorithm for this task, which improves over those proposed by Demaine, Munro and LópezOrtiz [SODA 2000/ALENEX 2001], by us ..."
Abstract

Cited by 28 (4 self)
 Add to MetaCart
Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we engineer a better algorithm for this task, which improves over those proposed by Demaine, Munro and LópezOrtiz [SODA 2000/ALENEX 2001], by using a variant of interpolation search. More specifically, our contributions are threefold. First, we corroborate and complete the practical study from Demaine et al. on comparison based intersection algorithms. Second, we show that in practice replacing binary search and galloping (onesided binary) search [4] by interpolation search improves the performance of each main intersection algorithms. Third, we introduce and test variants of interpolation search: this results in an even better intersection algorithm.
Adaptive searching in succinctly encoded binary relations and treestructured documents (Extended Abstract)
 THEORETICAL COMPUTER SCIENCE
, 2005
"... This paper deals with succinct representations of data types motivated by applications in posting lists for search engines, in querying XML documents, and in the more general setting (which extends XML) of multilabeled trees, where several labels can be assigned to each node of a tree. To find th ..."
Abstract

Cited by 26 (9 self)
 Add to MetaCart
This paper deals with succinct representations of data types motivated by applications in posting lists for search engines, in querying XML documents, and in the more general setting (which extends XML) of multilabeled trees, where several labels can be assigned to each node of a tree. To find the set of references corresponding to a set of keywords, one typically intersects the list of references associated with each keyword. We view this instead as having a single list of objects [n] = {1,..., n} (the references), each of which has a subset of the labels [σ] = {1,..., σ} (the keywords) associated with it. We are able to find the objects associated with an arbitrary set of keywords in time O(δk lg lg σ) using a data structure requiring only t(lg σ +o(lg σ)) bits, where δ is the number of steps required by a nondeterministic algorithm to check the answer, k is the number of keywords in the query, σ is the size of the set from which the keywords are chosen, and t is the number of associations between references and keywords. The data structure is succinct in that it differs from the space needed to write down all t occurrences of keywords by only a lower order term. An XML document is, for our purpose, a labeled rooted tree. We deal primarily with “nonrecursive labeled trees”, where no label occurs more than once on any root to leaf path. We find the set of nodes which path from the root include a set of keywords in the same time, O(δk lg lg σ), on a representation of the tree using essentially minimum space, 2n + n(lg σ + o(lg σ)) bits, where n is the number of nodes in the tree. If we permit nodes to have multiple
Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences
 In Proceedings of 12th International Conference on String Processing and Information Retrieval (SPIRE
, 2005
"... Abstract. This work presents an experimental comparison of intersection algorithms for sorted sequences, including the recent algorithm of BaezaYates. This algorithm performs on average less comparisons than the total number of elements of both inputs (n and m respectively) when n = αm (α> 1). We c ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
Abstract. This work presents an experimental comparison of intersection algorithms for sorted sequences, including the recent algorithm of BaezaYates. This algorithm performs on average less comparisons than the total number of elements of both inputs (n and m respectively) when n = αm (α> 1). We can find applications of this algorithm on query processing in Web search engines, where large intersections, or differences, must be performed fast. In this work we concentrate in studying the behavior of the algorithm in practice, using for the experiments test data that is close to the actual conditions of its applications. We compare the efficiency of the algorithm with other intersection algorithm and we study different optimizations, showing that the algorithm is more efficient than the alternatives in most cases, especially when one of the sequences is much larger than the other. 1
Compressed representations of permutations, and applications
 SYMPOSIUM ON THEORETICAL ASPECTS OF COMPUTER SCIENCE
"... We explore various techniques to compress a permutation π over n integers, taking advantage of ordered subsequences in π, while supporting its application π(i) and the application of its inverse π −1 (i) in small time. Our compression schemes yield several interesting byproducts, in many cases mat ..."
Abstract

Cited by 19 (11 self)
 Add to MetaCart
We explore various techniques to compress a permutation π over n integers, taking advantage of ordered subsequences in π, while supporting its application π(i) and the application of its inverse π −1 (i) in small time. Our compression schemes yield several interesting byproducts, in many cases matching, improving or extending the best existing results on applications such as the encoding of a permutation in order to support iterated applications π k (i) of it, of integer functions, and of inverted lists and suffix arrays.
A.: Compact set representation for information retrieval
 In: SPIRE. Lecture
, 2007
"... Abstract. Conjunctive Boolean queries are a fundamental operation in web search engines. These queries can be reduced to the problem of intersecting ordered sets of integers, where each set represents the documents containing one of the query terms. But there is tension between the desire to store t ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Abstract. Conjunctive Boolean queries are a fundamental operation in web search engines. These queries can be reduced to the problem of intersecting ordered sets of integers, where each set represents the documents containing one of the query terms. But there is tension between the desire to store the lists effectively, in a compressed form, and the desire to carry out intersection operations efficiently, using nonsequential processing modes. In this paper we evaluate intersection algorithms on compressed sets, comparing them to the best nonsequential arraybased intersection algorithms. By adding a simple, lowcost, auxiliary index, we show that compressed storage need not hinder efficient and highspeed intersection operations. 1
An Experimental Investigation of Set Intersection Algorithms for Text Searching ⋆
"... Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the inte ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the intersection context. We perform an experimental comparison with the algorithms from the previous studies from Demaine, LópezOrtiz and Munro [ALENEX 2001], and from BaezaYates and Salinger [SPIRE 2005]; in addition, we implement and test the intersection algorithm from Barbay and Kenyon [SODA 2002] and its randomized variant [SAGA 2003]. We consider both the random data set from BaezaYates and Salinger, the Google queries used by Demaine et al., a corpus provided by Google and a larger corpus from the TREC Terabyte 2006 efficiency query stream, along with its own query log. We measure the performance both in terms of the number of comparisons and searches performed, and in terms of the CPU time on two different architectures. Our results confirm or improve the results from both previous studies in their respective context (comparison model on real data and CPU measures on random data), and extend them to new contexts. In particular we show that valuebased search algorithms perform well in posting lists in terms of the number of comparisons performed. 1
Alternation and Redundancy Analysis of the Intersection Problem
"... The intersection of sorted arrays problem has applications in search engines such as Google. Previous work propose and compare deterministic algorithms for this problem, in an adaptive analysis based on the encoding size of a certificate of the result (cost analysis). We define the alternation analy ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
The intersection of sorted arrays problem has applications in search engines such as Google. Previous work propose and compare deterministic algorithms for this problem, in an adaptive analysis based on the encoding size of a certificate of the result (cost analysis). We define the alternation analysis, based on the nondeterministic complexity of an instance. In this analysis we prove that there is a deterministic algorithm asymptotically performing as well as any randomized algorithm in the comparison model. We define the redundancy analysis, based on a measure of the internal redundancy of the instance. In this analysis we prove that any algorithm optimal in the redundancy analysis is optimal in the alternation analysis, but that there is a randomized algorithm which performs strictly better than any deterministic algorithm in the comparison model. Finally, we describe how those results can be extended beyond the comparison model.
Improving the Performance of List Intersection
"... List intersection is a central operation, utilized excessively for query processing on text and databases. We present list intersection algorithms for an arbitrary number of sorted and unsorted lists tailored to the characteristics of modern hardware architectures. Two new list intersection algorith ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
List intersection is a central operation, utilized excessively for query processing on text and databases. We present list intersection algorithms for an arbitrary number of sorted and unsorted lists tailored to the characteristics of modern hardware architectures. Two new list intersection algorithms are presented for sorted lists. The first algorithm, termed Dynamic Probes, dynamically decides the probing order on the lists exploiting information from previous probes at runtime. This information is utilized as a cacheresident microindex. The second algorithm, termed Quantilebased, deduces in advance a good probing order, thus avoiding the overhead of adaptivity and is based on detecting lists with nonuniform distribution of document identifiers. For unsorted lists, we present a novel hashbased algorithm that avoids the overhead of sorting. A detailed experimental evaluation is presented based on real and synthetic data using existing chip multiprocessor architectures with eight cores, validating the efficiency and efficacy of the proposed algorithms. 1.
DualSorted Inverted Lists
"... Several IR tasks rely, to achieve high efficiency, on a single pervasive data structure called the inverted index. This is a mapping from the terms in a text collection to the documents where they appear, plus some supplementary data. Different orderings in the list of documents associated to a ter ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
Several IR tasks rely, to achieve high efficiency, on a single pervasive data structure called the inverted index. This is a mapping from the terms in a text collection to the documents where they appear, plus some supplementary data. Different orderings in the list of documents associated to a term, and different supplementary data, fit widely different IR tasks. Index designers have to choose the right order for one such task, rendering the index difficult to use for others. In this paper we introduce a general technique, based on wavelet trees, to maintain a single data structure that offers the combined functionality of two independent orderings for an inverted index, with competitive efficiency and within the space of one compressed inverted index. We show in particular that the technique allows combining an ordering by decreasing term frequency (useful for ranked document retrieval) with an ordering by increasing document identifier (useful for phrase and Boolean queries). We show that we can support not only the primitives required by the different search paradigms (e.g., in order to implement any intersection algorithm on top of our data structure), but also that the data structure offers novel ways of carrying out many operations of interest, including spacefree treatment of stemming and hierarchical documents.