Results 1  10
of
22
Compressed representations of permutations, and applications
 SYMPOSIUM ON THEORETICAL ASPECTS OF COMPUTER SCIENCE
"... We explore various techniques to compress a permutation π over n integers, taking advantage of ordered subsequences in π, while supporting its application π(i) and the application of its inverse π −1 (i) in small time. Our compression schemes yield several interesting byproducts, in many cases mat ..."
Abstract

Cited by 19 (11 self)
 Add to MetaCart
We explore various techniques to compress a permutation π over n integers, taking advantage of ordered subsequences in π, while supporting its application π(i) and the application of its inverse π −1 (i) in small time. Our compression schemes yield several interesting byproducts, in many cases matching, improving or extending the best existing results on applications such as the encoding of a permutation in order to support iterated applications π k (i) of it, of integer functions, and of inverted lists and suffix arrays.
A.: Compact set representation for information retrieval
 In: SPIRE. Lecture
, 2007
"... Abstract. Conjunctive Boolean queries are a fundamental operation in web search engines. These queries can be reduced to the problem of intersecting ordered sets of integers, where each set represents the documents containing one of the query terms. But there is tension between the desire to store t ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Abstract. Conjunctive Boolean queries are a fundamental operation in web search engines. These queries can be reduced to the problem of intersecting ordered sets of integers, where each set represents the documents containing one of the query terms. But there is tension between the desire to store the lists effectively, in a compressed form, and the desire to carry out intersection operations efficiently, using nonsequential processing modes. In this paper we evaluate intersection algorithms on compressed sets, comparing them to the best nonsequential arraybased intersection algorithms. By adding a simple, lowcost, auxiliary index, we show that compressed storage need not hinder efficient and highspeed intersection operations. 1
An Experimental Investigation of Set Intersection Algorithms for Text Searching ⋆
"... Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the inte ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
Abstract. The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the intersection context. We perform an experimental comparison with the algorithms from the previous studies from Demaine, LópezOrtiz and Munro [ALENEX 2001], and from BaezaYates and Salinger [SPIRE 2005]; in addition, we implement and test the intersection algorithm from Barbay and Kenyon [SODA 2002] and its randomized variant [SAGA 2003]. We consider both the random data set from BaezaYates and Salinger, the Google queries used by Demaine et al., a corpus provided by Google and a larger corpus from the TREC Terabyte 2006 efficiency query stream, along with its own query log. We measure the performance both in terms of the number of comparisons and searches performed, and in terms of the CPU time on two different architectures. Our results confirm or improve the results from both previous studies in their respective context (comparison model on real data and CPU measures on random data), and extend them to new contexts. In particular we show that valuebased search algorithms perform well in posting lists in terms of the number of comparisons performed. 1
Improving the Performance of List Intersection
"... List intersection is a central operation, utilized excessively for query processing on text and databases. We present list intersection algorithms for an arbitrary number of sorted and unsorted lists tailored to the characteristics of modern hardware architectures. Two new list intersection algorith ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
List intersection is a central operation, utilized excessively for query processing on text and databases. We present list intersection algorithms for an arbitrary number of sorted and unsorted lists tailored to the characteristics of modern hardware architectures. Two new list intersection algorithms are presented for sorted lists. The first algorithm, termed Dynamic Probes, dynamically decides the probing order on the lists exploiting information from previous probes at runtime. This information is utilized as a cacheresident microindex. The second algorithm, termed Quantilebased, deduces in advance a good probing order, thus avoiding the overhead of adaptivity and is based on detecting lists with nonuniform distribution of document identifiers. For unsorted lists, we present a novel hashbased algorithm that avoids the overhead of sorting. A detailed experimental evaluation is presented based on real and synthetic data using existing chip multiprocessor architectures with eight cores, validating the efficiency and efficacy of the proposed algorithms. 1.
SelfIndexing Natural Language
, 2008
"... Selfindexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Selfindexes represent a string in a space close to its compressed size and provide indexed searchi ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
Selfindexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Selfindexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides a reasonable alternative, in space and time, for indexed searching of words and phrases. In this paper we explore the possibility of regarding natural language text as a string of words and applying a selfindex to it. There are several challenges involved, such as dealing with a very large alphabet and detaching searchable content from nonsearchable presentation aspects in the text. As a result, we show that the selfindex requires space very close to that of the best wordbased compressors, and that it obtains better search time than inverted indexes (using the same overall space) when searching for phrases.
Fast Set Intersection in Memory
"... Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a worstcase efficient way. In general, given k (preprocessed) sets, with totally n elements ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a worstcase efficient way. In general, given k (preprocessed) sets, with totally n elements, we will show how to compute their intersection in expected time O(n / √ w + kr), where r is the intersection size and w is the number of bits in a machineword. In addition,we introduce a very simple version of this algorithm that has weaker asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads. 1.
A Wordbased SelfIndexes for Natural Language Text
"... The inverted index supports efficient fulltext searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for singleword searches, yet phrase searches require more expensive intersections. In this art ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
The inverted index supports efficient fulltext searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for singleword searches, yet phrase searches require more expensive intersections. In this article we introduce a different kind of index. It replaces the text using essentially the same space required by the compressed text alone (compression ratio around 35%). Within this space it supports not only decompression of arbitrary passages, but efficient word and phrase searches. Searches are orders of magnitude faster than those over inverted indexes when looking for phrases, and still faster on singleword searches when little space is available. Our new indexes are particularly fast at counting the occurrences of words or phrases. This is useful for computing relevance of words or phrases. We adapt selfindexes that succeeded in indexing arbitrary strings within compressed space to deal with large alphabets. Natural language texts are then regarded as sequences of words, not characters, to achieve wordbased selfindexes. We design an architecture that separates the searchable sequence from its presentation aspects. This permits applying case folding, stemming, removing stopwords, etc. as is usual on inverted indexes.
Fast set intersection and twopatterns matching
 Theoretical Computer Science
, 2010
"... Abstract. In this paper we present a new problem, the fast set intersection problem, which is to preprocess a collection of sets in order to efficiently report the intersection of any two sets in the collection. In addition we suggest new solutions for the twodimensional substring indexing problem ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Abstract. In this paper we present a new problem, the fast set intersection problem, which is to preprocess a collection of sets in order to efficiently report the intersection of any two sets in the collection. In addition we suggest new solutions for the twodimensional substring indexing problem and the document listing problem for two patterns by reduction to the fast set intersection problem. 1 Introduction and Related Work The intersection of large sets is a common problem in the context of retrieval algorithms, search engines, evaluation of relational queries and more. Relational databases use indices to decrease query time, but when a query involves two different indices, each one returning a different set of results, we have to intersect
Efficient set intersection for inverted indexing
 ACM Transactions on Information Systems
, 2010
"... Conjunctive Boolean queries are a key component of modern information retrieval systems, especially when webscale repositories are being searched. A conjunctive query q is equivalent to a qway intersection over ordered sets of integers, where each set represents the documents containing one of t ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Conjunctive Boolean queries are a key component of modern information retrieval systems, especially when webscale repositories are being searched. A conjunctive query q is equivalent to a qway intersection over ordered sets of integers, where each set represents the documents containing one of the terms, and each integer in each set is an ordinal document identifier. As is the case with many computing applications, there is tension between the way in which the data is represented, and the ways in which it is to be manipulated. In particular, the sets representing index data for typical document collections are highly compressible, but are processed using random access techniques, meaning that methods for carrying out set intersections must be alert to issues to do with access patterns and data representation. Our purpose in this paper is to explore these tradeoffs, by investigating intersection techniques that make use of both uncompressed “integer” representations, as well as compressed arrangements. We also propose a simple hybrid method that provides both compact storage, and also faster intersection computations for conjunctive querying than is possible even with uncompressed representations.
Indexes for Highly Repetitive Document Collections
"... We introduce new compressed inverted indexes for highly repetitive document collections. They are based on runlength, LempelZiv, or grammarbased compression of the differential inverted lists, instead of gapencoding them as is the usual practice. We show that our compression methods significantly ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
We introduce new compressed inverted indexes for highly repetitive document collections. They are based on runlength, LempelZiv, or grammarbased compression of the differential inverted lists, instead of gapencoding them as is the usual practice. We show that our compression methods significantly reduce the space achieved by classical compression, at the price of moderate slowdowns. Moreover, many of our methods are universal, that is, they do not need to know the versioning structure of the collection. We also introduce compressed selfindexes in the comparison. We show that techniques can compress much further, using a small fraction of the space required by our new inverted indexes, yet they are orders of magnitude slower.