Results 1 
5 of
5
Fast Set Intersection in Memory
"... Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a worstcase efficient way. In general, given k (preprocessed) sets, with totally n elements ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a worstcase efficient way. In general, given k (preprocessed) sets, with totally n elements, we will show how to compute their intersection in expected time O(n / √ w + kr), where r is the intersection size and w is the number of bits in a machineword. In addition,we introduce a very simple version of this algorithm that has weaker asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads. 1.
A New Data Layout For Set Intersection on GPUs
"... Abstract—Set intersection is the core in a variety of problems, e.g. frequent itemset mining and sparse boolean matrix multiplication. It is wellknown that large speed gains can, for some computational problems, be obtained by using a graphics processing unit (GPU) as a massively parallel computing ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract—Set intersection is the core in a variety of problems, e.g. frequent itemset mining and sparse boolean matrix multiplication. It is wellknown that large speed gains can, for some computational problems, be obtained by using a graphics processing unit (GPU) as a massively parallel computing device. However, GPUs require highly regular control flow and memory access patterns, and for this reason previous GPU methods for intersecting sets have used a simple bitmap representation. This representation requires excessive space on sparse data sets. In this paper we present a novel data layout, BATMAP, that is particularly well suited for parallel processing, and is compact even for sparse data. Frequent itemset mining is one of the most important applications of set intersection. As a casestudy on the potential of BATMAPs we focus on frequent pair mining, which is a core special case of frequent itemset mining. The main finding is that our method is able to achieve speedups over both Apriori and FPgrowth when the number of distinct items is large, and the density of the problem instance is above 1%. Previous implementations of frequent itemset mining on GPU have not been able to show speedups over the best singlethreaded implementations. KeywordsSet intersection; Frequent itemset mining; Sparse boolean matrix multiplication; Data layout; GPU
Secondary Indexing in One Dimension: Beyond Btrees and Bitmap Indexes ∗
"... Let Σ be a finite, ordered alphabet, and consider a string x = x1x2... xn ∈ Σn. A secondary index for x answers alphabet range queries of the form: Given a range [al, ar] ⊆ Σ, return the set I [al;ar] = {i  xi ∈ [al; ar]}. Secondary indexes are heavily used in relational databases and scientific ..."
Abstract
 Add to MetaCart
Let Σ be a finite, ordered alphabet, and consider a string x = x1x2... xn ∈ Σn. A secondary index for x answers alphabet range queries of the form: Given a range [al, ar] ⊆ Σ, return the set I [al;ar] = {i  xi ∈ [al; ar]}. Secondary indexes are heavily used in relational databases and scientific data analysis. It is wellknown that the obvious solution, storing a dictionary for the set ⋃ i {xi} with a position set associated with each character, does not always give optimal query time. In this paper we give the first theoretically optimal data structure for the secondary indexing problem. In the I/O model, the amount of data read when answering a query is within a constant factor of the minimum space needed to represent the set I [al;ar], assuming that the size of internal memory is (Σ  lg n) δ blocks, for some constant δ> 0. The space usage of the data structure is O(n lg Σ) bits in the worst case, and we further show how to bound the size of the data structure in terms of the 0th order entropy of x. We show how to support updates achieving various timespace tradeoffs. We also consider an approximate version of the basic secondary indexing problem where a query reports a superset of I [al;ar] containing each element not in I [al;ar] with probability at most ε, where ε> 0 is the false positive probability. For this problem the amount of data that needs to be read by the query algorithm is reduced to O(I [al;ar]  lg(1/ε)) bits. The main ideas for this work were conceived during
Secondary Indexing in One Dimension: Beyond Btrees and Bitmap Indexes ∗
, 2008
"... Let Σ be a finite, ordered alphabet, and let x = x1x2... xn ∈ Σn. A secondary index for x answers alphabet range queries of the form: Given a range [al, ar] ⊆ Σ, return the set I [al;ar] = {i  xi ∈ [al; ar]}. Secondary indexes are heavily used in relational databases and scientific data analysis. ..."
Abstract
 Add to MetaCart
Let Σ be a finite, ordered alphabet, and let x = x1x2... xn ∈ Σn. A secondary index for x answers alphabet range queries of the form: Given a range [al, ar] ⊆ Σ, return the set I [al;ar] = {i  xi ∈ [al; ar]}. Secondary indexes are heavily used in relational databases and scientific data analysis. It is wellknown that the obvious solution, storing a dictionary for the set ⋃ i {xi} with a position set associated with each character, does not always give optimal query time. In this paper we give the first theoretically optimal data structure for the secondary indexing problem. In the I/O model, the amount of data read when answering a query is within a constant factor of the minimum space needed to represent I [al;ar], assuming that the size of internal memory is (Σ  lg n) δ blocks, for some constant δ> 0. The space usage of the data structure is O(n lg Σ) bits in the worst case, and we further show how to bound the size of the data structure in terms of the 0th order entropy of x. We show how to support updates achieving various timespace tradeoffs. We also consider an approximate version of the basic secondary indexing problem where a query reports a superset of I [al;ar] containing each element not in I [al;ar] with probability at most ε, where ε> 0 is the false positive probability. For this problem the amount of data that needs to be read by the query algorithm is reduced to O(I [al;ar]  lg(1/ε)) bits. The main ideas for this work were conceived during Dagstuhl seminar No. 08081 on Data
Bitlist: New Fulltext Index for Low Space Cost and Efficient Keyword Search
"... Nowadays Web search engines are experiencing significant performance challenges caused by a huge amount of Web pages and increasingly larger number of Web users. The key issue for addressing these challenges is to design a compact structure which can index Web documents with low space and meanwhile ..."
Abstract
 Add to MetaCart
Nowadays Web search engines are experiencing significant performance challenges caused by a huge amount of Web pages and increasingly larger number of Web users. The key issue for addressing these challenges is to design a compact structure which can index Web documents with low space and meanwhile process keyword search very fast. Unfortunately, the current solutions typically separate the space optimization from the search improvement. As a result, such solutions either save space yet with search inefficiency, or allow fast keyword search but with huge space requirement. In this paper, to address the challenges, we propose a novel structure bitlist with both low space requirement and supporting fast keyword search. Specifically, based on a simple and yet very efficient encoding scheme, bitlist uses a single number to encode a set of integer document IDs for low space, and adopts fast bitwise operations for very efficient booleanbased keyword search. Our extensive experimental results on real and synthetic data sets verify that bitlist outperforms the recent proposed solution, inverted list compression [23, 22] by spending 36.71 % less space and 61.91 % faster processing time, and achieves comparable running time as [8] but with significantly lower space. 1.