Results 1  10
of
39
Efficient Content Location Using InterestBased Locality in PeertoPeer Systems
, 2003
"... Locating content in decentralized peertopeer systems is a challenging problem. Gnutella, a popular filesharing application, relies on flooding queries to all peers. Although flooding is simple and robust, it is not scalable. In this paper, we explore how to retain the simplicity of Gnutella, whil ..."
Abstract

Cited by 226 (2 self)
 Add to MetaCart
Locating content in decentralized peertopeer systems is a challenging problem. Gnutella, a popular filesharing application, relies on flooding queries to all peers. Although flooding is simple and robust, it is not scalable. In this paper, we explore how to retain the simplicity of Gnutella, while addressing its inherent weakness: scalability. We propose a content location solution in which peers loosely organize themselves into an interestbased structure on top of the existing Gnutella network. Our approach exploits a simple, yet powerful principle called interestbased locality, which posits that if a peer has a particular piece of content that one is interested in, it is very likely that it will have other items that one is interested in as well. When using our algorithm, called interestbased shortcuts,asignificant amount of flooding can be avoided, making Gnutella a more competitive solution. In addition, shortcuts are modular and can be used to improve the performance of other content location mechanisms including distributed hash table schemes.
Inverted files for text search engines
 ACM Computing Surveys
, 2006
"... The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolida ..."
Abstract

Cited by 192 (5 self)
 Add to MetaCart
The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolidated in textbooks, many specific techniques are not widely known or the textbook descriptions are out of date. In this tutorial, we introduce the key techniques in the area, describing both a core implementation and how the core can be enhanced through a range of extensions. We conclude with a comprehensive bibliography of text indexing literature.
Compact Representations of Separable Graphs
 In Proceedings of the Annual ACMSIAM Symposium on Discrete Algorithms
, 2003
"... We consider the problem of representing graphs compactly while supporting queries e#ciently. In particular we describe a data structure for representing nvertex unlabeled graphs that satisfy an O(n )separator theorem, c < 1. The structure uses O(n) bits, and supports adjacency and degree queri ..."
Abstract

Cited by 36 (11 self)
 Add to MetaCart
We consider the problem of representing graphs compactly while supporting queries e#ciently. In particular we describe a data structure for representing nvertex unlabeled graphs that satisfy an O(n )separator theorem, c < 1. The structure uses O(n) bits, and supports adjacency and degree queries in constant time, and neighbor listing in constant time per neighbor. This generalizes previous results for graphs with constant genus, such as planar graphs.
On Compressing Social Networks
"... Motivated by structural properties of the Web graph that support efficient data structures for in memory adjacency queries, we study the extent to which a large network can be compressed. Boldi and Vigna (WWW 2004), showed that Web graphs can be compressed down to three bits of storage per edge; we ..."
Abstract

Cited by 35 (1 self)
 Add to MetaCart
Motivated by structural properties of the Web graph that support efficient data structures for in memory adjacency queries, we study the extent to which a large network can be compressed. Boldi and Vigna (WWW 2004), showed that Web graphs can be compressed down to three bits of storage per edge; we study the compressibility of social networks where again adjacency queries are a fundamental primitive. To this end, we propose simple combinatorial formulations that encapsulate efficient compressibility of graphs. We show that some of the problems are NPhard yet admit effective heuristics, some of which can exploit properties of social networks such as link reciprocity. Our extensive experiments show that social networks and the Web graph exhibit vastly different compressibility characteristics.
Assigning Identifiers to Documents to Enhance the Clustering Property of Fulltext Indexes
 In Proc. of the 27th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval
, 2004
"... Web Search Engines provide a largescale text document retrieval service by processing huge Inverted File indexes. Inverted File indexes allow fast query resolution and good memory utilization since their dgaps representation can be effectively and efficiently compressed by using variable length en ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
Web Search Engines provide a largescale text document retrieval service by processing huge Inverted File indexes. Inverted File indexes allow fast query resolution and good memory utilization since their dgaps representation can be effectively and efficiently compressed by using variable length encoding methods. This paper proposes and evaluates some algorithms aimed to find an assignment of the document identifiers which minimizes the average values of dgaps, thus enhancing the effectiveness of traditional compression methods. We ran several tests over the Google contest collection in order to validate the techniques proposed. The experiments demonstrated the scalability and effectiveness of our algorithms. Using the proposed algorithms, we were able to sensibly improve (up to 20.81%) the compression ratios of several encoding schemes.
Sorting out the document identifier assignment problem
 In Proc. of 29th European Conference on IR Research (ECIR
, 2007
"... Abstract. The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In t ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
Abstract. The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. We will validate this assumption by comparing several assignment techniques and several compression algorithms on a quite large document collection composed by about six million documents. The results are very encouraging since we can improve the compression ratio up to 40 % using an algorithm that takes about ninety seconds to finish using only 100 MB of main memory. 1
Compact Representations Of Ordered Sets
, 2004
"... We consider the problem of e#ciently representing sets S of size n from an ordered universe U = . . . , m1}. Given any ordered dictionary structure (or comparisonbased ordered set structure) D that uses O(n) pointers, we demonstrate a simple blocking technique that produces an ordered set struc ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
We consider the problem of e#ciently representing sets S of size n from an ordered universe U = . . . , m1}. Given any ordered dictionary structure (or comparisonbased ordered set structure) D that uses O(n) pointers, we demonstrate a simple blocking technique that produces an ordered set structure supporting the same operations in the same time bounds but with O(n log n ) bits. This is within a constant factor of the informationtheoretic lower bound. We assume the unit cost RAM model with word size #ze U ) and a table of size O(m m) bits, for some constant # > 0. The time bound for our operations contains a factor of 1/#. We present
Alphabet Partitioning for Compressed Rank/Select and Applications
"... Abstract. We present a data structure that stores a string s[1..n] over the alphabet [1..σ] in nH0(s) + o(n)(H0(s)+1) bits, where H0(s) is the zeroorder entropy of s. This data structure supports the queries access and rank in time O (lg lg σ), and the select query in constant time. This result imp ..."
Abstract

Cited by 18 (13 self)
 Add to MetaCart
Abstract. We present a data structure that stores a string s[1..n] over the alphabet [1..σ] in nH0(s) + o(n)(H0(s)+1) bits, where H0(s) is the zeroorder entropy of s. This data structure supports the queries access and rank in time O (lg lg σ), and the select query in constant time. This result improves on previously known data structures using nH0(s) + o(n lg σ) bits, where on highly compressible instances the redundancy o(n lg σ) cease to be negligible compared to the nH0(s) bits that encode the data. The technique is based on combining previous results through an ingenious partitioning of the alphabet, and practical enough to be implementable. It applies not only to strings, but also to several other compact data structures. For example, we achieve (i) faster search times and lower redundancy for the smallest existing fulltext selfindex; (ii) compressed permutations π with times for π() and π −1 () improved to loglogarithmic; and (iii) the first compressed representation of dynamic collections of disjoint sets. 1
Document identifier reassignment through dimensionality reduction
 In ECIR
, 2005
"... Abstract. Most modern retrieval systems use compressed Inverted Files (IF) for indexing. Recent works demonstrated that it is possible to reduce IF sizes by reassigning the document identifiers of the original collection, as it lowers the average distance between documents related to a single term. ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
Abstract. Most modern retrieval systems use compressed Inverted Files (IF) for indexing. Recent works demonstrated that it is possible to reduce IF sizes by reassigning the document identifiers of the original collection, as it lowers the average distance between documents related to a single term. Variablebit encoding schemes can exploit the average gap reduction and decrease the total amount of bits per document pointer. However, approximations developed so far requires great amounts of time or use an uncontrolled memory size. This paper presents an efficient solution to the reassignment problem consisting in reducing the input data dimensionality using a SVD transformation. We tested this approximation with the GreedyNN TSP algorithm and one more efficient variant based on dividing the original problem in subproblems. We present experimental tests and performance results in two TREC collections, obtaining good compression ratios with low running times. We also show experimental results about the tradeoff between dimensionality reduction and compression, and time performance.
An Experimental Analysis of a Compact Graph Representation
 In ALENEX04
, 2004
"... In previous work we described a method for compactly representing graphs with small separators, which makes use of small separators, and presented preliminary experimental results. In this paper we extend the experimental results in several ways, including extensions for dynamic insertion and deleti ..."
Abstract

Cited by 15 (6 self)
 Add to MetaCart
In previous work we described a method for compactly representing graphs with small separators, which makes use of small separators, and presented preliminary experimental results. In this paper we extend the experimental results in several ways, including extensions for dynamic insertion and deletion of edges, a comparison of a variety of coding schemes, and an implementation of two applications using the representation.