Results 1 - 10
of
14
Sorting out the document identifier assignment problem
- In Proc. of 29th European Conference on IR Research (ECIR
, 2007
"... Abstract. The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In t ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Abstract. The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. We will validate this assumption by comparing several assignment techniques and several compression algorithms on a quite large document collection composed by about six million documents. The results are very encouraging since we can improve the compression ratio up to 40 % using an algorithm that takes about ninety seconds to finish using only 100 MB of main memory. 1
MG4J at TREC 2005
- IN THE FOURTEENTH TEXT RETRIEVAL CONFERENCE (TREC 2005) PROCEEDINGS
, 2005
"... MG4J participated in two tracks of TREC 2005 --- the ad hoc task and the efficiency task of the Terabyte Track (find all the relevant documents with high precision from 25.2 million pages from the .gov domain). It was the first time the MG4J group participated to TREC, and we concentrated our eff ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
MG4J participated in two tracks of TREC 2005 --- the ad hoc task and the efficiency task of the Terabyte Track (find all the relevant documents with high precision from 25.2 million pages from the .gov domain). It was the first time the MG4J group participated to TREC, and we concentrated our efforts on the ad hoc task, using a combination of techniques based on a new multi-index minimal-interval semantics and PageRank.
Permuting Web Graphs ∗
"... Since the first investigations on web graph compression, it has been clear that the ordering of the nodes of the graph has a fundamental influence on the compression rate (usually expressed as the number of bits per link). The author of the LINK database [1], for instance, investigated three differe ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Since the first investigations on web graph compression, it has been clear that the ordering of the nodes of the graph has a fundamental influence on the compression rate (usually expressed as the number of bits per link). The author of the LINK database [1], for instance, investigated three different approaches: an extrinsic ordering (URL ordering) and two intrinsic (or coordinate-free) orderings based on the rows of the adjacency matrix (lexicographic and Gray code); they concluded that URL ordering has many advantages in spite of a small penalty in compression. In this paper we approach this issue in a more systematic way, testing some old orderings and proposing some new ones. Our experiments are made in the WebGraph framework [2], and show that the compression technique and the structure of the graph can produce significantly different results. In particular, we show that for the transpose web graph URL ordering is significantly less effective, and that some new orderings combining host information and Gray/lexicographic orderings outperform all previous methods. In particular, in some large transposed graphs they yield the quite incredible compression rate of 1 bit per link. 1
Scalable Techniques for Document Identifier Assignment in Inverted Indexes
- WWW2010
, 2010
"... Web search engines are based on a full-text data structure called an inverted index. The size of the inverted index structures is a major performance bottleneck during query processing, and a large amount of research has focused on fast and effective techniques for compressing this structure. Severa ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Web search engines are based on a full-text data structure called an inverted index. The size of the inverted index structures is a major performance bottleneck during query processing, and a large amount of research has focused on fast and effective techniques for compressing this structure. Several authors have recently proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant improvements in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on Travelling Salesman or graph partitioning problems that achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Travelling Salesman computation on a reduced sparse graph obtained using Locally Sensitive Hashing, which achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.
Improved Index Compression Techniques for Versioned Document Collections
"... Current Information Retrieval systems use inverted index structures for efficient query processing. Due to the extremely large size of many data sets, these index structures are usually kept in compressed form, and many techniques for optimizing compressed size and query processing speed have been p ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Current Information Retrieval systems use inverted index structures for efficient query processing. Due to the extremely large size of many data sets, these index structures are usually kept in compressed form, and many techniques for optimizing compressed size and query processing speed have been proposed. In this paper, we focus on versioned document collections, that is, collections where each document is modified over time, resulting in multiple versions of the document. Consecutive versions of the same document are often similar, and several researchers have explored ideas for exploiting this similarity to decrease index size. We propose new index compression techniques for versioned document collections that achieve reductions in index size over previous methods. In particular, we first propose several bitwise compression techniques that achieve a compact index structure but that are too slow for most applications. Based on the lessons learned, we then propose additional techniques that come close to the sizes of the bitwise technique while also improving on the speed of the best previous methods.
Compact Data Structures with Fast Queries
, 2005
"... Many applications dealing with large data structures can benefit from keeping them in compressed form. Compression has many benefits: it can allow a representation to fit in main memory rather than swapping out to disk, and it improves cache performance since it allows more data to fit into the c ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Many applications dealing with large data structures can benefit from keeping them in compressed form. Compression has many benefits: it can allow a representation to fit in main memory rather than swapping out to disk, and it improves cache performance since it allows more data to fit into the cache. However, a data structure is only useful if it allows the application to perform fast queries (and updates) to the data.
Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem
- In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval
"... In this poster, we analyze recent work in the document identifiers reassignment problem. After that, we present a formalization of a simple case of the problem as a PSP (Pattern Sequencing Problem). This may facilitate future work as it opens a new research line to solve the general problem. ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In this poster, we analyze recent work in the document identifiers reassignment problem. After that, we present a formalization of a simple case of the problem as a PSP (Pattern Sequencing Problem). This may facilitate future work as it opens a new research line to solve the general problem.
Faster Temporal Range Queries over Versioned Text
"... Versionedtextualcollections arecollections thatretainmultiple versions of a document as it evolves over time. Important large-scale examples are Wikipedia and the web collection of the Internet Archive. Search queries over such collections often use keywords as well as temporal constraints, most com ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Versionedtextualcollections arecollections thatretainmultiple versions of a document as it evolves over time. Important large-scale examples are Wikipedia and the web collection of the Internet Archive. Search queries over such collections often use keywords as well as temporal constraints, most commonly a time range of interest. In this paper, we study how to support such temporal range queries over versioned text. Our goal is to process these queries faster than the corresponding keyword-only queries, by exploiting the additional constraint. A simple approach might partition the index into different time ranges, and then access only therelevant parts. However, specialized inverted index compression techniques are crucial for large versioned collections, and a naive partitioning can negatively affect index compression and query throughput. We show how to achieve high query throughput by using smart index partitioning techniques that take index compression into account. Experiments on over 85 million versions of Wikipedia articles show that queries can be executed in a few milliseconds on memory-based index structures, and only slightly more time on disk-based structures. Wealso showhowtoefficientlysupporttherecentlyproposed stable top-k search primitive on top of our schemes.
Categories and Subject Descriptors: H.3 [Information Storage And Retrieval]
"... Characterization of a simple case of the reassignment ofdocument identifiers as a pattern sequencing problem ..."
Abstract
- Add to MetaCart
Characterization of a simple case of the reassignment ofdocument identifiers as a pattern sequencing problem
Algorithms, Performance
"... We study the problem of creating highly compressed fulltext index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page archive maintained by the Internet Archive. A st ..."
Abstract
- Add to MetaCart
We study the problem of creating highly compressed fulltext index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page archive maintained by the Internet Archive. A straightforward indexing approach would simply treat each document version as a separate document, such that index size scales linearly with the number of versions. However, several authors have recently studied approaches that exploit the significant similarities between different versions of the same document to obtain much smaller index sizes. In this paper, we propose new techniques for organizing and compressing inverted index structures for such collections. We also perform a detailed experimental comparison of new techniques and the existing techniques in the literature. Our results on an archive of the English version of Wikipedia, and on a subset of the Internet Archive collection, show significant benefits over previous approaches.

