• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Algorithms, Performance

Cached

  • Download as a PDF

Download Links

  • [cis.poly.edu]
  • [cis.poly.edu]
  • [koala.poly.edu]
  • [cis.poly.edu]

  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Jinru He , Hao Yan , Torsten Suel
  • Summary
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@MISC{He_algorithms,performance,
    author = {Jinru He and Hao Yan and Torsten Suel},
    title = {Algorithms, Performance},
    year = {}
}

Bookmark

citeulike Connotea Bibsonomy Del.icio.us Digg Reddit

OpenURL

 

Abstract

We study the problem of creating highly compressed fulltext index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page archive maintained by the Internet Archive. A straightforward indexing approach would simply treat each document version as a separate document, such that index size scales linearly with the number of versions. However, several authors have recently studied approaches that exploit the significant similarities between different versions of the same document to obtain much smaller index sizes. In this paper, we propose new techniques for organizing and compressing inverted index structures for such collections. We also perform a detailed experimental comparison of new techniques and the existing techniques in the literature. Our results on an archive of the English version of Wikipedia, and on a subset of the Internet Archive collection, show significant benefits over previous approaches.

Citations

257 Efficient randomized pattern matching algorithms - Karp, Rabin - 1987
254 On the resemblance and containment of documents - Broder - 1997
240 A lowbandwidth network file system - Muthitacharoen, Chen, et al. - 2001
198 Venti: A New Approach to Archival Storage - Quinlan, Dorward - 2002
140 Pastiche: Making backup cheap and easy - Cox, Murray, et al. - 2002
136 Inverted files for text search engines - Zobel, Moffat
129 A.: Winnowing: Local algorithms for document fingerprinting - Schleimer, Wilkerson, et al.
125 The rsync algorithm - Tridgell, Mackerras - 1997
60 Finding replicated web collections - Cho, Shivakumar, et al.
49 P.: Super-scalar RAM-CPU cache compression - Zukowski, Heman, et al. - 2006
45 Redundancy elimination within large collections of files - Kulkarni, Douglis, et al. - 2004
36 Scalable Techniques for Clustering the Web - Haveliwala, Gionis, et al. - 2000
35 Index compression through document reordering - Blandford, Blelloch - 2002
32 Interactive communication: Balanced distributions, correlated files, and average-case complexity - Orlitsky - 1991
30 Efficient peer-topeer searches using result-caching - Bhattacharjee, Chawathe, et al. - 2003
26 T.: Inverted index compression and query processing with optimized document ordering - Yan, Ding, et al. - 2009
25 T.: Performance of compressed inverted list caching in search engines - Zhang, Long, et al. - 2008
22 Binary interpolative coding for effective index compression - Moffat, Stuiver
22 Low-cost comparisons of file copies - Schwarz, Bowdidge, et al. - 1990
20 Versioning a full-text information retrieval system - Anick, Flynn - 1992
20 Inverted file compression through document identifier reassignment - Shieh, Chen, et al.
19 Assigning identifiers to documents to enhance the clustering property of fulltext indexes - Silvestri, Orlando, et al. - 2004
18 Sorting out the document identifier assignment problem - Silvestri - 2007
16 Efficient indexing of versioned document sequences - Herscovici, Lempel, et al. - 2007
15 Document identifier reassignment through dimensionality reduction - Blanco, Barreiro - 2005
15 Improved hierarchical bit-vector compression in document retrieval systems - Choueka, Fraenkel, et al. - 1986
14 Super-scalar database compression between RAM and CPU-cache - Heman - 2005
13 Indexing shared content in information retrieval systems - Broder, Eiron, et al. - 2006
12 Efficient search in large textual collection with redundancy - Zhang, Suel - 2007
9 Optimizing file replication over limited bandwidth networks using remote differential compression - Teodosiu, Bjorner, et al. - 2006
6 Incremental cluster-based retrieval using compressed cluster-skipping inverted files - Altingovde, Demir, et al. - 2008
5 Tsp and cluster-based solutions to the reassignment of document identifiers - Blanco, Barreiro - 2006
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University