• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Managing Gigabytes: Compressing and Indexing Documents and Images - Errata (1996)

by I. H. Witten, A. Moffat, T. C. Bell
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 505
Next 10 →

Modern Information Retrieval

by Ricardo Baeza-Yates, Berthier Ribeiro-Neto , 1999
"... Information retrieval (IR) has changed considerably in the last years with the expansion of the Web (World Wide Web) and the advent of modern and inexpensive graphical user interfaces and mass storage devices. As a result, traditional IR textbooks have become quite out-of-date which has led to the i ..."
Abstract - Cited by 1928 (24 self) - Add to MetaCart
Information retrieval (IR) has changed considerably in the last years with the expansion of the Web (World Wide Web) and the advent of modern and inexpensive graphical user interfaces and mass storage devices. As a result, traditional IR textbooks have become quite out-of-date which has led to the introduction of new IR books recently. Nevertheless, we believe that there is still great need of a book that approaches the field in a rigorous and complete way from a computer-science perspective (in opposition to a user-centered perspective). This book is an effort to partially fulfill this gap and should be useful for a first course on information retrieval as well as for a graduate course on the topic. The book

Video google: A text retrieval approach to object matching in videos

by Josef Sivic, Andrew Zisserman - In Proc. ICCV , 2003
"... We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, ill ..."
Abstract - Cited by 550 (24 self) - Add to MetaCart
We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieval is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching on two full length feature films. 1.

High-order entropy-compressed text indexes

by Roberto Grossi, Ankur Gupta, Jeffrey Scott Vitter , 2003
"... We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet Σ, where each symbol is encoded by lg |Σ | bits. We show that compressed suffix arrays use just nHh + O(n lg lg ..."
Abstract - Cited by 163 (20 self) - Add to MetaCart
We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet Σ, where each symbol is encoded by lg |Σ | bits. We show that compressed suffix arrays use just nHh + O(n lg lg n / lg |Σ | n) bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(m lg |Σ | + polylog(n)) time. The term Hh ≤ lg |Σ | denotes the hth-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that Hh = o(1) and the alphabet size is small, we obtain a text index with o(m) search time that requires only o(n) bits. Further results and tradeoffs are reported in the paper. 1

Opportunistic Data Structures with Applications

by Paolo Ferragina, Giovanni Manzini , 2000
"... In this paper we address the issue of compressing and indexing data. We devise a data structure whose space occupancy is a function of the entropy of the underlying data set. We call the data structure opportunistic since its space occupancy is decreased when the input is compressible and this space ..."
Abstract - Cited by 142 (11 self) - Add to MetaCart
In this paper we address the issue of compressing and indexing data. We devise a data structure whose space occupancy is a function of the entropy of the underlying data set. We call the data structure opportunistic since its space occupancy is decreased when the input is compressible and this space reduction is achieved at no significant slowdown in the query performance. More precisely, its space occupancy is optimal in an information-content sense because a text T [1, u] is stored using O(H k (T )) + o(1) bits per input symbol in the worst case, where H k (T ) is the kth order empirical entropy of T (the bound holds for any fixed k). Given an arbitrary string P [1; p], the opportunistic data structure allows to search for the occ occurrences of P in T in O(p + occ log u) time (for any fixed > 0). If data are uncompressible we achieve the best space bound currently known [12]; on compressible data our solution improves the succinct suffix array of [12] and the classical suffix tree and suffix array data structures either in space or in query time or both.

PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities

by Francisco Matias Cuenca-acuna, Christopher Peery, Richard P. Martin, Thu D. Nguyen , 2003
"... Abstract. We present PlanetP, a peer-to-peer (P2P) content search and retrieval infrastructure targeting communities wishing to share large sets of text documents. P2P computing is an attractive model for information sharing between ad hoc groups of users because of its low cost of entry and explici ..."
Abstract - Cited by 139 (11 self) - Add to MetaCart
Abstract. We present PlanetP, a peer-to-peer (P2P) content search and retrieval infrastructure targeting communities wishing to share large sets of text documents. P2P computing is an attractive model for information sharing between ad hoc groups of users because of its low cost of entry and explicit model for resource scaling. As communities grow, however, a key challenge becomes finding relevant information. To address this challenge, our design centers around indexing, content search, and retrieval rather than scalable name-based object location, which has been the focus of recent P2P systems. PlanetP takes the novel approach of replicating the global directory and a compact summary index at every peer using gossiping. PlanetP then leverages this information to approximate a state-of-the-art document ranking algorithm to help users locate relevant information within the large communal data set. Using a prototype implementation together with simulation, we show: (i) it is possible to design a gossiping algorithm that reliably maintains a copy of communal state at each peer yet requires only a modest amount of bandwidth, (ii) our content search and retrieval algorithm tracks the performance of the original ranking algorithm very closely, giving P2P communities a search and retrieval algorithm as good as that possible assuming a centralized server, and (iii) PlanetP’s gossiping and search and retrieval algorithms both scale well to communities of at least several thousand peers. 1

On the Feasibility of Peer-to-Peer Web Indexing and Search

by Jinyang Li , Boon Thau Loo, Joseph M. Hellerstein, M. Frans Kaashoek, et al. - IN IPTPS’03 , 2003
"... This paper discusses the feasibility of peer-to-peer full-text keyword search of the Web. Two classes of keyword search techniques are in use or have been proposed: flooding of queries over an overlay network (as in Gnutella), and intersection of index lists stored in a distributed hash table. We pr ..."
Abstract - Cited by 121 (11 self) - Add to MetaCart
This paper discusses the feasibility of peer-to-peer full-text keyword search of the Web. Two classes of keyword search techniques are in use or have been proposed: flooding of queries over an overlay network (as in Gnutella), and intersection of index lists stored in a distributed hash table. We present a simple feasibility analysis based on the resource constraints and search workload. Our study suggests that the peer-to-peer network does not have enough capacity to make naive use of either of search techniques attractive for Web search. The paper presents a number of existing and novel optimizations for P2P search based on distributed hash tables, estimates their effects on performance, and concludes that in combination these optimizations would bring the problem to within an order of magnitude of feasibility. The paper suggests a number of compromises that might achieve the last order of magnitude.

Arithmetic coding revisited

by Alistair Moffat, Radford M. Neal, Ian H. Witten - ACM Transactions on Information Systems , 1995
"... Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed, low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmeti ..."
Abstract - Cited by 118 (2 self) - Add to MetaCart
Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed, low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmetic coding that incorporates several improvements over a widely used earlier version by Witten, Neal, and Cleary, which has become a de facto standard. These improvements include fewer multiplicative operations, greatly extended range of alphabet sizes and symbol probabilities, and the use of low-precision arithmetic, permitting implementation by fast shift/add operations. We also describe a modular structure that separates the coding, modeling, and probability estimation components of a compression system. To motivate the improved coder, we consider the needs of a word-based text compression program. We report a range of experimental results using this and other models. Complete source code is available.

Searching the Web

by Arvind Arasu, Junghoo Cho, Hector Garcia-Molia, Andreas Paepcke, Sriram Raghavan - ACM TRANSACTIONS ON INTERNET TECHNOLOGY , 2001
"... We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and im ..."
Abstract - Cited by 108 (1 self) - Add to MetaCart
We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.

Content-based query of image databases, inspirations from text retrieval: inverted files, frequency-based weights and relevance feedback

by David Mcg Squire, Wolfgang Müller, Henning Müller, Jilali Raki , 1998
"... ..."
Abstract - Cited by 93 (18 self) - Add to MetaCart
Abstract not found

ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval

by Torsten Suel, Chandan Mathur, Jo-wen Wu, Jiangong Zhang, Alex Delis, Mehdi Kharrazi, Xiaohui Long, Kulesh Shanmugasundaram - In WebDB , 2003
"... this paper appears in [15], and updated information is available at http://cis.poly.edu/westlab/odissea/ ..."
Abstract - Cited by 86 (3 self) - Add to MetaCart
this paper appears in [15], and updated information is available at http://cis.poly.edu/westlab/odissea/
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University