Results 1 - 10
of
12
Building a distributed full-text index for the web
- ACM Trans. Inf. Syst
, 2001
"... We identify crucial design issues in building a distributed inverted index for a large collection of Web pages. We introduce a novel pipelining technique for structuring the core index-building system that substantially reduces the index construction time. We also propose a storage scheme for creati ..."
Abstract
-
Cited by 63 (3 self)
- Add to MetaCart
We identify crucial design issues in building a distributed inverted index for a large collection of Web pages. We introduce a novel pipelining technique for structuring the core index-building system that substantially reduces the index construction time. We also propose a storage scheme for creating and managing inverted files using an embedded database system. We suggest and compare different strategies for collecting global statistics from distributed inverted indexes. Finally, we present performance results from experiments on a testbed distributed Web indexing system that we have implemented.
Efficient Single-Pass Index Construction for Text Databases
- Jour. of the American Society for Information Science and Technology
, 2003
"... Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this paper, we review the principal approaches to inversion, analyse their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approa ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this paper, we review the principal approaches to inversion, analyse their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single-pass inversion method that, in contrast to previous approaches, does not require the complete vocabulary of the indexed collection in main memory, can operate within limited resources, and does not sacrifice speed with high temporary storage requirements. We show that the performance of the single-pass approach can be improved by constructing inverted files in segments, reducing the cost of disk accesses during inversion of large volumes of data.
I/O-Conscious Data Preparation for Large-Scale Web Search Engines
- In VLDB 2002, Proceedings of 28th International Conference on Very Large Data Bases
, 2002
"... Given that commercial search engines cover billions of web pages, efficiently managing the corresponding volumes of disk-resident data needed to answer user queries quickly is a formidable data manipulation challenge. We present a general technique for efficiently carrying out large sets of simple t ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Given that commercial search engines cover billions of web pages, efficiently managing the corresponding volumes of disk-resident data needed to answer user queries quickly is a formidable data manipulation challenge. We present a general technique for efficiently carrying out large sets of simple transformation or querying operations over external-memory data tables. It greatly reduces the number of performed disk accesses and seeks by maximizing the temporal locality of data access and organizing most of the necessary disk accesses into long sequential reads or writes of data that is reused many times while in memory.
Comparing Distributed Indexing: To MapReduce or Not?
"... Information Retrieval (IR) systems require input corpora to be indexed. The advent of terabyte-scale Web corpora has reinvigorated the need for efficient indexing. In this work, we investigate distributed indexing paradigms, in particular within the auspices of the MapReduce programming framework. I ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Information Retrieval (IR) systems require input corpora to be indexed. The advent of terabyte-scale Web corpora has reinvigorated the need for efficient indexing. In this work, we investigate distributed indexing paradigms, in particular within the auspices of the MapReduce programming framework. In particular, we describe two indexing approaches based on the original MapReduce paper, and compare these with a standard distributed IR system, the MapReduce indexing strategy used by the Nutch IR platform, and a more advanced MapReduce indexing implementation that we propose. Experiments using the Hadoop MapReduce implementation and a large standard TREC corpus show our proposed MapReduce indexing implementation to be more efficient than those proposed in the original paper. 1.
SmartSeer: Using a DHT to process continuous queries over peer-to-peer networks
- In Proceedings of the 2006 IEEE INFOCOM
, 2006
"... Abstract — As the academic world moves away from physical journals and proceedings towards online document repositories, the ability to efficiently locate work of interest among the torrent of newly-generated papers will become increasingly important. To aid in this endeavor, we designed SmartSeer, ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract — As the academic world moves away from physical journals and proceedings towards online document repositories, the ability to efficiently locate work of interest among the torrent of newly-generated papers will become increasingly important. To aid in this endeavor, we designed SmartSeer, a system that allows users to register personalized continuous queries over the CiteSeer database of technical documents. Users are then alerted whenever papers that match their queries are put online. SmartSeer has two main design requirements. First, to allow effective information retrieval, it should support rich continuous queries (as opposed to simple keyword searches). Second, to make effective use of donated infrastructure, it should be capable of running on a loosely maintained group of unreliable machines spread across multiple organizations (as opposed to assuming a reliable and tightly coupled distributed system). Existing work on distributed continuous query systems fails at least one of these requirements. Our design for SmartSeer is based on Distributed Hash Tables (DHTs), and thereby leverages previous work on DHT-based query systems. A prototype of SmartSeer has been implemented and evaluated on Planetlab. Though we evaluate our design only for the SmartSeer application, we believe it also provides useful insights into other distributed and rich continuous query systems (web alerts, news alerts etc). I.
Implementation of a Modern Web Search Engine Cluster
"... Yuntis is a fully-functional prototype of a complete web search engine with features comparable to those available in commercial-grade search engines. In particular, Yuntis supports page quality scoring based on global web linkage graph, extensively exploits text associated with links, computes page ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Yuntis is a fully-functional prototype of a complete web search engine with features comparable to those available in commercial-grade search engines. In particular, Yuntis supports page quality scoring based on global web linkage graph, extensively exploits text associated with links, computes pages' keywords and lists of similar pages of good quality, and provides a very flexible query language. This paper reports our experiences in the three-year development process of Yuntis, by presenting its design issues, software architecture, implementation details, and performance measurements.
Brute-Force Approaches to Batch Retrieval: Scalable Indexing with MapReduce, or Why Bother?
"... Modern information retrieval research has evolved a standard workflow that involves first indexing a document collection and then running ad hoc queries sequentially to evaluate retrieval effectiveness using standard test collections. This paper explores how aspects of this workflow might change in ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Modern information retrieval research has evolved a standard workflow that involves first indexing a document collection and then running ad hoc queries sequentially to evaluate retrieval effectiveness using standard test collections. This paper explores how aspects of this workflow might change in a MapReduce cluster-based environment. First, we present and evaluate two algorithms for inverted indexing that take advantage of the programming model’s sorting mechanism to different extents. The running times of both algorithms scale linearly in terms of collection size up to 102 million web pages. Second, we show that it is possible to efficiently perform batch query evaluation with MapReduce by scanning all postings lists in parallel, as opposed to sequentially accessing each postings list. Third, we explore an approach that forgoes inverted indexing altogether and simply computes all query–document scores from document vectors themselves. Experimental results challenge us to think differently about previous assumptions in information retrieval, and show that brute force approaches are surprisingly compelling under certain circumstances: parallel scan of postings can effectively take advantage of large clusters and parallel scan of documents fits naturally with ranking functions that use document-level features. 1
A scalable architecture for XML retrieval
, 2003
"... While in classical text collections documents are regarded as atomic units, in XML collections nested elements of varying granularity are considered. This augmented view increases the number of potentially retrieved objects, e.g. documents, elements within documents, or aggregations of elements or o ..."
Abstract
- Add to MetaCart
While in classical text collections documents are regarded as atomic units, in XML collections nested elements of varying granularity are considered. This augmented view increases the number of potentially retrieved objects, e.g. documents, elements within documents, or aggregations of elements or of documents. The increase in the number of objects to be indexed and retrieved by XML retrieval systems leads, for XML collections of comparably small size (several 100 MB), already to the necessity to apply strategies for scalability, such as paralell and distributed processing, term, document and database pre-selection. We report in this paper on our approach for dealing with XML collections in general, and with the INEX collection in particular, using a scalable indexing and retrieval architecture. 1
A Reliable Storage Management Layer for
- In 12th ACM International Conference on Information and Knowledge Management
, 2003
"... We present a storage management layer that facilitates the implementation of parallel information retrieval systems, and related applications, on networks of workstations. The storage management layer automates the process of adding and removing nodes, and implements a dispersed mirroring strategy t ..."
Abstract
- Add to MetaCart
We present a storage management layer that facilitates the implementation of parallel information retrieval systems, and related applications, on networks of workstations. The storage management layer automates the process of adding and removing nodes, and implements a dispersed mirroring strategy to improve reliability. When nodes are added and removed, the document collection managed by the system is redistributed for load balancing purposes. The use of dispersed mirroring minimizes the impact of node failures and system modifications on query performance.

