Results 1 - 10
of
12
Distributed query processing using partitioned inverted files
- In Proc. of the 9th String Processing and Information Retrieval Symposium (SPIRE
, 2001
"... In this paper, we study query processing in a distributed text database. The novelty is a real distributed architecture implementation that offers concurrent query service. The distributed system adopts a network of workstations model and the client-server paradigm. The document collection is indexe ..."
Abstract
-
Cited by 35 (4 self)
- Add to MetaCart
In this paper, we study query processing in a distributed text database. The novelty is a real distributed architecture implementation that offers concurrent query service. The distributed system adopts a network of workstations model and the client-server paradigm. The document collection is indexed with an inverted file. We adopt two distinct strategies of index partitioning in the distributed system, namely local index partitioning and global index partitioning. In both strategies, documents are ranked using the vector space model along with a document filtering technique for fast ranking. We evaluate and compare the impact of the two index partitioning strategies on query processing performance. Experimental results on retrieval efficiency show that, within our framework, the global index partitioning outperforms the local index partitioning. 1.
Load balancing for term-distributed parallel retrieval
, 2006
"... Large-scale web and text retrieval systems deal with amounts of data that greatly exceed the capacity of any single machine. To handle the necessary data volumes and query throughput rates, parallel systems are used, in which the document and index data are split across tightly-clustered distributed ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Large-scale web and text retrieval systems deal with amounts of data that greatly exceed the capacity of any single machine. To handle the necessary data volumes and query throughput rates, parallel systems are used, in which the document and index data are split across tightly-clustered distributed computing systems. The index data can be distributed either by document or by term. In this paper we examine methods for load balancing in term-distributed parallel architectures, and propose a suite of techniques for reducing net querying costs. In combination, the techniques we describe allow a 30 % improvement in query throughput when tested on an eight-node parallel computer system. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content analysis and indexing – indexing methods; H.3.2 [Information Storage and Retrieval]:
Query-driven document partitioning and collection selection
- in INFOSCALE 2006: Proceedings of the first International Conference on Scalable Information Systems
, 2006
"... Abstract — We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a l ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Abstract — We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a list recording the queries for which the document itself is a match, along with their ranks. To both partition the collection and build the collection selection function, we co-cluster queries and documents. The document clusters are then assigned to the underlying IR servers, while the query clusters represent queries that return similar results, and are used for collection selection. We show that this document partition strategy greatly boosts the performance of standard collection selection algorithms, including CORI, w.r.t. a round-robin assignment. Secondly, we show that performing collection selection by matching the query to the existing query clusters and successively choosing only one server, we reach an average precision-at-5 up to 1.74 and we constantly improve CORI precision of a factor between 11 % and 15%. As a side result we show a way to select rarely asked-for documents. Separating these documents from the rest of the collection allows the indexer to produce a more compact index containing only relevant documents that are likely to be requested in the future. In our tests, around 52 % of the documents (3,128,366) are not returned among the first 100 top-ranked results of any query. I.
Distributed query processing using suffix arrays
- In Int. Conf. on String Processing and Information Retrieval, Lecture Notes in Computer Science
, 2003
"... Abstract. Suffix arrays are more efficient than inverted files for solving complex queries in a number of applications related to text databases. Examples arise when dealing with biological or musical data or with texts written in oriental languages, and when searching for phrases, approximate patte ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Abstract. Suffix arrays are more efficient than inverted files for solving complex queries in a number of applications related to text databases. Examples arise when dealing with biological or musical data or with texts written in oriental languages, and when searching for phrases, approximate patterns and, in general, regular expressions involving separators. In this paper we propose algorithms for processing in parallel batches of queries upon distributed text databases. We present efficient alternatives for speeding up query processing using distributed realizations of suffix arrays. Empirical results obtained from natural language text on a cluster of PCs show that the proposed algorithms are efficient in practice. 1
Parallel text query processing using Composite Inverted Lists
- IN SECOND INTERNATIONAL CONFERENCE ON HYBRID INTELLIGENT SYSTEMS (WEB COMPUTING SESSION). IO
, 2003
"... The inverted lists strategy is frequently used as an index data structure for very large textual databases. Its implementation and comparative performance has been studied in sequential and parallel applications. In the latter, with relatively few studies, there has been a sort of "which-is-better ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
The inverted lists strategy is frequently used as an index data structure for very large textual databases. Its implementation and comparative performance has been studied in sequential and parallel applications. In the latter, with relatively few studies, there has been a sort of "which-is-better" discussion about two alternative parallel realizations of the basic data structure and algorithms. We suggest that a mix between the two is actually a better alternative. Depending on the workload generated by the users, the composite inverted lists algorithm we propose in this paper can operate either as a local or global inverted list, or both at the same time.
Basic issues on the processing of web queries
- In: Proceedings of the 28th Annual International ACM SIGIR Conference on Reseach and Development in Information Retrieval (SIGIR’05
, 2005
"... Search engines represent a key component of Web economy these days. Despite that, there is not much technical literature available on their design, fine tuning, and internal operation. In this work, we make a preliminary attempt to partially fulfill this gap. We distinguish that Web query processing ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Search engines represent a key component of Web economy these days. Despite that, there is not much technical literature available on their design, fine tuning, and internal operation. In this work, we make a preliminary attempt to partially fulfill this gap. We distinguish that Web query processing is composed of two phases: (a) retrieving information on documents related to the queries and ranking them, and (b) generating snippets, title, and URL information for the answer page. The second phase has cost that is basically constant on the size of the collection, while the cost of the first phase is affected by the size of the collection. Thus, we concentrate here on studying the behavior of a search engine while executing the first phase of query processing. Using real data and a small cluster of index servers, we study four basic and key issues related to this first phase of query processing: load balance, broker behavior, performance by individual index servers, and overall throughput. Our study, while preliminary, does reveal interesting tradeoffs: (1) that load unbalance at low query arrival rates can be controlled with a simple measure of randomizing the distribution of documents among the index servers, (2) that the broker is not a bottleneck, (3) that disk and CPU utilization at individual servers depends on the relationship between memory size and the distribution of frequencies for the query terms, and (4) that load unbalance at high loads prevents higher throughput. Our results suggest that further studying and evaluating search engines is a promising research avenue.
A search engine accepting on-line updates
- In Euro-Par ’07: 13th International Conference on Parallel and Distributed Computing
, 2007
"... Abstract. We describe and evaluate the performance of a parallel search engine that is able to cope efficiently with concurrent read/write operations. Read operations come in the usual form of queries submitted to the search engine and write ones come in the form of new documents added to the text c ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract. We describe and evaluate the performance of a parallel search engine that is able to cope efficiently with concurrent read/write operations. Read operations come in the usual form of queries submitted to the search engine and write ones come in the form of new documents added to the text collection in an on-line manner, namely the insertions are embedded into the main stream of user queries in an unpredictable arrival order but with query results respecting causality. The search engine is built upon distributed inverted files for which we propose generic strategies for load balance and concurrency control. 1
(Sync|Async) + MPI Search Engines
"... Abstract. We propose a parallel MPI search engine that is capable of automatically switching between asynchronous message passing and bulk-synchronous message passing modes of operation. When the observed query traffic is small or moderate the standard multiple managers/workers thread based model of ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. We propose a parallel MPI search engine that is capable of automatically switching between asynchronous message passing and bulk-synchronous message passing modes of operation. When the observed query traffic is small or moderate the standard multiple managers/workers thread based model of message passing is applied for processing the queries. However, when the query traffic increases a roundrobin based approach is applied in order to prevent from unstable behavior coming from queries demanding the use of a large amount of resources in computation, communication and disk accesses. This is achieved by (i) a suitable object-oriented multi-threaded MPI software design and (ii) an “atomic ” organization of the query processing which allows the use of a novel control strategy that decides the proper mode of operation. 1
ABSTRACT Distributed Processing of Conjunctive Queries
"... We distinguish that Web query processing is composed of two phases: (a) retrieving information on documents related to the queries and ranking them, and (b) generating snippets, title, and URL information for the answer page. Using real data and a small cluster of index servers, we study four basic ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We distinguish that Web query processing is composed of two phases: (a) retrieving information on documents related to the queries and ranking them, and (b) generating snippets, title, and URL information for the answer page. Using real data and a small cluster of index servers, we study four basic and key issues related to this first phase of query processing: load balance, broker behavior, performance by individual index servers, and overall throughput. Our study reveals interesting tradeoffs: (1) that load unbalance at low query arrival rates can be controlled with a simple measure of randomizing the distribution of documents among the index servers, (2) that the broker is not a bottleneck, (3) that disk and CPU utilization at individual servers depends on the relationship between memory size and the distribution of frequencies for the query terms, and (4) that load unbalance at high loads prevents higher throughput. Categories and Subject Descriptors H.3.4 [Information Storage and Retrieval]: Systems
Load-Balancing and Caching for Collection Selection Architectures
"... Abstract — To address the rapid growth of the Internet, modern Web search engines have to adopt distributed organizations, where the collection of indexed documents is partitioned among several servers, and query answering is performed as a parallel and distributed task. Collection selection can be ..."
Abstract
- Add to MetaCart
Abstract — To address the rapid growth of the Internet, modern Web search engines have to adopt distributed organizations, where the collection of indexed documents is partitioned among several servers, and query answering is performed as a parallel and distributed task. Collection selection can be a way to reduce the overall computing load, by finding a trade-off between the quality of results retrieved and the cost of solving queries. In this paper, we analyze the relationship between the collection selection strategy, the effect on load balancing and on the caching subsystem, by exploring the design-space of a distributed search engine based on collection selection. In particular, we propose a strategy to perform collection selection in a load-driven way, and a novel caching policy able to incrementally refine the effectiveness of the results returned for each subsequent cache hit. The combination of load-driven collection selection and incremental caching strategies allows our system to retrieve two thirds of the top-ranked results returned by a baseline centralized index, with only one fifth of the computing workload. I.

