Results 1 - 10
of
16
Inverted files for text search engines
- ACM Computing Surveys
, 2006
"... The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolida ..."
Abstract
-
Cited by 136 (2 self)
- Add to MetaCart
The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolidated in textbooks, many specific techniques are not widely known or the textbook descriptions are out of date. In this tutorial, we introduce the key techniques in the area, describing both a core implementation and how the core can be enhanced through a range of extensions. We conclude with a comprehensive bibliography of text indexing literature.
Integrating Structured Data and Text: A relational approach
- Journal of the American Society of Information Science
, 1997
"... We integrate structured data and text using the unchanged, standard relational model. We started with the premise that a relational system could be used to implement an Information Retrieval (IR) system. After implementing a prototype to verify that premise, we then began to investigate the performa ..."
Abstract
-
Cited by 50 (27 self)
- Add to MetaCart
We integrate structured data and text using the unchanged, standard relational model. We started with the premise that a relational system could be used to implement an Information Retrieval (IR) system. After implementing a prototype to verify that premise, we then began to investigate the performance of a parallel relational database system for this application. We also tested the effect of query reduction on accuracy and found that queries can be reduced prior to their implementation without incurring a significant loss in precision/recall. This reduction also serves to improve run-time performance. After comparing our results to a special purpose IR system, we conclude that the relational model offers scalable performance and includes the ability to integrate structured data and text in a portable fashion. 1 Introduction Increasingly, applications integrate structured and unstructured data, responding to requests such as "Find articles containing vehicle and sales published in jou...
Inverted File Partitioning Schemes in Multiple Disk Systems
- IEEE Transactions on Parallel and Distributed Systems
, 1995
"... Multiple-disk I/O systems (Disk Arrays) have been an attractive approach to meet high performance I/O demands in data intensive applications such as information retrieval systems. When we partition and distribute files across multiple disks to exploit the potential for I/O parallelism, a balanced I/ ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
Multiple-disk I/O systems (Disk Arrays) have been an attractive approach to meet high performance I/O demands in data intensive applications such as information retrieval systems. When we partition and distribute files across multiple disks to exploit the potential for I/O parallelism, a balanced I/O workload distribution becomes important for good performance. Naturally, the performance of a parallel information retrieval system using an inverted file structure is affected by the partitioning scheme of the inverted file. In this paper, we propose two different partitioning schemes for an inverted file system for a shared-everything multiprocessor machine with multiple disks. We study the performance of these schemes by simulation under a number of workloads where the term frequencies in the documents are varied, the term frequencies in the queries are varied, the number of disks are varied and the multiprogramming level is varied. 1 Introduction Applying multiprocessor machines to th...
Evaluating the Performance of Distributed Architectures for Information Retrieval using a Variety of Workloads
- ACM Transactions on Information Systems
, 1997
"... Information explosion across the Internet and elsewhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this paper, we desc ..."
Abstract
-
Cited by 33 (7 self)
- Add to MetaCart
Information explosion across the Internet and elsewhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this paper, we describe a fully functional distributed IR system based on the Inquery unified IR system. To refine this prototype, we implement a flexible simulation model which we use to present a series of experiments using a variety of workloads that measure system performance. We vary numerous system parameters such as the number of users, document collections, terms per query, query term frequency, think time, answers returned, and workload. Based on our initial results, we recommend simple changes to the prototype and evaluate the changes using the simulator. Because of the significant resource demands of information retrieval, it is not difficult to generate workloads that overwhelm system resources regardless of the architecture. However under some realistic workloads, we demonstrate system organizations for which response time gracefully degrades as the workload increases and performance scales with the number of processors. This scalable architecture includes a surprisingly small number of brokers through which a large number of clients and servers communicate. Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems-- distributed applications; C.4 [Performance of Systems]: Performance Attributes; H.3.4 [Information Storage and Retrieval]: Systems and Software; General Terms: Experimentation, Performance Additional Key Words and Phrases: Distributed information retrieval architectures This material is based on work supported by ...
On the allocation of documents in multiprocessor information retrieval systems
- In Proceedings of the Fourteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1991
"... Abstract. Information retrieval is the selection of documents that are potentially relevant to a user’s information need. Given the vast volume of data stored in modern information retrieval systems, searching the document database requires vast computational resources. To meet these computational d ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Abstract. Information retrieval is the selection of documents that are potentially relevant to a user’s information need. Given the vast volume of data stored in modern information retrieval systems, searching the document database requires vast computational resources. To meet these computational demands, various researchers have developed parallel information retrieval systems. As efficient exploitation of parallelism demands fast access to the documents, data organization and placement significantly affect the total processing time. We describe and evaluate a data placement strategy for distributed memory, distributed 1/0 multicomputers. Initially, a formal description of the Multiprocessor Document Allocation Problem (MDAP) and a proof that MDAP is NP Complete are presented. A document allocation
Caching and database scaling in distributed shared-nothing information retrieval systems
- Proc. ACM SIGMOD Conf
, 1992
"... A common class of existing information retrieval system provides access to abstracts. For example Stanford University, through its FOLIO system, provides access to the INSPEC database of abstracts of the literature on physics, computer science, electrical engineering, etc. In this paper this databas ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
A common class of existing information retrieval system provides access to abstracts. For example Stanford University, through its FOLIO system, provides access to the INSPEC database of abstracts of the literature on physics, computer science, electrical engineering, etc. In this paper this database is studied by using a trace-driven simulation. We focus on physical index design, inverted index caching, and database scaling in a distributed shared-nothing system. All three issues are shown to have a strong e ect on response time and throughput. Database scaling is explored in two ways. One way assumes an \optimal " con guration for a single host and then linearly scales the database by duplicating the host architecture as needed. The second way determines the optimal number of hosts given a xed database size. 1
A Scaleable Technique For Best-Match Retrieval Of Sequential Information Using Metrics-Guided Search
- Journal of Information Science
, 1994
"... A new technique is described for retrieving information by finding the best match or matches between a textual `query' and a textual database. The technique uses principles of beam search with a measure of probability to guide the search and prune the search tree. Unlike many methods for comparing s ..."
Abstract
-
Cited by 13 (12 self)
- Add to MetaCart
A new technique is described for retrieving information by finding the best match or matches between a textual `query' and a textual database. The technique uses principles of beam search with a measure of probability to guide the search and prune the search tree. Unlike many methods for comparing strings, the method gives a set of alternative matches, graded by the `quality' of the matching achieved. For any one sequence of hits between a query and a database, the probability measure is an estimate of the probability that the observed configuration, or better, could have occurred by chance. This probability is an inverse measure of the redundancy between the query and the database. The new technique is embodied in a software simulation called SP21 which runs on a conventional computer. Examples are presented showing best-match retrieval of information from a textual database. Analytic and empirical evidence is presented showing that, in a serial processing environment, the search tech...
Load balancing for term-distributed parallel retrieval
, 2006
"... Large-scale web and text retrieval systems deal with amounts of data that greatly exceed the capacity of any single machine. To handle the necessary data volumes and query throughput rates, parallel systems are used, in which the document and index data are split across tightly-clustered distributed ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Large-scale web and text retrieval systems deal with amounts of data that greatly exceed the capacity of any single machine. To handle the necessary data volumes and query throughput rates, parallel systems are used, in which the document and index data are split across tightly-clustered distributed computing systems. The index data can be distributed either by document or by term. In this paper we examine methods for load balancing in term-distributed parallel architectures, and propose a suite of techniques for reducing net querying costs. In combination, the techniques we describe allow a 30 % improvement in query throughput when tested on an eight-node parallel computer system. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content analysis and indexing – indexing methods; H.3.2 [Information Storage and Retrieval]:
Scalable Distributed Architectures for Information Retrieval
, 1999
"... SCALABLE DISTRIBUTED ARCHITECTURES FOR INFORMATION RETRIEVAL MAY 1999 ZHIHONG LU B.Sc., TONGJI UNIVERSITY M.Sc., INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Kathryn S. McKinley As information explodes across the In ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
SCALABLE DISTRIBUTED ARCHITECTURES FOR INFORMATION RETRIEVAL MAY 1999 ZHIHONG LU B.Sc., TONGJI UNIVERSITY M.Sc., INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Kathryn S. McKinley As information explodes across the Internet and intranets, information retrieval (IR) systems must cope with the challenge of scale. How to provide scalable performance for rapidly increasing data and workloads is critical in the design of next generation information retrieval systems. This dissertation studies scalable distributed IR architectures that not only provide quick response but also maintain acceptable retrieval accuracy. Our distributed architectures exploit parallelism in information retrieval on a cluster of parallel IR servers using symmetric multiprocessors, and use partial collection replication and selection as well as collection selection to restrict the search to a small percentage of data while maintaining ...
Distributed Queries And Incremental Updates In Information Retrieval Systems
, 1994
"... The proliferation of the world's "information highways" has renewed interest in efficient document indexing techniques. This thesis explores the architecture of information retrieval systems for querying and indexing documents. Distributed queries are studied with analytical and trace-driven simulat ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
The proliferation of the world's "information highways" has renewed interest in efficient document indexing techniques. This thesis explores the architecture of information retrieval systems for querying and indexing documents. Distributed queries are studied with analytical and trace-driven simulations. We focus on physical index design, inverted index caching, and database scaling in a distributed system. All three issues influence response time and throughput. Incremental updates of inverted lists are studied using a new dual-structure index data structure. This index structure separates long and short inverted lists dynamically and optimizes the retrieval, update, and storage of each type of list. To study the behavior of the index, engineering tradeoffs are described that favor either update time or query performance. We explore these trade-offs quantitatively by using actual data and hardware and simulation to determine the best algorithm under a variety of criteria. Finally, imp...

