Results 1 - 10
of
14
Inverted files for text search engines
- ACM Computing Surveys
, 2006
"... The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolida ..."
Abstract
-
Cited by 136 (2 self)
- Add to MetaCart
The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolidated in textbooks, many specific techniques are not widely known or the textbook descriptions are out of date. In this tutorial, we introduce the key techniques in the area, describing both a core implementation and how the core can be enhanced through a range of extensions. We conclude with a comprehensive bibliography of text indexing literature.
A Parallel Computing Approach to Creating Engineering Concept Spaces for Semantic Retrieval: The Illinois Digital Library Initiative Project
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 1996
"... : This research presents preliminary results generated from the semantic retrieval research component of the Illinois Digital Library Initiative (DLI) project. Using a variation of the automatic thesaurus generation techniques, to which we refer as the concept space approach, we aimed to create gra ..."
Abstract
-
Cited by 37 (12 self)
- Add to MetaCart
: This research presents preliminary results generated from the semantic retrieval research component of the Illinois Digital Library Initiative (DLI) project. Using a variation of the automatic thesaurus generation techniques, to which we refer as the concept space approach, we aimed to create graphs of domain-specific concepts (terms) and their weighted co-occurrence relationships for all major engineering domains. Merging these concept spaces and providing traversal paths across different concept spaces could potentially help alleviate the vocabulary (difference) problem evident in large-scale information retrieval. We have experimented previously with such a technique for a smaller molecular biology domain (Worm Community System, with 10+ MBs of document collection) with encouraging results. In order to address the scalability issue related to large-scale information retrieval and analysis for the current Illinois DLI project, we recently conducted experiments using the concept sp...
Evaluating the Performance of Distributed Architectures for Information Retrieval using a Variety of Workloads
- ACM Transactions on Information Systems
, 1997
"... Information explosion across the Internet and elsewhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this paper, we desc ..."
Abstract
-
Cited by 33 (7 self)
- Add to MetaCart
Information explosion across the Internet and elsewhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this paper, we describe a fully functional distributed IR system based on the Inquery unified IR system. To refine this prototype, we implement a flexible simulation model which we use to present a series of experiments using a variety of workloads that measure system performance. We vary numerous system parameters such as the number of users, document collections, terms per query, query term frequency, think time, answers returned, and workload. Based on our initial results, we recommend simple changes to the prototype and evaluate the changes using the simulator. Because of the significant resource demands of information retrieval, it is not difficult to generate workloads that overwhelm system resources regardless of the architecture. However under some realistic workloads, we demonstrate system organizations for which response time gracefully degrades as the workload increases and performance scales with the number of processors. This scalable architecture includes a surprisingly small number of brokers through which a large number of clients and servers communicate. Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems-- distributed applications; C.4 [Performance of Systems]: Performance Attributes; H.3.4 [Information Storage and Retrieval]: Systems and Software; General Terms: Experimentation, Performance Additional Key Words and Phrases: Distributed information retrieval architectures This material is based on work supported by ...
Partial Collection Replication versus Caching for Information Retrieval Systems
- IN THE ACM INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL
, 2000
"... The explosion of content in distributed information retrieval (IR) systems requires new mechanisms to attain timely and accurate retrieval of unstructured text. In this paper, we compare two mechanisms to improve IR system performance: partial collection replication and caching. When queries have lo ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
The explosion of content in distributed information retrieval (IR) systems requires new mechanisms to attain timely and accurate retrieval of unstructured text. In this paper, we compare two mechanisms to improve IR system performance: partial collection replication and caching. When queries have locality, both mechanisms return results more quickly than sending queries to the original collection (s). Caches return results when queries exactly match a previous one. Partial replicas are a form of caching that return results when the IR technology determines the query is a good match. Caches are simpler and faster, but replicas can increase locality by detecting similarity between queries that are not exactly the same. We use real traces from THOMAS and Excite to measure query locality and similarity. With a very restrictive definition of query similarity, similarity improves query locality up to 15% over exact match. We use a validated simulator to compare their performance, and find that even if the partial replica hit rate increases only 3 to 6%, it will outperform simple caching under a variety of configurations. A combined approach will probably yield the best performance.
On the Enhancements of a Sparse Matrix Information Retrieval Approach
- Proceedings of the International Conference on Parallel and distributed Processing Techniques and Applications
, 2000
"... A novel approach to information retrieval is proposed and evaluated. By representing an inverted index as a sparse matrix, matrix-vector multiplication algorithms can be used to query the index. As many parallel sparse matrix multiplication algorithms exist, such an information retrieval approach le ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
A novel approach to information retrieval is proposed and evaluated. By representing an inverted index as a sparse matrix, matrix-vector multiplication algorithms can be used to query the index. As many parallel sparse matrix multiplication algorithms exist, such an information retrieval approach lends itself to parallelism. This enables us to attack the problem of parallel information retrieval, which has resisted good scalability. We evaluate our proposed approach using several document collections from within the commonly used NIST TREC corpus. Our results indicate that our approach saves approximately 30% of the total storage requirements for the inverted index. Additionally, to improve accuracy, we develop a novel matrix based, relevance feedback technique as well as a proximity search algorithm. 1 Introduction With constantly growing text resources, efficiency improvements via parallel processing, storage space reduction and the improvement of search effectiveness are the main ...
Scalable Distributed Architectures for Information Retrieval
, 1999
"... SCALABLE DISTRIBUTED ARCHITECTURES FOR INFORMATION RETRIEVAL MAY 1999 ZHIHONG LU B.Sc., TONGJI UNIVERSITY M.Sc., INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Kathryn S. McKinley As information explodes across the In ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
SCALABLE DISTRIBUTED ARCHITECTURES FOR INFORMATION RETRIEVAL MAY 1999 ZHIHONG LU B.Sc., TONGJI UNIVERSITY M.Sc., INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Kathryn S. McKinley As information explodes across the Internet and intranets, information retrieval (IR) systems must cope with the challenge of scale. How to provide scalable performance for rapidly increasing data and workloads is critical in the design of next generation information retrieval systems. This dissertation studies scalable distributed IR architectures that not only provide quick response but also maintain acceptable retrieval accuracy. Our distributed architectures exploit parallelism in information retrieval on a cluster of parallel IR servers using symmetric multiprocessors, and use partial collection replication and selection as well as collection selection to restrict the search to a small percentage of data while maintaining ...
The Hardware/Software Balancing Act for Information Retrieval on Symmetric Multiprocessors
- In Proceedings of Euro-Par'98
, 1998
"... . Web search engines, such as AltaVista and Infoseek, handle tremendous loads by exploiting the parallelism implicit in their tasks and using symmetric multiprocessors to support their services. The web searching problem that they solve is a special case of the more general information retrieval (IR ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
. Web search engines, such as AltaVista and Infoseek, handle tremendous loads by exploiting the parallelism implicit in their tasks and using symmetric multiprocessors to support their services. The web searching problem that they solve is a special case of the more general information retrieval (IR) problem of locating documents relevant to the information need of users. In this paper, we investigate how to exploit a symmetric multiprocessor to build high performance IR servers. Although the problem can be solved by throwing lots of CPU and disk resources at it, the important questions are how much of which hardware and what software structure is needed to effectively exploit hardware resources. We have found, to our surprise, that in some cases adding hardware degrades performance rather than improves it. We show that multiple threads are needed to fully utilize hardware resources. Our investigation is based on InQuery, a state-of-the-art full-text information retrieval engine. 1 In...
Partial collection replication for information retrieval
- Information Retrieval
, 1999
"... The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. This paper shows how to exploit locality by building, using, and searching partial replicas of text collections in a distributed I ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. This paper shows how to exploit locality by building, using, and searching partial replicas of text collections in a distributed IR system. In this work, a partial replica includes a subset of the documents from larger collection(s) and the corresponding inference network search mechanisms. For each query, the distributed system determines if partial replica is a good match and then searches it, or it searches the original collection. We demonstrate the performance of partial replication is better than systems that use caches which only store previous query and answer pairs. We first use logs from THOMAS and Excite to show to build partial replicas and caches from frequent queries. We show that searching replicas can improve locality (from 3 to 20%) over the exact match required by caching. Replicas increase locality because they satisfy queries which are distinct but return the same or very similar answers. We then present a novel inference network replica selection function. We vary its parameters and compare it to previous collection selection functions, demonstrating a configuration that directs most of the appropriate queries to replicas in a replica hierarchy. We then explore the performance of partial replication in a distributed IR system. We compare it with caching and partitioning. Our validated simulator shows that the increases in locality due to replication make it preferable to caching alone, and that even a small increase of 4 % in locality translates into a performance advantage. We also show a hybrid system with caches and replicas that performs better than each on their own.
Massive Parallelism on the Hybrid Text Retrieval Machine
- Machineā, Information Processing & Management, Vol.31, No.6
, 1995
"... The design of a high-performance, cost-effective, machine for retrieving textual data is discussed in this paper. High performance and cost effectiveness are achieved by a combination of low-cost hard disks, software filtering techniques, and a large amount of main memory. The discussion focuses on ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
The design of a high-performance, cost-effective, machine for retrieving textual data is discussed in this paper. High performance and cost effectiveness are achieved by a combination of low-cost hard disks, software filtering techniques, and a large amount of main memory. The discussion focuses on the signature processor, which is based on the partitioned signature file technique, and the mass storage system, which is based on a disk array. A performance evaluation on the individual system components, namely, the signature processor and the mass storage system, as well as the entire system is presented. *The author is on leave from the Department of Computer and Information Science, The Ohio State University, Columbus, OH 43210. 1 Introduction Information retrieval has been a very important application for computers. However, the massive amount of information handled by information retrieval applications such as library systems and office automation systems often overwhelms the lar...
A Concept Space Approach To Semantic Exchange
, 2000
"... This dissertation has been submitted in partial ful llment of requirements for ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This dissertation has been submitted in partial ful llment of requirements for

