Results 1 - 10
of
48
Self-Indexing Inverted Files for Fast Text Retrieval
- ACM Transactions on Information Systems
, 1996
"... Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Here we show that query response time for conjunctive Boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, b ..."
Abstract
-
Cited by 127 (23 self)
- Add to MetaCart
Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Here we show that query response time for conjunctive Boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for Boolean queries of 5--10 terms, can reduce processing time to under one fifth of the previous cost. Similarly, ranked queries of 40--50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
Incremental Updates of Inverted Lists for Text Document Retrieval
, 1993
"... With the proliferation of the world's "information highways" a renewed interest in efficient document indexing techniques has come about. In this paper, the problem of incremental updates of inverted lists is addressed using a new dual-structure index data structure. The index dynamically separates ..."
Abstract
-
Cited by 83 (9 self)
- Add to MetaCart
With the proliferation of the world's "information highways" a renewed interest in efficient document indexing techniques has come about. In this paper, the problem of incremental updates of inverted lists is addressed using a new dual-structure index data structure. The index dynamically separates long and short inverted lists and optimizes the retrieval, update, and storage of each type of list. To study the behavior of the index, a space of engineering tradeoffs which range from optimizing update time to optimizing query performance is described. We quantitatively explore this space by using actual data and hardware in combination with a simulation of an information retrieval system. We then describe the best algorithm for a variety of criteria. 1 Introduction As the world's "information highways" proliferate and grow in capacity, they are providing access to an ever growing number of electronic document repositories. At each repository, the number of documents available on-line is...
A survey of information retrieval and filtering methods
, 1995
"... We survey the major techniques for information retrieval. In the rst part, weprovide an overview of the traditional ones (full text scanning, inversion, signature les and clustering). In the second part we discuss attempts to include semantic information (natural language processing, latent semantic ..."
Abstract
-
Cited by 82 (0 self)
- Add to MetaCart
We survey the major techniques for information retrieval. In the rst part, weprovide an overview of the traditional ones (full text scanning, inversion, signature les and clustering). In the second part we discuss attempts to include semantic information (natural language processing, latent semantic indexing and neural networks).
Inverted files versus signature files for text indexing
- ACM Transactions on Database Systems
, 1998
"... Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We have examined their relative performance using bo ..."
Abstract
-
Cited by 74 (3 self)
- Add to MetaCart
Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We have examined their relative performance using both experimentation and a refined approach to modeling of signature files, and demonstrate that inverted files are distinctly superior to signature files. Not only can inverted files be used to evaluate typical queries in less time than can signature files, but inverted files require less space and provide greater functionality. Our results also show that a synthetic text database can provide a realistic indication of the behavior of an actual text database. The tools used to generate the synthetic database have been made publicly available.
Fast Incremental Indexing for Full-Text Information Retrieval
, 1994
"... Full-text information retrieval systems have traditionally been designed for archival environments. They often provide little or no support for adding new documents to an existing document collection, requiring instead that the entire collection be re-indexed. Modern applications, such as informatio ..."
Abstract
-
Cited by 69 (3 self)
- Add to MetaCart
Full-text information retrieval systems have traditionally been designed for archival environments. They often provide little or no support for adding new documents to an existing document collection, requiring instead that the entire collection be re-indexed. Modern applications, such as information filtering, operate in dynamic environments that require frequent additions to document collections. We provide this ability using a traditional inverted file index built on top of a persistent object store. The data management facilities of the persistent object store are used to produce efficient incremental update of the inverted lists. We describe our system and present experimental results showing superior incremental indexing and competitive query processing performance. Keywords: full-text document retrieval, incremental indexing, persistent object store, performance 1 Introduction Full-text information retrieval (IR) systems are well established tools for satisfying a user's inf...
Dissemination of Collection Wide Information in a Distributed Information Retrieval System
- In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1995
"... We find that dissemination of collection wide information (CWI) in a distributed collection of documents is needed to achieve retrieval effectiveness comparable to a centralized collection. Complete dissemination is unnecessary. The required dissemination level depends upon how documents are allocat ..."
Abstract
-
Cited by 64 (10 self)
- Add to MetaCart
We find that dissemination of collection wide information (CWI) in a distributed collection of documents is needed to achieve retrieval effectiveness comparable to a centralized collection. Complete dissemination is unnecessary. The required dissemination level depends upon how documents are allocated among sites. Low dissemination is needed for random document allocation, but higher levels are needed when documents are allocated based on content. We define parameters to control dissemination and document allocation and present results from four test collections. We define the notion of iso-knowledge lines with respect to the number of sites and level of dissemination in the distributed archive, and show empirically that iso-knowledge lines are also isoeffectiveness lines when documents are randomly allocated. 1 Introduction In the rapidly evolving internetworks of today, a vast diversity of information is becoming electronically available. The information environment is highly dist...
Building a distributed full-text index for the web
- ACM Trans. Inf. Syst
, 2001
"... We identify crucial design issues in building a distributed inverted index for a large collection of Web pages. We introduce a novel pipelining technique for structuring the core index-building system that substantially reduces the index construction time. We also propose a storage scheme for creati ..."
Abstract
-
Cited by 63 (3 self)
- Add to MetaCart
We identify crucial design issues in building a distributed inverted index for a large collection of Web pages. We introduce a novel pipelining technique for structuring the core index-building system that substantially reduces the index construction time. We also propose a storage scheme for creating and managing inverted files using an embedded database system. We suggest and compare different strategies for collecting global statistics from distributed inverted indexes. Finally, we present performance results from experiments on a testbed distributed Web indexing system that we have implemented.
Performance of inverted indices in shared-nothing distributed text document information retrieval systems
- In Proceedings of the Second International Conference on Parallel and Distributed Information Systems
, 1993
"... The performance of distributed text document retrieval systems is strongly in uenced bytheorganization of the inverted index. This paper compares the performance impact on query processing of various physical organizations for inverted lists. We present a new probabilistic model of the database and ..."
Abstract
-
Cited by 54 (6 self)
- Add to MetaCart
The performance of distributed text document retrieval systems is strongly in uenced bytheorganization of the inverted index. This paper compares the performance impact on query processing of various physical organizations for inverted lists. We present a new probabilistic model of the database and queries. Simulation experiments determine which variables most strongly inuence response time and throughput. This leadstoa set of design trade-o s over a range of hardware con gurations and new parallel query processing strategies. 1
Tree Pattern Relaxation
, 2002
"... Tree patterns are fundamental to querying tree-structured data like XML. Because of the heterogeneity of XML data, it is often more appropriate to permit approximate query matching and return ranked answers, in the spirit of Information Retrieval, than to return only exact answers. In this paper ..."
Abstract
-
Cited by 45 (5 self)
- Add to MetaCart
Tree patterns are fundamental to querying tree-structured data like XML. Because of the heterogeneity of XML data, it is often more appropriate to permit approximate query matching and return ranked answers, in the spirit of Information Retrieval, than to return only exact answers. In this paper, we study the problem of approximate XML query matching, based on tree pattern relaxations, and devise efficient algorithms to evaluate relaxed tree patterns. We consider weighted tree patterns, where exact and relaxed weights, associated with nodes and edges of the tree pattern, are used to compute the scores of query answers. We are

