Results 1 - 10
of
49
Inverted files for text search engines
- ACM Computing Surveys
, 2006
"... The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolida ..."
Abstract
-
Cited by 136 (2 self)
- Add to MetaCart
The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolidated in textbooks, many specific techniques are not widely known or the textbook descriptions are out of date. In this tutorial, we introduce the key techniques in the area, describing both a core implementation and how the core can be enhanced through a range of extensions. We conclude with a comprehensive bibliography of text indexing literature.
Incremental Updates of Inverted Lists for Text Document Retrieval
, 1993
"... With the proliferation of the world's "information highways" a renewed interest in efficient document indexing techniques has come about. In this paper, the problem of incremental updates of inverted lists is addressed using a new dual-structure index data structure. The index dynamically separates ..."
Abstract
-
Cited by 83 (9 self)
- Add to MetaCart
With the proliferation of the world's "information highways" a renewed interest in efficient document indexing techniques has come about. In this paper, the problem of incremental updates of inverted lists is addressed using a new dual-structure index data structure. The index dynamically separates long and short inverted lists and optimizes the retrieval, update, and storage of each type of list. To study the behavior of the index, a space of engineering tradeoffs which range from optimizing update time to optimizing query performance is described. We quantitatively explore this space by using actual data and hardware in combination with a simulation of an information retrieval system. We then describe the best algorithm for a variety of criteria. 1 Introduction As the world's "information highways" proliferate and grow in capacity, they are providing access to an ever growing number of electronic document repositories. At each repository, the number of documents available on-line is...
A survey of information retrieval and filtering methods
, 1995
"... We survey the major techniques for information retrieval. In the rst part, weprovide an overview of the traditional ones (full text scanning, inversion, signature les and clustering). In the second part we discuss attempts to include semantic information (natural language processing, latent semantic ..."
Abstract
-
Cited by 82 (0 self)
- Add to MetaCart
We survey the major techniques for information retrieval. In the rst part, weprovide an overview of the traditional ones (full text scanning, inversion, signature les and clustering). In the second part we discuss attempts to include semantic information (natural language processing, latent semantic indexing and neural networks).
Fast Incremental Indexing for Full-Text Information Retrieval
, 1994
"... Full-text information retrieval systems have traditionally been designed for archival environments. They often provide little or no support for adding new documents to an existing document collection, requiring instead that the entire collection be re-indexed. Modern applications, such as informatio ..."
Abstract
-
Cited by 69 (3 self)
- Add to MetaCart
Full-text information retrieval systems have traditionally been designed for archival environments. They often provide little or no support for adding new documents to an existing document collection, requiring instead that the entire collection be re-indexed. Modern applications, such as information filtering, operate in dynamic environments that require frequent additions to document collections. We provide this ability using a traditional inverted file index built on top of a persistent object store. The data management facilities of the persistent object store are used to produce efficient incremental update of the inverted lists. We describe our system and present experimental results showing superior incremental indexing and competitive query processing performance. Keywords: full-text document retrieval, incremental indexing, persistent object store, performance 1 Introduction Full-text information retrieval (IR) systems are well established tools for satisfying a user's inf...
Indexing and retrieval of scientific literature
- Proceedings of the 8 th International Conference on Information and Knowledge Management
, 1999
"... The web has greatly improved access to scientific literature. However, scientific articles on the web are largely disorganized, with research articles being spread across archive sites, institution sites, journal sites, and researcher homepages. No index covers all of the available literature, and t ..."
Abstract
-
Cited by 68 (14 self)
- Add to MetaCart
The web has greatly improved access to scientific literature. However, scientific articles on the web are largely disorganized, with research articles being spread across archive sites, institution sites, journal sites, and researcher homepages. No index covers all of the available literature, and the major web search engines typically do not index the content of Postscript/PDF documents at all. This paper discusses the creation of digital libraries of scientific literature on the web, including the efficient location of articles, full-text indexing of the articles, autonomous citation indexing, information extraction, display of query-sensitive summaries and citation context, hubs and authorities computation, similar document detection, user profiling, distributed error correction, graph analysis, and detection of overlapping documents. The software for the system is available at no cost for non-commercial use. 1
An Object-Oriented Architecture for Text Retrieval
- In Conference Proceedings of RIAO'91, Intelligent Text and Image Handling
, 1991
"... For almost all aspects of information access systems it is still the case that their optimal composition and functionality is hotly debated. Moreover, different application scenarios put different demands on individual components. It is therefore of the essence to be able to quickly build systems th ..."
Abstract
-
Cited by 35 (10 self)
- Add to MetaCart
For almost all aspects of information access systems it is still the case that their optimal composition and functionality is hotly debated. Moreover, different application scenarios put different demands on individual components. It is therefore of the essence to be able to quickly build systems that permit exploration of different designs and implementation strategies. This paper presents a software implementation architecture for text retrieval systems that facilitates (a) functional modularization (b) mix-and-match combination of module implementations and (c) definition of inter-module protocols. We show how an object-oriented approach easily accommodates this type of architecture. The design principles are exemplified by code examples in Common Lisp. Taken together these code examples constitute an operational retrieval system. The design principles and protocols implemented have also been instantiated in a large scale retrieval prototype in our research laboratory. 1 Introductio...
In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for . . .
- IN PROCEEDINGS OF THE 27TH CONFERENCE ON AUSTRALASIAN COMPUTER SCIENCE
, 2004
"... Indexes are the key technology underpinning efficient text search. A range of algorithms have been developed for fast query evaluation and for index creation, but update algorithms for high-performance indexes have not been evaluated or even fully described. In this paper, we explore the three main ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
Indexes are the key technology underpinning efficient text search. A range of algorithms have been developed for fast query evaluation and for index creation, but update algorithms for high-performance indexes have not been evaluated or even fully described. In this paper, we explore the three main alternative strategies for index update: in-place update, index merging, and complete re-build. Our experiments with large volumes of web data show that re-merge is for large numbers of updates the fastest approach, but in-place update is suitable when the rate of update is low or buffer size is limited.
On the Update of Term Weights in Dynamic Information Retrieval Systems
- In Proceedings of the 4th International Conference on Knowledge and Information Management
, 1995
"... Using the vector space information retrieval model, we show that the update of term weights under document insertions is computationally expensive for weighting schemes that use collection statistics and normalization by document vector lengths. In the dynamic setting, we argue that strict adherence ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
Using the vector space information retrieval model, we show that the update of term weights under document insertions is computationally expensive for weighting schemes that use collection statistics and normalization by document vector lengths. In the dynamic setting, we argue that strict adherence to such schemes is impractical and unnecessary as long as retrieval effectiveness commensurate with strict adherence is attained. Experiments using standard test collections as a source of document insertions support this argument. These experiments indicate that term weights may drift from their mathematically defined values without a serious loss of retrieval effectiveness. The only problematic setting is when new terms are present in newly inserted documents. Ignoring these terms can cause an effectiveness degradation. 1 Introduction The rapid growth in online information has fueled recent interest in techniques to handle the burgeoning flood of data becoming electronically available. ...
Fast Inverted Indexes with On-Line Update
, 1994
"... We describe data structures and an update strategy for the practical implementation of inverted indexes. The context of our discussion is the construction of a dedicated index engine for a distributed full-text information retrieval system, but the results have wider application. Retrieval operation ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
We describe data structures and an update strategy for the practical implementation of inverted indexes. The context of our discussion is the construction of a dedicated index engine for a distributed full-text information retrieval system, but the results have wider application. Retrieval operations require a single disk access per query term. The on-line update strategy guarantees the consistency of on-disk data structures. Index compression integrates smoothly. 1 Introduction 1.1 Environment Our general concern is the construction of a distributed full-text information retrieval system. The basic architecture consists of a group of LANconnected processors, each managing its own separate disk and memory. Individual processors act as either text servers, storing documents and servicing requests for portions of these documents, or as index engines, identifying the portions of documents that match client-generated search criteria. To external clients, the group of machines appears to ...
Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System
, 1995
"... We describe data structures and an update strategy for the implementation of dynamic inverted indexes in the context of a dedicated index engine for a distributed fulltext retrieval system. Except in rare cases, retrieval operations require a single disk access per query term. The on-line update str ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
We describe data structures and an update strategy for the implementation of dynamic inverted indexes in the context of a dedicated index engine for a distributed fulltext retrieval system. Except in rare cases, retrieval operations require a single disk access per query term. The on-line update strategy guarantees the consistency of ondisk data structures across node failures. Index compression integrates smoothly. We examine the performance of the system both experimentally and through an analytical comparison with a competing B-tree based approach. 1 Introduction 1.1 Environment Our general concern is the construction of a distributed full-text retrieval system. The architecture consists of a group of LAN-connected processors, each managing its own separate disk and memory. Individual processors act as either text servers, storing documents and servicing requests for portions of these documents, or as index engines, identifying the portions of documents that match client-generate...

