Results 1 - 10
of
23
In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for . . .
- IN PROCEEDINGS OF THE 27TH CONFERENCE ON AUSTRALASIAN COMPUTER SCIENCE
, 2004
"... Indexes are the key technology underpinning efficient text search. A range of algorithms have been developed for fast query evaluation and for index creation, but update algorithms for high-performance indexes have not been evaluated or even fully described. In this paper, we explore the three main ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
Indexes are the key technology underpinning efficient text search. A range of algorithms have been developed for fast query evaluation and for index creation, but update algorithms for high-performance indexes have not been evaluated or even fully described. In this paper, we explore the three main alternative strategies for index update: in-place update, index merging, and complete re-build. Our experiments with large volumes of web data show that re-merge is for large numbers of updates the fastest approach, but in-place update is suitable when the rate of update is low or buffer size is limited.
A Security Model for FullText File System Search in Multi-User Environments
- In Proceedings of the FAST
, 2005
"... Most desktop search systems maintain per-user indices to keep track of file contents. In a multi-user environment, this is not a viable solution, because the same file has to be indexed many times, once for every user that may access the file, causing both space and performance problems. Having a si ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Most desktop search systems maintain per-user indices to keep track of file contents. In a multi-user environment, this is not a viable solution, because the same file has to be indexed many times, once for every user that may access the file, causing both space and performance problems. Having a single system-wide index for all users, on the other hand, allows for efficient indexing but requires special security mechanisms to guarantee that the search results do not violate any file permissions. We present a security model for full-text file system search, based on the UNIX security model, and discuss two possible implementations of the model. We show that the first implementation, based on a postprocessing approach, allows an arbitrary user to obtain information about the content of files for which he does not have read permission. The second implementation does not share this problem. We give an experimental performance evaluation for both implementations and point out query optimization opportunities for the second one. 1
Fast On-Line Index Construction by Geometric Partitioning
- In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management
, 2005
"... Inverted index structures are the mainstay of modern text retrieval systems. They can be constructed quickly using off-line mergebased methods, and provide efficient support for a variety of querying modes. In this paper we examine the task of on-line index construction – that is, how to build an in ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Inverted index structures are the mainstay of modern text retrieval systems. They can be constructed quickly using off-line mergebased methods, and provide efficient support for a variety of querying modes. In this paper we examine the task of on-line index construction – that is, how to build an inverted index when the underlying data must be continuously queryable, and the documents must be indexed and available for search as soon they are inserted. When straightforward approaches are used, document insertions become increasingly expensive as the size of the database grows. This paper describes a mechanism based on controlled partitioning that can be adapted to suit different balances of insertion and querying operations, and is faster and scales better than previous methods. Using experiments on 100 GB of web data we demonstrate the efficiency of our methods in practice, showing that they dramatically reduce the cost of on-line index construction.
Indexing shared content in information retrieval systems
- In Proc. of the 10th Int. Conf. on Extending Database Technology
, 2006
"... Abstract. Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation mode ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Abstract. Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it. 1
Trustworthy keyword search for regulatorycompliant record retention
- In Proceedings of the 32nd International Conference on Very Large Data Bases
, 2006
"... Recent litigation and intense regulatory focus on secure retention of electronic records have spurred a rush to introduce Write-Once-Read-Many (WORM) storage devices for retaining business records such as electronic mail. However, simply storing records in WORM storage is insufficient to ensure that ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Recent litigation and intense regulatory focus on secure retention of electronic records have spurred a rush to introduce Write-Once-Read-Many (WORM) storage devices for retaining business records such as electronic mail. However, simply storing records in WORM storage is insufficient to ensure that the records are trustworthy, i.e., able to provide irrefutable proof and accurate details of past events. Specifically, some form of index is needed for timely access to the records, but unless the index is maintained securely, the records can in effect be hidden or altered, even if stored in WORM storage. In this paper, we systematically analyze the requirements for establishing a trustworthy inverted index to enable keyword-based search queries. We propose a novel scheme for efficient creation of such an index and demonstrate, through extensive simulations and experiments with an enterprise keyword search engine, that the scheme can achieve online update speeds while maintaining good query performance. In addition, we present a secure index structure for multi-keyword queries that supports insert, lookup and range queries in time logarithmic in the number of documents. 1.
Hybrid Index Maintenance for Growing Text Collections
- In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
, 2006
"... We present a new family of hybrid index maintenance strategies to be used in on-line index construction for monotonically growing text collections. These new strategies improve upon recent results for hybrid index maintenance in dynamic text retrieval systems. Like previous techniques, our new metho ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
We present a new family of hybrid index maintenance strategies to be used in on-line index construction for monotonically growing text collections. These new strategies improve upon recent results for hybrid index maintenance in dynamic text retrieval systems. Like previous techniques, our new method distinguishes between short and long posting lists: While short lists are maintained using a merge strategy, long lists are kept separate and are updated in-place. This way, costly relocations of long posting lists are avoided.
UMass at TREC 2004: Notebook
- In TREC 2004
, 2004
"... The retrieval model implemented in the Indri search engine is an enhanced version of the model described in [30], which combines the language modeling [35] and inference network [38] approaches to information retrieval. The resulting model allows structured queries similar to those used in INQUERY [ ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
The retrieval model implemented in the Indri search engine is an enhanced version of the model described in [30], which combines the language modeling [35] and inference network [38] approaches to information retrieval. The resulting model allows structured queries similar to those used in INQUERY [4] to be evaluated using language modeling estimates within the network, rather than tf.idf estimates. Figure 1.1 shows a graphical model representation of the network. As in the original inference network framework, documents are ranked according to P(I|D, α, β), the belief the information need I is met given document D and hyperparameters α and β as evidence. Due to space limitations, a general understanding of the inference network framework is assumed. See [30] and [38] to fill in any missing details. 1.1.1 DOCUMENT REPRESENTATION
A Hybrid Approach to Index Maintenance in Dynamic Text Retrieval Systems
- In ECIR 2006: Proceedings of the 28th European Conference on Information Retrieval
, 2006
"... In-place and merge-based index maintenance are the two main competing strategies for on-line index construction in dynamic information retrieval systems based on inverted lists. Motivated by recent results for both strategies, we investigate possible combinations of in-place and merge-based inde ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
In-place and merge-based index maintenance are the two main competing strategies for on-line index construction in dynamic information retrieval systems based on inverted lists. Motivated by recent results for both strategies, we investigate possible combinations of in-place and merge-based index maintenance. We present a hybrid approach in which long posting lists are updated in-place, while short lists are updated using a merge strategy. Our experimental results show that this hybrid approach achieves better indexing performance than either method (in-place, merge-based) alone.
Efficient online index maintenance for contiguous inverted lists
- Inf. Process. Manage
"... Search engines and other text retrieval systems use high-performance inverted indexes to provide efficient text query evaluation. Algorithms for fast query evaluation and index construction are well-known, but relatively little has been published concerning update. In this paper, we experimentally e ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Search engines and other text retrieval systems use high-performance inverted indexes to provide efficient text query evaluation. Algorithms for fast query evaluation and index construction are well-known, but relatively little has been published concerning update. In this paper, we experimentally evaluate the two main alternative strategies for index maintenance in the presence of insertions, with the constraint that inverted lists remain contiguous on disk for fast query evaluation. The in-place and re-merge strategies are benchmarked against the baseline of a complete re-build. Our experiments with large volumes of web data show that re-merge is the fastest approach if large buffers are available, but that even a simple implementation of in-place update is suitable when the rate of insertion is low or memory buffer size is limited. We also show that with careful design of aspects of implementation such as free-space management, in-place update can be improved by around an order of magnitude over a naïve implementation. Keywords: Text indexing, search engines, index construction, index update. This paper incorporates and extends material from “In-place versus re-build versus re-merge: Index maintenance
Multi-User File System Search
, 2007
"... Information retrieval research usually deals with globally visible, static document collections. Practical applications, in contrast, like file system search and enterprise search, have to cope with highly dynamic text collections and have to take into account user-specific access permissions when ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Information retrieval research usually deals with globally visible, static document collections. Practical applications, in contrast, like file system search and enterprise search, have to cope with highly dynamic text collections and have to take into account user-specific access permissions when generating the results to a search query. The goal of this thesis is to close the gap between information retrieval research and the requirements exacted by these real-life applications. The algorithms and data structures presented in this thesis can be used to implement a file system search engine that is able to react to changes in the file system by updating its index data in real time. File changes (in-sertions, deletions, or modifications) are reflected by the search results within a few seconds,

