Results 1 -
8 of
8
Inverted files for text search engines
- ACM Computing Surveys
, 2006
"... The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolida ..."
Abstract
-
Cited by 136 (2 self)
- Add to MetaCart
The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolidated in textbooks, many specific techniques are not widely known or the textbook descriptions are out of date. In this tutorial, we introduce the key techniques in the area, describing both a core implementation and how the core can be enhanced through a range of extensions. We conclude with a comprehensive bibliography of text indexing literature.
Inverted files versus signature files for text indexing
- ACM Transactions on Database Systems
, 1998
"... Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We have examined their relative performance using bo ..."
Abstract
-
Cited by 74 (3 self)
- Add to MetaCart
Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We have examined their relative performance using both experimentation and a refined approach to modeling of signature files, and demonstrate that inverted files are distinctly superior to signature files. Not only can inverted files be used to evaluate typical queries in less time than can signature files, but inverted files require less space and provide greater functionality. Our results also show that a synthetic text database can provide a realistic indication of the behavior of an actual text database. The tools used to generate the synthetic database have been made publicly available.
Improved Methods for Signature-Tree Construction
- The Computer Journal
, 2000
"... we locate a number of reasons for this problem and propose several methods for node splitting and partial-tree restructuring, which lead to improved query-response times. We have implemented all methods and we present experimental results, which indicate that the proposed methods are superior in all ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
we locate a number of reasons for this problem and propose several methods for node splitting and partial-tree restructuring, which lead to improved query-response times. We have implemented all methods and we present experimental results, which indicate that the proposed methods are superior in all cases to the standard one and up to 5-10 times better for medium and higher weights in inclusive (partial match) queries. Additionally, we have developed new functions for the performance estimation of signature trees which, in contrast to a previous estimation function, are able to take into account the outcome of different split methods and to provide more accurate estimation
Signature-based Structures for Objects with Set-valued Attributes
, 2002
"... Aiming at the efficient retrieval of objects with set-valued attributes, we introduce three variations of a new method in order to satisfy subset and superset queries. Our approach is to combine the advantages of two access methods, that of linear Hashing and of tree-shaped methods, on which other s ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Aiming at the efficient retrieval of objects with set-valued attributes, we introduce three variations of a new method in order to satisfy subset and superset queries. Our approach is to combine the advantages of two access methods, that of linear Hashing and of tree-shaped methods, on which other similar methods have been previously reported as well. Performance estimation analytical functions for each particular method are presented, followed by a thorough experimental comparison of all investigated structures, where analytical and experimental results deviate 10% on the average. Finally, the results of this performance evaluation are presented and discussed, clearly showing the superiority of the new methods reaching an improvement of up to 85%.
Iterative-Improvement-Based Declustering Heuristics For Multi-Disk Databases
, 2005
"... Data declustering is an important issue for reducing query response times in multi-disk database systems. In this paper, we propose a declustering method that utilizes the available information on query distribution, data distribution, data-item sizes, and disk capacity constraints. The proposed met ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Data declustering is an important issue for reducing query response times in multi-disk database systems. In this paper, we propose a declustering method that utilizes the available information on query distribution, data distribution, data-item sizes, and disk capacity constraints. The proposed method exploits the natural correspondence between a data set with a given query distribution and a hypergraph. We define an objective function that exactly represents the aggregate parallel query-response time for the declustering problem and adapt the iterative-improvement-based heuristics successfully used in hypergraph partitioning to this objective function. We propose a two-phase algorithm that first obtains an initial K-way declustering by recursively bipartitioning the data set, then applies multi-way refinement on this declustering. We provide effective gain models and efficient implementation schemes for both phases. The experimental results on a wide range of realistic data sets show that the proposed method provides a significant performance improvement compared with the state-of-the-art declustering strategy based on similarity-graph partitioning. Author Keywords: Parallel database systems
Performance Evaluation of Parallel S-trees
, 2000
"... The S-tree is a dynamic height-balanced tree similar in structure to B+trees. S-trees store fixed length bit-strings, which are called signatures. Signatures are used for indexing textbases, relational, object oriented and extensible databases as well as in data mining. In this article, methods of d ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
The S-tree is a dynamic height-balanced tree similar in structure to B+trees. S-trees store fixed length bit-strings, which are called signatures. Signatures are used for indexing textbases, relational, object oriented and extensible databases as well as in data mining. In this article, methods of designing multi-disk B-trees are adapted to S-trees and new methods of parallelizing S-trees are developed. The resulting structures aim at achieving performance gain by accessing two or more disks simultaneously. In addition, two different searching techniques that exploit parallel disk accessing are devised. Performance results of experiments based on the new structures and searching techniques are also presented and commented.
On the Signature Tree Construction and Analysis
, 2006
"... ... well as hypertext and multimedia systems, need to handle complex data structures with set-valued attributes, which can be represented as bit strings, called signatures. A set of signatures can be stored in a file, called a signature file. In this paper, we propose a new method to organize a sign ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
... well as hypertext and multimedia systems, need to handle complex data structures with set-valued attributes, which can be represented as bit strings, called signatures. A set of signatures can be stored in a file, called a signature file. In this paper, we propose a new method to organize a signature file into a tree structure, called a signature tree, to speed up the signature file scanning and query evaluation. In addition, the average time complexity of searching a signature tree is analyzed and how to maintain a signature tree on disk is discussed. We also conducted experiments, which show that the approach of signature trees provides a promising index structure.
Optimization of Signature File Parameters for Databases with Varying Record Lengths
, 1999
"... this paper we provide two example cases for this purpose and study the performance of IFD in the conventional sequential signature file (SSF) method and a new vertical partitioning environment, the multiframe signature file (MFSF) method, that we introduced in our recent study [10, 11, 18]. For this ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
this paper we provide two example cases for this purpose and study the performance of IFD in the conventional sequential signature file (SSF) method and a new vertical partitioning environment, the multiframe signature file (MFSF) method, that we introduced in our recent study [10, 11, 18]. For this purpose we developed a test environment and implemented the SSF and MFSF methods. We extended these methods to use IFD and tested their performance with real data. The experiments show that IFD improves the performance of the inspected methods by reducing the observed FD and the (query) response time. (Further experiments with similar results involving a generalized frame sliced signature file approach are reported in Kocberber [18].) The organization of the paper is as follows. In Section 2, the conventional FD estimation method and the proposed FD estimation method, IFD, are explained. Section 3 explains the test environment used in the experiments. In Sections 4 and 5, we apply IFD to the SSF and MFSF methods, respectively, and measure the performance improvements obtained by IFD experimentally with real data. Section 6 provides the conclusion. In the Appendix we provide a formal proof which shows that under certain conditions the number of false drop records (FD) estimated by considering the average number of terms in the records is less than or equal to the FD estimated by considering individual D values of the records

