Results 1 - 10
of
16
Efficient Single-Pass Index Construction for Text Databases
- Jour. of the American Society for Information Science and Technology
, 2003
"... Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this paper, we review the principal approaches to inversion, analyse their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approa ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this paper, we review the principal approaches to inversion, analyse their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single-pass inversion method that, in contrast to previous approaches, does not require the complete vocabulary of the indexed collection in main memory, can operate within limited resources, and does not sacrifice speed with high temporary storage requirements. We show that the performance of the single-pass approach can be improved by constructing inverted files in segments, reducing the cost of disk accesses during inversion of large volumes of data.
Searchable Words on the Web
- International Journal of Digital Libraries
, 2001
"... In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space req ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word, and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 gigabytes of world-wide web documents, and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large data sets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.
Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries
"... Ongoing changes in computer performance are affecting the efficiency of string sorting algorithms. The size of main memory in typical computers continues to grow, but memory accesses require increasing numbers of instruction cycles, which is a problem for the most efficient of the existing string-so ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
Ongoing changes in computer performance are affecting the efficiency of string sorting algorithms. The size of main memory in typical computers continues to grow, but memory accesses require increasing numbers of instruction cycles, which is a problem for the most efficient of the existing string-sorting algorithms as they do not utilise cache particularly well for large data sets. We propose a new sorting algorithm for strings, burstsort, based on dynamic construction of a compact trie in which strings are kept in buckets. It is simple, fast, and efficient. We experimentally compare burstsort to existing string-sorting algorithms on large and small sets of strings with a range of characteristics. These experiments show that, for large sets of strings, burstsort is almost twice as fast as any previous algorithm, due primarily to a lower rate of cache miss.
Efficient online index maintenance for contiguous inverted lists
- Inf. Process. Manage
"... Search engines and other text retrieval systems use high-performance inverted indexes to provide efficient text query evaluation. Algorithms for fast query evaluation and index construction are well-known, but relatively little has been published concerning update. In this paper, we experimentally e ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Search engines and other text retrieval systems use high-performance inverted indexes to provide efficient text query evaluation. Algorithms for fast query evaluation and index construction are well-known, but relatively little has been published concerning update. In this paper, we experimentally evaluate the two main alternative strategies for index maintenance in the presence of insertions, with the constraint that inverted lists remain contiguous on disk for fast query evaluation. The in-place and re-merge strategies are benchmarked against the baseline of a complete re-build. Our experiments with large volumes of web data show that re-merge is the fastest approach if large buffers are available, but that even a simple implementation of in-place update is suitable when the rate of insertion is low or memory buffer size is limited. We also show that with careful design of aspects of implementation such as free-space management, in-place update can be improved by around an order of magnitude over a naïve implementation. Keywords: Text indexing, search engines, index construction, index update. This paper incorporates and extends material from “In-place versus re-build versus re-merge: Index maintenance
Cache-efficient string sorting using copying
- In submission
, 2006
"... Abstract. Burstsort is a cache-oriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Abstract. Burstsort is a cache-oriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a list or array; this approach was found to be up to twice as fast as the previous best string sorts, mostly because of a sharp reduction in out-of-cache references. In this paper we introduce C-burstsort, which copies the unexamined tail of each key to the bucket and discards the original key to improve data locality. On both Intel and PowerPC architectures, and on a wide range of string types, we show that sorting is typically twice as fast as our original burstsort, and four to five times faster than multikey quicksort and previous radixsorts. A variant that copies both suffixes and record pointers to buckets, CP-burstsort, uses more memory but provides stable sorting. In current computers, where performance is limited by memory access latencies, these new algorithms can dramatically reduce the time needed for internal sorting of large numbers of strings. 1
Multi-User File System Search
, 2007
"... Information retrieval research usually deals with globally visible, static document collections. Practical applications, in contrast, like file system search and enterprise search, have to cope with highly dynamic text collections and have to take into account user-specific access permissions when ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Information retrieval research usually deals with globally visible, static document collections. Practical applications, in contrast, like file system search and enterprise search, have to cope with highly dynamic text collections and have to take into account user-specific access permissions when generating the results to a search query. The goal of this thesis is to close the gap between information retrieval research and the requirements exacted by these real-life applications. The algorithms and data structures presented in this thesis can be used to implement a file system search engine that is able to react to changes in the file system by updating its index data in real time. File changes (in-sertions, deletions, or modifications) are reflected by the search results within a few seconds,
Memory Management Strategies for Single-Pass Index Construction in Text Retrieval Systems
, 2005
"... Many text retrieval systems construct their index by accumulating postings in main memory until there is no more memory available and then creating an on-disk index from the in-memory data. When the entire text collection has been read, all on-disk indices are combined into one big index through a m ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Many text retrieval systems construct their index by accumulating postings in main memory until there is no more memory available and then creating an on-disk index from the in-memory data. When the entire text collection has been read, all on-disk indices are combined into one big index through a multiway merge process.
Self-adjusting of ternary search tries using conditional rotations and randomized heuristics
- Comput. J
, 2005
"... A Ternary Search Trie (TST) is a highly efficient dynamic dictionary structure applicable for strings and textual data. The strings are accessed based on a set of access probabilities and are to be arranged using a TST. We consider the scenario where the probabilities are not known a priori, and is ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
A Ternary Search Trie (TST) is a highly efficient dynamic dictionary structure applicable for strings and textual data. The strings are accessed based on a set of access probabilities and are to be arranged using a TST. We consider the scenario where the probabilities are not known a priori, and is time-invariant. Our aim is to adaptively restructure the TST so as to yield the best access or retrieval time. Unlike the case of lists and binary search trees, where numerous methods have been proposed, in the case of the TST, currently, the number of reported adaptive schemes are few. In this paper, we consider various self-organizing schemes that were applied to Binary Search Trees, and apply them to TSTs. Three new schemes, which are the splaying, the conditional rotation and the randomization heuristics, have been proposed, tested and comparatively presented. The results demonstrate that the conditional rotation heuristic is the best when compared to other heuristics that are considered in the paper.
Efficient Trie-Based Sorting of Large Sets of Strings
- Proceedings of the Australasian Computer Science Conference
, 2003
"... Sorting is a fundamental algorithmic task. Many general-purpose sorting algorithms have been developed, but efficiency gains can be achieved by designing algorithms for specific kinds of data, such as strings. In previous work we have shown that our burstsort, a trie-based algorithm for sorting stri ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Sorting is a fundamental algorithmic task. Many general-purpose sorting algorithms have been developed, but efficiency gains can be achieved by designing algorithms for specific kinds of data, such as strings. In previous work we have shown that our burstsort, a trie-based algorithm for sorting strings, is for large data sets more efficient than all previous algorithms for this task. In this paper we re-evaluate some of the implementation details of burstsort, in particular the method for managing buckets held at leaves. We show that better choice of data structures further improves the efficiency, at a small additional cost in memory. For sets of around 30,000,000 strings, our improved burstsort is nearly twice as fast as the previous best sorting algorithm.
Performance of Data Structures for Small Sets of Strings
- Proc. of the Australasian conference on Computer Science
, 2002
"... Fundamental structures such as trees and hash tables are used for managing data in a huge variety of circumstances. Making the right choice of structure is essential to efficiency. In previous work we have explored the performance of a range of data structures -- different forms of trees, tries, and ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Fundamental structures such as trees and hash tables are used for managing data in a huge variety of circumstances. Making the right choice of structure is essential to efficiency. In previous work we have explored the performance of a range of data structures -- different forms of trees, tries, and hash tables -- for the task of managing sets of millions of strings, and have developed new variants of each that are more efficient for this task than previous alternatives. In this paper we test the performance of the same data structures on small sets of strings, in the context of document processing for index construction. Our results show that the new structures, in particular our burst trie, are the most efficient choice for this task, thus demonstrating that they are suitable for managing sets of hundreds to millions of distinct strings, and for input of hundreds to billions of occurrences.

