Results 1 - 10
of
11
Lexicographical Indices for Text: Inverted files vs. PAT trees
, 1991
"... We survey two indices for text, with emphasis on Pat arrays (also called suffix arrays). A Pat array is an index based on a new model of text which does not use the concept of word and does not need to know the structure of the text. to appear in Information Retrieval: Data Structures and Algori ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
We survey two indices for text, with emphasis on Pat arrays (also called suffix arrays). A Pat array is an index based on a new model of text which does not use the concept of word and does not need to know the structure of the text. to appear in Information Retrieval: Data Structures and Algorithms, R.A. Baeza-Yates and W. Frakes, eds., Prentice-Hall. 1 1 Introduction Text searching methods may be classified as lexicographical indices (indices that are sorted), clustering techniques, and indices based on hashing (for example, signature files [FC87]). In this report we discuss lexicographical indices, in particular, two main data structures: inverted files and Pat trees. Our aim is to build an index for the text of size similar to or smaller than the text. Briefly, the traditional model of text used in information retrieval is that of a set of documents. Each document is assigned a list of keywords (attributes), with optional relevance weights associated to each keyword. This ...
Using Difficulty of Prediction to Decrease Computation: Fast Sort, Priority Queue and Convex Hull on Entropy Bounded Inputs
"... There is an upsurge in interest in the Markov model and also more general stationary ergodic stochastic distributions in theoretical computer science community recently (e.g. see [Vitter,KrishnanSl], [Karlin,Philips,Raghavan92], [Raghavan9 for use of Markov models for on-line algorithms, e.g., cashi ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
There is an upsurge in interest in the Markov model and also more general stationary ergodic stochastic distributions in theoretical computer science community recently (e.g. see [Vitter,KrishnanSl], [Karlin,Philips,Raghavan92], [Raghavan9 for use of Markov models for on-line algorithms, e.g., cashing and prefetching). Their results used the fact that compressible sources are predictable (and vise versa), and showed that on-line algorithms can improve their performance by prediction. Actual page access sequences are in fact somewhat compressible, so their predictive methods can be of benefit. This paper investigates the interesting idea of decreasing computation by using learning in the opposite way, namely to determine the difficulty of prediction. That is, we will ap proximately learn the input distribution, and then improve the performance of the computation when the input is not too predictable, rather than the reverse. To our knowledge,
Faster Searching in Tries and Quadtrees -- An Analysis of Level Compression
"... We analyze the behavior of the level-compressed trie, LC-trie, a compact version of the standard trie data structure. Based on this analysis, we argue that level compression improves the performance of both tries and quadtrees considerably in many practical situations. In particular, we show that ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
We analyze the behavior of the level-compressed trie, LC-trie, a compact version of the standard trie data structure. Based on this analysis, we argue that level compression improves the performance of both tries and quadtrees considerably in many practical situations. In particular, we show that LC-tries can be of great use for string searching in compressed text. Both tries and quadtrees are extensively used and much effort has been spent obtaining detailed analyses. Since the LC-trie performs significantly better than standard tries, for a large class of common distributions, while still being easy to implement, we believe that the LC-trie is a strong candidate for inclusion in the standard repertoire of basic data structures.
An Experimental Study of Compression Methods for Dynamic Tries
"... We study an order-preserving general purpose data structure for binary data, the LPC-trie. The structure is a compressed trie, using both level and path compression. The memory usage is similar to that of a balanced binary search tree, but the expected average depth is smaller. The LPC-trie is well ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
We study an order-preserving general purpose data structure for binary data, the LPC-trie. The structure is a compressed trie, using both level and path compression. The memory usage is similar to that of a balanced binary search tree, but the expected average depth is smaller. The LPC-trie is well suited to modern language environments with ecient memory allocation and garbage collection. We present an implementation in the Java programming language and show that the structure compares favorably to a balanced binary search tree.
Trie methods for text and spatial data on secondary storage
, 1995
"... ii Abstract This thesis presents three trie organizations for various binary tries. The new trie structures have two distinctive features: (1) they store no pointers and require two bits per node in the worst case, and (2) they partition tries into pages and are suitable for secondary storage. We ap ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
ii Abstract This thesis presents three trie organizations for various binary tries. The new trie structures have two distinctive features: (1) they store no pointers and require two bits per node in the worst case, and (2) they partition tries into pages and are suitable for secondary storage. We apply trie structures to indexing, storing and querying both text and spatial data on secondary storage. We are interested in practical problems such as storage compactness, I/O efficiency, and large trie construction. We use our tries to index and search arbitrary substrings of a text. For an index of 100 million keys, our trie is 10 %- 25 % smaller than the best known method. This difference is important since the index size is crucial for trie methods. We provide methods for dynamic tries and allow texts to be changed. We also use our tries to compress and approximately search large dictionaries. Our algorithm can find strings with k mismatches in sublinear time. To our knowledge, no other published sublinear algorithm is known for this problem. Besides, we use our tries to store and query spatial data such as maps. A trie structure is proposed to permit querying and retrieving spatial data at arbitrary levels of resolution, without reading from secondary storage any more data than is needed for the specified resolution. The trie structure also compresses spatial data substantially. The performance results on map data have confirmed our expectations: the querying cost is linear in the amount of data needed and independent of the data size in practice. We give algorithms for a set of sample queries including geometrical selection, geometrical join and the nearest neighbour. We also show how to control query cost by specifying an acceptable resolution.
Trie Methods for Structured Data on Secondary Storage
, 2000
"... We apply the trie structures to indexing, storing and querying structured data on secondary storage. We are interested in the storage compactness, the I/O efficiency, the order-preserving properties, the general orthogonal range queries and the exact match queries for very large files and databases. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We apply the trie structures to indexing, storing and querying structured data on secondary storage. We are interested in the storage compactness, the I/O efficiency, the order-preserving properties, the general orthogonal range queries and the exact match queries for very large files and databases. We also apply the trie structures to relational joins (set operations). We compare trie structures to various data structures on secondary storage: multipaging and grid files in the direct access method category, R-trees/R*-trees and X-trees in the logarithmic access cost category, as well as some representative join algorithms for performing join operations. Our results show that range queries by trie method are superior to these competitors in search cost when queries return more than a few records and are competitive to direct access methods for exact match queries. Furthermore, as the trie structure compresses data, it is the winner in terms of storage compared to all other methods mentioned above. We also present a new tidy function for order-preserving key-to-address transformation. Our tidy function is easy to construct and cheaper in access time and storage cost compared to its closest competitor.
On the Average Depth of Asymmetric LC-tries
- Information Processing Letters
, 2005
"... Andersson and Nilsson have already shown that the average depth Dn of random LC-tries is only Θ (log ∗ n) when the keys are produced by a symmetric memoryless process, and that Dn = O (log log n) when the process is asymmetric. In this paper we refine the second estimate by showing that asymptotical ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Andersson and Nilsson have already shown that the average depth Dn of random LC-tries is only Θ (log ∗ n) when the keys are produced by a symmetric memoryless process, and that Dn = O (log log n) when the process is asymmetric. In this paper we refine the second estimate by showing that asymptotically (with n → ∞): Dn ∼ 1 log log n, where n is the number of η keys inserted in a trie, η = − log (1 − h/h−∞), h = −p log p − q log q is the entropy of a binary memoryless source with probabilities p, q = 1 − p (p ̸ = q), and h− ∞ = − log min(p, q). Key words: average case analysis of algorithms, trie, LC-trie. 1
Using Learning and Difficulty of Prediction to Decrease Computation: A Fast Sort and Priority Queue on Entropy Bounded Inputs ∗
"... There is an upsurge in interest in the Markov model and also more general stationary ergodic stochastic distributions in theoretical computer science community recently, (e.g. see [Vitter,Krishnan,FOCS91], [Karlin,Philips,Raghavan,FOCS92] [Raghavan92]) for use of Markov models for on-line algorithms ..."
Abstract
- Add to MetaCart
There is an upsurge in interest in the Markov model and also more general stationary ergodic stochastic distributions in theoretical computer science community recently, (e.g. see [Vitter,Krishnan,FOCS91], [Karlin,Philips,Raghavan,FOCS92] [Raghavan92]) for use of Markov models for on-line algorithms e.g., cashing and prefetching). Their results used the fact that compressible sources are predictable (and vise versa), and show that on-line algorithms can improve their performance by prediction. Actual page access sequences are in fact somewhat compressible, so their predictive methods can be of benefit. This paper investigates the interesting idea of decreasing computation by using learning in the opposite way, namely to determine the difficulty of prediction. That is, we will approximately learn the input distribution, and then improve the performance of the computation when the input is not too predictable, rather than the reverse. To our knowledge, this is first case of a computational problem where we do not assume any particular fixed input distribution and yet computation is decreased when the input is less predictable, rather than the reverse. We concentrate our investigation on a basic computational problem: sorting and a basic data structure problem: maintaining a priority queue. We present the first known case of sorting and priority queue algorithms whose complexity depends on the binary entropy H ≤ 1 of input keys where assume that input keys are generated from an unknown but arbitrary stationary ergodic source. This is, we assume that each of the input keys can be each arbitrarily long, but have entropy H. Note that H
Analysis of a Class of Tries with Adaptive Multi-Digit Branching
"... Abstract. We study a class of adaptive multi-digit tries, in which the numbers of digits rn processed by nodes with n incoming strings are such that, in memoryless model (with n → ∞): rn → log n η (pr.) where η is an algorithm-specific constant. Examples of known data structures from this class incl ..."
Abstract
- Add to MetaCart
Abstract. We study a class of adaptive multi-digit tries, in which the numbers of digits rn processed by nodes with n incoming strings are such that, in memoryless model (with n → ∞): rn → log n η (pr.) where η is an algorithm-specific constant. Examples of known data structures from this class include LC-tries (Andersson and Nilsson, 1993), ”relaxed ” LC-tries (Nilsson and Tikkanen, 1998), tries with logarithmic selection of degrees of nodes, etc. We show, that the average depth Dn of such tries in asymmetric memoryless model has the following asymptotic behavior (with n → ∞): Dn = log log n (1 + o (1)) − log (1 − h/η) where n is the number of strings inserted in the trie, and h is the entropy of the source. We use this formula to compare performance of known adaptive trie structures, and to predict properties of other possible implementations of tries in this class. 1

