Results 1  10
of
13
Lexicographical Indices for Text: Inverted files vs. PAT trees
, 1991
"... We survey two indices for text, with emphasis on Pat arrays (also called suffix arrays). A Pat array is an index based on a new model of text which does not use the concept of word and does not need to know the structure of the text. to appear in Information Retrieval: Data Structures and Algori ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
We survey two indices for text, with emphasis on Pat arrays (also called suffix arrays). A Pat array is an index based on a new model of text which does not use the concept of word and does not need to know the structure of the text. to appear in Information Retrieval: Data Structures and Algorithms, R.A. BaezaYates and W. Frakes, eds., PrenticeHall. 1 1 Introduction Text searching methods may be classified as lexicographical indices (indices that are sorted), clustering techniques, and indices based on hashing (for example, signature files [FC87]). In this report we discuss lexicographical indices, in particular, two main data structures: inverted files and Pat trees. Our aim is to build an index for the text of size similar to or smaller than the text. Briefly, the traditional model of text used in information retrieval is that of a set of documents. Each document is assigned a list of keywords (attributes), with optional relevance weights associated to each keyword. This ...
Using Difficulty of Prediction to Decrease Computation: Fast Sort, Priority Queue and Convex Hull on Entropy Bounded Inputs
"... There is an upsurge in interest in the Markov model and also more general stationary ergodic stochastic distributions in theoretical computer science community recently (e.g. see [Vitter,KrishnanSl], [Karlin,Philips,Raghavan92], [Raghavan9 for use of Markov models for online algorithms, e.g., cashi ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
There is an upsurge in interest in the Markov model and also more general stationary ergodic stochastic distributions in theoretical computer science community recently (e.g. see [Vitter,KrishnanSl], [Karlin,Philips,Raghavan92], [Raghavan9 for use of Markov models for online algorithms, e.g., cashing and prefetching). Their results used the fact that compressible sources are predictable (and vise versa), and showed that online algorithms can improve their performance by prediction. Actual page access sequences are in fact somewhat compressible, so their predictive methods can be of benefit. This paper investigates the interesting idea of decreasing computation by using learning in the opposite way, namely to determine the difficulty of prediction. That is, we will ap proximately learn the input distribution, and then improve the performance of the computation when the input is not too predictable, rather than the reverse. To our knowledge,
Faster Searching in Tries and Quadtrees  An Analysis of Level Compression
"... We analyze the behavior of the levelcompressed trie, LCtrie, a compact version of the standard trie data structure. Based on this analysis, we argue that level compression improves the performance of both tries and quadtrees considerably in many practical situations. In particular, we show that ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
We analyze the behavior of the levelcompressed trie, LCtrie, a compact version of the standard trie data structure. Based on this analysis, we argue that level compression improves the performance of both tries and quadtrees considerably in many practical situations. In particular, we show that LCtries can be of great use for string searching in compressed text. Both tries and quadtrees are extensively used and much effort has been spent obtaining detailed analyses. Since the LCtrie performs significantly better than standard tries, for a large class of common distributions, while still being easy to implement, we believe that the LCtrie is a strong candidate for inclusion in the standard repertoire of basic data structures.
An Experimental Study of Compression Methods for Dynamic Tries
"... We study an orderpreserving general purpose data structure for binary data, the LPCtrie. The structure is a compressed trie, using both level and path compression. The memory usage is similar to that of a balanced binary search tree, but the expected average depth is smaller. The LPCtrie is well ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
We study an orderpreserving general purpose data structure for binary data, the LPCtrie. The structure is a compressed trie, using both level and path compression. The memory usage is similar to that of a balanced binary search tree, but the expected average depth is smaller. The LPCtrie is well suited to modern language environments with ecient memory allocation and garbage collection. We present an implementation in the Java programming language and show that the structure compares favorably to a balanced binary search tree.
Trie methods for text and spatial data on secondary storage
, 1995
"... ii Abstract This thesis presents three trie organizations for various binary tries. The new trie structures have two distinctive features: (1) they store no pointers and require two bits per node in the worst case, and (2) they partition tries into pages and are suitable for secondary storage. We ap ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
ii Abstract This thesis presents three trie organizations for various binary tries. The new trie structures have two distinctive features: (1) they store no pointers and require two bits per node in the worst case, and (2) they partition tries into pages and are suitable for secondary storage. We apply trie structures to indexing, storing and querying both text and spatial data on secondary storage. We are interested in practical problems such as storage compactness, I/O efficiency, and large trie construction. We use our tries to index and search arbitrary substrings of a text. For an index of 100 million keys, our trie is 10 % 25 % smaller than the best known method. This difference is important since the index size is crucial for trie methods. We provide methods for dynamic tries and allow texts to be changed. We also use our tries to compress and approximately search large dictionaries. Our algorithm can find strings with k mismatches in sublinear time. To our knowledge, no other published sublinear algorithm is known for this problem. Besides, we use our tries to store and query spatial data such as maps. A trie structure is proposed to permit querying and retrieving spatial data at arbitrary levels of resolution, without reading from secondary storage any more data than is needed for the specified resolution. The trie structure also compresses spatial data substantially. The performance results on map data have confirmed our expectations: the querying cost is linear in the amount of data needed and independent of the data size in practice. We give algorithms for a set of sample queries including geometrical selection, geometrical join and the nearest neighbour. We also show how to control query cost by specifying an acceptable resolution.
A growth model for rna secondary structures
 Journal of Statistical Mechanics: Theory and Experiment
"... Abstract. A hierarchical model for the growth of planar arch structures for RNA secondary structures is presented, and shown to be equivalent to a treegrowth model. Both models can be solved analytically, giving access to scaling functions for large molecules, and corrections to scaling, checked by ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
Abstract. A hierarchical model for the growth of planar arch structures for RNA secondary structures is presented, and shown to be equivalent to a treegrowth model. Both models can be solved analytically, giving access to scaling functions for large molecules, and corrections to scaling, checked by numerical simulations of up to 6500 bases. The equivalence of both models should be helpful in understanding more general treegrowth processes. PACS numbers: 87.14.gn, 87.15.bd, 02.10.Ox, 02.50.EyA growth model for RNA secondary structures 2 1.
On the Average Depth of Asymmetric LCtries
 Information Processing Letters
, 2005
"... Andersson and Nilsson have already shown that the average depth Dn of random LCtries is only Θ (log ∗ n) when the keys are produced by a symmetric memoryless process, and that Dn = O (log log n) when the process is asymmetric. In this paper we refine the second estimate by showing that asymptotical ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Andersson and Nilsson have already shown that the average depth Dn of random LCtries is only Θ (log ∗ n) when the keys are produced by a symmetric memoryless process, and that Dn = O (log log n) when the process is asymmetric. In this paper we refine the second estimate by showing that asymptotically (with n → ∞): Dn ∼ 1 log log n, where n is the number of η keys inserted in a trie, η = − log (1 − h/h−∞), h = −p log p − q log q is the entropy of a binary memoryless source with probabilities p, q = 1 − p (p ̸ = q), and h− ∞ = − log min(p, q). Key words: average case analysis of algorithms, trie, LCtrie. 1
Trie Methods for Structured Data on Secondary Storage
, 2000
"... We apply the trie structures to indexing, storing and querying structured data on secondary storage. We are interested in the storage compactness, the I/O efficiency, the orderpreserving properties, the general orthogonal range queries and the exact match queries for very large files and databases. ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We apply the trie structures to indexing, storing and querying structured data on secondary storage. We are interested in the storage compactness, the I/O efficiency, the orderpreserving properties, the general orthogonal range queries and the exact match queries for very large files and databases. We also apply the trie structures to relational joins (set operations). We compare trie structures to various data structures on secondary storage: multipaging and grid files in the direct access method category, Rtrees/R*trees and Xtrees in the logarithmic access cost category, as well as some representative join algorithms for performing join operations. Our results show that range queries by trie method are superior to these competitors in search cost when queries return more than a few records and are competitive to direct access methods for exact match queries. Furthermore, as the trie structure compresses data, it is the winner in terms of storage compared to all other methods mentioned above. We also present a new tidy function for orderpreserving keytoaddress transformation. Our tidy function is easy to construct and cheaper in access time and storage cost compared to its closest competitor.
Using Learning and Difficulty of Prediction to Decrease Computation: A Fast Sort and Priority Queue on Entropy Bounded Inputs ∗
"... There is an upsurge in interest in the Markov model and also more general stationary ergodic stochastic distributions in theoretical computer science community recently, (e.g. see [Vitter,Krishnan,FOCS91], [Karlin,Philips,Raghavan,FOCS92] [Raghavan92]) for use of Markov models for online algorithms ..."
Abstract
 Add to MetaCart
There is an upsurge in interest in the Markov model and also more general stationary ergodic stochastic distributions in theoretical computer science community recently, (e.g. see [Vitter,Krishnan,FOCS91], [Karlin,Philips,Raghavan,FOCS92] [Raghavan92]) for use of Markov models for online algorithms e.g., cashing and prefetching). Their results used the fact that compressible sources are predictable (and vise versa), and show that online algorithms can improve their performance by prediction. Actual page access sequences are in fact somewhat compressible, so their predictive methods can be of benefit. This paper investigates the interesting idea of decreasing computation by using learning in the opposite way, namely to determine the difficulty of prediction. That is, we will approximately learn the input distribution, and then improve the performance of the computation when the input is not too predictable, rather than the reverse. To our knowledge, this is first case of a computational problem where we do not assume any particular fixed input distribution and yet computation is decreased when the input is less predictable, rather than the reverse. We concentrate our investigation on a basic computational problem: sorting and a basic data structure problem: maintaining a priority queue. We present the first known case of sorting and priority queue algorithms whose complexity depends on the binary entropy H ≤ 1 of input keys where assume that input keys are generated from an unknown but arbitrary stationary ergodic source. This is, we assume that each of the input keys can be each arbitrarily long, but have entropy H. Note that H