Results 1  10
of
14
Opportunistic Data Structures with Applications
, 2000
"... In this paper we address the issue of compressing and indexing data. We devise a data structure whose space occupancy is a function of the entropy of the underlying data set. We call the data structure opportunistic since its space occupancy is decreased when the input is compressible and this space ..."
Abstract

Cited by 178 (11 self)
 Add to MetaCart
In this paper we address the issue of compressing and indexing data. We devise a data structure whose space occupancy is a function of the entropy of the underlying data set. We call the data structure opportunistic since its space occupancy is decreased when the input is compressible and this space reduction is achieved at no significant slowdown in the query performance. More precisely, its space occupancy is optimal in an informationcontent sense because a text T [1, u] is stored using O(H k (T )) + o(1) bits per input symbol in the worst case, where H k (T ) is the kth order empirical entropy of T (the bound holds for any fixed k). Given an arbitrary string P [1; p], the opportunistic data structure allows to search for the occ occurrences of P in T in O(p + occ log u) time (for any fixed > 0). If data are uncompressible we achieve the best space bound currently known [12]; on compressible data our solution improves the succinct suffix array of [12] and the classical suffix tree and suffix array data structures either in space or in query time or both.
Faster Deterministic Sorting and Searching in Linear Space
, 1995
"... We present a significant improvement on linear space deterministic sorting and searching. On a unitcost RAM with word size w, an ordered set of n wbit keys (viewed as binary strings or integers) can be maintained in O ` min ` p log n; log n log w + log log n; log w log log n " time per op ..."
Abstract

Cited by 37 (7 self)
 Add to MetaCart
We present a significant improvement on linear space deterministic sorting and searching. On a unitcost RAM with word size w, an ordered set of n wbit keys (viewed as binary strings or integers) can be maintained in O ` min ` p log n; log n log w + log log n; log w log log n " time per operation, including insert, delete, member search, and neighbour search. The cost for searching is worstcase while the cost for updates is amortized. For range queries, there is an additional cost of reporting the found keys. As an application, n keys can be sorted in linear space at a worstcase cost of O \Gamma n p log n \Delta . The best previous method for deterministic sorting and searching in linear space has been the fusion trees which supports queries in O(logn= log log n) amortized time and sorting in O(n log n= log log n) worstcase time. We also make two minor observations on adapting our data structure to the input distribution and on the complexity of perfect hashing. 1 I...
A New Efficient Radix Sort
, 1994
"... We present new improved algorithms for the sorting problem. The algorithms are not only efficient but also clear and simple. First, we introduce Forward Radix Sort which combines the advantages of traditional lefttoright and righttoleft radix sort in a simple manner. We argue that this algorithm ..."
Abstract

Cited by 30 (7 self)
 Add to MetaCart
We present new improved algorithms for the sorting problem. The algorithms are not only efficient but also clear and simple. First, we introduce Forward Radix Sort which combines the advantages of traditional lefttoright and righttoleft radix sort in a simple manner. We argue that this algorithm will work very well in practice. Adding a preprocessing step, we obtain an algorithm with attractive theoretical properties. For example, n binary strings can be sorted in \Theta i n log i B n log n + 2 jj time, where B is the minimum number of bits that have to be inspected to distinguish the strings. This is an improvement over the previously best known result by Paige and Tarjan. The complexity may also be expressed in terms of H, the entropy of the input: n strings from a stationary ergodic process can be sorted in \Theta \Gamma n log \Gamma 1 H + 1 \Delta\Delta time, an improvement over the result recently presented by Chen and Reif.
Dynamic Ordered Sets with Exponential Search Trees
 Combination of results presented in FOCS 1996, STOC 2000 and SODA
, 2001
"... We introduce exponential search trees as a novel technique for converting static polynomial space search structures for ordered sets into fullydynamic linear space data structures. This leads to an optimal bound of O ( √ log n/log log n) for searching and updating a dynamic set of n integer keys i ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
We introduce exponential search trees as a novel technique for converting static polynomial space search structures for ordered sets into fullydynamic linear space data structures. This leads to an optimal bound of O ( √ log n/log log n) for searching and updating a dynamic set of n integer keys in linear space. Here searching an integer y means finding the maximum key in the set which is smaller than or equal to y. This problem is equivalent to the standard text book problem of maintaining an ordered set (see, e.g., Cormen, Leiserson, Rivest, and Stein: Introduction to Algorithms, 2nd ed., MIT Press, 2001). The best previous deterministic linear space bound was O(log n/log log n) due Fredman and Willard from STOC 1990. No better deterministic search bound was known using polynomial space.
Nonparametric Entropy Estimation for Stationary Processes and Random Fields, with Applications to English Text
, 1998
"... We discuss a family of estimators for the entropy rate of a stationary ergodic process and prove their pointwise and mean consistency under a Doeblintype mixing condition. The estimators are Ces`aro averages of longest matchlengths, and their consistency follows from a generalized ergodic theorem ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
We discuss a family of estimators for the entropy rate of a stationary ergodic process and prove their pointwise and mean consistency under a Doeblintype mixing condition. The estimators are Ces`aro averages of longest matchlengths, and their consistency follows from a generalized ergodic theorem due to Maker. We provide examples of their performance on English text, and we generalize our results to countable alphabet processes and to random fields.
Efficient Lossless Compression of Trees and Graphs
 In IEEE Data Compression Conference (DCC
, 1996
"... In this paper, we study the problem of compressing a data structure (e.g. tree, undirected and directed graphs) in an efficient way while keeping a similar structure in the compressed form. To date, there has been no proven optimal algorithm for this problem. We use the idea of building LZW tree in ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
In this paper, we study the problem of compressing a data structure (e.g. tree, undirected and directed graphs) in an efficient way while keeping a similar structure in the compressed form. To date, there has been no proven optimal algorithm for this problem. We use the idea of building LZW tree in LZW compression to compress a binary tree generated by a stationary ergodic source in an optimal manner. We also extend our tree compression algorithm to compress undirected and directed acyclic graphs.
The Complexity and Entropy of Literary Styles
, 1996
"... Since Shannon's original experiment in 1951, several methods have been applied to the problem of determining the entropy of English text. These methods were based either on prediction by human subjects, or on computerimplemented parametric models for the data, of a certain Markov order. We ask why ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Since Shannon's original experiment in 1951, several methods have been applied to the problem of determining the entropy of English text. These methods were based either on prediction by human subjects, or on computerimplemented parametric models for the data, of a certain Markov order. We ask why computerbased experiments almost always yield much higher entropy estimates than the ones produced by humans. We argue that there are two main reasons for this discrepancy. First, the longrange correlations of English text are not captured by Markovian models and, second, computerbased models only take advantage of the text statistics without being able to "understand" the contextual structure and the semantics of the given text. The second question we address is what does the "entropy" of a text say about the author's literary style. In particular, is there an intuitive notion of "complexity of style" that is captured by the entropy? We present preliminary results based on a nonparametric entropy estimation algorithm that o er partial answers to these questions. These results indicate that taking longrange correlations into account significantly improves the entropy estimates. We get an estimate of 1.77 bitspercharacter for a onemillioncharacter sample taken from Jane Austen's works. Also comparing the estimates obtained from several di erent texts provides some insight into the interpretation of the notion of "entropy" when applied to English text rather than to random processes, and the relationship between the entropy and the "literary complexity" of an author's style. Advantages of this entropy estimation method are that it does not require prior training, it is uniformly good over different styles and languages, and it seems to converge reasonably fast.
Estimating the Entropy of Binary Time Series: Methodology, Some Theory and a Simulation Study
"... entropy ..."
Stationary Entropy Estimation via String Matching (Extended Abstract)
, 1996
"... , submitted to DCC 1996, Snowbird, Utah Ioannis Kontoyiannis Yurii M. Suhov September 1995, revised March 1996 We prove an asymptotic relationship between certain longest matchlengths along a single realization of a stationary process, and its entropy rate: Given a process X = fX n ; n 2 Zg and a ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
, submitted to DCC 1996, Snowbird, Utah Ioannis Kontoyiannis Yurii M. Suhov September 1995, revised March 1996 We prove an asymptotic relationship between certain longest matchlengths along a single realization of a stationary process, and its entropy rate: Given a process X = fX n ; n 2 Zg and a realization x from X, we define N i (x) as the length of the shortest substring starting at x i , that does not appear as a contiguous substring of (x i\GammaN ; x i\GammaN +1 ; : : : ; x i\Gamma1 ). We show that, for a class of stationary processes with finite statespace (including all i.i.d. and mixing Markov processes of all orders), the following limiting relation holds: lim N!1 P N i=1 N i (x) N log N = 1 H ; (1) almost surely and in L 1 , where H ? 0 is the entropy rate of the process. We generalize this result to the cases where the alphabet of the process is countably infinite, and to random fields in several dimensions. 1 Introduction Beginning with Wyner and Ziv's 198...
Fast Pattern Matching for Entropy Bounded Text
 in Proceedings of DCC'95 Data Compression Conference, Snowbird
, 1995
"... We present the first known case of onedimensional and twodimensional string matching algorithms for text with bounded entropy. Let n be the length of the text and m be the length of the pattern. We show that the expected complexity of the algorithms is related to the entropy of the text for variou ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We present the first known case of onedimensional and twodimensional string matching algorithms for text with bounded entropy. Let n be the length of the text and m be the length of the pattern. We show that the expected complexity of the algorithms is related to the entropy of the text for various assumptions of the distribution of the pattern. For the case of uniformly distributed patterns, our one dimensional matching algorithm works in O(n log m=pm)) expected running time where H is the entropy of the text and p = 1 \Gamma (1 \Gamma H 2 ) H=(1+H) . The worst case running time T can also be bounded by n log m p(m+ p V ) T n log m p(m\Gamma p V ) if V is the variance of the source from which the pattern is generated. Our algorithm utilizes data structures and probabilistic analysis techniques that are found in certain lossless data compression schemes. 1 Introduction 1.1 Pattern matching problem Given a text of length n and a pattern of length m, the pattern match...