Results 1 - 10
of
127
A general limit theorem for recursive algorithms and combinatorial structures
- ANN. APPL. PROB
, 2004
"... Limit laws are proven by the contraction method for random vectors of a recursive nature as they arise as parameters of combinatorial structures such as random trees or recursive algorithms, where we use the Zolotarev metric. In comparison to previous applications of this method, a general transfer ..."
Abstract
-
Cited by 36 (21 self)
- Add to MetaCart
Limit laws are proven by the contraction method for random vectors of a recursive nature as they arise as parameters of combinatorial structures such as random trees or recursive algorithms, where we use the Zolotarev metric. In comparison to previous applications of this method, a general transfer theorem is derived which allows us to establish a limit law on the basis of the recursive structure and to use the asymptotics of the first and second moments of the sequence. In particular, a general asymptotic normality result is obtained by this theorem which typically cannot be handled by the more common ℓ2 metrics. As applications we derive quite automatically many asymptotic limit results ranging from the size of tries or m-ary search trees and path lengths in digital structures to mergesort and parameters of random recursive trees, which were previously shown by different methods one by one. We also obtain a related local density approximation result as well as a global approximation result. For the proofs of these results we establish that a smoothed density distance as well as a smoothed total variation distance can be estimated from above by the Zolotarev metric, which is the main tool in this article.
Burst Tries: A Fast, Efficient Data Structure for String Keys
- ACM Transactions on Information Systems
, 2002
"... Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, t ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or near-sorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
Precise Minimax Redundancy and Regret
- IEEE TRANS. INFORMATION THEORY
, 2004
"... Recent years have seen a resurgence of interest in redundancy of lossless coding. The redundancy (regret) of universal xed{to{variable length coding for a class of sources determines by how much the actual code length exceeds the optimal (ideal over the class) code length. In a minimax scenario ..."
Abstract
-
Cited by 19 (8 self)
- Add to MetaCart
Recent years have seen a resurgence of interest in redundancy of lossless coding. The redundancy (regret) of universal xed{to{variable length coding for a class of sources determines by how much the actual code length exceeds the optimal (ideal over the class) code length. In a minimax scenario one nds the best code for the worst source either in the worst case (called also maximal minimax) or on average. We rst study the worst case minimax redundancy over a class of stationary ergodic sources and replace Shtarkov's bound by an exact formula. Among others, we prove that a generalized Shannon code minimizes the worst case redundancy, derive asymptotically its redundancy, and establish some general properties. This allows us to obtain precise redundancy rates for memoryless, Markov and renewal sources. For example, we derive the exact constant of the redundancy rate for memoryless and Markov sources by showing that an integer nature of coding contributes log(log m=(m 1))= log m+ o(1) where m is the size of the alphabet. Then we deal with the average minimax redundancy and regret. Our approach
Source coding, large deviations, and approximate pattern matching
- IEEE Trans. Inform. Theory
, 2002
"... Dedicated to the memory of Aaron Wyner, a valued friend and colleague. Abstract—In this review paper, we present a development of parts of rate-distortion theory and pattern-matching algorithms for lossy data compression, centered around a lossy version of the asymptotic equipartition property (AEP) ..."
Abstract
-
Cited by 17 (8 self)
- Add to MetaCart
Dedicated to the memory of Aaron Wyner, a valued friend and colleague. Abstract—In this review paper, we present a development of parts of rate-distortion theory and pattern-matching algorithms for lossy data compression, centered around a lossy version of the asymptotic equipartition property (AEP). This treatment closely parallels the corresponding development in lossless compression, a point of view that was advanced in an important paper of Wyner and Ziv in 1989. In the lossless case, we review how the AEP underlies the analysis of the Lempel–Ziv algorithm by viewing it as a random code and reducing it to the idealized Shannon code. This also provides information about the redundancy of the Lempel–Ziv algorithm and about the asymptotic behavior of several relevant quantities. In the lossy case, we give various versions of the statement of the generalized AEP and we outline the general methodology of its proof via large deviations. Its relationship with Barron and Orey’s generalized AEP is also discussed. The lossy AEP is applied to i) prove strengthened versions of Shannon’s direct sourcecoding theorem and universal coding theorems; ii) characterize the performance of “mismatched ” codebooks in lossy data compression; iii) analyze the performance of pattern-matching algorithms for lossy compression (including Lempel–Ziv schemes); and iv) determine the first-order asymptotic of waiting times between stationary processes. A refinement to the lossy AEP is then presented, and it is used to i) prove second-order (direct and converse) lossy source-coding theorems, including universal coding theorems; ii) characterize which sources are quantitatively easier to compress; iii) determine the second-order asymptotic of waiting times between stationary processes; and iv) determine the precise asymptotic behavior of longest match-lengths between stationary processes. Finally, we discuss extensions of the above framework and results to random fields. Index Terms—Data compression, large deviations, patternmatching, rate-distortion theory.
An Universal Predictor Based on Pattern Matching
- IEEE Trans. Inform. Theory
, 2000
"... We consider here an universal predictor based on pattern matching. For a given string x 1 ; x 2 ; : : : ; xn , the predictor will guess the next symbol xn+1 in such a way that the prediction error tends to zero as n ! 1 provided the string x n 1 = x 1 ; x 2 ; : : : ; xn is generated by a mixing s ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
We consider here an universal predictor based on pattern matching. For a given string x 1 ; x 2 ; : : : ; xn , the predictor will guess the next symbol xn+1 in such a way that the prediction error tends to zero as n ! 1 provided the string x n 1 = x 1 ; x 2 ; : : : ; xn is generated by a mixing source. We shall prove that the rate of convergence of the prediction error is O(n \Gamma" ) for any " ? 0. In this preliminary version, we only prove our results for memoryless sources and a sketch for mixing sources. However, we indicate that our algorithm can predict equally successfully the next k symbols as long as k = O(1). 1 Introduction Prediction is important in communication, control, forecasting, investment and other areas. We understand how to do optimal prediction when the data model is known, but one needs to design universal prediction algorithm that will perform well no matter what the underlying probabilistic model is. More precisely, let X 1 ; X 2 ; : : : be an infinite ...
Reliable Detection of Episodes in Event Sequences
- Knowledge and Information Systems
, 2004
"... Suppose one wants to detect \bad" or \suspicious" subsequences in event sequences. Whether an observed pattern of activity (in the form of a particular subsequence) is signi cant and should be a cause for alarm, depends on how likely it is to occur fortuitously. A long enough sequence of observed ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Suppose one wants to detect \bad" or \suspicious" subsequences in event sequences. Whether an observed pattern of activity (in the form of a particular subsequence) is signi cant and should be a cause for alarm, depends on how likely it is to occur fortuitously. A long enough sequence of observed events will almost certainly contain any subsequence, and setting thresholds for alarm is an important issue in a monitoring system that seeks to avoid false alarms. Suppose a long sequence T of observed events contains a suspicious subsequence pattern S within it, where the suspicious subsequence S consists of m events and spans a window of size w within T . We address the fundamental problem: is a certain number of occurrences of a particular subsequence unlikely to be generated by randomness itself (i.e., indicative of suspicious activity)? If the probability of an occurrence generated by randomness is high and an automated monitoring system ags it as suspicious anyway, then such a system will suer from generating too many false alarms. This paper quanti es the probability of such an S occurring in T within a window of size w, the number of distinct windows containing S as a subsequence, the expected number of such occurrences, its variance, and establishes its limiting distribution that allows to set up an alarm threshold so that the probability of false alarms is very small. We report on experiments con rming the theory and showing that we can detect bad subsequences with low false alarm rate.
Multicast Tree Structure and the Power Law
- IN PROCEEDINGS OF ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS (SODA
, 2002
"... One of the main benefits of multicast communication is the overall reduction of network load. To quantify this reduction, when compared to traditional unicast, experimental studies by Chuang and Sirbu indicated the so called power law which asserts that the ratio R(m) of the average number of link ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
One of the main benefits of multicast communication is the overall reduction of network load. To quantify this reduction, when compared to traditional unicast, experimental studies by Chuang and Sirbu indicated the so called power law which asserts that the ratio R(m) of the average number of links in a multicast delivery tree connecting a source to m (distinct) sites to the average number of links in a unicast path, satisfies R(m) cm ; where c is a constant. In order to explain theoretically this behavior, Phillips, Shenker, and Tangmunarunkit examined approximately R(m) for a V -ary complete tree topology, and concluded that R(m) grows nearly linearly with m, thus not obeying the power law. We first re-examine the analysis by Phillips et.al. and provide precise asymptotic expansion for R(m) that confirms the nearly linear (with some wobbling) growth. Claiming that the essence of the problem lies in the modeling assumptions, we replace the V -ary complete tree topology by a V -ary self-similar tree with similarity factor 0 < 1. In such a tree a node at level k is replicated CV times, where D is the depth of the tree and C is a constant. Under this assumption, we analyze again R(m) and prove that R(m) cm constant. Hence self-similar trees provide a plausible explanation of the multicast power law. Next, we examine more general conditions for general trees, under which the power law still holds. We also discuss some experimental results in real networks that reaf- rm the power law and show that in these networks the general conditions hold. In particular, our experiments show that for the tested networks 0:12.
Engineering a Fast Online Persistent Suffix Tree Construction
- In 20th Int’l Conference on Data Engineering
, 2004
"... Online persistent suffix tree construction has been considered impractical due to its excessive I/O costs. However, these prior studies have not taken into account the effects of the buffer management policy and the internal node structure of the suffix tree on I/O behavior of construction and subse ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
Online persistent suffix tree construction has been considered impractical due to its excessive I/O costs. However, these prior studies have not taken into account the effects of the buffer management policy and the internal node structure of the suffix tree on I/O behavior of construction and subsequent retrievals over the tree. In this paper, we study these two issues in detail in the context of large genomic DNA and Protein sequences. In particular, we make the following contributions: (i) a novel, low-overhead buffering policy called TOP-Q which improves the on-disk behavior of suffix tree construction and subsequent retrievals, and (ii) empirical evidence that the space efficient linked-list representation of suffix tree nodes provides significantly inferior performance when compared to the array representation. These results demonstrate that a careful choice of implementation strategies can make online persistent suffix tree construction considerably more scalable – in terms of length of sequences indexed with a fixed memory budget, than currently perceived. 1.
Rounding of continuous random variables and oscillatory asymptotics
- Ann. Probab
"... We study the characteristic function and moments of the integervalued random variable ⌊X + α⌋, where X is a continuous random variables. The results can be regarded as exact versions of Sheppard’s correction. Rounded variables of this type often occur as subsequence limits of sequences of integer-va ..."
Abstract
-
Cited by 14 (7 self)
- Add to MetaCart
We study the characteristic function and moments of the integervalued random variable ⌊X + α⌋, where X is a continuous random variables. The results can be regarded as exact versions of Sheppard’s correction. Rounded variables of this type often occur as subsequence limits of sequences of integer-valued random variables. This leads to oscillatory terms in asymptotics for these variables, something that has often been observed, for example in the analysis of several algorithms. We give some examples, including applications to tries, digital search trees and Patricia tries. 1. Introduction. Let
Markov Types and Minimax Redundancy for Markov Sources
- IEEE Trans. Information Theory
, 2003
"... Redundancy of universal codes for a class of sources determines by how much the actual code length exceeds the optimal code length. In the minimax scenario one designs the best code for the worst source within the class. Such minimax redundancy comes in two flavors: either on average or for individu ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
Redundancy of universal codes for a class of sources determines by how much the actual code length exceeds the optimal code length. In the minimax scenario one designs the best code for the worst source within the class. Such minimax redundancy comes in two flavors: either on average or for individual sequences. The latter is also known as the maximal or the worst case minimax redundancy. We study the maximal minimax redundancy of universal block codes for Markovian sources of any order. We prove that the maximal minimax redundancy for Markov sources of order r is asymptotically equal to 1) log 2 n + log 2 A (ln ln m 1/(m-1) )/ ln m + o(1), where n is the length of a source sequence, m is the size of the alphabet and A m is an explicit constant (e.g., we find that for a binary alphabet m = 2 and Markov of order r = 1 the constant 14.655449504 where G is the Catalan number). Unlike previous attempts, we view the redundancy problem as an asymptotic evaluation of certain sums over a set of matrices representing Markov types. The enumeration of Markov types is accomplished by reducing it to counting Eulerian paths in a multigraph. In particular, we propose an asymptotic formula for the number of strings of a given Markov type. All of these findings are obtained by analytic and combinatorial tools of analysis of algorithms. Index terms: Minimax redundancy, Markov sources, Markov types, Eulerian paths, multidimensional generating functions, analytic information theory. # A preliminary version of this paper was presented at Colloquium on Mathematics and Computer Science: Algorithms, Trees, Combinatorics and Probabilities, Versailles, 2002.

