Results 1  10
of
166
A general limit theorem for recursive algorithms and combinatorial structures
 ANN. APPL. PROB
, 2004
"... Limit laws are proven by the contraction method for random vectors of a recursive nature as they arise as parameters of combinatorial structures such as random trees or recursive algorithms, where we use the Zolotarev metric. In comparison to previous applications of this method, a general transfer ..."
Abstract

Cited by 53 (25 self)
 Add to MetaCart
Limit laws are proven by the contraction method for random vectors of a recursive nature as they arise as parameters of combinatorial structures such as random trees or recursive algorithms, where we use the Zolotarev metric. In comparison to previous applications of this method, a general transfer theorem is derived which allows us to establish a limit law on the basis of the recursive structure and to use the asymptotics of the first and second moments of the sequence. In particular, a general asymptotic normality result is obtained by this theorem which typically cannot be handled by the more common ℓ2 metrics. As applications we derive quite automatically many asymptotic limit results ranging from the size of tries or mary search trees and path lengths in digital structures to mergesort and parameters of random recursive trees, which were previously shown by different methods one by one. We also obtain a related local density approximation result as well as a global approximation result. For the proofs of these results we establish that a smoothed density distance as well as a smoothed total variation distance can be estimated from above by the Zolotarev metric, which is the main tool in this article.
Precise Minimax Redundancy and Regret
 IEEE TRANS. INFORMATION THEORY
, 2004
"... Recent years have seen a resurgence of interest in redundancy of lossless coding. The redundancy (regret) of universal xed{to{variable length coding for a class of sources determines by how much the actual code length exceeds the optimal (ideal over the class) code length. In a minimax scenario ..."
Abstract

Cited by 33 (13 self)
 Add to MetaCart
Recent years have seen a resurgence of interest in redundancy of lossless coding. The redundancy (regret) of universal xed{to{variable length coding for a class of sources determines by how much the actual code length exceeds the optimal (ideal over the class) code length. In a minimax scenario one nds the best code for the worst source either in the worst case (called also maximal minimax) or on average. We rst study the worst case minimax redundancy over a class of stationary ergodic sources and replace Shtarkov's bound by an exact formula. Among others, we prove that a generalized Shannon code minimizes the worst case redundancy, derive asymptotically its redundancy, and establish some general properties. This allows us to obtain precise redundancy rates for memoryless, Markov and renewal sources. For example, we derive the exact constant of the redundancy rate for memoryless and Markov sources by showing that an integer nature of coding contributes log(log m=(m 1))= log m+ o(1) where m is the size of the alphabet. Then we deal with the average minimax redundancy and regret. Our approach
Burst Tries: A Fast, Efficient Data Structure for String Keys
 ACM Transactions on Information Systems
, 2002
"... Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, t ..."
Abstract

Cited by 28 (10 self)
 Add to MetaCart
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or nearsorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
Reliable Detection of Episodes in Event Sequences
 Knowledge and Information Systems
, 2004
"... Suppose one wants to detect \bad" or \suspicious" subsequences in event sequences. Whether an observed pattern of activity (in the form of a particular subsequence) is signi cant and should be a cause for alarm, depends on how likely it is to occur fortuitously. A long enough sequence of observed ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
Suppose one wants to detect \bad" or \suspicious" subsequences in event sequences. Whether an observed pattern of activity (in the form of a particular subsequence) is signi cant and should be a cause for alarm, depends on how likely it is to occur fortuitously. A long enough sequence of observed events will almost certainly contain any subsequence, and setting thresholds for alarm is an important issue in a monitoring system that seeks to avoid false alarms. Suppose a long sequence T of observed events contains a suspicious subsequence pattern S within it, where the suspicious subsequence S consists of m events and spans a window of size w within T . We address the fundamental problem: is a certain number of occurrences of a particular subsequence unlikely to be generated by randomness itself (i.e., indicative of suspicious activity)? If the probability of an occurrence generated by randomness is high and an automated monitoring system ags it as suspicious anyway, then such a system will suer from generating too many false alarms. This paper quanti es the probability of such an S occurring in T within a window of size w, the number of distinct windows containing S as a subsequence, the expected number of such occurrences, its variance, and establishes its limiting distribution that allows to set up an alarm threshold so that the probability of false alarms is very small. We report on experiments con rming the theory and showing that we can detect bad subsequences with low false alarm rate.
An Universal Predictor Based on Pattern Matching
 IEEE Trans. Inform. Theory
, 2000
"... We consider here an universal predictor based on pattern matching. For a given string x 1 ; x 2 ; : : : ; xn , the predictor will guess the next symbol xn+1 in such a way that the prediction error tends to zero as n ! 1 provided the string x n 1 = x 1 ; x 2 ; : : : ; xn is generated by a mixing s ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
We consider here an universal predictor based on pattern matching. For a given string x 1 ; x 2 ; : : : ; xn , the predictor will guess the next symbol xn+1 in such a way that the prediction error tends to zero as n ! 1 provided the string x n 1 = x 1 ; x 2 ; : : : ; xn is generated by a mixing source. We shall prove that the rate of convergence of the prediction error is O(n \Gamma" ) for any " ? 0. In this preliminary version, we only prove our results for memoryless sources and a sketch for mixing sources. However, we indicate that our algorithm can predict equally successfully the next k symbols as long as k = O(1). 1 Introduction Prediction is important in communication, control, forecasting, investment and other areas. We understand how to do optimal prediction when the data model is known, but one needs to design universal prediction algorithm that will perform well no matter what the underlying probabilistic model is. More precisely, let X 1 ; X 2 ; : : : be an infinite ...
Source coding, large deviations, and approximate pattern matching
 IEEE Trans. Inform. Theory
, 2002
"... Dedicated to the memory of Aaron Wyner, a valued friend and colleague. Abstract—In this review paper, we present a development of parts of ratedistortion theory and patternmatching algorithms for lossy data compression, centered around a lossy version of the asymptotic equipartition property (AEP) ..."
Abstract

Cited by 23 (10 self)
 Add to MetaCart
Dedicated to the memory of Aaron Wyner, a valued friend and colleague. Abstract—In this review paper, we present a development of parts of ratedistortion theory and patternmatching algorithms for lossy data compression, centered around a lossy version of the asymptotic equipartition property (AEP). This treatment closely parallels the corresponding development in lossless compression, a point of view that was advanced in an important paper of Wyner and Ziv in 1989. In the lossless case, we review how the AEP underlies the analysis of the Lempel–Ziv algorithm by viewing it as a random code and reducing it to the idealized Shannon code. This also provides information about the redundancy of the Lempel–Ziv algorithm and about the asymptotic behavior of several relevant quantities. In the lossy case, we give various versions of the statement of the generalized AEP and we outline the general methodology of its proof via large deviations. Its relationship with Barron and Orey’s generalized AEP is also discussed. The lossy AEP is applied to i) prove strengthened versions of Shannon’s direct sourcecoding theorem and universal coding theorems; ii) characterize the performance of “mismatched ” codebooks in lossy data compression; iii) analyze the performance of patternmatching algorithms for lossy compression (including Lempel–Ziv schemes); and iv) determine the firstorder asymptotic of waiting times between stationary processes. A refinement to the lossy AEP is then presented, and it is used to i) prove secondorder (direct and converse) lossy sourcecoding theorems, including universal coding theorems; ii) characterize which sources are quantitatively easier to compress; iii) determine the secondorder asymptotic of waiting times between stationary processes; and iv) determine the precise asymptotic behavior of longest matchlengths between stationary processes. Finally, we discuss extensions of the above framework and results to random fields. Index Terms—Data compression, large deviations, patternmatching, ratedistortion theory.
Profile of Tries
, 2006
"... Tries (from retrieval) are one of the most popular data structures on words. They are pertinent to (internal) structure of stored words and several splitting procedures used in diverse contexts. The profile of a trie is a parameter that represents the number of nodes (either internal or external) wi ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
Tries (from retrieval) are one of the most popular data structures on words. They are pertinent to (internal) structure of stored words and several splitting procedures used in diverse contexts. The profile of a trie is a parameter that represents the number of nodes (either internal or external) with the same distance from the root. It is a function of the number of strings stored in a trie and the distance from the root. Several, if not all, trie parameters such as height, size, depth, shortest path, and fillup level can be uniformly analyzed through the (external and internal) profiles. Although profiles represent one of the most fundamental parameters of tries, they have been hardly studied in the past. The analysis of profiles is surprisingly arduous but once it is carried out it reveals unusually intriguing and interesting behavior. We present a detailed study of the distribution of the profiles in a trie built over random strings generated by a memoryless source. We first derive recurrences satisfied by the expected profiles and solve them asymptotically for all possible ranges of the distance from the root. It appears that profiles of tries exhibit several fascinating phenomena. When moving from the root to the leaves of a trie, the growth of the expected profiles vary. Near the root, the external profiles tend to zero in an exponentially rate, then the rate gradually rises to being logarithmic; the external profiles then abruptly tend to infinity, first logarithmically
Markov Types and Minimax Redundancy for Markov Sources
 IEEE Trans. Information Theory
, 2003
"... Redundancy of universal codes for a class of sources determines by how much the actual code length exceeds the optimal code length. In the minimax scenario one designs the best code for the worst source within the class. Such minimax redundancy comes in two flavors: either on average or for individu ..."
Abstract

Cited by 18 (10 self)
 Add to MetaCart
Redundancy of universal codes for a class of sources determines by how much the actual code length exceeds the optimal code length. In the minimax scenario one designs the best code for the worst source within the class. Such minimax redundancy comes in two flavors: either on average or for individual sequences. The latter is also known as the maximal or the worst case minimax redundancy. We study the maximal minimax redundancy of universal block codes for Markovian sources of any order. We prove that the maximal minimax redundancy for Markov sources of order r is asymptotically equal to 1) log 2 n + log 2 A (ln ln m 1/(m1) )/ ln m + o(1), where n is the length of a source sequence, m is the size of the alphabet and A m is an explicit constant (e.g., we find that for a binary alphabet m = 2 and Markov of order r = 1 the constant 14.655449504 where G is the Catalan number). Unlike previous attempts, we view the redundancy problem as an asymptotic evaluation of certain sums over a set of matrices representing Markov types. The enumeration of Markov types is accomplished by reducing it to counting Eulerian paths in a multigraph. In particular, we propose an asymptotic formula for the number of strings of a given Markov type. All of these findings are obtained by analytic and combinatorial tools of analysis of algorithms. Index terms: Minimax redundancy, Markov sources, Markov types, Eulerian paths, multidimensional generating functions, analytic information theory. # A preliminary version of this paper was presented at Colloquium on Mathematics and Computer Science: Algorithms, Trees, Combinatorics and Probabilities, Versailles, 2002.
Multicast Tree Structure and the Power Law
 In Proc. of SODA
, 2002
"... One of the main bene…ts of multicast communication is the overall reduction of network load. To quantify this reduction, when compared to traditional unicast, experimental studies by Chuang and Sirbu indicated the so called power law which asserts that the number of links L(m) in a multicast deliver ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
One of the main bene…ts of multicast communication is the overall reduction of network load. To quantify this reduction, when compared to traditional unicast, experimental studies by Chuang and Sirbu indicated the so called power law which asserts that the number of links L(m) in a multicast delivery tree connecting a source to m (distinct) sites satis…es L(m) cm 0:8 where c is a constant. In order to explain theoretically this behavior, Phillips, Shenker, and Tangmunarunkit examined approximately L(m) for a Vary complete tree topology, and concluded that L(m) grows nearly linearly with m, thus not obeying the power law. We …rst reexamine the analysis by Phillips et.al. and provide precise asymptotic expansion for L(m) that con…rms the nearly linear (with some wobbling) growth. Claiming that the essence of the problem lies in the modeling assumptions, we replace the Vary complete tree topology by a Vary selfsimilar tree with similarity factor 0 · µ<1. In such a tree a node at level k is replicated CV (D¡k)µ times, where D isthedepthofthetreeandC is a constant. Under this assumption, we analyze again L(m) and prove that L(m) » cm 1¡µ where c is an explicitly computable constant. Hence selfsimilar trees provide a plausible explanation of the multicast power law. Next, we examine more general conditions for general trees, under which the power law still holds. We also discuss some experimental results in real networks that rea¢rm the power law and show that in these networks the general conditions hold. In particular, our experiments show that for the tested networks µ 0:12. 1
Assessing significance of connectivity and conservation in protein interaction networks
 Journal of Computational Biology
, 2006
"... Computational and comparative analysis of proteinprotein interaction (PPI) networks enable understanding of the modular organization of the cell through identification of functional modules and protein complexes. These analysis techniques generally rely on topological features such as connectedness ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
Computational and comparative analysis of proteinprotein interaction (PPI) networks enable understanding of the modular organization of the cell through identification of functional modules and protein complexes. These analysis techniques generally rely on topological features such as connectedness, based on the premise that functionally related proteins are likely to interact densely and that these interactions follow similar evolutionary trajectories. Significant recent work in our lab, and in other labs has focused on efficient algorithms for identification of modules and their conservation. Application of these methods to a variety of networks has yielded novel biological insights. In spite of algorithmic advances, development of a comprehensive infrastructure for interaction databases is in relative infancy compared to corresponding sequence analysis tools such as BLAST and CLUSTAL. One critical component of this infrastructure is a measure of the statistical significance of a match or a dense subcomponent. Corresponding sequencebased measures such as Evalues are key components of sequence matching tools. In the absence of an analytical measure, conventional methods rely on computer simulations based on adhoc models for quantifying significance. This paper presents the first such effort, to the best of our knowledge, aimed at analytically quantifying statistical significance