Results 11  20
of
172
A Generalized Suffix Tree and Its (Un)Expected Asymptotic Behaviors
 SIAM J. Computing
, 1996
"... Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees  further calle ..."
Abstract

Cited by 52 (29 self)
 Add to MetaCart
Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees  further called bsuffix trees  built from the first n suffixes of a random word. In this family a noncompact suffix tree (i.e., such that every edge is labeled by a single symbol) is represented by b = 1, and a compact suffix tree (i.e., without unary nodes) is asymptotically equivalent to b ! 1 as n ! 1. We study several parameters of bsuffix trees, namely: the depth of a given suffix, the depth of insertion, the height and the shortest feasible path. Some new results concerning typical (i.e., almost sure) behaviors of these parameters are established. These findings are used to obtain several insights into certain algorithms on words, molecular biology and universal data compression schemes. Key Wo...
LempelZiv parsing and sublinearsize index structures for string matching (Extended Abstract)
 Proc. 3rd South American Workshop on String Processing (WSP'96
, 1996
"... String matching over a long text can be significantly speeded up with an index structure formed by preprocessing the text. For very long texts, the size of such an index can be a problem. This paper presents the first sublinearsize index structure. The new structure is based on LempelZiv parsing ..."
Abstract

Cited by 48 (1 self)
 Add to MetaCart
String matching over a long text can be significantly speeded up with an index structure formed by preprocessing the text. For very long texts, the size of such an index can be a problem. This paper presents the first sublinearsize index structure. The new structure is based on LempelZiv parsing of the text and has size linear in N, the size of the LempelZiv parse. For a text of length n, N = O(n = log n) and can be still smaller if the text is compressible. With the new index structure, all occurrences of a pattern string of length m can be found in time O(m 2
Computational mechanics: Pattern and prediction, structure and simplicity
 Journal of Statistical Physics
, 1999
"... Computational mechanics, an approach to structural complexity, defines a process’s causal states and gives a procedure for finding them. We show that the causalstate representation—an Emachine—is the minimal one consistent with ..."
Abstract

Cited by 43 (8 self)
 Add to MetaCart
Computational mechanics, an approach to structural complexity, defines a process’s causal states and gives a procedure for finding them. We show that the causalstate representation—an Emachine—is the minimal one consistent with
JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support
 In IEEE Fourth International Conference on Cluster Computing
, 2002
"... A distributed Java Virtual Machine (DJVM) spanning multiple cluster nodes can provide a true parallel execution environment for multithreaded Java applications. Most existing DJVMs suffer from the slow Java execution in interpretive mode and thus may not be efficient enough for solving computation ..."
Abstract

Cited by 43 (6 self)
 Add to MetaCart
A distributed Java Virtual Machine (DJVM) spanning multiple cluster nodes can provide a true parallel execution environment for multithreaded Java applications. Most existing DJVMs suffer from the slow Java execution in interpretive mode and thus may not be efficient enough for solving computationintensive problems. We present JESSICA2, a new DJVM running in JIT compilation mode that can execute multithreaded Java applications transparently on clusters. JESSICA2 provides a single system image (SSI) illusion to Java applications via an embedded global object space (GOS) layer. It implements a clusteraware Java execution engine that supports transparent Java thread migration for achieving dynamic load balancing. We discuss the issues of supporting transparent Java thread migration in a JIT compilation environment and propose several lightweight solutions. An adaptive migratinghome protocol used in the implementation of the GOS is introduced. The system has been implemented on x86based Linux clusters, and significant performance improvements over the previous JESSICA system have been observed.
Compressing relations and indexes
 In proceedings of IEEE International Conference on Data Engineering
, 1998
"... We propose a new compression algorithm that is tailored to database applications. It can be applied to a collection of records, and is especially e ective for records with many low to medium cardinality elds and numeric elds. In addition, this new technique supports very fast decompression. Promisin ..."
Abstract

Cited by 42 (0 self)
 Add to MetaCart
We propose a new compression algorithm that is tailored to database applications. It can be applied to a collection of records, and is especially e ective for records with many low to medium cardinality elds and numeric elds. In addition, this new technique supports very fast decompression. Promising application domains include decision support systems (DSS), since \fact tables", which are by far the largest tables in these applications, contain many low and medium cardinality elds and typically no text elds. Further, our decompression rates are faster than typical disk throughputs for sequential scans � in contrast, gzip is slower. This is important in DSS applications, which often scan large ranges of records. An important distinguishing characteristic of our algorithm, in contrasttocompression algorithms proposed earlier, is that we can decompress individual tuples (even individual elds), rather than a full page (or an entire relation) at a time. Also, all the information needed for tuple decompression resides on the same page with the tuple. This means that a page can be stored in the bu er pool and used in compressed form, simplifying the job of the bu er manager and improving memory utilization. Our compression algorithm also improves index structures such as Btrees and Rtrees signi cantly by reducing the number of leaf pages and compressing index entries, which greatly increases the fanout. We can also use lossy compression on the internal nodes of an index. 1
Asymptotic Properties Of Data Compression And Suffix Trees
 IEEE Trans. Inform. Theory
, 1993
"... Recently, Wyner and Ziv have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1=h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data com ..."
Abstract

Cited by 40 (11 self)
 Add to MetaCart
Recently, Wyner and Ziv have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1=h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data compression schemes, most notably the LempelZiv data compression algorithm. Wyner and Ziv have also conjectured that their result can be extended to a stronger almost sure convergence. In this paper, we settle this conjecture in the negative in the so called right domain asymptotic, that is, during a dynamic phase of expanding the data base. We prove  under an additional assumption involving mixing conditions  that the length of a typical repeated subword oscillates almost surely (a.s.) between (1=h 1 ) log n and (1=h 2 ) log n where 0 ! h 2 ! h h 1 ! 1. We also show that the length of the nth block in the LempelZiv parsing algorithm reveals a similar behavior. We relate our findings to...
Interval and Recency Rank Source Coding: Two OnLine Adaptive VariableLength Schemes
, 1987
"... In the schemes presented the encoder maps each message into a codeword in a prefixfree codeword set. In interval encoding the codeword is indexed by the interval since the last previous occurrence of that message, and the codeword set must be countably infinite. In recency rank encoding the codewor ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
In the schemes presented the encoder maps each message into a codeword in a prefixfree codeword set. In interval encoding the codeword is indexed by the interval since the last previous occurrence of that message, and the codeword set must be countably infinite. In recency rank encoding the codeword is indexed by the number of distinct messages in that interval, and there must be no fewer codewords than messages. The decoder decodes each codeword on receipt. Users need not know message probabilities, but must agree on indexings, of the codeword set in an order of increasing length and of the message set in some arbitrary order. The average codeword length over a communications bout is never much larger than the value for an offline scheme which maps the jth most frequent message in the bout into the jth shortest codeword in the given set, and is never too much larger than the value for offline Huffman encoding of messages into the best codeword set for the bout message frequencies.
Automatic inference of models for statistical code compression
 In Proceedings of the ACM Conference on Programming Language Design and Implementation
, 1999
"... This paper describes experiments that apply machine learning to compress computer programs, formalizing and automating decisions about instruction encoding that have traditionally been made by humans in a more ad hoc manner. A program accepts a large training set of program material in a conventiona ..."
Abstract

Cited by 35 (1 self)
 Add to MetaCart
This paper describes experiments that apply machine learning to compress computer programs, formalizing and automating decisions about instruction encoding that have traditionally been made by humans in a more ad hoc manner. A program accepts a large training set of program material in a conventional compiler intermediate representation (IR) and automatically infers a decision tree that separates IR code into streams that compress much better than the undifferentiated whole. Driving a conventional arithmetic compressor with this model yields code 30 % smaller than the previous record for IR code compression, and 24 % smaller than an ambitious optimizing compiler feeding an ambitious generalpurpose data compressor. Keywords Abstract machines, code compaction, code compression, compiler intermediate languages and representations, data compression, decision trees, machine learning, statistical models, virtual machines.
System Identification, Approximation and Complexity
 International Journal of General Systems
, 1977
"... This paper is concerned with establishing broadlybased systemtheoretic foundations and practical techniques for the problem of system identification that are rigorous, intuitively clear and conceptually powerful. A general formulation is first given in which two order relations are postulated on a ..."
Abstract

Cited by 34 (23 self)
 Add to MetaCart
This paper is concerned with establishing broadlybased systemtheoretic foundations and practical techniques for the problem of system identification that are rigorous, intuitively clear and conceptually powerful. A general formulation is first given in which two order relations are postulated on a class of models: a constant one of complexity; and a variable one of approximation induced by an observed behaviour. An admissible model is such that any less complex model is a worse approximation. The general problem of identification is that of finding the admissible subspace of models induced by a given behaviour. It is proved under very general assumptions that, if deterministic models are required then nearly all behaviours require models of nearly maximum complexity. A general theory of approximation between models and behaviour is then developed based on subjective probability concepts and semantic information theory The role of structural constraints such as causality, locality, finite memory, etc., are then discussed as rules of the game. These concepts and results are applied to the specific problem or stochastic automaton, or grammar, inference. Computational results are given to demonstrate that the theory is complete and fully operational. Finally the formulation of identification proposed in this paper is analysed in terms of Klir’s epistemological hierarchy and both are discussed in terms of the rich philosophical literature on the acquisition of knowledge. 1
Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String
 TREES, AND SEQUENCES: COMPUTER SCIENCE AND COMPUTATIONAL BIOLOGY
, 1998
"... A tandem repeat (or square) is a string ffff, where ff is a nonempty string. We present an O(jSj)time algorithm that operates on the suffix tree T (S) for a string S, finding and marking the endpoint in T (S) of every tandem repeat that occurs in S. This decorated suffix tree implicitly represents ..."
Abstract

Cited by 34 (2 self)
 Add to MetaCart
A tandem repeat (or square) is a string ffff, where ff is a nonempty string. We present an O(jSj)time algorithm that operates on the suffix tree T (S) for a string S, finding and marking the endpoint in T (S) of every tandem repeat that occurs in S. This decorated suffix tree implicitly represents all occurrences of tandem repeats in S, and can be used to efficiently solve many questions concerning tandem repeats and tandem arrays in S. This improves and generalizes several prior efforts to efficiently capture large subsets of tandem repeats.