Results 11 - 20
of
129
A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
, 2002
"... The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in sub-quadratic time, for metrics which use a scoring ..."
Abstract
-
Cited by 46 (3 self)
- Add to MetaCart
The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n 2 ) time. We address the challenge of computing the similarity of two strings in sub-quadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global alignment computations. The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n 2 = log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn 2 = log n) where h 1 is the entropy of the text. Institut Gaspard-Monge, Universite de Marne-la-Vallee, Cite Descartes, Champs-surMarne, 77454 Marne-la-Vallee Cedex 2, France, email: mac@univ-mlv.fr. y Department of Computer Science, Haifa University, Haifa 31905, Israel, phone: (972-4) 824-0103, FAX: (972-4) 824-9331; Department of Computer and Information Science, Polytechnic University, Six MetroTech Center, Brooklyn, NY 11201-3840; email: landau@poly.edu; partially supported by NSF grant CCR-0104307, by NATO Science Programme grant PST.CLG.977017, by the Israel Science Foundation (grants 173/98 and 282/01), by the FIRST Foundation of the Israel Academy of Science and Humanities, and by IBM Faculty Partnership Award. z Department of Computer Science, Haifa University, Haifa 31905, Israel; On Education Leave from the IBM T.J.W. Research Center; email: michal@cs.haifa.il; partially supported by by the Israel Science Foundation (grants 173/98 and 282/01), and by the FIRST Foundation of the Israel Academy of Science ...
A Generalized Suffix Tree and Its (Un)Expected Asymptotic Behaviors
- SIAM J. Computing
, 1996
"... Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees -- further calle ..."
Abstract
-
Cited by 45 (27 self)
- Add to MetaCart
Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees -- further called b-suffix trees -- built from the first n suffixes of a random word. In this family a noncompact suffix tree (i.e., such that every edge is labeled by a single symbol) is represented by b = 1, and a compact suffix tree (i.e., without unary nodes) is asymptotically equivalent to b ! 1 as n ! 1. We study several parameters of b-suffix trees, namely: the depth of a given suffix, the depth of insertion, the height and the shortest feasible path. Some new results concerning typical (i.e., almost sure) behaviors of these parameters are established. These findings are used to obtain several insights into certain algorithms on words, molecular biology and universal data compression schemes. Key Wo...
Compressing relations and indexes
- In proceedings of IEEE International Conference on Data Engineering
, 1998
"... We propose a new compression algorithm that is tailored to database applications. It can be applied to a collection of records, and is especially e ective for records with many low to medium cardinality elds and numeric elds. In addition, this new technique supports very fast decompression. Promisin ..."
Abstract
-
Cited by 39 (0 self)
- Add to MetaCart
We propose a new compression algorithm that is tailored to database applications. It can be applied to a collection of records, and is especially e ective for records with many low to medium cardinality elds and numeric elds. In addition, this new technique supports very fast decompression. Promising application domains include decision support systems (DSS), since \fact tables", which are by far the largest tables in these applications, contain many low and medium cardinality elds and typically no text elds. Further, our decompression rates are faster than typical disk throughputs for sequential scans � in contrast, gzip is slower. This is important in DSS applications, which often scan large ranges of records. An important distinguishing characteristic of our algorithm, in contrasttocompression algorithms proposed earlier, is that we can decompress individual tuples (even individual elds), rather than a full page (or an entire relation) at a time. Also, all the information needed for tuple decompression resides on the same page with the tuple. This means that a page can be stored in the bu er pool and used in compressed form, simplifying the job of the bu er manager and improving memory utilization. Our compression algorithm also improves index structures such as B-trees and R-trees signi cantly by reducing the number of leaf pages and compressing index entries, which greatly increases the fan-out. We can also use lossy compression on the internal nodes of an index. 1
JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support
- In IEEE Fourth International Conference on Cluster Computing
, 2002
"... A distributed Java Virtual Machine (DJVM) spanning multiple cluster nodes can provide a true parallel execution environment for multi-threaded Java applications. Most existing DJVMs suffer from the slow Java execution in interpretive mode and thus may not be efficient enough for solving computation- ..."
Abstract
-
Cited by 39 (6 self)
- Add to MetaCart
A distributed Java Virtual Machine (DJVM) spanning multiple cluster nodes can provide a true parallel execution environment for multi-threaded Java applications. Most existing DJVMs suffer from the slow Java execution in interpretive mode and thus may not be efficient enough for solving computation-intensive problems. We present JESSICA2, a new DJVM running in JIT compilation mode that can execute multi-threaded Java applications transparently on clusters. JESSICA2 provides a single system image (SSI) illusion to Java applications via an embedded global object space (GOS) layer. It implements a cluster-aware Java execution engine that supports transparent Java thread migration for achieving dynamic load balancing. We discuss the issues of supporting transparent Java thread migration in a JIT compilation environment and propose several lightweight solutions. An adaptive migrating-home protocol used in the implementation of the GOS is introduced. The system has been implemented on x86-based Linux clusters, and significant performance improvements over the previous JESSICA system have been observed.
Asymptotic Properties Of Data Compression And Suffix Trees
- IEEE Trans. Inform. Theory
, 1993
"... Recently, Wyner and Ziv have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1=h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data com ..."
Abstract
-
Cited by 36 (10 self)
- Add to MetaCart
Recently, Wyner and Ziv have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1=h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data compression schemes, most notably the Lempel-Ziv data compression algorithm. Wyner and Ziv have also conjectured that their result can be extended to a stronger almost sure convergence. In this paper, we settle this conjecture in the negative in the so called right domain asymptotic, that is, during a dynamic phase of expanding the data base. We prove -- under an additional assumption involving mixing conditions -- that the length of a typical repeated subword oscillates almost surely (a.s.) between (1=h 1 ) log n and (1=h 2 ) log n where 0 ! h 2 ! h h 1 ! 1. We also show that the length of the nth block in the Lempel-Ziv parsing algorithm reveals a similar behavior. We relate our findings to...
Automatic inference of models for statistical code compression
- In Proceedings of the ACM Conference on Programming Language Design and Implementation
, 1999
"... This paper describes experiments that apply machine learning to compress computer programs, formalizing and automating decisions about instruction encoding that have traditionally been made by humans in a more ad hoc manner. A program accepts a large training set of program material in a conventiona ..."
Abstract
-
Cited by 35 (1 self)
- Add to MetaCart
This paper describes experiments that apply machine learning to compress computer programs, formalizing and automating decisions about instruction encoding that have traditionally been made by humans in a more ad hoc manner. A program accepts a large training set of program material in a conventional compiler intermediate representation (IR) and automatically infers a decision tree that separates IR code into streams that compress much better than the undifferentiated whole. Driving a conventional arithmetic compressor with this model yields code 30 % smaller than the previous record for IR code compression, and 24 % smaller than an ambitious optimizing compiler feeding an ambitious general-purpose data compressor. Keywords Abstract machines, code compaction, code compression, compiler intermediate languages and representations, data compression, decision trees, machine learning, statistical models, virtual machines.
Interval and Recency Rank Source Coding: Two On-Line Adaptive Variable-Length Schemes
, 1987
"... In the schemes presented the encoder maps each message into a codeword in a prefix-free codeword set. In interval encoding the codeword is indexed by the interval since the last previous occurrence of that message, and the codeword set must be countably infinite. In recency rank encoding the codewor ..."
Abstract
-
Cited by 31 (0 self)
- Add to MetaCart
In the schemes presented the encoder maps each message into a codeword in a prefix-free codeword set. In interval encoding the codeword is indexed by the interval since the last previous occurrence of that message, and the codeword set must be countably infinite. In recency rank encoding the codeword is indexed by the number of distinct messages in that interval, and there must be no fewer codewords than messages. The decoder decodes each codeword on receipt. Users need not know message probabilities, but must agree on indexings, of the codeword set in an order of increasing length and of the message set in some arbitrary order. The average codeword length over a communications bout is never much larger than the value for an off-line scheme which maps the jth most frequent message in the bout into the jth shortest codeword in the given set, and is never too much larger than the value for off-line Huffman encoding of messages into the best codeword set for the bout message frequencies.
Computational mechanics: Pattern and prediction, structure and simplicity
- Journal of Statistical Physics
, 1999
"... Computational mechanics, an approach to structural complexity, defines a process’s causal states and gives a procedure for finding them. We show that the causal-state representation—an E-machine—is the minimal one consistent with ..."
Abstract
-
Cited by 31 (7 self)
- Add to MetaCart
Computational mechanics, an approach to structural complexity, defines a process’s causal states and gives a procedure for finding them. We show that the causal-state representation—an E-machine—is the minimal one consistent with
Query Optimization In Compressed Database Systems
- In ACM SIGMOD
, 2001
"... Over the lastd ecad es, improvements in CPU speed have outpaced improvements in main memory and d isk access rates by ord ers of magnitud , enabling the use ofd ata compression techniques to improve the performance ofd atabase systems. Previous work d scribes the benefits of compression for numerica ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
Over the lastd ecad es, improvements in CPU speed have outpaced improvements in main memory and d isk access rates by ord ers of magnitud , enabling the use ofd ata compression techniques to improve the performance ofd atabase systems. Previous work d scribes the benefits of compression for numerical attributes, whered8 a is stored in compressed format ond isk. Despite the abund3& e of stringvalued attributes in relational schemas there is little work on compression for string attributes in ad atabase context. Moreover, none of the previous work suitablyad2 esses the role of the query optimizer: During query execution, dD a is either eagerly d compressed when it is read into main memory, or dD a lazily stays compressed in main memory and is d compressed ond emand only. In this paper, we present an e#ective approach for dD abase compression based on lightweight, attribute-level compression techniques. We propose a Hierarchical ictionary Encod ing strategy that intelligently selects the most e#ective compression method for string-valued attributes. We show that eager and lazy d compression strategies prod1 e suboptimal plans for queries involving compressed string attributes. We then formalize the problem of compressionaware query optimizationand propose one provably optimal and two fast heuristic algorithms for selecting a query plan for relational schemas with compressed attributes; our algorithms can easily be integrated into existing cost-based query optimizers. Experiments using TPC-Hd atad emonstrate the impact of our string compression method s and show the importance of compression-aware query optimization. Our approach results in up to an or d r speed up over existing approaches. 1.
Self-Alignment in Words and their Applications
- J. Algorithms
, 1992
"... Some quantities associated with periodicities in words are analyzed within the Bernoulli probabilistic model. In particular, the following problem is addressed. Assume that a string X is given, with symbols emitted randomly but independently according to some known distribution of probabilities. T ..."
Abstract
-
Cited by 27 (8 self)
- Add to MetaCart
Some quantities associated with periodicities in words are analyzed within the Bernoulli probabilistic model. In particular, the following problem is addressed. Assume that a string X is given, with symbols emitted randomly but independently according to some known distribution of probabilities. Then, for each pair (W , Z) of distinct suffixes of X, the expected length of the longest common prefix of W and Z is sought. The collection of these lengths, that are called here self-alignments, plays a crucial role in several algorithmic problems on words, such as building suffix trees or inverted files, detecting squares and other regularities, computing substring statistics, etc. The asymptotically best algorithms for these problems are quite complex and thus risk to be unpractical. The present analysis of self-alignments and related measures suggests that, in a variety of cases, more straightforward algorithmic solutions may yield comparable or even better performances. Key words and ph...

