Results 1  10
of
29
Towards parameterfree data mining
 In: Proc. 10th ACM SIGKDD Intnâ€™l Conf. Knowledge Discovery and Data Mining
, 2004
"... Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameterladen algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorit ..."
Abstract

Cited by 118 (18 self)
 Add to MetaCart
Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameterladen algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible, ideally none. A parameterfree algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics and computational theory hold great promise for a parameterfree datamining paradigm. The results are motivated by observations in Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any offtheshelf compression algorithm with the addition of just a dozen or so lines of code. We will show that this approach is competitive or superior to the stateoftheart approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets.
Offline compression by greedy textual substitution
 PROC. IEEE
, 2000
"... Greedy offline textual substitution refers to the following approach to compression or structural inference. Given a long textstring x, a substring w is identified such that replacing all instances of w in x except one by a suitable pair of pointers yields the highest possible contraction of x; the ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
Greedy offline textual substitution refers to the following approach to compression or structural inference. Given a long textstring x, a substring w is identified such that replacing all instances of w in x except one by a suitable pair of pointers yields the highest possible contraction of x; the process is then repeated on the contracted textstring until substrings capable of producing contractions can no longer be found. This paper examines computational issues arising in the implementation of this paradigm and describes some applications and experiments.
DNABased Cryptography
 5th DIMACS Workshop on DNA Based Computers, MIT
, 1999
"... Recent research has considered DNA as a medium for ultrascale computation and for ultracompact information storage. One potential key application is DNAbased, molecular cryptography systems. We present some procedures for DNAbased cryptography based on onetimepads that are in principle unbre ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
Recent research has considered DNA as a medium for ultrascale computation and for ultracompact information storage. One potential key application is DNAbased, molecular cryptography systems. We present some procedures for DNAbased cryptography based on onetimepads that are in principle unbreakable. Practical applications of cryptographic systems based on onetime pads are limited in conventional electronic media by the size of the onetimepad; however DNA provides a much more compact storage medium, and an extremely small amount of DNA su#ces even for huge onetimepads. We detail procedures for two DNA onetimepad encryption schemes: (i) a substitution method using libraries of distinct pads, each of which defines a specific, randomly generated, pairwise mapping; and (ii) an XOR scheme utilizing molecular computation and indexed, random key strings. These methods can be applied either for the encryption of natural DNA or for artificial DNA encoding binary data. In the latter case, we also present a novel use of chipbased DNA microarray technology for 2D data input and output.
Computationally Inspired Biotechnologies: Improved DNA Synthesis and Associative Search Using ErrorCorrecting Codes and VectorQuantization
 Sixth International Meeting on DNA Based Computers (DNA6), DIMACS Series in Discrete Mathematics and Theoretical Computer Science
, 2000
"... . The main theme of this paper is to take inspiration from methods used in computer science and related disciplines, and to apply these to develop improved biotechnology. In particular, our proposed improvements are made by adapting various information theoretic coding techniques which originate ..."
Abstract

Cited by 15 (7 self)
 Add to MetaCart
. The main theme of this paper is to take inspiration from methods used in computer science and related disciplines, and to apply these to develop improved biotechnology. In particular, our proposed improvements are made by adapting various information theoretic coding techniques which originate in computational and information processing disciplines, but which we retailor to work in the biotechnology context. (a) We apply ErrorCorrecting Codes, developed to correct transmission errors in electronic media, to decrease (in certain contexts, optimally) error rates in opticallyaddressed DNA synthesis (e.g., of DNA chips). (b) We apply VectorQuantization (VQ) Coding techniques (which were previously used to cluster, quantize, and compress data such as speech and images) to improve I/O rates (in certain contexts, optimally) for transformation of electronic data to and from DNA with bounded error. (c) We also apply VQ Coding techniques, some of which hierarchically cluster ...
Implementing the Context Tree Weighting Method for Text Compression
 In Data Compression Conference
, 2000
"... Context tree weighting method is a universal compression algorithm for FSMX sources. Though we expect that it will have good compression ratio in practice, it is difficult to implement it and in many cases the implementation is only for estimating compression ratio. Though Willems and Tjalkens showe ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
Context tree weighting method is a universal compression algorithm for FSMX sources. Though we expect that it will have good compression ratio in practice, it is difficult to implement it and in many cases the implementation is only for estimating compression ratio. Though Willems and Tjalkens showed practical implementation using not block probabilities but conditional probabilities, it is used for only binary alphabet sequences. We extend the method for multialphabet sequences and show a simple implementation using PPM techniques. We also propose a method to optimize a parameter of the context tree weighting for binary alphabet case. Experimental results on texts and DNA sequences show that the performance of PPM can be improved by combining the context tree weighting and that DNA sequences can be compressed in less than 2.0 bpc.
A Simple Statistical Algorithm for Biological Sequence Compression
 DATA COMPRESSION CONFERENCE
, 2007
"... This paper introduces a novel algorithm for biological sequence compression that makes use of both statistical properties and repetition within sequences. A panel of experts is maintained to estimate the probability distribution of the next symbol in the sequence to be encoded. Expert probabilities ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
This paper introduces a novel algorithm for biological sequence compression that makes use of both statistical properties and repetition within sequences. A panel of experts is maintained to estimate the probability distribution of the next symbol in the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of the biological sequence. Each symbol is then encoded by arithmetic coding. Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time. 1.
Sequence Complexity for Biological Sequence Analysis
, 2000
"... A new statistical model for DNA considers a sequence to be a mixture of regions with little structure and regions that are approximate repeats of other subsequences, i.e. instances of repeats do not need to match each other exactly. Both forward and reversecomplementary repeats are allowed. The mo ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
A new statistical model for DNA considers a sequence to be a mixture of regions with little structure and regions that are approximate repeats of other subsequences, i.e. instances of repeats do not need to match each other exactly. Both forward and reversecomplementary repeats are allowed. The model has a small number of parameters which are fitted to the data. In general there are many explanations for a given sequence and how to compute the total probability of the data given the model is shown. Computer algorithms are described for these tasks. The model can be used to compute the information content of a sequence, either in total or base by base. This amounts to looking at sequences from a datacompression point of view and it is argued that this is a good way to tackle intelligent sequence analysis in general.
DNA sequence compression using the BurrowsWheeler Transform
 Proc. IEEE Bioinformatics Conference, Stanford University, CA, 2002: 303
"... Abstract We investigate offline dictionary oriented approaches to DNA sequence compression, based on the BurrowsWheeler Transform (BWT). The preponderance of short repeating patterns is an important phenomenon in biological sequences. Here, we propose offline methods to compress DNA sequences t ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Abstract We investigate offline dictionary oriented approaches to DNA sequence compression, based on the BurrowsWheeler Transform (BWT). The preponderance of short repeating patterns is an important phenomenon in biological sequences. Here, we propose offline methods to compress DNA sequences that exploit the different repetition structures inherent in such sequences. Repetition analysis is performed based on the relationship between the BWT and important pattern matching data structures, such as the suffix tree and suffix array. We discuss how the proposed approach can be incorporated in the BWT compression pipeline. Index terms DNA sequence compression, repetition structures, BurrowsWheeler Transform, BWT 1.
Grammarbased Compression of DNA Sequences
, 2004
"... Grammarbased compression algorithms infer contextfree grammars to represent the input data. The grammar is then transformed into a symbol stream and finally encoded in binary. We explore the utility of grammarbased compression of DNA sequences. We strive to optimize the three stages of grammarba ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Grammarbased compression algorithms infer contextfree grammars to represent the input data. The grammar is then transformed into a symbol stream and finally encoded in binary. We explore the utility of grammarbased compression of DNA sequences. We strive to optimize the three stages of grammarbased compression to work optimally for DNA. DNA is notoriously hard to compress, and ultimately, our algorithm fails to achieve better compression than the best competitor. 1
Unifying Text Search And Compression  Suffix Sorting, Block Sorting and Suffix Arrays
, 2000
"... Today many electronic documents are available such as articles of newspapers, dictionaries, books, DNA sequences, etc. and they are stored in databases. We also have many documents on the Internet and have many email documents. Therefore, fast queries on such huge amount of documents and their comp ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Today many electronic documents are available such as articles of newspapers, dictionaries, books, DNA sequences, etc. and they are stored in databases. We also have many documents on the Internet and have many email documents. Therefore, fast queries on such huge amount of documents and their compression to reduce costs for storing or transferring them are important. In this thesis, a unified method for improving efficiency of search and compression for huge text data is proposed. All search methods and compression methods used in this thesis are related to a data structure called suffix array. The suffix array is a text search data structure and it is used in a text compression method called block sorting. Both are promising search method and compression method and there are many studies on the methods. Now a data structure called inverted file is used for queries from huge amount of documents. Though it is widely used, query unit is a document in order to reduce disk space to sto...