Results 1 - 10
of
13
A New Challenge for Compression Algorithms: Genetic Sequences
- Information Processing & Management
, 1994
"... Universal data compression algorithms fail to compress genetic sequences. It is due to the specificity of this particular kind of "text". We analyze in some details the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algorithm, biocompress-2, ..."
Abstract
-
Cited by 55 (0 self)
- Add to MetaCart
Universal data compression algorithms fail to compress genetic sequences. It is due to the specificity of this particular kind of "text". We analyze in some details the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algorithm, biocompress-2, to compress the information contained in DNA and RNA sequences, based on the detection of regularities, such as the presence of palindromes. The algorithm combines substitutional and statistical methods, and to the best of our knowledge, lead to the highest compression of DNA. The results, although not satisfactory, gives insight to the necessary correlation between compression and comprehension of genetic sequences. 1 Introduction There are plenty of specific types of data which need to be compressed, for ease of storage and communication. Among them are texts (such as natural language and programs), images, sounds, etc. In this paper, we focus on the compression of a specific kin...
A Natural Law of Succession
, 1995
"... We present a new solution to multinomial estimation and demonstrate that our solution outperforms standard solutions both in theory and in practice. The novelty of our approach lies in our use of combinatorial priors on strings. I. Natural Strings An alphabet represents the set of logically possib ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
We present a new solution to multinomial estimation and demonstrate that our solution outperforms standard solutions both in theory and in practice. The novelty of our approach lies in our use of combinatorial priors on strings. I. Natural Strings An alphabet represents the set of logically possible events. In this world, all strings are finite and most are very short. For this basic reason, natural strings do not include all the symbols in the alphabet. This claim is tautological for short strings, but it is also true for long strings. To model this phenomenon, we propose a uniform prior on the cardinalities of all nonempty subsets of the alphabet. Such a prior on an alphabet of size k entails the probability pN (x n jn) = min(k; n) ` k q '` n \Gamma 1 q \Gamma 1 '` n fn i g ' \Gamma1 for strings x n of length n with cardinality q. This probability is not Kolmogorov compatible. To obtain a conditional probability, we must use p(ijx n ; n + 1) instead of the more o...
Easy Sets and Hard Certificate Schemes
, 1995
"... Can easy sets only have easy certificate schemes? In this paper, we study the class of sets that, for all NP certificate schemes (i.e., NP machines), always have easy acceptance certificates (i.e., accepting paths) that can be computed in polynomial time. We also study the class of sets that, for al ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Can easy sets only have easy certificate schemes? In this paper, we study the class of sets that, for all NP certificate schemes (i.e., NP machines), always have easy acceptance certificates (i.e., accepting paths) that can be computed in polynomial time. We also study the class of sets that, for all NP certificate schemes, infinitely often have easy acceptance certificates. We give structural conditions that control the size of these classes. 1 Introduction Borodin and Demers [BD76] proved the following result. Theorem 1.1 [BD76] If NP " coNP 6= P, then there exists a set L such that 1. L 2 P, 2. L ` SAT, and 3. For no polynomial-time computable function f does it hold that: for each F 2 L, f(F ) outputs a satisfying assignment of F . That is, under a hypothesis most theoreticians would guess to be true, it follows that there is a set of satisfiable formulas for which it is trivial to determine they are satisfiable, yet it is hard to determine why (i.e., via what satisfying assignm...
Language Acquisition in the MDL Framework
- In Eric Sven Ristad, Language Computation. American Mathemtatical Society, Philedelphia
, 1994
"... The Minimum Description Length (MDL) principle provides guidance to the fundamental question of determining what a given set of observed data tells us about the underlying data generating machinery. Hence, in the broadest sense the MDL principle relates to the central question of all science, al ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
The Minimum Description Length (MDL) principle provides guidance to the fundamental question of determining what a given set of observed data tells us about the underlying data generating machinery. Hence, in the broadest sense the MDL principle relates to the central question of all science, although its most useful applications have been to the more practical problem of fitting statistical models to data. In this article, we review the MDL principle and demonstrate how it may be profitably applied to the logical problem of language acquisition.
Measuring Sets in Infinite Groups
, 2002
"... We are now witnessing a rapid growth of a new part of group theory which has become known as "statistical group theory". A typical result in this area would say something like "a random element (or a tuple of elements) of a group G has a property P with probability p". The validity of a statement li ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
We are now witnessing a rapid growth of a new part of group theory which has become known as "statistical group theory". A typical result in this area would say something like "a random element (or a tuple of elements) of a group G has a property P with probability p". The validity of a statement like that does, of course, heavily depend on how one defines probability on groups, or, equivalently, how one measures sets in a group (in particular, in a free group). We hope that new approaches to defining probabilities on groups as outlined in this paper create, among other things, an appropriate framework for the study of the "average case" complexity of algorithms on groups. Contents 1.
Unsupervised Lexical Learning as Inductive Inference
, 2000
"... To learn a language, the learners must first learn its words, the essential building blocks for utterances. The difficulty in learning words lies in the unavailability of explicit word boundaries in speech input. The learners have to infer lexical items with some innately endowed learning mechanism( ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
To learn a language, the learners must first learn its words, the essential building blocks for utterances. The difficulty in learning words lies in the unavailability of explicit word boundaries in speech input. The learners have to infer lexical items with some innately endowed learning mechanism(s) for regularity detection- regularities in the speech normally indicate word patterns. With respect to Zipf's least-effort principle and Chomsky's thoughts on the minimality of grammar for human language, we hypothesise a cognitive mechanism underlying language learning that seeks for the least-effort representation for input data. Accordingly, lexical learning is to infer the minimal-cost representation for the input under the constraint of permissible representation for lexical items. The main theme of this thesis is to examine how far this learning mechanism can go in unsupervised lexical learning from real language data without any pre-defined (e.g., prosodic and phonotactic) cues, but entirely resting on statistical induction of structural patterns for the most economic representation for the data. We first review
A Goodness Measure for Phrase Learning via Compression with the MDL Principle
- In The ESSLLI-98 Student Session, Chapter 13
, 1998
"... . This paper reports our ongoing research on unsupervised language learning via compression within the MDL paradigm. It formulates an empirical information-theoretical measure, description length gain, for evaluating the goodness of guessing a sequence of words (or characters) as a phrase (or a word ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
. This paper reports our ongoing research on unsupervised language learning via compression within the MDL paradigm. It formulates an empirical information-theoretical measure, description length gain, for evaluating the goodness of guessing a sequence of words (or characters) as a phrase (or a word), which can be calculated easily following classic information theory. The paper also presents a best-first learning algorithm based on this measure. Experiments on phrase and lexical learning from POS tag and character sequence, respectively, show promising results. 1 Introduction Grammar induction from a naturally-occurring text corpus is in the general domain of inferring a theory (or model) to account for observed data. Practical techniques for grammar induction have many important applications for a wide range of natural language (NL) and speech processing tasks. Researchers have devoted tremendous effort to the research in the past decades. Inferring a probabilistic grammar from NL d...
The complexity of information extraction
- IEEE Transactions on Information Theory
, 1986
"... Abstract-How difficult are decision problems based on natural data, such as pattern recognition? To answer this question, decision problems are characterized by introducing four measures defined on a Boolean function f of N variables: the implementation cost C(f), the randomness R(f), the determinis ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract-How difficult are decision problems based on natural data, such as pattern recognition? To answer this question, decision problems are characterized by introducing four measures defined on a Boolean function f of N variables: the implementation cost C(f), the randomness R(f), the deterministic entropy H(f), and the complexity K(f). The highlights and main results are roughly as follows. 1) C(f) = R(f) = H ( f) = K ( f), all measured in bits. 2) Decision problems based on natural data are partially random (in the Kolmogorov sense) and have low entropy with respect to their dimensionality, and the relations between the four measures translate to lower and upper bounds on the cost of solving these problems. 3) Allowing small errors in the implementation of f saves a lot in the low entropy case but saves nothing in the high-entropy case. If f is partially structured, the implementation cost is reduced substantially. T I.
Symmetry of Information and One-Way Functions
- Inform. Proc. letters
, 1993
"... Symmetry of information (in Kolmogorov complexity) is a concept that comes from formalizing the idea of how much information about a string y is contained in a string x. The situation is symmetric because it can be shown that the amount of information contained in the string y about the string x is ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Symmetry of information (in Kolmogorov complexity) is a concept that comes from formalizing the idea of how much information about a string y is contained in a string x. The situation is symmetric because it can be shown that the amount of information contained in the string y about the string x is almost exactly the same as that contained in x about y. In this paper we address symmetry of information in resource bounded environments. While we show that symmetry still holds in space bounded environments, it probably doesn't hold in time bounded environments. We show that if it holds for polynomial time bounds, then one-way functions cannot exist. 1 Introduction Keywords: computational complexity, Kolmogorov complexity, one-way functions. In probability theory, the phenomenon of dependence between random variables is well known. Cast in terms of classical Shannon entropy [Sha48, Sha49], the quantity of information in a random variable Y about another random variable X is I(X; Y ) =...

