Results 1  10
of
14
A New Challenge for Compression Algorithms: Genetic Sequences
 Information Processing & Management
, 1994
"... Universal data compression algorithms fail to compress genetic sequences. It is due to the specificity of this particular kind of "text". We analyze in some details the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algorithm, biocompress2, ..."
Abstract

Cited by 69 (0 self)
 Add to MetaCart
Universal data compression algorithms fail to compress genetic sequences. It is due to the specificity of this particular kind of "text". We analyze in some details the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algorithm, biocompress2, to compress the information contained in DNA and RNA sequences, based on the detection of regularities, such as the presence of palindromes. The algorithm combines substitutional and statistical methods, and to the best of our knowledge, lead to the highest compression of DNA. The results, although not satisfactory, gives insight to the necessary correlation between compression and comprehension of genetic sequences. 1 Introduction There are plenty of specific types of data which need to be compressed, for ease of storage and communication. Among them are texts (such as natural language and programs), images, sounds, etc. In this paper, we focus on the compression of a specific kin...
A Natural Law of Succession
, 1995
"... Consider the following problem. You are given an alphabet of k distinct symbols and are told that the i th symbol occurred exactly ni times in the past. On the basis of this information alone, you must now estimate the conditional probability that the next symbol will be i. In this report, we presen ..."
Abstract

Cited by 35 (3 self)
 Add to MetaCart
Consider the following problem. You are given an alphabet of k distinct symbols and are told that the i th symbol occurred exactly ni times in the past. On the basis of this information alone, you must now estimate the conditional probability that the next symbol will be i. In this report, we present a new solution to this fundamental problem in statistics and demonstrate that our solution outperforms standard approaches, both in theory and in practice.
Easy sets and hard certificate schemes
 Acta Informatica
, 1997
"... Can easy sets only have easy certificate schemes? In this paper, we study the class of sets that, for all NP certificate schemes (i.e., NP machines), always have easy acceptance certificates (i.e., accepting paths) that can be computed in polynomial time. We also study the class of sets that, for al ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
Can easy sets only have easy certificate schemes? In this paper, we study the class of sets that, for all NP certificate schemes (i.e., NP machines), always have easy acceptance certificates (i.e., accepting paths) that can be computed in polynomial time. We also study the class of sets that, for all NP certificate schemes, infinitely often have easy acceptance certificates. In particular, we provide equivalent characterizations of these classes in terms of relative generalized Kolmogorov complexity, showing that they are robust. We also provide structural conditions—regarding immunity and class collapses—that put upper and lower bounds on the sizes of these two classes. Finally, we provide negative results showing that some of our positive claims are optimal with regard to being relativizable. Our negative results are proven using a novel observation: we show that the classical “wide spacing ” oracle construction technique yields instant nonbiimmunity results. Furthermore, we establish a result that improves upon Baker, Gill, and Solovay’s classical result that NP = P = NP ∩ coNP holds in some relativized world.
Measuring Sets in Infinite Groups
, 2002
"... We are now witnessing a rapid growth of a new part of group theory which has become known as "statistical group theory". A typical result in this area would say something like "a random element (or a tuple of elements) of a group G has a property P with probability p". The validity of a statement li ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
We are now witnessing a rapid growth of a new part of group theory which has become known as "statistical group theory". A typical result in this area would say something like "a random element (or a tuple of elements) of a group G has a property P with probability p". The validity of a statement like that does, of course, heavily depend on how one defines probability on groups, or, equivalently, how one measures sets in a group (in particular, in a free group). We hope that new approaches to defining probabilities on groups as outlined in this paper create, among other things, an appropriate framework for the study of the "average case" complexity of algorithms on groups.
Language Acquisition in the MDL Framework
 In Eric Sven Ristad, Language Computation. American Mathemtatical Society, Philedelphia
, 1994
"... The Minimum Description Length (MDL) principle provides guidance to the fundamental question of determining what a given set of observed data tells us about the underlying data generating machinery. Hence, in the broadest sense the MDL principle relates to the central question of all science, al ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
The Minimum Description Length (MDL) principle provides guidance to the fundamental question of determining what a given set of observed data tells us about the underlying data generating machinery. Hence, in the broadest sense the MDL principle relates to the central question of all science, although its most useful applications have been to the more practical problem of fitting statistical models to data. In this article, we review the MDL principle and demonstrate how it may be profitably applied to the logical problem of language acquisition.
Unsupervised Lexical Learning as Inductive Inference
, 2000
"... To learn a language, the learners must first learn its words, the essential building blocks for utterances. The difficulty in learning words lies in the unavailability of explicit word boundaries in speech input. The learners have to infer lexical items with some innately endowed learning mechanism( ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
To learn a language, the learners must first learn its words, the essential building blocks for utterances. The difficulty in learning words lies in the unavailability of explicit word boundaries in speech input. The learners have to infer lexical items with some innately endowed learning mechanism(s) for regularity detection regularities in the speech normally indicate word patterns. With respect to Zipf's leasteffort principle and Chomsky's thoughts on the minimality of grammar for human language, we hypothesise a cognitive mechanism underlying language learning that seeks for the leasteffort representation for input data. Accordingly, lexical learning is to infer the minimalcost representation for the input under the constraint of permissible representation for lexical items. The main theme of this thesis is to examine how far this learning mechanism can go in unsupervised lexical learning from real language data without any predefined (e.g., prosodic and phonotactic) cues, but entirely resting on statistical induction of structural patterns for the most economic representation for the data. We first review
The complexity of information extraction
 IEEE Transactions on Information Theory
, 1986
"... AbstractHow difficult are decision problems based on natural data, such as pattern recognition? To answer this question, decision problems are characterized by introducing four measures defined on a Boolean function f of N variables: the implementation cost C(f), the randomness R(f), the determinis ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
AbstractHow difficult are decision problems based on natural data, such as pattern recognition? To answer this question, decision problems are characterized by introducing four measures defined on a Boolean function f of N variables: the implementation cost C(f), the randomness R(f), the deterministic entropy H(f), and the complexity K(f). The highlights and main results are roughly as follows. 1) C(f) = R(f) = H ( f) = K ( f), all measured in bits. 2) Decision problems based on natural data are partially random (in the Kolmogorov sense) and have low entropy with respect to their dimensionality, and the relations between the four measures translate to lower and upper bounds on the cost of solving these problems. 3) Allowing small errors in the implementation of f saves a lot in the low entropy case but saves nothing in the highentropy case. If f is partially structured, the implementation cost is reduced substantially. T I.
A Goodness Measure for Phrase Learning via Compression with the MDL Principle
 In The ESSLLI98 Student Session, Chapter 13
, 1998
"... . This paper reports our ongoing research on unsupervised language learning via compression within the MDL paradigm. It formulates an empirical informationtheoretical measure, description length gain, for evaluating the goodness of guessing a sequence of words (or characters) as a phrase (or a word ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
. This paper reports our ongoing research on unsupervised language learning via compression within the MDL paradigm. It formulates an empirical informationtheoretical measure, description length gain, for evaluating the goodness of guessing a sequence of words (or characters) as a phrase (or a word), which can be calculated easily following classic information theory. The paper also presents a bestfirst learning algorithm based on this measure. Experiments on phrase and lexical learning from POS tag and character sequence, respectively, show promising results. 1 Introduction Grammar induction from a naturallyoccurring text corpus is in the general domain of inferring a theory (or model) to account for observed data. Practical techniques for grammar induction have many important applications for a wide range of natural language (NL) and speech processing tasks. Researchers have devoted tremendous effort to the research in the past decades. Inferring a probabilistic grammar from NL d...