Results 1  10
of
30
A New Challenge for Compression Algorithms: Genetic Sequences
 Information Processing & Management
, 1994
"... Universal data compression algorithms fail to compress genetic sequences. It is due to the specificity of this particular kind of "text". We analyze in some details the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algorithm, bioco ..."
Abstract

Cited by 73 (0 self)
 Add to MetaCart
Universal data compression algorithms fail to compress genetic sequences. It is due to the specificity of this particular kind of "text". We analyze in some details the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algorithm, biocompress2, to compress the information contained in DNA and RNA sequences, based on the detection of regularities, such as the presence of palindromes. The algorithm combines substitutional and statistical methods, and to the best of our knowledge, lead to the highest compression of DNA. The results, although not satisfactory, gives insight to the necessary correlation between compression and comprehension of genetic sequences. 1 Introduction There are plenty of specific types of data which need to be compressed, for ease of storage and communication. Among them are texts (such as natural language and programs), images, sounds, etc. In this paper, we focus on the compression of a specific kin...
A Generalized Suffix Tree and Its (Un)Expected Asymptotic Behaviors
 SIAM J. Computing
, 1996
"... Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees  further calle ..."
Abstract

Cited by 53 (29 self)
 Add to MetaCart
Suffix trees find several applications in computer science and telecommunications, most notably in algorithms on strings, data compressions and codes. Despite this, very little is known about their typical behaviors. In a probabilistic framework, we consider a family of suffix trees  further called bsuffix trees  built from the first n suffixes of a random word. In this family a noncompact suffix tree (i.e., such that every edge is labeled by a single symbol) is represented by b = 1, and a compact suffix tree (i.e., without unary nodes) is asymptotically equivalent to b ! 1 as n ! 1. We study several parameters of bsuffix trees, namely: the depth of a given suffix, the depth of insertion, the height and the shortest feasible path. Some new results concerning typical (i.e., almost sure) behaviors of these parameters are established. These findings are used to obtain several insights into certain algorithms on words, molecular biology and universal data compression schemes. Key Wo...
Dynamic Dictionary Matching
, 1993
"... We consider the dynamic dictionary matching problem. We are given a set of pattern strings (the dictionary) that can change over time; that is, we can insert a new pattern into the dictionary or delete a pattern from it. Moreover, given a text string, we must be able to find all occurrences of any p ..."
Abstract

Cited by 47 (8 self)
 Add to MetaCart
We consider the dynamic dictionary matching problem. We are given a set of pattern strings (the dictionary) that can change over time; that is, we can insert a new pattern into the dictionary or delete a pattern from it. Moreover, given a text string, we must be able to find all occurrences of any pattern of the dictionary in the text. Let D 0 be the empty dictionary. We present an algorithm that performs any sequence of the following operations in the given time bounds: (1) insert(p; D i01 ): Insert pattern p[1; m] into the dictionary D i01 . D i is the dictionary after the operation. The time complexity is O(m log jD i j). (2) delete(p; D i01 ): Delete pattern p[1; m] from the dictionary D i01 . D i is the dictionary after the operation. The time complexity is O(m log jD i01 j). (3) search(t; D i ): Search text t[1; n] for all occurrences of the patterns of dictionary D i . The time complexity is O((n + tocc) log jD i j), where tocc is the total number of occurrences of patterns i...
Genomics via Optical Mapping II: Ordered Restriction Maps
 Journal of Computational
, 1997
"... ..."
The mixture transition distribution model for highorder Markov chains and nonGaussian time series
 Statistical Science
, 2002
"... Abstract. The mixture transition distribution model (MTD) was introduced in 1985 by Raftery for the modeling of highorder Markov chains with a finite state space. Since then it has been generalized and successfully applied to a range of situations, including the analysis of wind directions, DNA seq ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
Abstract. The mixture transition distribution model (MTD) was introduced in 1985 by Raftery for the modeling of highorder Markov chains with a finite state space. Since then it has been generalized and successfully applied to a range of situations, including the analysis of wind directions, DNA sequences and social behavior. Here we review the MTD model and the developments since 1985. We first introduce the basic principle and then we present several extensions, including general state spaces and spatial statistics. Following that, we review methods for estimating the model parameters. Finally, a review of different types of applications shows the practical interest of the MTD model. Key words and phrases: Mixture transition distribution (MTD) model, Markov chains, highorder dependences, time series, GMTD model, EM algorithm,
Suffix Trees and their Applications in String Algorithms
, 1993
"... : The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
: The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter design, information retrieval, abstract data types and many others. In this paper, we survey some applications of suffix trees and some algorithmic techniques for their construction. Special emphasis is given to the most recent developments in this area, such as parallel algorithms for suffix tree construction and generalizations of suffix trees to higher dimensions, which are important in multidimensional pattern matching. Work partially supported by the ESPRIT BRA ALCOM II under contract no. 7141 and by the Italian MURST Project "Algoritmi, Modelli di Calcolo e Strutture Informative". y Part of this work was done while the author was visiting AT&T Bell Laboratories. Email: grossi@di.uni...
String Editing and Longest Common Subsequences
 In Handbook of Formal Languages
, 1996
"... this paper, in view of the particularly rich variety of algorithmic solutions that have been devised for this problem over the past two decades or so, which made it susceptible to some degrees of unification and systematization of independent and general interest. Our discussion starts with the expo ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
this paper, in view of the particularly rich variety of algorithmic solutions that have been devised for this problem over the past two decades or so, which made it susceptible to some degrees of unification and systematization of independent and general interest. Our discussion starts with the exposition of two basic approaches to LCS computation, due respectively to Hirschberg [1978] and Hunt and Szymanski [1977]. We then discuss faster implementations of this second paradigm, and the data strucures that support them. In Section 5. we discuss algorithms that use only linear space to compute an LCS and yet do not necessarily take \Theta(nm) time. One, final, such algorithm is presented in section 6. where many of the ideas and tools accumulated in the course of our discussion find employment together. In Section 7. we make return to string editing in its general formulation and discuss some of its efficient solutions within a parallel model of computation.
In Search of the Lost Schema
 PROC. OF THE 7TH INTERNATIONAL CONFERENCE ON DATABASE THEORY
, 1999
"... We study the problem of rediscovering the schema of nested relations that have been encoded as strings for storage purposes. We consider various classes of encoding functions, and consider the markup encodings, which allow to find the schema without knowledge of the encoding function, under rea ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
We study the problem of rediscovering the schema of nested relations that have been encoded as strings for storage purposes. We consider various classes of encoding functions, and consider the markup encodings, which allow to find the schema without knowledge of the encoding function, under reasonable assumptions on the input data. Depending upon the encoding of empty sets, we propose two polynomial online algorithms (with different buffer size) solving the schema finding problem. We also prove that with a high probability, both algorithms find the schema after examining a fixed number of tuples, thus leading in practice to a linear time behavior with respect to the database size for wrapping the data. Finally, we show that the proposed techniques are wellsuited for practical applications, such as structuring and wrapping HTML pages and Web sites.
RNA Movies  Visualizing RNA Secondary Structure Spaces
, 1997
"... RNA Movies is a system for the visualization of RNA secondary structure spaces. Its input is a script consisting of primary and secondary structure information. From this script, the system generates animated graphical structure representations. In this way, it creates the impression of an RNA molec ..."
Abstract

Cited by 13 (7 self)
 Add to MetaCart
RNA Movies is a system for the visualization of RNA secondary structure spaces. Its input is a script consisting of primary and secondary structure information. From this script, the system generates animated graphical structure representations. In this way, it creates the impression of an RNA molecule exploring its own 2D structure space. RNA Movies has been used to generate animations of a switching structure in the spliced leader RNA of L. collosoma and sequential foldings of PSTV transcripts. Purpose Finding the secondary structure of an RNA molecule is an important step towards understanding its function in many cases, as the 2D folding constrains the set of possible tertiary structures [1]. While the computed minimum free energy structure structure (mfe) can certainly often provide clues as to what the correct structure might be, it is not the answer to all questions and must generally be used with caution. Over the last years the RNA folding programs have been modified (Mfold a...
SAMBA  Systolic Accelerators For Molecular Biological Applications
 IRISA, 35042 Rennes Cedex
, 1996
"... : Samba is a full custom systolic array dedicated to the comparison of biological sequences. This hardware accelerator implements a parameterized version of the Smith and Waterman algorithm allowing the computation of local or global alignments with or without gap penalty. The speedup provided by S ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
: Samba is a full custom systolic array dedicated to the comparison of biological sequences. This hardware accelerator implements a parameterized version of the Smith and Waterman algorithm allowing the computation of local or global alignments with or without gap penalty. The speedup provided by Samba over standard workstations ranges from 50 to 500, depending on the application. Keywords: biological sequence comparison, Smith and Waterman algorithm, dedicated hardware, linear systolic array (R'esum'e : tsvp) This work was partially funded by the French Researh Group GREG (Groupement de Recherches et d'Etudes sur les G'enomes) and the French Coordinated Research Program ANM (Architectures Nouvelles de Machines) CENTRE NATIONAL DE LA RECHERCHE SCIENTIFIQUE Centre National de la Recherche Scientifique Institut National de Recherche en Informatique (URA 227) Universit de Rennes 1  Insa de Rennes et en Automatique  unit de recherche de Rennes SAMBA Acc'el'erateur Systolique ...