Suffix arrays: A new method for online string searches
, 1991
"... A new and conceptually simple data structure, called a suffix array, for online string searches is introduced in this paper. Constructing and querying suffix arrays is reduced to a sort and search paradigm that employs novel algorithms. The main advantage of suffix arrays over suffix trees is that ..."
A new and conceptually simple data structure, called a suffix array, for online string searches is introduced in this paper. Constructing and querying suffix arrays is reduced to a sort and search paradigm that employs novel algorithms. The main advantage of suffix arrays over suffix trees is that, in practice, they use three to five times less space. From a complexity standpoint, suffix arrays permit online string searches of the type, "Is W a substring of A?" to be answered in time O(P + log N), where P is the length of W and N is the length of A, which is competitive with (and in some cases slightly better than) suffix trees. The only drawback is that in those instances where the underlying alphabet is finite and small, suffix trees can be constructed in O(N) time in the worst case, versus O(N log N) time for suffix arrays. However, we give an augmented algorithm that, regardless of the alphabet size, constructs suffix arrays in O(N) expected time, albeit with lesser space efficiency. We believe that suffix arrays will prove to be better in practice than suffix trees for many applications.
Approaches to the Automatic Discovery of Patterns in Biosequences
, 1995
"... This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which a ..."
This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering those patterns which are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis presented of the ways in which an assessment can be made of the significance and usefulness of the discovered patterns. It is shown that this problem is related to problems studied in the field of machine learning. The largest part of this paper comprises a review of a number of existing methods developed to solve this problem and how these relate to each other, focusing on the algorithms underlying the approaches. A comparison is given of the algorithms, and examples are given of patterns that have been discovered...
Galaxy of News  An Approach to Visualizing and Understanding Expansive News Landscapes
, 1994
"... The Galaxy of News system embodies an approach to visualizing large quantities of independently authored pieces of information, in this case news stories. At the heart of this system is a powerful relationship construction engine that constructs an associative relation network to automatically build ..."
The Galaxy of News system embodies an approach to visualizing large quantities of independently authored pieces of information, in this case news stories. At the heart of this system is a powerful relationship construction engine that constructs an associative relation network to automatically build implicit links between related articles. To visualize these relationships, and hence the news information space, the Galaxy of News uses pyramidal structuring and visual presentation, semantic zooming and panning, animated visual cues that are dynamically constructed to illustrate relationships between articles, and fluid interaction in a three dimensional information space to browse and search through large databases of news articles. The result is a tool that allows people to quickly gain a broad understanding of a news base by providing an abstracted presentation that covers the entire information base, and through interaction, progressively refines the details of the information space. ...
Spelling Approximate Repeated Or Common Motifs Using a Suffix Tree
, 1998
"... . We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet \Sigma . For instance, \Sigma may be equal to fA, C, G, Tg and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the s ..."
. We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet \Sigma . For instance, \Sigma may be equal to fA, C, G, Tg and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint). The second algorithm extracts common motifs from a set of N 2 sequences. In this case, the motifs must occur, again with at most e mismatches, in 1 q N distinct sequences of the set. In both cases, the words representing the motifs may never be present exactly in the sequences. We therefore speak of the motifs, repeated in a sequence or common to a set of them, as being "external" objects and denote them by the expression "valid models" if they verify the quorum constraint q. The approach we introduce here for finding all valid models corr...
Monotony of Surprise and LargeScale Quest for Unusual Words
 In proceedings of the 6 th Int’l Conference on Research in Computational Molecular Biology
, 2002
"... The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In Molecular Biology, exceptionall ..."
The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In Molecular Biology, exceptionally frequent or rare words in biosequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems, and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation.
A Double Combinatorial Approach to Discovering Patterns in Biological Sequences
 Combinatorial Pattern Matching, volume 1075 of Lecture Notes in Computer Science
"... We present in this paper an algorithm for finding degenerated common features by multiple comparison of a set of biological sequences (nucleic acids or proteins). The features that are of interest to us are words in the sequences. The algorithm uses the concept of a model we introduced earlier for l ..."
We present in this paper an algorithm for finding degenerated common features by multiple comparison of a set of biological sequences (nucleic acids or proteins). The features that are of interest to us are words in the sequences. The algorithm uses the concept of a model we introduced earlier for locating these features. A model can be seen as a generalization of a consensus pattern as defined by Waterman [42]. It is an object against which the words in the sequences are compared and which serves as an identifier for the groups of similar ones. The algorithm given here innovates in relation to our previous work in that the models are defined over what we call a weighted combinatorial cover. This is a collection of sets among all possible subsets of the alphabet \Sigma of nucleotides or amino acids, including the wild card f\Sigmag, with a weight attached to each of these sets indicating the number of times it may appear in a model. In this way, we explore both the space of models and ...
Direct construction of Compact Directed Acyclic Word Graphs
 COMBINATORIAL PATTERN MATCHING (AARHUS, 1997), FRANCE
, 1997
"... The Directed Acyclic Word Graph (DAWG) is an efficient data structure to treat and analyze repetitions in a text, especially in DNA genomic sequences. Here, we consider the Compact Directed Acyclic Word Graph of a word. We give the first direct algorithm to construct it. It runs in time linear in t ..."
The Directed Acyclic Word Graph (DAWG) is an efficient data structure to treat and analyze repetitions in a text, especially in DNA genomic sequences. Here, we consider the Compact Directed Acyclic Word Graph of a word. We give the first direct algorithm to construct it. It runs in time linear in the length of the string on a fixed alphabet. Our implementation requires half the memory space used by DAWGs.
Identifying Satellites and Periodic Repetitions in Biological Sequences
, 1998
"... We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequenc ..."
We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. We concentrate here on short to moderately long (up to 3040 base pairs) approximate tandem repeats where copies may di#er up to # = 1520% from a consensus model of the repeating unit (implying individual units may vary by 2# from each other). The algorithm is composed of two parts. The first one consists of a filter that basically eliminates all regions whose probability of containing a satellite is less than one in 10 4 when # = 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. It therefore has the advantage over previous work of being able to report a consensus model, say m, of the repe...
On Compact Directed Acyclic Word Graphs
 Structures in Logic and Computer Science
, 1997
"... The Directed Acyclic Word Graph (DAWG) is a spaceefficient data structure to treat and analyze repetitions in a text, especially in DNA genomic sequences. Here, we consider the Compact Directed Acyclic Word Graph of a word. We give the first direct algorithm to construct it. It runs in time lin ..."
The Directed Acyclic Word Graph (DAWG) is a spaceefficient data structure to treat and analyze repetitions in a text, especially in DNA genomic sequences. Here, we consider the Compact Directed Acyclic Word Graph of a word. We give the first direct algorithm to construct it. It runs in time linear in the length of the string on a fixed alphabet. Our implementation requires half the memory space used by DAWGs.