Results 1  10
of
23
Incremental Paradigms of Motif Discovery
 Journal of Computational Biology
, 2004
"... We examine the problem of extracting maximal irredundant motifs from a string. A combinatorial argument poses a linear bound on the total number of such motifs, thereby opening the way to the quest for the fastest and most efficient methods of extraction. The basic paradigm explored here is that of ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
We examine the problem of extracting maximal irredundant motifs from a string. A combinatorial argument poses a linear bound on the total number of such motifs, thereby opening the way to the quest for the fastest and most efficient methods of extraction. The basic paradigm explored here is that of iterated updates of the set of irredundant motifs in a string under consecutive unit symbol extensions of the string itself. This approach exposes novel characterizations for the base set of motifs in a string, hinged on notions of partial order. Such properties support the design of ad hoc data structures and constructs, and lead to develop an O(n 3) time incremental discovery algorithm. Key words: 1.
Greedy Mixture Learning for Multiple Motif Discovery in Biological Sequences
, 2003
"... Motivation: This paper studies the problem of discovering subsequences, known as motifs, that are common to a given collection of related biosequences, by proposing agreedy algorithm for learning a mixture of motifs model through likelihood maximization. The approach adds sequentially a new motif to ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
Motivation: This paper studies the problem of discovering subsequences, known as motifs, that are common to a given collection of related biosequences, by proposing agreedy algorithm for learning a mixture of motifs model through likelihood maximization. The approach adds sequentially a new motif to a mixture model by performing a combined scheme of global and local search for appropriately initializing its parameters. In addition, a hierarchical partitioning scheme based on kdtrees is presented for partitioning the input dataset in order to speedup the global searching procedure. The proposed method compares favorably over the wellknown MEME approach and treats successfully several drawbacks of MEME.
Gene Expression Profiling of DNA Microarray Data using Peano Count Trees (PTrees)
 In Online Proceedings of the First Virtual Conference on Genomics and Bioinformatics, October 2001. URL: http://midas10.cs.ndsu.nodak.edu/bio
, 2001
"... The explosion of genomic data made possible by advances in parallel, highthroughput technologies in the area of molecular biology, has ushered in a new era in the area of Bioinformatics. During the last many years, efforts concentrated on sequencing the genome of organisms. Current emphasis lies in ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
The explosion of genomic data made possible by advances in parallel, highthroughput technologies in the area of molecular biology, has ushered in a new era in the area of Bioinformatics. During the last many years, efforts concentrated on sequencing the genome of organisms. Current emphasis lies in extracting meaningful information from this huge DNA sequence and expression data. The techniques currently employed to do analysis of microarray expression data is clustering and classification. These techniques present their own limitations as to the amount of useful information that can be derived. In this paper, we propose a new approach to data mining the microarray data using new data mining technology called Peano Count Tree (Ptree) . This technology employs Association Rule Mining as means to do data mining of the microarray data.
A similar fragments merging approach to learn automata on proteins
 In: Machine Learning: ECML
, 2005
"... Publication interne n˚1735 — Juillet 2005 — 18 pages Abstract: We propose here to learn automata for the characterization of proteins families to overcome the limitations of the positionspecific characterizations classically used in Pattern Discovery. We introduce a new heuristic approach learning ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Publication interne n˚1735 — Juillet 2005 — 18 pages Abstract: We propose here to learn automata for the characterization of proteins families to overcome the limitations of the positionspecific characterizations classically used in Pattern Discovery. We introduce a new heuristic approach learning nondeterministic automata based on selection and ordering of significantly similar fragments to be merged and on physicochemical properties identification. Quality of the characterization of the major intrinsic protein (MIP) family is assessed by leaveoneout crossvalidation for a large range of models specificity. Keywords: grammatical inference, automata, proteins Goulven Kerbellec is supported by a PhD research grant from Région Bretagne.
Hidden Word Statistics
"... We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is... ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is...
Pattern discovery and the algorithmics of surprise
 Proceedings of the NATO ASI on Arti Intelligence and Heuristic Methods for Bioinformatics
, 2003
"... ..."
A Bayesian network classification methodology for gene expression data
 JOURNAL OF COMPUTATIONAL BIOLOGY
, 2004
"... We present new techniques for the application of a Bayesian network learning framework to the problem of classifying gene expression data. The focus on classification permits us to develop techniques that address in several ways the complexities of learning Bayesian nets. Our classification model re ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
We present new techniques for the application of a Bayesian network learning framework to the problem of classifying gene expression data. The focus on classification permits us to develop techniques that address in several ways the complexities of learning Bayesian nets. Our classification model reduces the Bayesian network learning problem to the problem of learning multiple subnetworks, each consisting of a class label node and its set of parent genes. We argue that this classification model is more appropriate for the gene expression domain than are other structurally similar Bayesian network classification models, such as Naive Bayes and Tree Augmented Naive Bayes (TAN), because our model is consistent with prior domain experience suggesting that a relatively small number of genes, taken in different combinations, is required to predict most clinical classes of interest. Within this framework, we consider two different approaches to identifying parent sets which are supported by the gene expression observations and any other currently available evidence. One approach employs a simple greedy algorithm to search the universe of all genes; the second approach develops and applies a gene selection algorithm whose results are incorporated as a prior to enable an exhaustive search for parent sets over a restricted universe of genes. Two other significant contributions are the construction of classifiers from multiple, competing Bayesian network hypotheses and algorithmic methods for normalizing and binning gene expression data in the
An Extension and Novel Solution to the (l,d)Motif Challenge Problem
, 2004
"... The (l,d )motif challenge problem, as introduced by Pevzner and Sze [12], is a mathematical abstraction of the DNA functional site discovery task. Here we expand the (l,d )motif problem to more accurately model this task and present a novel algorithm to solve this extended problem. This algor ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
The (l,d )motif challenge problem, as introduced by Pevzner and Sze [12], is a mathematical abstraction of the DNA functional site discovery task. Here we expand the (l,d )motif problem to more accurately model this task and present a novel algorithm to solve this extended problem. This algorithm is guaranteed to find all (l,d )motifs in a set of input sequences with unbounded support and length. We demonstrate the performance of the algorithm on publicly available datasets and show that the algorithm deterministically enumerates the optimal motifs.
SLiMDisc: short, linear motif discovery, correcting
, 2006
"... for common evolutionary descent ..."
Pattern matching statistics on correlated sources
 In Proc. of LATIN’06 (2006
, 1992
"... Abstract. In pattern matching algorithms, two characteristic parameters play an important rôle: the number of occurrences of a given pattern, and the number of positions where a pattern occurrence ends. Since there may exist many occurrences which end at the same position, these two parameters may d ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Abstract. In pattern matching algorithms, two characteristic parameters play an important rôle: the number of occurrences of a given pattern, and the number of positions where a pattern occurrence ends. Since there may exist many occurrences which end at the same position, these two parameters may differ in a significant way. Here, we consider a general framework where the text is produced by a probabilistic source, which can be built by a dynamical system. Such “dynamical sources ” encompass the classical sources –memoryless sources, and Markov chains–, and may possess a high degree of correlations. We are mainly interested in two situations: the pattern is a general word of a regular expression, and we study the number of occurrence positions – the pattern is a finite set of strings, and we study the number of occurrences. In both cases, we determine the mean and the variance of the parameter, and prove that its distribution is asymptotically Gaussian. In this way, we extend methods and results which have been already obtained for classical sources [for instance in [9] and in [6]] to this general “dynamical ” framework. Our methods use various techniques: formal languages, and generating functions, as in previous works. However, in this correlated model, it is not possible to use a direct transfer into generating functions, and we mainly deal with generating operators which generate... generating functions. 1