Results 1 - 10
of
23
Incremental Paradigms of Motif Discovery
- Journal of Computational Biology
, 2004
"... We examine the problem of extracting maximal irredundant motifs from a string. A combinatorial argument poses a linear bound on the total number of such motifs, thereby opening the way to the quest for the fastest and most efficient methods of extraction. The basic paradigm explored here is that of ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
We examine the problem of extracting maximal irredundant motifs from a string. A combinatorial argument poses a linear bound on the total number of such motifs, thereby opening the way to the quest for the fastest and most efficient methods of extraction. The basic paradigm explored here is that of iterated updates of the set of irredundant motifs in a string under consecutive unit symbol extensions of the string itself. This approach exposes novel characterizations for the base set of motifs in a string, hinged on notions of partial order. Such properties support the design of ad hoc data structures and constructs, and lead to develop an O(n 3) time incremental discovery algorithm. Key words: 1.
Greedy Mixture Learning for Multiple Motif Discovery in Biological Sequences
, 2003
"... Motivation: This paper studies the problem of discovering subsequences, known as motifs, that are common to a given collection of related biosequences, by proposing agreedy algorithm for learning a mixture of motifs model through likelihood maximization. The approach adds sequentially a new motif to ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
Motivation: This paper studies the problem of discovering subsequences, known as motifs, that are common to a given collection of related biosequences, by proposing agreedy algorithm for learning a mixture of motifs model through likelihood maximization. The approach adds sequentially a new motif to a mixture model by performing a combined scheme of global and local search for appropriately initializing its parameters. In addition, a hierarchical partitioning scheme based on kd-trees is presented for partitioning the input dataset in order to speed-up the global searching procedure. The proposed method compares favorably over the well-known MEME approach and treats successfully several drawbacks of MEME.
A similar fragments merging approach to learn automata on proteins
- In: Machine Learning: ECML
, 2005
"... Publication interne n˚1735 — Juillet 2005 — 18 pages Abstract: We propose here to learn automata for the characterization of proteins families to overcome the limitations of the position-specific characterizations classically used in Pattern Discovery. We introduce a new heuristic approach learning ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Publication interne n˚1735 — Juillet 2005 — 18 pages Abstract: We propose here to learn automata for the characterization of proteins families to overcome the limitations of the position-specific characterizations classically used in Pattern Discovery. We introduce a new heuristic approach learning non-deterministic automata based on selection and ordering of significantly similar fragments to be merged and on physico-chemical properties identification. Quality of the characterization of the major intrinsic protein (MIP) family is assessed by leave-one-out cross-validation for a large range of models specificity. Key-words: grammatical inference, automata, proteins Goulven Kerbellec is supported by a PhD research grant from Région Bretagne.
Pattern discovery and the algorithmics of surprise
- Proceedings of the NATO ASI on Arti Intelligence and Heuristic Methods for Bioinformatics
, 2003
"... ..."
Gene Expression Profiling of DNA Microarray Data using Peano Count Trees (P-Trees)
- In Online Proceedings of the First Virtual Conference on Genomics and Bioinformatics, October 2001. URL: http://midas10.cs.ndsu.nodak.edu/bio
, 2001
"... The explosion of genomic data made possible by advances in parallel, high-throughput technologies in the area of molecular biology, has ushered in a new era in the area of Bioinformatics. During the last many years, efforts concentrated on sequencing the genome of organisms. Current emphasis lies in ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
The explosion of genomic data made possible by advances in parallel, high-throughput technologies in the area of molecular biology, has ushered in a new era in the area of Bioinformatics. During the last many years, efforts concentrated on sequencing the genome of organisms. Current emphasis lies in extracting meaningful information from this huge DNA sequence and expression data. The techniques currently employed to do analysis of microarray expression data is clustering and classification. These techniques present their own limitations as to the amount of useful information that can be derived. In this paper, we propose a new approach to data mining the microarray data using new data mining technology called Peano Count Tree (P-tree) . This technology employs Association Rule Mining as means to do data mining of the microarray data.
Analysis Of An Associative Memory Neural Network For Pattern Identification In Gene Expression Data
, 2001
"... DNA microarrays are becoming a standard tool for determining the role of genes in the regulation of any biological process in an organism. The application of this technology for the analysis of gene expression creates enormous opportunities for accelerating the pace towards the understanding of livi ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
DNA microarrays are becoming a standard tool for determining the role of genes in the regulation of any biological process in an organism. The application of this technology for the analysis of gene expression creates enormous opportunities for accelerating the pace towards the understanding of living systems and for the identification of target genes and pathways for drug development. However, equally efficient methods need to be developed for upgrading the information content of the large amounts of data generated by microarray experiments. A procedure for extracting patterns of gene expression through the analysis of the architecture of an associative memory neural network is described. Such patterns contain critical information about the gene-networking relationships observed during changes in cell physiology and the onset of diseases. The proposed method has been tested on two different microarray data sets, namely DeRisi's experiment on yeast cultures [10] and Golub's analysis of acute human leukemia molecular profiles [17]. Using these data sets, the neural network structure has been examined to extract relationships among different genes involved in major metabolic pathways and to relate specific genes to different classes of leukemia. Keywords Gene expression data, cDNA microarrays, neural networks, pattern recognition, data mining 1.
Pattern matching statistics on correlated sources
- In Proc. of LATIN’06 (2006
, 1992
"... Abstract. In pattern matching algorithms, two characteristic parameters play an important rôle: the number of occurrences of a given pattern, and the number of positions where a pattern occurrence ends. Since there may exist many occurrences which end at the same position, these two parameters may d ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract. In pattern matching algorithms, two characteristic parameters play an important rôle: the number of occurrences of a given pattern, and the number of positions where a pattern occurrence ends. Since there may exist many occurrences which end at the same position, these two parameters may differ in a significant way. Here, we consider a general framework where the text is produced by a probabilistic source, which can be built by a dynamical system. Such “dynamical sources ” encompass the classical sources –memoryless sources, and Markov chains–, and may possess a high degree of correlations. We are mainly interested in two situations: the pattern is a general word of a regular expression, and we study the number of occurrence positions – the pattern is a finite set of strings, and we study the number of occurrences. In both cases, we determine the mean and the variance of the parameter, and prove that its distribution is asymptotically Gaussian. In this way, we extend methods and results which have been already obtained for classical sources [for instance in [9] and in [6]] to this general “dynamical ” framework. Our methods use various techniques: formal languages, and generating functions, as in previous works. However, in this correlated model, it is not possible to use a direct transfer into generating functions, and we mainly deal with generating operators which generate... generating functions. 1
Discovering biological motifs with genetic programming
- In Genetic And Evolutionary Computation Conference. Proceedings of the 2005 conference on Genetic and evolutionary computation
, 2005
"... Choosing the right representation for a problem is important. In this article we introduce a linear genetic programming approach for motif discovery in protein families, and we also present a thorough comparison between our approach and Koza-style genetic programming using ADFs. In a study of 45 pro ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Choosing the right representation for a problem is important. In this article we introduce a linear genetic programming approach for motif discovery in protein families, and we also present a thorough comparison between our approach and Koza-style genetic programming using ADFs. In a study of 45 protein families, we demonstrate that our algorithm, given equal processing resources and no prior knowledge in shaping of datasets, consistently generates motifs that are of significantly better quality than those we found by using trees as representation. For several of the studied protein families we evolve motifs comparable to those found in Prosite, a manually curated database of protein motifs. Our linear genome gave better results than Koza-style genetic programming for 37 of 45 families. The difference is statistically significant for 24 of the families at the 99 % confidence level.
An Extension and Novel Solution to the (l,d)-Motif Challenge Problem
, 2004
"... The (l,d )--motif challenge problem, as introduced by Pevzner and Sze [12], is a mathematical abstraction of the DNA functional site discovery task. Here we expand the (l,d )--motif problem to more accurately model this task and present a novel algorithm to solve this extended problem. This algor ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The (l,d )--motif challenge problem, as introduced by Pevzner and Sze [12], is a mathematical abstraction of the DNA functional site discovery task. Here we expand the (l,d )--motif problem to more accurately model this task and present a novel algorithm to solve this extended problem. This algorithm is guaranteed to find all (l,d )--motifs in a set of input sequences with unbounded support and length. We demonstrate the performance of the algorithm on publicly available datasets and show that the algorithm deterministically enumerates the optimal motifs.
Hidden Word Statistics
"... We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is... ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We consider the sequence comparison problem, also known as "hidden" pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is...

