Results 1  10
of
10
Indexing and Retrieval for Genomic Databases
 IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Genomic sequence databases are widely used by molecular biologists for homology searching. Aminoacid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationall ..."
Abstract

Cited by 45 (6 self)
 Add to MetaCart
Genomic sequence databases are widely used by molecular biologists for homology searching. Aminoacid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences only and to reduce the costs of the alignments that are attempted. We present an indexbased approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in signi cant savings in computationally intensive local alignments, and that indexbased searching is as accurate as existing exhaustive search schemes.
VOTING ALGORITHMS FOR DISCOVERING LONG MOTIFS
"... Pevzner and Sze [14] have introduced the Planted (l,d)Motif Problem to find similar patterns (motifs) in sequences which represent the promoter region of coregulated genes. l is the length of the motif and d is the maximum Hamming distance around the similar patterns. Many algorithms have been dev ..."
Abstract

Cited by 28 (6 self)
 Add to MetaCart
Pevzner and Sze [14] have introduced the Planted (l,d)Motif Problem to find similar patterns (motifs) in sequences which represent the promoter region of coregulated genes. l is the length of the motif and d is the maximum Hamming distance around the similar patterns. Many algorithms have been developed to solve this motif problem. However, these algorithms either have long running times or do not guarantee the motif can be found. In this paper, we introduce new algorithms to solve the motif problem. Our algorithms can find motifs in reasonable time for not only the challenging (9,2), (11,3), (15,5)motif problems but for even longer motifs, say (20,7), (30,11) and (40,15), which have never been seriously attempted by other researchers because of heavy time and space requirements. 1
Computational identification of transcriptional regulatory elements in DNA sequence
, 2006
"... Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computatio ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and highthroughput experimental methods for mapping proteinbinding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cisregulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges.
An efficient algorithm for the extended (l,d)motif problem with unknown number of binding sites
 Proc. BIBE
, 2005
"... Finding common patterns, or motifs, from a set of DNA sequences is an important problem in molecular biology. Most motifdiscovering algorithms/software require the length of the motif as input. Motivated by the fact that the motif’s length is usually unknown in practice, Styczynski et al. introduce ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
Finding common patterns, or motifs, from a set of DNA sequences is an important problem in molecular biology. Most motifdiscovering algorithms/software require the length of the motif as input. Motivated by the fact that the motif’s length is usually unknown in practice, Styczynski et al. introduced the Extended (l,d)Motif Problem (EMP), where the motif’s length is not an input parameter. Unfortunately, the algorithm given by Styczynski et al. to solve EMP can take an unacceptably long time to run, e.g. over 3 months to discover a length14 motif. This paper makes two main contributions. First, we eliminate another input parameter from EMP: the minimum number of binding sites in the DNA sequences. Fewer input parameters not only reduces the burden of the user, but also may give more realistic/robust results since restrictions on length or on the number of binding sites make little sense when the best motif may not be the longest nor have the largest number of binding sites. Second, we develop an efficient algorithm to solve our redefined problem. The algorithm is also a fast solution for EMP (without any sacrifice to accuracy) making EMP practical. 1.
Generalized Planted (l,d)Motif Problem with Negative Set
 WABI
, 2005
"... Abstract. Finding similar patterns (motifs) in a set of sequences is an important problem in Computational Molecular Biology. Pevzner and Sze [18] defined the planted (l,d)motif problem as trying to find a lengthl pattern that occurs in each input sequence with at most d substitutions. When d is la ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Abstract. Finding similar patterns (motifs) in a set of sequences is an important problem in Computational Molecular Biology. Pevzner and Sze [18] defined the planted (l,d)motif problem as trying to find a lengthl pattern that occurs in each input sequence with at most d substitutions. When d is large, this problem is difficult to solve because the input sequences do not contain enough information on the motif. In this paper, we propose a generalized planted (l,d)motif problem which considers as input an additional set of sequences without any substring similar to the motif (negative set) as extra information. We analyze the effects of this negative set on the finding of motifs, and define a set of unsolvable problems and another set of most difficult problems, known as “challenging generalized problems”. We develop an algorithm called VANS based on voting and other novel techniques, which can solve the (9,3), (11,4),(15,6) and (20,8)motif problems which were unsolvable before as well as challenging problems of the planted (l,d)motif problem such as (9,2), (11,3), (15,5) and (20,7)motif problems. 1
OPTIMAL ALGORITHM FOR FINDING DNA MOTIFS WITH NUCLEOTIDE ADJACENT DEPENDENCY
"... Finding motifs and the corresponding binding sites is a critical and challenging problem in studying the process of gene expression. String and matrix representations are two popular models to represent a motif. However, both representations share an important weakness by assuming that the occurren ..."
Abstract
 Add to MetaCart
Finding motifs and the corresponding binding sites is a critical and challenging problem in studying the process of gene expression. String and matrix representations are two popular models to represent a motif. However, both representations share an important weakness by assuming that the occurrence of a nucleotide in a binding site is independent of other nucleotides. More complicated representations, such as HMM or regular expression, exist that can capture the nucleotide dependency. Unfortunately, these models are not practical (with too many parameters and require many known binding sites). Recently, Chin and Leung introduced the SPSP representation which overcomes the limitations of these complicated models. However, discovering novel motifs in SPSP representation is still a NPhard problem. In this paper, based on our observations in real binding sites, we propose a simpler model, the Dependency Pattern Sets (DPS) representation, which is simpler than the SPSP model but can still capture the nucleotide dependency. We develop a branch and bound algorithm (DPSFinder) for finding optimal DPS motifs. Experimental results show that DPSFinder can discover a length10 motif from 22 length500 DNA sequences within a few minutes and the DPS representation has a similar performance as SPSP representation.
unknown title
"... Predicting rules on organization of cisregulatory elements, taking the order of elements into account Goro Terai 1,2 and Toshihisa Takagi 2,∗ Motivation: In eukaryotes, rules regarding organization of cisregulatory elements are complex. They sometimes govern multiple kinds of elements and position ..."
Abstract
 Add to MetaCart
Predicting rules on organization of cisregulatory elements, taking the order of elements into account Goro Terai 1,2 and Toshihisa Takagi 2,∗ Motivation: In eukaryotes, rules regarding organization of cisregulatory elements are complex. They sometimes govern multiple kinds of elements and positional restrictions on elements. Results: We propose a method for detecting rules, by which the order of elements is restricted. The order restriction is expressed as element patterns. We extract all the element patterns that occur in promoter regions of at least the specified number of genes. Then, we find significant patterns based on the expression similarity of genes with promoter regions containing each of the extracted patterns. When we applied our method to Saccharomyces cerevisiae, we detected significant patterns overlooked by previous methods, thus demonstrating the utility of our method for analyses of eukaryotic gene regulation. We also suggest that several types of element organization exist: (i) those in which only the order of elements is important, (ii) order and distance both are important and (iii) only the combination of elements is important.
Bioinformatics
, 2003
"... Selection of significant genes via expression patterns is an important problem in microarray experiments. Owing to small sample size and the large number of variables (genes), the selection process can be unstable. This paper proposes a hierarchical Bayesian model for gene (variable) selection. We e ..."
Abstract
 Add to MetaCart
Selection of significant genes via expression patterns is an important problem in microarray experiments. Owing to small sample size and the large number of variables (genes), the selection process can be unstable. This paper proposes a hierarchical Bayesian model for gene (variable) selection. We employ latent variables to specialize the model to a regression setting and uses a Bayesian mixture prior to perform the variable selection. We control the size of the model by assigning a prior distribution over the dimension (number of significant genes) of the model. The posterior distributions of the parameters are not in explicit form and we need to use a combination of truncated sampling and Markov Chain Monte Carlo (MCMC) based computation techniques to simulate the parameters from the posteriors. The Bayesian model is flexible enough to identify significant genes as well as to perform future predictions. The method is applied to cancer classification via cDNA microarrays where the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the method is used to identify a set of significant genes. The method is also applied successfully to the leukemia data.
OPTIMAL ALGORITHM FOR FINDING DNA MOTIFS WITH NUCLEOTIDE ADJACENT DEPENDENCY ∗
"... Abstract: Finding motifs and the corresponding binding sites is a critical and challenging problem in studying the process of gene expression. String and matrix representations are two popular models to represent a motif. However, both representations share an important weakness by assuming that the ..."
Abstract
 Add to MetaCart
Abstract: Finding motifs and the corresponding binding sites is a critical and challenging problem in studying the process of gene expression. String and matrix representations are two popular models to represent a motif. However, both representations share an important weakness by assuming that the occurrence of a nucleotide in a binding site is independent of other nucleotides. More complicated representations, such as HMM or regular expression, exist that can capture the nucleotide dependency. Unfortunately, these models are not practical (with too many parameters and require many known binding sites). Recently, Chin and Leung introduced the SPSP representation which overcomes the limitations of these complicated models. However, discovering novel motifs in SPSP representation is still a NPhard problem. In this paper, based on our observations in real binding sites, we propose a simpler model, the Dependency Pattern Sets (DPS) representation, which is simpler than the SPSP model but can still capture the nucleotide dependency. We develop a branch and bound algorithm (DPSFinder) for finding optimal DPS motifs. Experimental results show that DPSFinder can discover a length10 motif from 22 length500 DNA sequences within a few minutes and the DPS representation has a similar performance as SPSP representation. 1
Identification of functional elements in unaligned
"... nucleic acid sequences by a novel tuple search algorithm ..."