Results 11 - 20
of
32
Whole-genome comparative annotation and regulatory motif discovery in multiple yeast species; 2003
- Proceedings of the 7th International Conference on Research in Computational Molecular Biology 2003, 7
, 2003
"... In [13] we reported the genome sequences of S. paradoxus, S. mikatae and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genome-wide comparative analysis allowed the identification of functionally important sequences, both coding and non-coding. In this comp ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In [13] we reported the genome sequences of S. paradoxus, S. mikatae and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genome-wide comparative analysis allowed the identification of functionally important sequences, both coding and non-coding. In this companion paper we describe the mathematical and algorithmic results underpinning the analysis of these genomes. We developed methods for the automatic comparative annotation of the four species and the determination of orthologous genes and intergenic regions. The algorithms enabled the automatic identification of orthologs for more than 90 % of genes despite the large number of duplicated genes in the yeast genome, and the discovery of recent gene family expansions and genome rearrangements. We also developed a test to validate
Family Pairwise Search with Embedded Motif Models
, 1999
"... Motivation: Statistical models of protein families, such as position-specific scoring matrices, profiles and hidden Markov models, have been used effectively to find remote homologs when given a set of known protein family members. Unfortunately, training these models typically requires a relatively ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Motivation: Statistical models of protein families, such as position-specific scoring matrices, profiles and hidden Markov models, have been used effectively to find remote homologs when given a set of known protein family members. Unfortunately, training these models typically requires a relatively large set of training sequences. Recentwork [Grundy, 1998] has shown that, when only a few family members are known, several theoretically justified statistical modeling techniques fail to provide homology detection performance on a par with Family Pairwise Search (FPS), an algorithm that combines scores from a pairwise sequence similarity algorithm such as BLAST. Results: This paper provides a model-based algorithm that improves FPS by incorporating hybrid motif-based models of the form generated by Cobbler [Henikoff and Henikoff, 1997]. For the 73 protein families investigated here, this cobbled FPS algorithm provides better homology detection performance than either Cobbler or FPS alo...
Three-Dimensional Shape-Structure Comparison Method for Protein Classification
- IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS
, 2006
"... In this paper, a 3D shape-based approach is presented for the efficient search, retrieval, and classification of protein molecules. The method relies primarily on the geometric 3D structure of the proteins, which is produced from the corresponding PDB files and secondarily on their primary and seco ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In this paper, a 3D shape-based approach is presented for the efficient search, retrieval, and classification of protein molecules. The method relies primarily on the geometric 3D structure of the proteins, which is produced from the corresponding PDB files and secondarily on their primary and secondary structure. After proper positioning of the 3D structures, in terms of translation and scaling, the Spherical Trace Transform is applied to them so as to produce geometry-based descriptor vectors, which are completely rotation invariant and perfectly describe their 3D shape. Additionally, characteristic attributes of the primary and secondary structure of the protein molecules are extracted, forming attribute-based descriptor vectors. The descriptor vectors are weighted and an integrated descriptor vector is produced. Three classification methods are tested. A part of the FSSP/DALI database, which provides a structural classification of the proteins, is used as the ground truth in order to evaluate the classification accuracy of the proposed method. The experimental results show that the proposed method achieves more than 99 percent classification accuracy while remaining much simpler and faster than the DALI method.
A Combinatorial Approach for Motif Discovery in Unaligned DNA Sequences
, 2004
"... Motif (conserved pattern) modelling and finding in unaligned DNA sequences is a fundamental problem in computational biology with important applications in understanding gene regulation. Biological approaches for this problem are tedious and time-consuming. Large amounts of genome sequence data and ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Motif (conserved pattern) modelling and finding in unaligned DNA sequences is a fundamental problem in computational biology with important applications in understanding gene regulation. Biological approaches for this problem are tedious and time-consuming. Large amounts of genome sequence data and gene expression micro-array data let us solve this problem computationally. Most computer science problems of this sort are NP-complete. Many heuristic approaches have been developed in the last decade, but the problem is far from being solved. Practical solutions must have good models to describe motifs, a sensitive scoring function to distinguish functional motifs, and a reliable algorithm to recognize these motifs. We discuss these ideas in this thesis and develop a combinatorial approach for profile motif models. Our approach combines greedy signal search and statistical refinement approaches. It tries to catch the motif signal through pairwise signals and to provide the statistical approach with good starting points. Comparative experiments on simulated challenge problems and on real biological samples demonstrate that our approach performs better than other widely used profile motif finding approaches.
A self-organizing neural network structure for motif identification in DNA sequences
- In Proceedings of the IEEE International Conference on Networking, Sensing and Control, 129–134
, 2005
"... Abstract — In this paper, we study the problem of subtle signal discoveries in unaligned DNA and protein sequences. Motifs, also known as approximate common substrings, are good examples of subtle signals in DNA and protein sequences. The problem of motif identification in DNA and protein sequences ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract — In this paper, we study the problem of subtle signal discoveries in unaligned DNA and protein sequences. Motifs, also known as approximate common substrings, are good examples of subtle signals in DNA and protein sequences. The problem of motif identification in DNA and protein sequences has been studied for many years in the literature. Major hurdles at this point include computational complexity and reliability of the searching algorithms. We will develop a self-organizing neural network for solving the problem of motif identification in DNA and protein sequences. Our network contains several layers with each layer performing classifications at different level. The top layer divide the input space into a small number of regions and the bottom layer classifies all input patterns into motifs and non-motif patterns. Depending on the number of input patterns to be classified, several layers between the top layer and the bottom layer are needed to perform intermediate classification. We maintain a low computational complexity through the use of the layered structure so that each pattern’s classification is performed with respect to a small subspace of the whole input space. We also maintain a high reliability using our self-organizing neural network since the network will grow as needed to make sure all input patterns are considered and are given the same amount of attention. Finally, simulation results show that our algorithm significantly outperforms existing algorithms, especially in the reliability aspect. Our algorithm can identify motifs with higher accuracy than existing algorithms.
ES: Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery
- J Comput Biol
, 2004
"... In Kellis et al. (2003), we reported the genome sequences of S. paradoxus, S. mikatae and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genome-wide comparative analysis allowed the identification of functionally important sequences, both coding and non-cod ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In Kellis et al. (2003), we reported the genome sequences of S. paradoxus, S. mikatae and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genome-wide comparative analysis allowed the identification of functionally important sequences, both coding and non-coding. In this companion paper we describe the mathematical and algorithmic results underpinning the analysis of these genomes. We present methods for the automatic determination of genome correspondence. The algorithms enabled the automatic identification of orthologs for more than 90 % of genes and intergenic regions across the four species despite the large number of duplicated genes in the yeast genome. The remaining ambiguities in the gene correspondence revealed recent gene family expansions in regions of rapid genomic change. We present methods for the identification of protein-coding genes based on their patterns of nucleotide conservation across related species. We observed the pressure to conserve the reading frame of functional proteins and developed a test for gene identification with high sensitivity and specificity. We used this test to revisit the genome of S. cerevisiae, reducing the overall gene count by 500 genes (10 % of previously annotated genes) and refining the gene structure of hundreds of genes. We present novel methods for the systematic de novo identification of regulatory motifs. The methods do not rely on previous knowledge of gene function and in that way differ from the current literature on computational motif discovery. Based
Zhang H: Motif discoveries in unaligned molecular sequences using self-organizing neural network
- IEEE Transactions on Neural Networks
"... Abstract — In this paper, we study the problem of motif discoveries in unaligned DNA and protein sequences. The problem of motif identification in DNA and protein sequences has been studied for many years in the literature. Major hurdles at this point include computational complexity and reliability ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract — In this paper, we study the problem of motif discoveries in unaligned DNA and protein sequences. The problem of motif identification in DNA and protein sequences has been studied for many years in the literature. Major hurdles at this point include computational complexity and reliability of the search algorithms. We propose a self-organizing neural network structure for solving the problem of motif identification in DNA and protein sequences. Our network contains several layers with each layer performing classifications at different levels. The top layer divides the input space into a small number of regions and the bottom layer classifies all input patterns into motifs and non-motif patterns. Depending on the number of input patterns to be classified, several layers between the top layer and the bottom layer are needed to perform intermediate classifications. We maintain a low computational complexity through the use of the layered structure so that each pattern’s classification is performed with respect to a small subspace of the whole input space. Our self-organizing neural network will grow as needed (e.g., when more motif patterns are classified). It will give the same amount of attention to each input pattern and it will not omit any potential motif patterns. Finally, simulation results show that our algorithm outperforms existing algorithms in certain aspects. In particular, simulation results show that our algorithm can identify motifs with more mutations than existing algorithms and our algorithm works well for long DNA sequences as well. Index Terms — DNA sequences, motif finding, neural networks, protein sequences, self-organization, subtle signals. I.
Higher Order Hidden Markov Models for DNA-binding Site Identification
"... ast weeks of writing, and took the time to proofread and offer general criticism in the final days. 1 Contents 1 Introduction 3 1.1 Background on Molecular Biology . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 The Problem: DNA Binding Site Identification . . . . . . . . . . . . . . . . 6 1. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
ast weeks of writing, and took the time to proofread and offer general criticism in the final days. 1 Contents 1 Introduction 3 1.1 Background on Molecular Biology . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 The Problem: DNA Binding Site Identification . . . . . . . . . . . . . . . . 6 1.3 Hidden Markov Models (HMMs) --- a New Computational Technique . . . 16 2 Methods 21 2.1 HMM Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 HMM Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3 Results and Discussion 34 3.1 Identifying Sites: HMMs vs. Prior Techniques . . . . . . . . . . . . . . . . 35 3.2 Posterior Probability vs. Actual Binding Affinity . . . . . . . . . . . . . . 43 3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Glossary 5
Automated Semantic Analysis of Schematic Data
"... Content in numerous Web data sources, designed primarily for human consumption, are not directly amenable to machine processing. Automated semantic analysis of such content facilitates their transformation into machine-processable and richly structured semantically annotated data. This paper describ ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Content in numerous Web data sources, designed primarily for human consumption, are not directly amenable to machine processing. Automated semantic analysis of such content facilitates their transformation into machine-processable and richly structured semantically annotated data. This paper describes a learning-based technique for semantic analysis of schematic data which are characterized by being template-generated from backend databases. Starting with a seed set of hand-labeled instances of semantic concepts in a set of Web pages, the technique learns statistical models of these concepts using light-weight content features. These models direct the annotation of diverse Web pages possessing similar content semantics. The principles behind the technique find application in information retrival and extraction problems. Focused Web browsing activities require only selective fragments of particular Web pages but are often performed using bookmarks which fetch the contents of the entire page. This results in information overload for users of constrained interaction modality devices such as small-screen handheld devices. Fine-grained information extraction from Web pages, which are typically performed using page specific and syntactic expressions known as wrappers, suffer from lack of scalability and robustness. We report on the application of our technique in developing semantic bookmarks for retrieving targeted browsing content and semantic wrappers for robust and scalable information extraction from Web pages sharing a semantic domain.

