Results 1 
5 of
5
Hidden Markov models for detecting remote protein homologies
 Bioinformatics
, 1998
"... A new hidden Markov model method (SAMT98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAMT98 is ..."
Abstract

Cited by 306 (12 self)
 Add to MetaCart
A new hidden Markov model method (SAMT98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAMT98 is also used to construct model libraries automatically from sequences in structural databases. We evaluate the SAMT98 method with four datasets. Three of the test sets are foldrecognition tests, where the correct answers are determined by structural similarity. The fourth uses a curated database. The method is compared against wublastp and against doubleblast, a twostep method similar to ISS, but using blast instead of fasta. Results SAMT98 had the fewest errors in all tests dramatically so for the foldrecognition tests. At the minimumerror point on the SCOPdomains test, SAMT98 got 880 true positives and 68 false positives, doubleblast got 533 true positives with 71 false positives, and wublastp got 353 true positives with 24 false positives. The method is optimized to recognize superfamilies, and would require parameter adjustment to be used to nd family or fold relationships. One key to the performance of the hmm method is a new scorenormalization technique that compares the score to the score with a reversed model rather than to a uniform null model. Availability A World Wide Web server, as well as information on obtaining the Sequence Alignment and PREPRINT to appear in Bioinformatics, 1999
Scoring Hidden Markov Models
"... Motivation: Statistical sequence comparison techniques, such as hidden Markov models and generalized pro les, calculate the probability that a sequence was generated by a given model. Logodds scoring is a means of evaluating this probability by comparing it to a null hypothesis, usually a simpler s ..."
Abstract

Cited by 37 (5 self)
 Add to MetaCart
Motivation: Statistical sequence comparison techniques, such as hidden Markov models and generalized pro les, calculate the probability that a sequence was generated by a given model. Logodds scoring is a means of evaluating this probability by comparing it to a null hypothesis, usually a simpler statistical model intended to represent the universe of sequences as a whole, rather than the group of interest. Such scoring leads to two immediate questions: what should the null model be, and what threshold of logodds score should be deemed a match to the model. Results: This paper experimentally analyses these two issues. Within the context of the Sequence Alignment and Modeling software suite (SAM), we consider a variety ofnull models and suitable thresholds. Additionally, we consider HMMer's logodds scoring and SAM's original Zscoring method. Among the null model choices, a simple looping null model that emits characters according to the geometric mean of the character probabilities in the columns modeled by the HMM performs well or best across all four discrimination experiments.
Detection of Protein Coding Sequences Using a Mixture Model for Local Protein Amino Acid Sequence
 BIOKDD01: Workshop on Data Mining in Bioinformatics (with SIGKDD01 Conference
, 2000
"... Locating protein coding regions in genomic DNA is a critical step in accessing the information generated by large scale sequencing projects. Current methods for gene detection depend on statistical measures of content differences between coding and noncoding DNA in addition to the recognition of pro ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Locating protein coding regions in genomic DNA is a critical step in accessing the information generated by large scale sequencing projects. Current methods for gene detection depend on statistical measures of content differences between coding and noncoding DNA in addition to the recognition of promoters, splice sites, and other regulatory sites. Here we explore the potential value of recurrent amino acid sequence patterns 319 amino acids in length as a content statistic for use in gene finding approaches. A finite mixture model incorporating these patterns can partially discriminate protein sequences which have no (detectable) known homologs from randomized versions of these sequences, and from short ( 50 amino acids) noncoding segments extracted from the S. cerevisiea genome. The mixture model derived scores for a collection of human exons were not correlated with the GENSCAN scores, suggesting that the addition of our protein pattern recognition module to current gene recognition programs may improve their performance.
Dealing with Size Limits in a Hardware Encoding of Weighted Finite Automata
, 2004
"... the maximum incoming degree. Prototype Implementation. Our practical implementation uses the Rdisk prototype, a parallel architecture designed for mass data ltering [6]. Data is distributed among several nodes linked by an Ethernet network, and each node houses a hard disk drive and a recongurable ..."
Abstract
 Add to MetaCart
the maximum incoming degree. Prototype Implementation. Our practical implementation uses the Rdisk prototype, a parallel architecture designed for mass data ltering [6]. Data is distributed among several nodes linked by an Ethernet network, and each node houses a hard disk drive and a recongurable processor (a FPGA) which lters data in an onthey way. A host computer send queries and collects results (Figure 2). With the current recongurable chip, the size constraint is p jj 600. Therefore, one can use WFA with 75 transitions and 8 bits weights. That covers common biological patterns like those of [1]. The speed constraint is less restrictive, as the clock runs at 40 MHz, and each board can lter data at 16 MB/s. That ow on a single board is more than 4 times faster than software simulation of WFA [4] on a 2 GHz PC. Massive parallelism is achieved through parallelization of several boards. 37 Additional Transitions. The size limit becomes crucial in some applications, for