Results 1 - 10
of
43
Hidden Markov models for detecting remote protein homologies
- Bioinformatics
, 1998
"... A new hidden Markov model method (SAM-T98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAM-T98 is ..."
Abstract
-
Cited by 229 (12 self)
- Add to MetaCart
A new hidden Markov model method (SAM-T98) for nding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (hmm) from the sequence and homologs found using the hmm for database search. SAM-T98 is also used to construct model libraries automatically from sequences in structural databases. We evaluate the SAM-T98 method with four datasets. Three of the test sets are fold-recognition tests, where the correct answers are determined by structural similarity. The fourth uses a curated database. The method is compared against wu-blastp and against double-blast, a two-step method similar to ISS, but using blast instead of fasta. Results SAM-T98 had the fewest errors in all tests| dramatically so for the fold-recognition tests. At the minimum-error point on the SCOP-domains test, SAM-T98 got 880 true positives and 68 false positives, double-blast got 533 true positives with 71 false positives, and wu-blastp got 353 true positives with 24 false positives. The method is optimized to recognize superfamilies, and would require parameter adjustment to be used to nd family or fold relationships. One key to the performance of the hmm method is a new score-normalization technique that compares the score to the score with a reversed model rather than to a uniform null model. Availability A World Wide Web server, as well as information on obtaining the Sequence Alignment and PREPRINT to appear in Bioinformatics, 1999
Duplicate record detection: A survey
- TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2007
"... Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a dif cult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard ..."
Abstract
-
Cited by 155 (4 self)
- Add to MetaCart
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a dif cult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats or any combination of these factors. In this article, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar eld entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the ef ciency and scalability of approximate duplicate detection algorithms. We conclude with a coverage of existing tools and with a brief discussion of the big open problems in the area.
Combining Pairwise Sequence Similarity and Support Vector Machines for Remote Protein Homology Detection
- J. Comput. Biol
, 2002
"... One key element in understanding the molecular machinery of the cell is to understand the meaning, or function, of each protein encoded in the genome. A very successful means of inferring the function of a previously unannotated protein is via sequence similarity with one or more proteins whose func ..."
Abstract
-
Cited by 116 (12 self)
- Add to MetaCart
One key element in understanding the molecular machinery of the cell is to understand the meaning, or function, of each protein encoded in the genome. A very successful means of inferring the function of a previously unannotated protein is via sequence similarity with one or more proteins whose functions are already known. Currently, one of the most powerful such homology detection methods is the SVM-Fisher method of Jaakkola, Diekhans and Haussler (ISMB 2000). This method combines a generative, profile hidden Markov model (HMM) with a discriminative classification algorithm known as a support vector machine (SVM). The current work presents an alternative method for SVMbased protein classification. The method, SVM-pairwise, uses a pairwise sequence similarity algorithm such as SmithWaterman in place of the HMM in the SVM-Fisher method. The resulting algorithm, when tested on its ability to recognize previously unseen families from the SCOP database, yields significantly better remote protein homology detection than SVM-Fisher, profile HMMs and PSI-BLAST.
A Database Index to Large Biological Sequences
- In VLDB
, 2001
"... We present an approach to searching genetic DNA sequences using an adaptation of the suffix tree data structure deployed on the general purpose persistent Java platform, PJama. Our implementation technique is novel, in that it allows us to build suffix trees on disk for arbitrarily large sequences, ..."
Abstract
-
Cited by 50 (3 self)
- Add to MetaCart
We present an approach to searching genetic DNA sequences using an adaptation of the suffix tree data structure deployed on the general purpose persistent Java platform, PJama. Our implementation technique is novel, in that it allows us to build suffix trees on disk for arbitrarily large sequences, for instance for the longest human chromosome consisting of 263 million letters. We propose to use such indexes as an alternative to the current practice of serial scanning. We describe our tree creation algorithm, analyse the performance of our index, and discuss the interplay of the data structure with object store architectures. Early measurements are presented.
Indexing and Retrieval for Genomic Databases
- IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationall ..."
Abstract
-
Cited by 40 (6 self)
- Add to MetaCart
Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences only and to reduce the costs of the alignments that are attempted. We present an index-based approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in signi cant savings in computationally intensive local alignments, and that index-based searching is as accurate as existing exhaustive search schemes.
Applications of generalized pair hidden Markov models to alignment and gene finding problems
- J. Comput. Biol
, 2002
"... Hidden Markov models (HMMs) have been successfully applied to a variety of problems in molecular biology, ranging from alignment problems to gene � nding and annotation. Alignment problems can be solved with pair HMMs, while gene � nding programs rely on generalized HMMs in order to model exon lengt ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
Hidden Markov models (HMMs) have been successfully applied to a variety of problems in molecular biology, ranging from alignment problems to gene � nding and annotation. Alignment problems can be solved with pair HMMs, while gene � nding programs rely on generalized HMMs in order to model exon lengths. In this paper, we introduce the generalized pair HMM (GPHMM), which is an extension of both pair and generalized HMMs. We show how GPHMMs, in conjunction with approximate alignments, can be used for cross-species gene � nding and describe applications to DNA–cDNA and DNA–protein alignment. GPHMMs provide a unifying and probabilistically sound theory for modeling these problems. Key words: hidden Markov model, alignment, gene � nding, comparative genomics. 1.
Kestrel: A Programmable Array for Sequence Analysis
, 1998
"... Kestrel is a programmable linear array processor designed for sequence analysis. Among other features, Kestrel includes an 8-bit word, a single-cycle add-and-minimize instruction, a multiplier and efficient communication using shared registers. This paper describes Kestrel’s functional units in de ..."
Abstract
-
Cited by 20 (8 self)
- Add to MetaCart
Kestrel is a programmable linear array processor designed for sequence analysis. Among other features, Kestrel includes an 8-bit word, a single-cycle add-and-minimize instruction, a multiplier and efficient communication using shared registers. This paper describes Kestrel’s functional units in detail, and examines each of their effects on system performance. With functional prototype chips completed, we will assemble a full single-board Kestrel array, with 512 processing elements on eight chips, in early 1998.
Classifying Molecular Sequences Using a Linkage Graph With Their Pairwise Similarities
- Theoretical Computer Science
, 1999
"... This paper presents a method for classifying a large and mixed set of uncharacterized sequences provided by genome projects. As the measure of sequence similarity, we use similarity score computed by a method based on the dynamic programming (DP), such as the Smith--Waterman local alignment algorith ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
This paper presents a method for classifying a large and mixed set of uncharacterized sequences provided by genome projects. As the measure of sequence similarity, we use similarity score computed by a method based on the dynamic programming (DP), such as the Smith--Waterman local alignment algorithm. Although comparison by DP based method is very sensitive, when given sequences include a family of sequences that are much diverged in evolutionary process, similarity among some of them may be hidden behind spurious similarity of some unrelated sequences. Also the distance derived from the similarity score may not be metric (i.e., triangle inequality may not hold) when some sequences have multi-domain structure. To cope with these problems, we introduce a new graph structure called p-quasi complete graph for describing a family of sequences with a con#dence measure. We prove that a restricted version of the p- quasi complete graph problem (given a positive integer k, whether a graph cont...
Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models
- J. COMP. BIOL
, 2001
"... The score statistics of probabilistic gapped local alignment of random sequences is investigated both analytically and numerically. The full probabilistic algorithm (e.g., the “local” version of maximum-likelihood or hidden Markov model method) is found to have anomalous statistics. A modified “semi ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
The score statistics of probabilistic gapped local alignment of random sequences is investigated both analytically and numerically. The full probabilistic algorithm (e.g., the “local” version of maximum-likelihood or hidden Markov model method) is found to have anomalous statistics. A modified “semi-probabilistic” alignment consisting of a hybrid of Smith–Waterman and probabilistic alignment is then proposed and studied in detail. It is predicted that the score statistics of the hybrid algorithm is of the Gumbel universal form, with the key Gumbel parameter l taking on a fixed asymptotic value for a wide variety of scoring systems and parameters. A simple recipe for the computation of the “relative entropy,” and from it the finite size correction to l, is also given. These predictions compare well with direct numerical simulations for sequences of lengths between 100 and 1,000 examined using various PAM substitution scores and affine gap functions. The sensitivity of the hybrid method in the detection of sequence homology is also studied using correlated sequences generated from toy mutation models. It is found to be comparable to that of the Smith–Waterman alignment and significantly better than the Viterbi version of the probabilistic alignment.
Evaluation Measures of Multiple Sequence Alignments
- J Comput Biol
, 1996
"... Multiple sequence alignments (MSAs) are frequently used in the study of families of protein sequences or DNA/RNA sequences. They are a fundamental tool for the understanding of structure, functionality and ultimately evolution of proteins. A new algorithm, the Circular Sum (CS) method, is present ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Multiple sequence alignments (MSAs) are frequently used in the study of families of protein sequences or DNA/RNA sequences. They are a fundamental tool for the understanding of structure, functionality and ultimately evolution of proteins. A new algorithm, the Circular Sum (CS) method, is presented for formally evaluating the quality of an MSA. It is based on the use of a solution to the Traveling Salesman Problem, which identifies a circular tour through an evolutionary tree connecting the sequences in a protein family. With this approach the calculation of an evolutionary tree and the errors that it would introduce can be avoided altogether. The algorithm gives an upper bound, the best score that can possibly be achieved by any MSA for a given set of protein sequences. Alternatively, if presented with a specific MSA, the algorithm provides a formal score for the MSA, which serves as an absolute measure of the quality of the MSA. The CS measure yields a direct connection be...

