• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

G.: Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and (2000)

by A Apostolico, Bejerano
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 19
Next 10 →

Protein family classification using sparse markov transducers

by Eleazar Eskin, William Stafford Noble, Yoram Singer - PROC. 8TH INT. CONF. INTELLIGENT SYSTEMS FOR MOLECULAR BIOLOGY , 2003
"... We present a method for classifying proteins into families based on short subsequences of amino acids using a new probabilistic model called sparse Markov transducers (SMT). We classify a protein by estimating probability distributions over subsequences of amino acids from the protein. Sparse Markov ..."
Abstract - Cited by 23 (9 self) - Add to MetaCart
We present a method for classifying proteins into families based on short subsequences of amino acids using a new probabilistic model called sparse Markov transducers (SMT). We classify a protein by estimating probability distributions over subsequences of amino acids from the protein. Sparse Markov transducers, similar to probabilistic suffix trees, estimate a probability distribution conditioned on an input sequence. SMTs generalize probabilistic suffix trees by allowing for wild-cards in the conditioning sequences. Since substitutions of amino acids are common in protein families, incorporating wild-cards into the model significantly improves classification performance. We present two models for building protein family classifiers using SMTs. As protein databases become larger, data driven learning algorithms for probabilistic models such as SMTs will require vast amounts of memory. We therefore describe and use efficient data structures to improve the memory usage of SMTs. We evaluate SMTs by building protein family classifiers using the Pfam and SCOP databases and compare our results to previously published results and state-of-the-art protein homology detection methods. SMTs outperform previous probabilistic suffix tree methods and under certain conditions perform comparably to state-of-the-art protein homology methods.

CLUSEQ: Efficient and Effective Sequence Clustering

by Jiong Yang, Wei Wang - In ICDE , 2003
"... Analyzing sequence data has become increasingly important recently in the area of biological sequences, text documents, web access logs, etc. In this paper, we investigate the problem of clustering sequences based on their structural features. As a widely recognized technique, clustering has proven ..."
Abstract - Cited by 21 (2 self) - Add to MetaCart
Analyzing sequence data has become increasingly important recently in the area of biological sequences, text documents, web access logs, etc. In this paper, we investigate the problem of clustering sequences based on their structural features. As a widely recognized technique, clustering has proven to be very useful in detecting unknown object categories and revealing hidden correlations among objects. One difficulty that prevents clustering from being performed extensively on sequence data (in categorical domain) is the lack of an effective yet efficient similarity measure. Therefore, we propose a novel model (CLUSEQ) for sequence cluster by exploring significant statistical properties possessed by the sequences. The conditional probability distribution (CPD) of the next symbol given a preceding segment is derived and used to characterize sequence behavior and to support the similarity measure. A variation of the suffix tree, namely probabilistic suffix tree, is employed to organize (the significant portion of) the CPD in a concise way. A novel algorithm is devised to efficiently discover clusters with high quality and is able to automatically adjust the number of clusters to its optimal range via a unique combination of successive new cluster generation and cluster consolidation. The performance of CLUSEQ has been demonstrated via extensive experiments on several real and synthetic sequence databases. 1

Unsupervised Sequence Segmentation by a Mixture of Switching Variable Memory Markov Sources

by Yevgeny Seldin, Gill Bejerano, Naftali Tishby - In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001 , 2001
"... We present a novel information theoretic algorithm for unsupervised segmentation of sequences into alternating Variable Memory Markov sources. The algorithm is based on competitive learning between Markov models, when implemented as Prediction Suffix Trees (Ron et al., 1996) using the MDL principle. ..."
Abstract - Cited by 8 (5 self) - Add to MetaCart
We present a novel information theoretic algorithm for unsupervised segmentation of sequences into alternating Variable Memory Markov sources. The algorithm is based on competitive learning between Markov models, when implemented as Prediction Suffix Trees (Ron et al., 1996) using the MDL principle. By applying a model clustering procedure, based on rate distortion theory combined with deterministic annealing, we obtain a hierarchical segmentation of sequences between alternating Markov sources. The algorithm seems to be self regulated and automatically avoids over segmentation. The method is applied successfully to unsupervised segmentation of multilingual texts into languages where it is able to infer correctly both the number of languages and the language switching points. When applied to protein sequence families, we demonstrate the method's ability to identify biologically meaningful sub-sequences within the proteins, which correspond to important functional sub-units called domains.

The power of selective memory: selfbounded learning of prediction suffix trees

by Ofer Dekel, Shai Shalev-shwartz, Yoram Singer - In Advances in Neural Information Processing Systems 17 , 2004
"... Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST le ..."
Abstract - Cited by 7 (2 self) - Add to MetaCart
Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1

Pattern discovery and the algorithmics of surprise

by Alberto Apostolico - Proceedings of the NATO ASI on Arti Intelligence and Heuristic Methods for Bioinformatics , 2003
"... ..."
Abstract - Cited by 6 (2 self) - Add to MetaCart
Abstract not found

Markovian domain fingerprinting: statistical segmentation of protein sequences

by Gill Bejerano, Yevgeny Seldin, Hanah Margalit, Naftali Tishby - 16 [CT02] [DH01] [DV00] [GFT98] [GGG02] [HPB98 , 1991
"... Motivation: Characterization of a protein family by its distinct sequence domains is crucial for functional annotation and correct classification of newly discovered proteins. Conventional Multiple Sequence Alignment (MSA) based methods find difficulties when faced with heterogeneous groups of prote ..."
Abstract - Cited by 6 (1 self) - Add to MetaCart
Motivation: Characterization of a protein family by its distinct sequence domains is crucial for functional annotation and correct classification of newly discovered proteins. Conventional Multiple Sequence Alignment (MSA) based methods find difficulties when faced with heterogeneous groups of proteins. However, even many families of proteins that do share a common domain contain instances of several other domains, without any common underlying linear ordering. Ignoring this modularity may lead to poor or even false classification results. An automated method that can analyze a group of proteins into the sequence domains it contains is therefore highly desirable. Results: We apply a novel method to the problem of protein domain detection. The method takes as input an unaligned group of protein sequences. It segments them and clusters the segments into groups sharing the same underlying statistics. A Variable Memory Markov (VMM) model is built using a Prediction Suffix Tree (PST) data structure for each group of segments. Refinement is achieved by letting the PSTs compete over the segments, and a deterministic annealing framework infers the number of underlying PST models while avoiding many inferior solutions. We show that regions of similar statistics correlate well with protein sequence domains, by matching a unique signature to each domain. This is done in a fully automated manner, and does not require or attempt an MSA. Several representative cases are analyzed. We identify a protein fusion event, refine an HMM superfamily classification into the underlying families the HMM cannot separate, and detect all 12 instances of a short domain in a group of 396 sequences.

Local Prediction Approach for Protein Classification Using Probabilistic Suffix Trees

by Zhaohui Sun, Jitender S. Deogun
"... Probabilistic suffix tree (PST) is a stochastic model that uses a suffix tree as an index structure to store conditional probabilities associated with subsequences. PST has been successfully used to model and predict protein families following global approach. Their approach takes into account the e ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
Probabilistic suffix tree (PST) is a stochastic model that uses a suffix tree as an index structure to store conditional probabilities associated with subsequences. PST has been successfully used to model and predict protein families following global approach. Their approach takes into account the entire sequence, and thus is not suitable for partially conserved families. We develop two variants of PST for local prediction: multiple-domain prediction and best-domain prediction. The multiple-domain method predicts the probability that a protein belongs to a family based on one or more significant conserved regions, while the best-domain method does it based on the most conserved region in the query sequence. The time complexity of both of our approaches is the same as that of the global prediction, that is, O(Lm) where L is the depth bound of the tree and m is the size of the query sequence. We tested our algorithms on the Pfam database of protein families and compared the results with the global prediction method . The experimental results show that our approaches have higher accuracy of prediction than that of global approach. We also show that, our local prediction approach is an effective way to extract motifs/domains. Our approaches employ a linear time method for building PST by adapting the linear time construction of Probabilistic Automata reported by A.Apostolico et al.

SEQUENCE-BASED PROTEIN FUNCTION PREDICTION

by Brett Poulin , 2004
"... ..."
Abstract - Cited by 4 (1 self) - Add to MetaCart
Abstract not found

String Pattern Matching For A Deluge Survival Kit

by Alberto Apostolico, Maxime Crochemore , 2000
"... String Pattern Matching concerns itself with algorithmic and combinatorial issues related to matching and searching on linearly arranged sequences of symbols, arguably the simplest possible discrete structures. As unprecedented volumes of sequence data are amassed, disseminated and shared at an incr ..."
Abstract - Cited by 4 (1 self) - Add to MetaCart
String Pattern Matching concerns itself with algorithmic and combinatorial issues related to matching and searching on linearly arranged sequences of symbols, arguably the simplest possible discrete structures. As unprecedented volumes of sequence data are amassed, disseminated and shared at an increasing pace, effective access to, and manipulation of such data depend crucially on the efficiency with which strings are structured, compressed, transmitted, stored, searched and retrieved. This paper samples from this perspective, and with the authors' own bias, a rich arsenal of ideas and techniques developed in more than three decades of history.

Using Mixtures of Common Ancestors for Estimating the Probabilities of Discrete Events in Biological Sequences

by Eleazar Eskin, William Noble Grundy, Yoram Singer - In Proceedings of the Ninth International Conference on Intelligent Systems for Molecular Biology , 2002
"... Accurately estimating probabilities from observations is important for probabilistic-based approaches to problems in computational biology. In this paper we present a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture mode ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
Accurately estimating probabilities from observations is important for probabilistic-based approaches to problems in computational biology. In this paper we present a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors. The method is an extension of substitution matrix-based probability estimation methods. In contrast to previous such methods, our method has a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphabets.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University