Results 1 
3 of
3
The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length
 Machine Learning
, 1996
"... . We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions gene ..."
Abstract

Cited by 173 (16 self)
 Add to MetaCart
. We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions generated by general probabilistic automata, we prove that the algorithm we present can efficiently learn distributions generated by PSAs. In particular, we show that for any target PSA, the KLdivergence between the distribution generated by the target and the distribution generated by the hypothesis the learning algorithm outputs, can be made small with high confidence in polynomial time and sample complexity. The learning algorithm is motivated by applications in humanmachine interaction. Here we present two applications of the algorithm. In the first one we apply the algorithm in order to construct a model of the English language, and use this model to correct corrupted text. In the second ...
Using Markov models and Hidden Markov Models to find repetitive extragenic palindromic sequences in Escherichia coli
, 1994
"... This paper presents a technique for using simple Markov models and hidden Markov models (hmms) to search for interesting sequences in a database of DNA sequences. The models are used to create a cost map for each sequence in the database. These cost maps can be searched rapidly for subsequences that ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
This paper presents a technique for using simple Markov models and hidden Markov models (hmms) to search for interesting sequences in a database of DNA sequences. The models are used to create a cost map for each sequence in the database. These cost maps can be searched rapidly for subsequences that have signi cantly lower costs than a null model. Milosavljevic's algorithmic signi cance test is used to determine when a subsequence is signi cantly found. The sequences reported aretrimmed to maximize the signaltonoise ratio (cost savings / p length). Methods are given for automatically constructing simple Markov models and hidden Markov models from small training sets. The techniques areillustrated bysearching a database of E. coli genomic DNA, EcoSeq6, for clusters of Repetitive Extragenic Palindromic sequences (REPs). Of the known REPs, 91 % are found with simple Markov models starting with a single REP cluster as a seed, and 95 % are found by a hidden Markov model built from the results of the simple Markov model search. There areno false positives from the simple Markov models, and the few extra sequences found by the hmms may be genuinely related sequences. 1. Using compression models to nd signi cant sequences 1
Research Paper 279 Metabolism and evolution of Haemophilus influenzae deduced from a wholegenome comparison with Escherichia coli
"... influenzae chromosome, the first completed genome sequence of a cellular life form, has been recently reported. Approximately 75 % of the 4.7 Mb genome sequence of Escherichia coli is also available. The life styles of the two bacteria are very different — H. influenzae is an obligate parasite that ..."
Abstract
 Add to MetaCart
influenzae chromosome, the first completed genome sequence of a cellular life form, has been recently reported. Approximately 75 % of the 4.7 Mb genome sequence of Escherichia coli is also available. The life styles of the two bacteria are very different — H. influenzae is an obligate parasite that lives in human upper respiratory mucosa and can be cultivated only on rich media, whereas E. coli is a saprophyte that can grow on minimal media. A detailed comparison of the protein products encoded by these two genomes is expected to provide valuable insights into bacterial cell physiology and genome evolution. Results: We describe the results of computer analysis of the aminoacid sequences of 1703 putative proteins encoded by the complete genome of H. influenzae. We detected sequence similarity to proteins in current databases for 92 % of the H. influenzae protein sequences, and at least a general functional prediction was possible for 83 %. A comparison of the H. influenzae protein sequences with those of 3010 proteins encoded by the sequenced 75 % of the E. coli genome revealed 1128 pairs of apparent orthologs, with an average of