Results 1  10
of
16
Speciesspecific typing of DNA based on palindrome frequency patterns
 DNA Res
, 2011
"... DNA in its natural, doublestranded form may contain palindromes, sequences which read the same from either side because they are identical to their reverse complement on the sister strand. Short palindromes are underrepresented in all kinds of genomes. The frequency distribution of short palindrom ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
DNA in its natural, doublestranded form may contain palindromes, sequences which read the same from either side because they are identical to their reverse complement on the sister strand. Short palindromes are underrepresented in all kinds of genomes. The frequency distribution of short palindromes exhibits more than twice the interspecies variance of nonpalindromic sequences, which renders palindromes optimally suited for the typing of DNA. Here, we show that based on palindrome frequency, DNA sequences can be discriminated to the level of species of origin. By plotting the ratios of actual occurrence to expectancy, we generate palindrome frequency patterns that allow to cluster different sequences of the same genome and to assign plasmids, and in some cases even viruses to their respective host genomes. This finding will be of use in the growing field of metagenomics. Key words: comparative genomics; DNA palindrome; hierarchical clustering 1.
RNAVLab: A unified environment for computational RNA structure snalysis based on grid computing technology
 In Proc. of the 6th IEEE Int. Workshop on High Performance Comp. Biology
, 2007
"... technology ..."
(Show Context)
©2004 INFORMS Palindromes in SARS and Other Coronaviruses
"... With the identification of a novel coronavirus associated with the severe acute respiratory syndrome (SARS),computational analysis of its RNA genome sequence is expected to give useful clues to help elucidate the origin, evolution, and pathogenicity of the virus. In this paper, we study the collecti ..."
Abstract
 Add to MetaCart
(Show Context)
With the identification of a novel coronavirus associated with the severe acute respiratory syndrome (SARS),computational analysis of its RNA genome sequence is expected to give useful clues to help elucidate the origin, evolution, and pathogenicity of the virus. In this paper, we study the collective counts of palindromes in the SARS genome along with all the completely sequenced coronaviruses. Based on a Markovchain model for the genome sequence, the mean and standard deviation for the number of palindromes at or above a given length are derived. These theoretical results are complemented by extensive simulations to provide empirical estimates. Using a z score obtained from these mathematical and empirical means and standard deviations, we have observed that palindromes of length four are significantly underrepresented in all the coronaviruses in our data set. In contrast, lengthsix palindromes are significantly underrepresented only in the SARS coronavirus. Two other features are unique to the SARS sequence. First, there is a length22 palindrome TCTTTAACAAGCTTGTTAAAGA spanning positions 25962–25983. Second, there are two repeating length12 palindromes TTATAATTATAA spanning positions 22712–22723 and 22796–22807. Some further investigations into possible biological implications of these palindrome features are proposed. Key words: Markov chain; palindrome counts; simulation; RNA viral genome; severe acute respiratory syndrome History: Accepted by Harvey J. Greenberg, Guest Editor; received August 2003; accepted January 2004. 1.
Estimating the Occurrence Rate of DNA Palindromes
"... Abstract A DNA palindrome is a segment of letters along a DNA sequence with inversion symmetry that one strand is identical to its complementary one running in the opposite direction. Searching nonrandom clusters of DNA palindromes, an interesting bioinformatic problem, relies on the estimation of ..."
Abstract
 Add to MetaCart
Abstract A DNA palindrome is a segment of letters along a DNA sequence with inversion symmetry that one strand is identical to its complementary one running in the opposite direction. Searching nonrandom clusters of DNA palindromes, an interesting bioinformatic problem, relies on the estimation of the null palindrome occurrence rate. The most commonly used approach for estimating this number is the average rate method. However, we observed that the average rate could exceed the actual rate by 50% when inserting 5,000 bp hotspot regions with 15fold rate in a simulated 150,000 bp genome sequence. Here, we propose a Markov based estimator to avoid counting the number of palindromes directly, and thus to reduce the impact from the hotspots. Our simulation shows that this method is more robust against the hotspot effect than the average rate method. Furthermore, this method can be generalized to either a higher order Markov model or a segmented Markov model, and extended to calculate the occurrence rate for palindromes with gaps. We also provide a pvalue approximation for various scan statistics to test nonrandom palindrome clusters under a Markov model.
Counterintuitive answers to some questions concerning minimalpalindromic extensions of binary words
"... Abstract In [Š. Holub & K. Saari, On highly palindromic words, Discrete Appl. Math. 157 (2009), 953959] the authors proposed to measure the degree of "palindromicity" of a binary word w by ratio rws w , where the word rws is minimalpalindromicthat is, does not contain palindro ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract In [Š. Holub & K. Saari, On highly palindromic words, Discrete Appl. Math. 157 (2009), 953959] the authors proposed to measure the degree of "palindromicity" of a binary word w by ratio rws w , where the word rws is minimalpalindromicthat is, does not contain palindromic subwords of length greater than w 2 and the length r+s is as small as possible. It was asked whether the words of a given length n which reach the maximal possible ratio rws w among the words of length n are always palindromes. It was further asked whether it can be assumed, w.l.o.g., that r and s are of form 0 * or 1 * , or at least 0 * 1 * or 1 * 0 * . We negatively answer these questions, and also one further question of a similar kind. Mathematics Subject Classification (2010): 68R15
Importance Sampling of Word Patterns in DNA and Protein Sequences
, 2007
"... The use of Monte Carlo evaluation to compute pvalues of pattern counting test statistics is especially attractive when an asymptotic theory is absent or when the search sequence or the word pattern is too short for an asymptotic formula to be accurate. The drawback of applying Monte Carlo simulatio ..."
Abstract
 Add to MetaCart
The use of Monte Carlo evaluation to compute pvalues of pattern counting test statistics is especially attractive when an asymptotic theory is absent or when the search sequence or the word pattern is too short for an asymptotic formula to be accurate. The drawback of applying Monte Carlo simulations directly is its inefficiency when pvalues are small, which precisely is the situation of importance. In this paper, we provide a general importance sampling algorithm for efficient Monte Carlo evaluation of small pvalues of pattern counting test statistics and apply it on word patterns of biological interest, in particular palindromes and inverted repeats, patterns arising from position specific weight matrices, as well as cooccurrences of pairs of motifs. We also show that our importance sampling technique satisfies a log efficient criterion. Key words: importance sampling, biological sequence analysis, motif analysis. joint first authors. Searching for matches to a word pattern in a stretch of biological sequence has become
STEIN’S METHOD, PALM THEORY AND POISSON PROCESS APPROXIMATION 1
, 2002
"... The framework of Stein’s method for Poisson process approximation is presented from the point of view of Palm theory, which is used to construct Stein identities and define local dependence. A general result (Theorem 2.3) in Poisson process approximation is proved by taking the local approach. It is ..."
Abstract
 Add to MetaCart
The framework of Stein’s method for Poisson process approximation is presented from the point of view of Palm theory, which is used to construct Stein identities and define local dependence. A general result (Theorem 2.3) in Poisson process approximation is proved by taking the local approach. It is obtained without reference to any particular metric, thereby allowing wider applicability. A Wasserstein pseudometric is introduced for measuring the accuracy of point process approximation. The pseudometric provides a generalization of many metrics used so far, including the total variation distance for random variables and the Wasserstein metric for processes as in Barbour and Brown [Stochastic Process. Appl. 43 (1992) 9–31]. Also, through the pseudometric, approximation for certain point processes on a given carrier space is carried out by lifting it to one on a larger space, extending an idea of Arratia, Goldstein and Gordon [Statist. Sci. 5 (1990) 403–434]. The error bound in the general result is similar in form to that for Poisson approximation. As it yields the Stein factor 1/λ as in Poisson approximation, it provides good approximation, particularly in cases where λ is large. The general result is applied to a number of problems including Poisson process modeling of rare words in a DNA sequence. 1. Introduction. Poisson
Importance Sampling of Word Patterns in DNA and Protein Sequences
, 2008
"... Monte Carlo methods can provide accurate pvalue estimates of word counting test statistics and are easy to implement. They are especially attractive when an asymptotic theory is absent or when either the search sequence or the word pattern is too short for the application of asymptotic formulae. Na ..."
Abstract
 Add to MetaCart
Monte Carlo methods can provide accurate pvalue estimates of word counting test statistics and are easy to implement. They are especially attractive when an asymptotic theory is absent or when either the search sequence or the word pattern is too short for the application of asymptotic formulae. Naive direct Monte Carlo is undesirable for the estimation of small probabilities because the associated rare events of interest are seldom generated. We propose instead efficient importance sampling algorithms that use controlled insertion of the desired word patterns on randomly generated sequences. The implementation is illustrated on word patterns of biological interest: Palindromes and inverted repeats, patterns arising from position specific weight matrices and cooccurrences of pairs of motifs.