## Probabilistic and Statistical Properties of Words: An Overview (2000)

### Cached

### Download Links

- [www.cs.ucr.edu]
- [www.cs.ucr.edu]
- [www-hto.usc.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Computational Biology |

Citations: | 91 - 1 self |

### BibTeX

@ARTICLE{Reinert00probabilisticand,

author = {Gesine Reinert and Sophie Schbath and Michael S. Waterman},

title = {Probabilistic and Statistical Properties of Words: An Overview},

journal = {Journal of Computational Biology},

year = {2000},

volume = {7},

pages = {1--46}

}

### Years of Citing Articles

### OpenURL

### Abstract

In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process approximations, and compound Poisson approximations are derived. Here, a sequence is modelled as a stationary ergodic Markov chain; a test for determining the appropriate order of the Markov chain is described. The convergence results take the error made by estimating the Markovian transition probabilities into account. The main tools involved are moment generating functions, martingales, Stein’s method, and the Chen-Stein method. Similar results are given for occurrences of multiple patterns, and, as an example, the problem of unique recoverability of a sequence from SBH chip data is discussed. Special emphasis lies on disentangling the complicated dependence structure between word occurrences, due to self-overlap as well as due to overlap between words. The results can be used to derive approximate, and conservative, con � dence intervals for tests. Key words: word counts, renewal counts, Markov model, exact distribution, normal approximation, Poisson process approximation, compound Poisson approximation, occurrences of multiple words, sequencing by hybridization, martingales, moment generating functions, Stein’s method, Chen-Stein method. 1.

### Citations

674 | Algebraic Combinatorics on Words - Lothaire - 2002 |

658 |
Biological Sequence Analysis
- Durbin
- 1998
(Show Context)
Citation Context ...on); the independent case is a particular case with m 5 0. Hidden Markov models (HMMs) reveal, however, that the composition of a DNA sequence may vary over the sequence (Churchill, 1989; Muri, 1998; =-=Durbin et al., 1998-=-) and can be studied with HMMs. However, no statistical properties of words have been yet derived in such heterogeneous models. DNA sequences code for amino acid sequences (proteins) by nonoverlapping... |

348 | A course in combinatorics - Lint, Wilson - 1992 |

323 |
Mathematical Statistics and Data Analysis
- Rice
- 1994
(Show Context)
Citation Context ...quence of length n, the most straightforward test is a Chi-square test, which can be viewed as a generalized likelihood ratio test. Most well-known is the Chi-square test for independence (see, e.g., =-=Rice, 1995-=-). In general, suppose we have a sample of size n cross-classi� ed in a table with U rows and V columns. For instance, we could have four rows labeled A, C, G, T, and four columns labeled A, C, G, T, ... |

217 | Poisson Approximation - Barbour, Holst, et al. - 1992 |

199 | Large Deviations Techniques in Decision, Simulation and Estimation - Bucklew - 1990 |

168 | Approximate string matching with q-grams and maximal matches - Ukkonen - 1992 |

141 |
A bound for the error in the normal approximation to the distribution of a sum of dependent random variables
- Stein
- 1972
(Show Context)
Citation Context ... (Z )) 5 5 ¡ sup B �E ,measurable jP(Y 2 B ) P(Z 2 B )j sup h:E ! [0,1],measurable jEh(Y ) ¡ Eh(Z )j. First published by Chen (1975) as the Poisson analog to Stein’s method for normal approximations (=-=Stein, 1972-=-), it has found widespread application; word counts being just one of them. A friendly exposition is found in Arratia et al. (1989) and a description with many examples can be found in Arratia et al. ... |

125 |
Stochastic models for heterogeneous DNA sequences
- CHURCHILL
- 1989
(Show Context)
Citation Context ...quence (and not on the position); the independent case is a particular case with m 5 0. Hidden Markov models (HMMs) reveal, however, that the composition of a DNA sequence may vary over the sequence (=-=Churchill, 1989-=-; Muri, 1998; Durbin et al., 1998) and can be studied with HMMs. However, no statistical properties of words have been yet derived in such heterogeneous models. DNA sequences code for amino acid seque... |

87 | Poisson approximation for dependent trials - Chen - 1975 |

65 |
A Course in Probability Theory. 2nd ed
- Chung
- 1974
(Show Context)
Citation Context ...reover, the asymptotic normality of (N (w) ¡ bN`¡ 2(w))= p n and the asymptotic variance can be obtained in an elegant way using martingale techniques. (For an introduction to martingales, see, e.g., =-=Chung, 1974-=-.) Indeed, bN`¡ 2(w) is a natural estimator of N (w ¡ )p( ¡ w ¡ , w`), and N (w) ¡ N (w ¡ )p( ¡ w ¡ , w`) is approximately a martingale as it is shown below. We introduce the martingale Mn 5 P n i5 ` ... |

65 | A first course in stochastic processes, 2nd ed - KARLIN, TAYLOR - 1975 |

62 | Poisson approximation and the Chen{Stein method - Arratia, Goldstein, et al. - 1990 |

61 | On a new law of large numbers - Erdös, Rényi - 1970 |

59 | Algorithms on strings, trees, and sequences - Gus - 1997 |

34 | On coupling constructions and rates in the CLT for dependent summands with applications to the antivoter model and weighted U -statistics - RINOTT, ROTAR - 1997 |

33 |
Statistical analyses of counts and distributions of restriction sites in DNA sequences
- Karlin, Burge, et al.
- 1992
(Show Context)
Citation Context ...o genome stability. Well-known examples of words with exceptional frequencies in DNA sequences are certain biological palindromes corresponding to restriction sites avoided, for instance, in E. coli (=-=Karlin et al., 1992-=-), and the cross-over hotspot instigator sites in several bacteria (see Biaudet et al., 1998; Chedin et al., 1998; Sourice 1 King’s College and Statistical Laboratory, Cambridge CB2 1ST, UK. 2 Unité d... |

29 | Compound Poisson approximation for non-negative random variables via Stein's method - Barbour, Chen, et al. - 1992 |

29 |
Linguistics of nucleotide sequences: morphology and comparison of vocabularies
- Brendel, Beckmann, et al.
- 1986
(Show Context)
Citation Context ...estimator bN m(w) of EN (w): bNm(w) 5 N (w1 wm1 1) N (w`¡ m w`) . (5.1) N (w2 wm1 1) N (w`¡ m w`¡ 1) 5 ¡ Let us � rst consider the maximal model (m ` 2) that is mainly used to � nd exceptional words (=-=Brendel et al., 1986-=-; Leung and Speed, 1996; Rocha et al., 1998). We introduce the following notation: w ¡ 5 ¡ w1 w`¡ 1 (pre� x of w with length ` 1), ¡ 5 ¡ ¡ ¡ 5 w w2 w w2 w` (suf� x of w with length ` w`¡ 1. Under the ... |

28 |
Long repetitive patterns in random sequences
- Guibas, Odlyzko
- 1980
(Show Context)
Citation Context ... that P(w) is the set of periods of w and that w ( p) 5 w1w2 w p denotes the word composed of the � rst p letters of w. We consider the overlap-matching polynomial Q (z) associated with w (see, e.g., =-=Guibas and Odlyzko, 1980-=-; Li, 1980; Biggins and Cannings, 1987) de� ned by X m (w) Q (z) 5 m (w (`¡ p) ) z p . p 2P(w)[f0g When the Markov process is in stationarity, we have from renewal theory that m R (w) 5 m (w) . (6.1) ... |

28 |
A martingale approach to the study of occurrence of sequence patterns in repeated experiments, The Annals of Probability
- Li
- 1980
(Show Context)
Citation Context ...eriods of w and that w ( p) 5 w1w2 w p denotes the word composed of the � rst p letters of w. We consider the overlap-matching polynomial Q (z) associated with w (see, e.g., Guibas and Odlyzko, 1980; =-=Li, 1980-=-; Biggins and Cannings, 1987) de� ned by X m (w) Q (z) 5 m (w (`¡ p) ) z p . p 2P(w)[f0g When the Markov process is in stationarity, we have from renewal theory that m R (w) 5 m (w) . (6.1) Q (1) To u... |

23 |
Poisson approximations for r-scan processes
- Dembo, Karlin
- 1988
(Show Context)
Citation Context ...e of w in the sequence, since we use an homogeneous model. It may be useful to study the distance D (r) between the j-th and ( j 1 r)-th occurrence of w, called r-scan by Karlin and colleagues (e.g., =-=Dembo and Karlin, 1992-=-). The distance D (r) is the sum of r independent and identically distributed random variables with the same distribution as D . So we have © D (r)(t) 5 ¡ ©D (t) ¢ r . We get the exact distribution of... |

23 | Finding words with unexpected frequencies in DNA sequences - Prum, Rodolphe, et al. - 1995 |

22 |
Exact distribution of word occurrences in a random sequence of letters
- Robin, Daudin
- 1999
(Show Context)
Citation Context ...n, we say that a word w occurs at position i if an occurrence of w ends at position i; it happens with probability m (w) given in (3.2). The probability f (d) can be obtained via a recursive formula (=-=Robin and Daudin, 1999-=-), as � rst proposed for independent and uniformly distributed letters by Blom and Thorburn (1982). It is clear that, if 1 µ d µ ` ¡ 1 and d 62 P(w), then f (d) 5 0. If d 2 P(w) or if d ¶ `, then we d... |

21 |
Compound Poisson approximation of word counts in DNA sequences
- Schbath
- 1995
(Show Context)
Citation Context .... Proposition 1.3.2 from Lothaire (1983) says that two words commute if and only if they are powers of the same word. Thus, we would obtain the contradiction that the minimal root is not minimal (see =-=Schbath, 1995-=-a, for more details). It follows from Equation (3.4) that the probability em (w) that a clump of w starts at a given position in X is given by em (w) 5 m (w) ¡ X m (w ( p) w). (3.5) p 2P 0 (w ) The nu... |

20 | Poisson perturbations - Barbour, Xia - 1999 |

20 | Over- and underrepresentation of short DNA words in herpesvirus genomes - Leung, Marsh, et al. - 1996 |

20 | DNA Physical Mapping and Alternating Eulerian Cycles in Colored Graphs - Pevzner - 1995 |

18 |
Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains
- Reinert, Schbath
- 1998
(Show Context)
Citation Context ...sition 1 and would stop after position ` ¡ 1. The difference eN (w) ¡ eN inf(w) is either equal to 0 or equal to 1. In fact, it can be shown that P( eN (w) 56 eN inf(w)) µ (` ¡ 1)(m (w) ¡ em(w)) (see =-=Reinert and Schbath, 1998-=-). 3.3. k-clump and number of k-clumps A k-clump of w starts at position i in X if and only if there is an occurrence of a concatenated word c 2 C k(w) starting at position i that does not overlap any... |

17 | How many random digits are required until given sequences are obtained - Blom, Thorburn |

16 | First and second moment of counts of words in random texts generated by Markov chains - Kleffe, Borodovsky - 1992 |

16 |
Oligonucleotide bias in Bacillus Subtilis: general trends and taxonomic comparisons
- Eduardo, Rocha, et al.
- 1998
(Show Context)
Citation Context ...Science, USC, Los Angeles, CA 90089. 1s2 REINERT ET AL. et al., 1998). Several papers aim at identifying over- and under-represented words in a particular genome (for instance, Leung and Speed, 1996; =-=Rocha et al., 1998-=-). Statistical methods to study the distribution of the word locations along a sequence and word frequencies have also been an active � eld of research. Because DNA sequences are long, asymptotic dist... |

14 |
A fivenucleotide sequence protects DNA from exonucleolytic degradation by AddAB, the RecBCD analogue of Bacillus subtilis
- Chedin, Noirot, et al.
- 1998
(Show Context)
Citation Context ...ical palindromes corresponding to restriction sites avoided, for instance, in E. coli (Karlin et al., 1992), and the cross-over hotspot instigator sites in several bacteria (see Biaudet et al., 1998; =-=Chedin et al., 1998-=-; Sourice 1 King’s College and Statistical Laboratory, Cambridge CB2 1ST, UK. 2 Unité de Biométric, INRA, 78352 Jouy-en-Josas, France. 3 Department of Mathematics, Department of Biological Sciences, a... |

13 | Euler circuits and DNA sequencing by hybridization - Arratia, Bollobás, et al. |

12 |
Poisson Approximations for Runs and Patterns of Rare Events
- Godbole
- 1991
(Show Context)
Citation Context ...isson and compound Poisson approximations for N (w) have been widely studied in the literature (Chryssaphinou and Papastavridis, 1988a,b; Chryssaphinou and Papastavridis, 1988b; Arratia et al., 1990; =-=Godbole, 1991-=-; Hirano and Aki, 1993; Godbole and Schaffner, 1993; Fu, 1993). Markovian models under different conditions have then been considered (Rajarshi, 1974; Godbole, 1991; Godbole and Schaffner, 1993; Hiran... |

12 | Kerstan’s method for compound Poisson approximation - Roos - 2003 |

11 | Two moments suce for Poisson approximations: the Chen-Stein method - Arratia, Goldstein, et al. - 1989 |

11 |
Markov renewal processes, counters and repeated sequences
- Biggins, Cannings
- 1987
(Show Context)
Citation Context ...w and that w ( p) 5 w1w2 w p denotes the word composed of the � rst p letters of w. We consider the overlap-matching polynomial Q (z) associated with w (see, e.g., Guibas and Odlyzko, 1980; Li, 1980; =-=Biggins and Cannings, 1987-=-) de� ned by X m (w) Q (z) 5 m (w (`¡ p) ) z p . p 2P(w)[f0g When the Markov process is in stationarity, we have from renewal theory that m R (w) 5 m (w) . (6.1) Q (1) To understand this formula, note... |

11 | Renewal theory for several patterns
- Breen, Waterman, et al.
- 1985
(Show Context)
Citation Context ...Central Limit Theorem. First we derive the expected renewal count. If the Ii (w)’s had the same expectation, say m R (w), then ER n(w) 5 (n ¡ ` 1 1)m R (w). This is the commonly used expectation (see =-=Breen et al., 1985-=-, or Tanushev, 1996, for instance), but it ignores the end effect. For i . `, the Ii (w)’s are effectively identically distributed by stationarity of the Markov process, but this is not the case for 1... |

11 | Algorithms on Strings, Trees and Sequences - eld - 1997 |

10 | Exact convergence rate in the limit theorems of ErdösRényi and - Deheuvels, Devroye, et al. |

10 | The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability - Gentleman, Mullin - 1989 |

9 |
A limit theorem for the number of non-overlapping occurences of a pattern in a sequence of independent trials
- Chryssaphinou, Papastavridis
- 1988
(Show Context)
Citation Context ...epending on the asymptotic behavior of the expected value. When the sequence letters are independent, Poisson and compound Poisson approximations for N (w) have been widely studied in the literature (=-=Chryssaphinou and Papastavridis, 1988-=-a,b; Chryssaphinou and Papastavridis, 1988b; Arratia et al., 1990; Godbole, 1991; Hirano and Aki, 1993; Godbole and Schaffner, 1993; Fu, 1993). Markovian models under different conditions have then be... |

8 |
Modelling bacterial genomes using hidden Markov models
- MURI
- 1998
(Show Context)
Citation Context ... on the position); the independent case is a particular case with m =0. Hidden Markov models (HMMs) reveal however that the composition of a DNA sequence may vary over the sequence (Churchill (1989), =-=Muri 1998-=-), Durbin et al. (1998)) and can be studied with HMMs. However, no statistical properties of words have been yet derived in such heterogeneous models. DNA sequences code for amino acid sequences (prot... |

7 | Probabilités et Statistiques. 2. Problèmes à temps mobile. 2e édition - Dacunha-Castelle, Duflo - 1994 |

7 | Stochastic models and statistical methods for DNA sequence data - Lundstrom - 1990 |

6 | Solving the Stein equation in compound Poisson approximation - Barbour, Utev - 1998 |

6 | Expected frequencies of DNA patterns using Whittle’s formula - Cowan - 1991 |

6 | Exact computation of pattern probabilities in random sequences generated by Markov chains - Kleffe, Langbecker - 1990 |

5 | Annotated Statistical Indices for Sequence Analysis, (invited paper - Apostolico, Bock, et al. - 1998 |