## Dna segmentation as a model selection process

Venue: | In International Conference on Research in Computational Molecular Biology (RECOMB |

Citations: | 10 - 1 self |

### BibTeX

@INPROCEEDINGS{Li_dnasegmentation,

author = {Wentian Li},

title = {Dna segmentation as a model selection process},

booktitle = {In International Conference on Research in Computational Molecular Biology (RECOMB},

year = {},

pages = {204--210}

}

### OpenURL

### Abstract

Previous divide-and-conquer segmentation analyses of DNA sequences do not provide a satisfactory stopping criterion for the recursion. This paper proposes that segmentation be considered as a model selection process. Using the tools in model selection, a limit for the stopping criterion on the relaxed end can be determined. The Bayesian information criterion, in particular, provides a much more stringent stopping criterion than what is currently used. Such a stringent criterion can be used to delineate larger DNA domains. A relationship between the stopping criterion and the average domain size is empirically determined, which may aid in the determination of isochore borders. 1.

### Citations

4934 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...imilar recursion processes are also discussed in statistics and machine learning under the names of “classification and regression tree” [10], “recursive partitioning” [38], “decision tree induction” =-=[30, 31]-=-, etc). The DNA sequence is first segmented into two subsequences so that base compositions on two sides of the partition are maximized. Then, the same procedure is carried out on both the left and th... |

3354 | Induction of Decision Trees
- Quinlan
- 1986
(Show Context)
Citation Context ...imilar recursion processes are also discussed in statistics and machine learning under the names of “classification and regression tree” [10], “recursive partitioning” [38], “decision tree induction” =-=[30, 31]-=-, etc). The DNA sequence is first segmented into two subsequences so that base compositions on two sides of the partition are maximized. Then, the same procedure is carried out on both the left and th... |

1840 |
A new look at the statistical model identification
- Akaike
- 1974
(Show Context)
Citation Context ... by the Kullback-Leibler distance (divergence) [20], and the Akaike Information Criterion (AIC) is one approximation of this distance (with the constant term removed, and multiplied by a factor of 2) =-=[2]-=-: AIC = −2log( ˆ L)+2K + O( 1 ). (5) N where ˆ L is the maximized likelihood of the model, K is the number of free parameters in the model. A model with the lowest AIC is closest to the true model, th... |

1240 |
A.: On information and sufficiency
- KULLBACK, LEIBLER
- 1951
(Show Context)
Citation Context ...the “merit” of a model. One measure of such a “merit” is whether the model is close (a better approximation) to the true model. The closeness is measured by the Kullback-Leibler distance (divergence) =-=[20]-=-, and the Akaike Information Criterion (AIC) is one approximation of this distance (with the constant term removed, and multiplied by a factor of 2) [2]: AIC = −2log( ˆ L)+2K + O( 1 ). (5) N where ˆ L... |

411 | Divergence measures based on the Shannon entropy
- Lin
- 1991
(Show Context)
Citation Context ...means that the quantity is estimated from the data) Ê({Nα}) = � α=a,c,g,t Nα N Nα log . (1) N Given a partition point i (1 <i<N), an entropy-based quantity called Jensen-Shannon distance (divergence) =-=[26]-=- is defined as ˆDJS = Ê({Nα}) − i N Ê({Nα,l}) − N − i Ê({Nα,r}) (2) N where {Nα,l} and {Nα,r} are the base counts of the left (from position 1 to i) and the right (from position i +1to position N) sub... |

202 |
The genome sequence of Drosophila melanogaster
- Adams, Celniker, et al.
- 2000
(Show Context)
Citation Context ... criterion has to be used (to be discussed later). 3.3 Left-arm of Drosophila chromosome 2 The last sequence to be segmented is the left arm of Drosophila melanogaster chromosome 2 (N =22, 075, 671 b)=-=[1]-=-. There is 1.78% of the sequence that is not determined (symbol “n” or“N”). To preserve the location information, these undetermined symbols are replaced randomly by the four nucleotides (according to... |

183 |
Regression and time series model selection in small samples
- Hurvich, Tsai
- 1989
(Show Context)
Citation Context ...probability. Note that AIC emphasizes an approximation of the true model, and BIC emphasizes the selection of the true model from the space of all models. The high-order terms in AIC are discussed in =-=[37, 19]-=-, and the derivation of BIC canlambda phage (N=48.5k) BIC-based ACGT segmentation CG% 0.2 0.3 0.4 0.5 0.6 1st BIC-based SW segmentation AIC-based ACGT segmentation . . . . ... ... ..... . .. . . . ..... |

152 |
Estimating the dimension of a model. Annals of Statistics
- Schwarz
- 1978
(Show Context)
Citation Context ...sterior probability of the model is the “integrated likelihood” [32]. An asymptotic approximation of minus-twice the logarithm of the integrated likelihood is the Bayesian Information Criterion (BIC) =-=[36]-=-: BIC = −2log( ˆ L)+log(N)K+O(1)+O( 1 √ )+O( N 1 ), (6) N where N is the sample size. A model with the lowest BIC has the largest integrated likelihood, and this translates to the largest posterior pr... |

115 |
Stochastic models for heterogeneous DNA sequences
- Churchill
- 1989
(Show Context)
Citation Context ...er of types of domains (e.g. C+G rich and C+G poor represent two types of domains, whereas C+G high, intermediate, and low specify three types). Segmentation analysis of DNA sequences can be found in =-=[18, 15, 9, 33]-=-. One particularly attractive segmentation method is a divideand-conquer approach [4, 28] (similar recursion processes are also discussed in statistics and machine learning under the names of “classif... |

53 |
Further analysis of the data by akaike’s information criterion and the finite corrections
- Sugiura
- 1978
(Show Context)
Citation Context ...probability. Note that AIC emphasizes an approximation of the true model, and BIC emphasizes the selection of the true model from the space of all models. The high-order terms in AIC are discussed in =-=[37, 19]-=-, and the derivation of BIC canlambda phage (N=48.5k) BIC-based ACGT segmentation CG% 0.2 0.3 0.4 0.5 0.6 1st BIC-based SW segmentation AIC-based ACGT segmentation . . . . ... ... ..... . .. . . . ..... |

50 |
Compositional segmentation and long-range fractal correlations in DNA sequences,” Phys
- Bernaola-Galvan, Roman-Roldan, et al.
- 1996
(Show Context)
Citation Context ...gh, intermediate, and low specify three types). Segmentation analysis of DNA sequences can be found in [18, 15, 9, 33]. One particularly attractive segmentation method is a divideand-conquer approach =-=[4, 28]-=- (similar recursion processes are also discussed in statistics and machine learning under the names of “classification and regression tree” [10], “recursive partitioning” [38], “decision tree inductio... |

47 |
Model Selection and Inference
- Burnham, Anderson
- 1998
(Show Context)
Citation Context ...tee objectivity [3]. We provide a stopping criterion based on the framework of model selection (for a detailed discussion of the hypothesis testing framework versus the model selection framework, see =-=[12]-=-). This new stopping criterion offers at a minimum con-dition for the recursive segmentation to continue. On the other hand, in the hypothesis testing framework, no such minimum condition exists; for... |

43 | The study of correlation structures of DNA sequences: A critical review
- Li
- 1997
(Show Context)
Citation Context ...t (i.e. for BIC to decrease) not only leads to large, 100kb-plus domains, but also leads to smaller-scaled base composition fluctuation. This “domains-within-domains” phenomenon has been discussed in =-=[25, 4, 22, 23]-=-. If one is only interested in isochores, i.e., large DNA segments with usually 300 kb or longer that have relatively homogeneous base composition [7, 8], a more stringent criterion has to be used (to... |

39 |
The human genome: organization and evolutionary history
- Bernardi
- 1995
(Show Context)
Citation Context ...ins” phenomenon has been discussed in [25, 4, 22, 23]. If one is only interested in isochores, i.e., large DNA segments with usually 300 kb or longer that have relatively homogeneous base composition =-=[7, 8]-=-, a more stringent criterion has to be used (to be discussed later). 3.3 Left-arm of Drosophila chromosome 2 The last sequence to be segmented is the left arm of Drosophila melanogaster chromosome 2 (... |

32 |
Finding borders between coding and noncoding DNA regions by an entropic segmentation method,” Phys
- Bernaola-Galván, Grosse, et al.
- 1345
(Show Context)
Citation Context ...wo models are K2 = 7 (the partition point i is also a free parameter) and K1 =3. So2NˆDJS under the null hypothesis should obey the χ 2 df =4 distribution (the same conclusion was reached before, see =-=[6]-=- and (I Grosse, et al. in preparation), only the df used there is 3, instead 4). 2.3 The divide-and-conquer segmentation as a model selection There are many shortcomings in the hypothesis testing fram... |

28 |
Statistical methods for DNA sequence segmentation, Statist
- Braun, Muller
- 1998
(Show Context)
Citation Context ...er of types of domains (e.g. C+G rich and C+G poor represent two types of domains, whereas C+G high, intermediate, and low specify three types). Segmentation analysis of DNA sequences can be found in =-=[18, 15, 9, 33]-=-. One particularly attractive segmentation method is a divideand-conquer approach [4, 28] (similar recursion processes are also discussed in statistics and machine learning under the names of “classif... |

21 |
Sequence compositional complexity of DNA through an entropic segmentation method
- Román-Roldán, Bernaola-Galván, et al.
- 1998
(Show Context)
Citation Context ...s difference is that the MHC is more “complex” than other sequences in Fig.4, in the sense of the existence of a huge number of domains. Plots like Fig.4 are similar to the “compositional complexity” =-=[34, 23, 5]-=-. The difference is that in [34, 23, 5], not only the number of domains, but also the base composition difference between domains is part of the measure of complexity. In Fig.4, it is purely the numbe... |

18 | New stopping criteria for segmenting DNA sequences - Li - 2001 |

9 |
Understanding long-range correlations
- Li, Marr, et al.
- 1994
(Show Context)
Citation Context ...t (i.e. for BIC to decrease) not only leads to large, 100kb-plus domains, but also leads to smaller-scaled base composition fluctuation. This “domains-within-domains” phenomenon has been discussed in =-=[25, 4, 22, 23]-=-. If one is only interested in isochores, i.e., large DNA segments with usually 300 kb or longer that have relatively homogeneous base composition [7, 8], a more stringent criterion has to be used (to... |

8 |
Theoretical models for heterogeneity for base composition in DNA
- Elton
- 1974
(Show Context)
Citation Context ...er of types of domains (e.g. C+G rich and C+G poor represent two types of domains, whereas C+G high, intermediate, and low specify three types). Segmentation analysis of DNA sequences can be found in =-=[18, 15, 9, 33]-=-. One particularly attractive segmentation method is a divideand-conquer approach [4, 28] (similar recursion processes are also discussed in statistics and machine learning under the names of “classif... |

6 |
The complexity of DNA: the measure of compositional heterogeneity in DNA sequences and measures of complexity
- Li
- 1997
(Show Context)
Citation Context ...t (i.e. for BIC to decrease) not only leads to large, 100kb-plus domains, but also leads to smaller-scaled base composition fluctuation. This “domains-within-domains” phenomenon has been discussed in =-=[25, 4, 22, 23]-=-. If one is only interested in isochores, i.e., large DNA segments with usually 300 kb or longer that have relatively homogeneous base composition [7, 8], a more stringent criterion has to be used (to... |

5 |
The isochore organization of the human genome, Annual Review of Genetics 23
- Bernardi
- 1989
(Show Context)
Citation Context ...ins” phenomenon has been discussed in [25, 4, 22, 23]. If one is only interested in isochores, i.e., large DNA segments with usually 300 kb or longer that have relatively homogeneous base composition =-=[7, 8]-=-, a more stringent criterion has to be used (to be discussed later). 3.3 Left-arm of Drosophila chromosome 2 The last sequence to be segmented is the left arm of Drosophila melanogaster chromosome 2 (... |

4 |
Nucleotide sequence of bacteriophage λ DNA
- Sanger
- 1982
(Show Context)
Citation Context ...is testing framework. We illustrate the BIC-based segmentation by three DNA sequences with a wide range of sequence lengths. 3.1 Lambda phage Fig.1 shows the result for λ bacteriophage (N =48, 502 b) =-=[35]-=-. This sequence has been tested with various segmentation methods in [9]. There are several pieces of information displayed in Fig.1: the domain borders obtained by the BICbased segmentation on the or... |

3 |
Analyzing data: is objectivity possible?”, American Scientist
- Berger, Berry
- 1988
(Show Context)
Citation Context ... of the recursion in our case) is decided by a pre-set “significance level”. Usually, the significance level can be 0.05, 0.01, or 0.001. These levels are arbitrary and will not guarantee objectivity =-=[3]-=-. We provide a stopping criterion based on the framework of model selection (for a detailed discussion of the hypothesis testing framework versus the model selection framework, see [12]). This new sto... |

2 |
Image Segmentation and Compression Using Hidden Karkov Models
- Li, Gray
- 2000
(Show Context)
Citation Context ...rmation, it is actually possible to determine the border exactly by certain mathematical criterion. These mathematical approaches to delineate regional homogeneous domains are known as “segmentation” =-=[21]-=-, “parti∗ Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or comm... |

2 |
V Ju Markeev, MA Roytberg, VG Tumanyan (2000), “DNA segmentation through the Bayesian approach
- Ramensky
(Show Context)
Citation Context |