Results 1 -
5 of
5
Large scale sequencing by hybridization
- J. of Computational Biology
, 2002
"... Sequencing by Hybridization is a method for reconstructing a DNA sequence based on its k-mer content. This content, called the spectrum of the sequence, can be obtained from hybridization with a universal DNA chip. However, even with a sequencing chip containing all 4 9 9-mers and assuming no hybrid ..."
Abstract
-
Cited by 15 (7 self)
- Add to MetaCart
Sequencing by Hybridization is a method for reconstructing a DNA sequence based on its k-mer content. This content, called the spectrum of the sequence, can be obtained from hybridization with a universal DNA chip. However, even with a sequencing chip containing all 4 9 9-mers and assuming no hybridization errors, only about 400 bases-long sequences can be reconstructed unambiguously. Drmanac et al. suggested sequencing long DNA targets by obtaining spectra of many short overlapping fragments of the target, inferring their relative positions along the target and then computing spectra of subfragments that are short enough to be uniquely recoverable. Drmanac et al. do not treat the realistic case of errors in the hybridization process. In this paper we study the effect of such errors. We show that the probability of ambiguous reconstruction in the presence of (false negative) errors is close to the probability in the errorless case. More precisely, the ratio between these probabilities is 1 + O(p/(1 − p) 4 · 1/d) where d is the average length of subfragments, and p is the probability of a false negative. We also obtain lower and upper bounds for the probability of unambiguous reconstruction based on errorless spectrum. For realistic chip sizes, these bounds are tighter than those given by Arratia et al. Finally, we report results on simulations with real DNA sequences, showing that even in the presence of 50 % false negative errors, a target of cosmid length can be recovered with less than 0.1 % miscalled bases. 1
A preprocessor for shotgun assembly of large genomes
- Journal of Computational Biology
, 2004
"... The whole-genome shotgun (WGS) assembly technique has been remarkably successful in efforts to determine the sequence of bases that make up a genome. WGS assembly begins with a large collection of short fragments that have been selected at random from a genome. The sequence of bases at each end of t ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The whole-genome shotgun (WGS) assembly technique has been remarkably successful in efforts to determine the sequence of bases that make up a genome. WGS assembly begins with a large collection of short fragments that have been selected at random from a genome. The sequence of bases at each end of the fragment is determined, albeit imprecisely, resulting in a sequence of letters called a “read”. Each letter in a read is assigned a quality value, which estimates the probability that a sequencing error occurred in determining that letter. Reads are typically cut off after about 500 letters, where sequencing errors become endemic. We report on a set of procedures that (1) corrects most of the sequencing errors, (2) changes quality values accordingly, and (3) produces a list of “overlaps”, i.e., pairs of reads that plausibly come from overlapping parts of the genome. Our procedures, which we call collectively the “UMD Overlapper”, can be run iteratively and as a preprocessor for other assemblers. We tested the UMD Overlapper on Celera’s Drosophila reads. When we replaced Celera’s overlap procedures in the front end of their assembler, it was able to produce a significantly improved genome.
Bioinformatics
, 2003
"... Selection of significant genes via expression patterns is an important problem in microarray experiments. Owing to small sample size and the large number of variables (genes), the selection process can be unstable. This paper proposes a hierarchical Bayesian model for gene (variable) selection. We e ..."
Abstract
- Add to MetaCart
Selection of significant genes via expression patterns is an important problem in microarray experiments. Owing to small sample size and the large number of variables (genes), the selection process can be unstable. This paper proposes a hierarchical Bayesian model for gene (variable) selection. We employ latent variables to specialize the model to a regression setting and uses a Bayesian mixture prior to perform the variable selection. We control the size of the model by assigning a prior distribution over the dimension (number of significant genes) of the model. The posterior distributions of the parameters are not in explicit form and we need to use a combination of truncated sampling and Markov Chain Monte Carlo (MCMC) based computation techniques to simulate the parameters from the posteriors. The Bayesian model is flexible enough to identify significant genes as well as to perform future predictions. The method is applied to cancer classification via cDNA microarrays where the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the method is used to identify a set of significant genes. The method is also applied successfully to the leukemia data.
Serafim Batzoglou,
"... this paper, we describe a computer system for performing WGS assembly of complex genomes ..."
Abstract
- Add to MetaCart
this paper, we describe a computer system for performing WGS assembly of complex genomes
DNA Sequence Assembly . . .
, 2003
"... We describe an Eulerian path approach to the DNA fragment assembly that was originated by Idury and Waterman 1995, and then advanced by Pevzner et al. 2001b. This combinatorial approach bypasses the traditional “overlap-layout-consensus ” approach and successfully resolved some of the troublesome re ..."
Abstract
- Add to MetaCart
We describe an Eulerian path approach to the DNA fragment assembly that was originated by Idury and Waterman 1995, and then advanced by Pevzner et al. 2001b. This combinatorial approach bypasses the traditional “overlap-layout-consensus ” approach and successfully resolved some of the troublesome repeats in practical assembly projects. The assembly results by the Eulerian path approach are accurate, and its computation is significantly more efficient than other assembly programs. As an extension, we use the Eulerian path idea to address the multiple sequence alignment problem. In particular, we have as a goal aligning thousands of sequences simultaneously, which is computationally exorbitant for all existing alignment algorithms. As a beginning, we focus on DNA sequence alignment. Our method can align hundreds of DNA sequences within minutes with high accuracy, and its computational

