Results 1 -
6 of
6
Ab Initio Whole Genome Shotgun Assembly With Mated Short Reads
"... Abstract. Next Generation Sequencing (NGS) technologies are capable of reading millions of short DNA sequences both quickly and cheaply. While these technologies are already being used for resequencing individuals once a reference genome exists, it has not been shown if it is possible to use them fo ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Abstract. Next Generation Sequencing (NGS) technologies are capable of reading millions of short DNA sequences both quickly and cheaply. While these technologies are already being used for resequencing individuals once a reference genome exists, it has not been shown if it is possible to use them for ab initio genome assembly. In this paper, we give a novel network flow-based algorithm that, by taking advantage of the high coverage provided by NGS, accurately estimates the copy counts of repeats in a genome. We also give a second algorithm that combines the predicted copy-counts with mate-pair data in order to assemble the reads into contigs. We run our algorithms on simulated read data from E. Coli and predict copy-counts with extremely high accuracy, while assembling long contigs. 1
IDBA- A Practical Iterative de Bruijn Graph De Novo Assembler
"... Abstract. The de Bruijn graph assembly approach breaks reads into k-mers before assembling them into contigs. The string graph approach forms contigs by connecting two reads with k or more overlapping nucleotides. Both approaches face the problem of false-positive vertices from erroneous reads, miss ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. The de Bruijn graph assembly approach breaks reads into k-mers before assembling them into contigs. The string graph approach forms contigs by connecting two reads with k or more overlapping nucleotides. Both approaches face the problem of false-positive vertices from erroneous reads, missing vertices due to non-uniform coverage and branching due to erroneous reads and repeat regions. A proper choice of k is crucial but for any single k there is always a trade-off: a small k favors the situation of erroneous reads and non-uniform coverage, and a large k favors short repeat regions. We propose an iterative de Bruijn graph approach iterating from small to large k capturing merits of all values in between. With real and simulated data, our IDBA algorithm is superior to all existing algorithms by constructing longer contigs with similar accuracy and using less memory. The running time of IDBA is comparable with existing algorithms. Availability: IDBA is available at
Genome Informatics 21:3-14 (2008) AN APPROACH TO TRANSCRIPTOME ANALYSIS OF NON-MODEL ORGANISMS USING SHORT-READ SEQUENCES
"... Transcriptome analysis using high-throughput short-read sequencing technology is straightforward when the sequenced genome is the same species or extremely similar to the reference genome. We present an analysis approach for when the sequenced organism does not have an already sequenced genome that ..."
Abstract
- Add to MetaCart
Transcriptome analysis using high-throughput short-read sequencing technology is straightforward when the sequenced genome is the same species or extremely similar to the reference genome. We present an analysis approach for when the sequenced organism does not have an already sequenced genome that can be used for a reference, as will be the case of many non-model organisms. As proof of concept, data from Solexa sequencing of the polyploid plant Pachycladon enysii was analysed using our approach with its nearest model reference genome being the diploid plant Arabidopsis thaliana. By using a combination of mapping and de novo assembly tools we could determine duplicate genes belonging to one or other of the genome copies. Our approach demonstrates that transcriptome analysis using high-throughput short-read sequencing need not be restricted to the genomes of model organisms.
COMPUTATIONAL GENOMIC SIGNATURES AND METAGENOMICS
, 2011
"... Mathematical characterizations of biological sequences form one of the main elements of bioinformatics. In this work, a class of DNA sequence characterization, namely computational genomics signatures, which capture global features of these sequences is used to address emerging computational biology ..."
Abstract
- Add to MetaCart
Mathematical characterizations of biological sequences form one of the main elements of bioinformatics. In this work, a class of DNA sequence characterization, namely computational genomics signatures, which capture global features of these sequences is used to address emerging computational biology challenges. Because of the species specificity and pervasiveness of genome signatures, it is possible to use these signatures to characterize and identify a genome or a taxonomic unit using a short genome fragment from that source. However, the identification accuracy is generally poor when the sequence model and the sequence distance measure are not selected carefully. We show that the use of relative distance measures instead of absolute metrics makes it possible to obtain better detection accuracy. Furthermore, the use of relative metrics can create opportunities for using more complex models to develop genome signatures, which cannot be used efficiently when conventional distance measures are used. Using a relative distance measure and a model based on the relative abundance
SCALING SHORT READ DE NOVO DNA SEQUENCE ASSEMBLY TO GIGABASE GENOMES
, 2011
"... The recent advent of massively parallel sequencing technologies has drastically reduced the cost of sequencing, sparking a revolution in whole genome de novo sequencing. However, these new technologies sample much shorter segments of DNA, called short reads, than conventional but more costly long re ..."
Abstract
- Add to MetaCart
The recent advent of massively parallel sequencing technologies has drastically reduced the cost of sequencing, sparking a revolution in whole genome de novo sequencing. However, these new technologies sample much shorter segments of DNA, called short reads, than conventional but more costly long read sequencing technologies, and suffer from higher and more varied error rates. Modern genome assembly tools compensate for these shortcomings by using de Bruijn graph based assembly techniques; however, for large genomes, the physical memory required to efficiently build and manipulate the de Bruijn graph generally far exceeds that which is available on modern commodity workstations. This dissertation develops novel out-of-core algorithms that permit conservative assembly of the de Bruijn graph using one to three orders of magnitude less memory than is required by the naïve approach. These algorithms are implemented in an open source genome assembly tool that

