Results 1 - 10
of
23
Substantial biases in ultra-short read data sets from high-throughput DNA sequencing
- Nucleic Acids Res
, 2008
"... Novel sequencing technologies permit the rapid production of large sequence data sets. These tech-nologies are likely to revolutionize genetics and bio-medical research, but a thorough characterization of the ultra-short read output is necessary. We gen-erated and analyzed two Illumina 1G ultra-shor ..."
Abstract
-
Cited by 121 (0 self)
- Add to MetaCart
(Show Context)
Novel sequencing technologies permit the rapid production of large sequence data sets. These tech-nologies are likely to revolutionize genetics and bio-medical research, but a thorough characterization of the ultra-short read output is necessary. We gen-erated and analyzed two Illumina 1G ultra-short read data sets, i.e. 2.8 million 27mer reads from a Beta vulgaris genomic clone and 12.3 million 36mers from the Helicobacter acinonychis genome. We found that error rates range from 0.3 % at the beginning of reads to 3.8 % at the end of reads. Wrong base calls are frequently preceded by base G. Base sub-stitution error frequencies vary by 10- to 11-fold, with A>C transversion being among the most fre-quent and C>G transversions among the least fre-quent substitution errors. Insertions and deletions of single bases occur at very low rates. When simu-lating re-sequencing we found a 20-fold sequencing coverage to be sufficient to compensate errors by correct reads. The read coverage of the sequenced regions is biased; the highest read density was found in intervals with elevated GC content. High Solexa quality scores are over-optimistic and low scores underestimate the data quality. Our results show different types of biases and ways to detect them. Such biases have implications on the use and interpretation of Solexa data, for de novo sequencing, re-sequencing, the identification of single nucleotide polymorphisms and DNA methylation sites, as well as for transcriptome analysis.
De novo fragment assembly with short mate-paired reads: Does the read length matter?
, 2009
"... Increasing read length is currently viewed as the crucial condition for fragment assembly with next-generation sequencing technologies. However, introducing mate-paired reads (separated by a gap of length, GapLength) opens a possibility to transform short mate-pairs into long mate-reads of length Ga ..."
Abstract
-
Cited by 54 (1 self)
- Add to MetaCart
(Show Context)
Increasing read length is currently viewed as the crucial condition for fragment assembly with next-generation sequencing technologies. However, introducing mate-paired reads (separated by a gap of length, GapLength) opens a possibility to transform short mate-pairs into long mate-reads of length GapLength, and thus raises the question as to whether the read length (as opposed to GapLength) even matters. We describe a new tool, EULER-USR, for assembling mate-paired short reads and use it to analyze the question of whether the read length matters. We further complement the ongoing experimental efforts to maximize read length by a new computational approach for increasing the effective read length. While the common practice is to trim the error-prone tails of the reads, we present an approach that substitutes trimming with error correction using repeat graphs. An important and counterintuitive implication of this result is that one may extend sequencing reactions that degrade with length "past their prime" to where the error rate grows above what is normally acceptable for fragment assembly.
Nextgeneration sequencing: from basic research to diagnostics,”
- Clinical Chemistry,
, 2009
"... ..."
De novo assembly of a 40 mb eukaryotic genome from short sequence reads: Sordaria Macrospora, a model organism for fungal morphogenesis. PLoS Genet 2010; 6(4): e1000891
"... Abstract Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While nextgeneration sequencing techniques have revolutionized ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
Abstract Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While nextgeneration sequencing techniques have revolutionized genome resequencing, e.g. for strain comparisons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30-90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the Sordaria macrospora genome that was obtained by a combination of Illumina/ Solexa and Roche/454 sequencing. Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in ,4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for comparative studies to address basic questions of fungal biology.
Rangwala H: Evaluation of short read metagenomic assembly
- BMC Genomics 2011, 12(Suppl 2):S8
"... Advances in sequencing technologies have equipped re-searchers with the ability to sequence the collective genome of entire microbial communities commonly re-ferred to as metagenomics. These microbes are are omnipresent within the human body and environments across the world. As such, characterizing ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
Advances in sequencing technologies have equipped re-searchers with the ability to sequence the collective genome of entire microbial communities commonly re-ferred to as metagenomics. These microbes are are omnipresent within the human body and environments across the world. As such, characterizing and under-standing their roles is crucial for improving human health and the environment. The problem of using short reads obtained from cur-rent next generation sequencing technologies to assem-ble the genomes within the community sample is chal-lenging for several reasons. In this study we assess the performance of a state-of-the-art Eulerian-based graph assembler on a series of simulated dataset with varying complexity. We evaluate the feasibility of metagenomic assembly with reads restricted to be 36 base pairs ob-tained from the Solexa/Illumina platform. We developed a pipeline to evaluate the quality of as-sembly based on contig length statistics and accuracy. We studied the effect of overlap parameters used for the metagenomic assembly and developed a clustering so-lution to pool the contigs obtained from different runs of the assembly algorithm which allowed us to obtain longer contigs. We also computed an entropy/impurity metric to assess how mixed the assembled contigs were. Ideally a contig should be assembled from reads obtained from the same organism. We also compared the metage-nomic assemblies to the best possible solution that could be obtained by assembling individual source genomes. Our results show that accuracy was better than expected for the metagenomic samples with a few dominant or-ganisms and was especially poor in samples containing many closely related strains. 1
Sequencing by Cyclic Ligation and Cleavage (CycLiC) directly on a microarray captured template
, 2008
"... Next generation sequencing methods that can be applied to both the resequencing of whole genomes and to the selective resequencing of specific parts of genomes are needed. We describe (i) a massively scalable biochemistry, Cyclical Ligation and Cleavage (CycLiC) for contiguous base sequencing and (i ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Next generation sequencing methods that can be applied to both the resequencing of whole genomes and to the selective resequencing of specific parts of genomes are needed. We describe (i) a massively scalable biochemistry, Cyclical Ligation and Cleavage (CycLiC) for contiguous base sequencing and (ii) apply it directly to a template captured on a microarray. CycLiC uses four color-coded DNA/RNA chimeric oligonucleotide libraries (OL) to extend a primer, a base at a time, along a template. The cycles comprise the steps: (i) ligation of OLs, (ii) identification of extended base by label detection, and (iii) cleavage to remove label/terminator and undetermined bases. For proof-of-principle, we show that the method conforms to design and that we can read contiguous bases of sequence correctly from a template captured by hybridization from solution to a microarray probe. The method is amenable to massive scale-up, miniaturization and automation. Implementation on a microarray format offers the potential for both selection and sequencing of a large number of genomic regions on a single platform. Because the method uses commonly available reagents it can be developed further by a community of users.
Read Length and Repeat Resolution: Exploring Prokaryote Genomes Using Next-Generation Sequencing Technologies
, 2010
"... Background: There are a growing number of next-generation sequencing technologies. At present, the most cost-effective options also produce the shortest reads. However, even for prokaryotes, there is uncertainty concerning the utility of these technologies for the de novo assembly of complete genome ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Background: There are a growing number of next-generation sequencing technologies. At present, the most cost-effective options also produce the shortest reads. However, even for prokaryotes, there is uncertainty concerning the utility of these technologies for the de novo assembly of complete genomes. This reflects an expectation that short reads will be unable to resolve small, but presumably abundant, repeats. Methodology/Principal Findings: Using a simple model of repeat assembly, we develop and test a technique that, for any read length, can estimate the occurrence of unresolvable repeats in a genome, and thus predict the number of gaps that would need to be closed to produce a complete sequence. We apply this technique to 818 prokaryote genome sequences. This provides a quantitative assessment of the relative performance of various lengths. Notably, unpaired reads of only 150nt can reconstruct approximately 50 % of the analysed genomes with fewer than 96 repeat-induced gaps. Nonetheless, there is considerable variation amongst prokaryotes. Some genomes can be assembled to near contiguity using very short reads while others require much longer reads. Conclusions: Given the diversity of prokaryote genomes, a sequencing strategy should be tailored to the organism under
Visualising the repeat structure of genomic sequences
- Complex Systems
, 2008
"... Repeats are a common feature of genomic sequences and much remains to be understood of their origin and structure. The identification of repeated strings in genomic sequences is therefore of importance for a variety of applications in biology. In this paper a new method for finding all repeats and v ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Repeats are a common feature of genomic sequences and much remains to be understood of their origin and structure. The identification of repeated strings in genomic sequences is therefore of importance for a variety of applications in biology. In this paper a new method for finding all repeats and visualising them in a two dimensional plot is presented. The method is first ap-plied to a set of constructed sequences in order to develop a compara-tive framework. Several complete genomes are then analysed, including the whole human genome. The technique reveals the complex repeat structure of genomic se-quences. In particular, interesting differences in the repeat character of the coding and non-coding regions of bacterial genomes are noted. The method allows fast identification of all repeats and easy inter-genome comparison. In doing this the plot effectively creates a sig-nature of a sequence which allows some classes of repeat present in a sequence to be identified by simple visual inspection. To our knowledge this is the first time all exact repeats have been visualised in a single plot that highlights the degree to which repeats occur within a genomic sequence, giving an indication of the important
Sequencing a bacterial genome: an overview
"... genome sequences have been determined. DNA sequencing technology has dramatically improved from the first generation, automated Sanger DNA sequencing, which dominated this field for almost two decades, to the current Next-generation sequencing protocols. This newer technology dramatically reduces bo ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
genome sequences have been determined. DNA sequencing technology has dramatically improved from the first generation, automated Sanger DNA sequencing, which dominated this field for almost two decades, to the current Next-generation sequencing protocols. This newer technology dramatically reduces both the time and cost of DNA sequencing, making it possible for a small laboratory to completely sequence the genome of their favorite bacterium. With the enormous amount of information obtained from whole genome sequencing, scientists can readily address a wide range of biological questions that were hitherto beyond their capabilities. In this chapter, strategies of how to sequence genomic DNA as well as how to assemble and annotate a bacterial genome are reviewed and discussed. Keywords bacterial genomics; next generation sequencing; de novo assembly; bacterial genome annotation 1.Genome sequencing As of May 2010, 1,072 complete published bacterial genomes have been reported in the Genomes Online Database and another 4,289 bacterial genome projects are known to be ongoing (www.genomesonline.org). The underlying reasons for sequencing the genome of various bacteria are either because they are highly virulent to humans, animals or plants, or they can be applied to bioremediation or bioenergy production. In 2009, a new initiative called ‘Genomic Encyclopedia of Bacteria and Archaea ’ (GEBA) was reported by Eisen and colleagues [1]. The project aims to provide a more complete picture of bacterial and archaeal genomic diversity by systematically filling in the gaps in the tree of
1 Oases: Robust de novo RNA-seq assembly across the dynamic range of expression levels
"... Motivation: High-throughput sequencing has made the analysis of new model organisms more affordable. Although assembling a new genome can still be costly and difficult, it is possible to use RNA-seq to sequence mRNA. In the absence of a known genome, it is necessary to assemble these sequences de no ..."
Abstract
- Add to MetaCart
Motivation: High-throughput sequencing has made the analysis of new model organisms more affordable. Although assembling a new genome can still be costly and difficult, it is possible to use RNA-seq to sequence mRNA. In the absence of a known genome, it is necessary to assemble these sequences de novo, taking into account possible alternative isoforms and the dynamic range of expression values. Results: We present a software package named Oases designed to heuristically assemble RNA-seq reads in the absence of a reference genome, across a broad spectrum of expression values and in presence of alternative isoforms. It achieves this by using an array of hash lengths, a dynamic filtering of noise, a robust resolution of alternative splicing events, and the efficient merging of multiple assemblies. It was tested on human and mouse RNA-seq data and is shown to improve significantly on the transABySS and Trinity de novo transcriptome assemblers. Availability: Oases is freely available under the GPL license at www.ebi.ac.uk/~zerbino/oases/