| Literature DB >> 25188499 |
Rajiv C McCoy1, Ryan W Taylor1, Timothy A Blauwkamp2, Joanna L Kelley3, Michael Kertesz4, Dmitry Pushkarev5, Dmitri A Petrov1, Anna-Sophie Fiston-Lavier6.
Abstract
High-throughput DNA sequencing technologies have revolutionized genomic analysis, including the de novo assembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, in part due to the presence of dispersed repeats which introduce ambiguity during genome reconstruction. Transposable elements (TEs) can be particularly problematic, especially for TE families exhibiting high sequence identity, high copy number, or complex genomic arrangements. While TEs strongly affect genome function and evolution, most current de novo assembly approaches cannot resolve long, identical, and abundant families of TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly-parallel library preparation and local assembly of short read data and which achieve lengths of 1.5-18.5 Kbp with an extremely low error rate ([Formula: see text]0.03% per base). To test the utility of this technology, we sequenced and assembled the genome of the model organism Drosophila melanogaster (reference genome strain y; cn, bw, sp) achieving an N50 contig size of 69.7 Kbp and covering 96.9% of the euchromatic chromosome arms of the current reference genome. TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recovered and accurately placed 4,229 (77.8%) of the 5,434 annotated transposable elements with perfect identity to the current reference genome. As TEs are ubiquitous features of genomes of many species, TruSeq synthetic long-reads, and likely other methods that generate long-reads, offer a powerful approach to improve de novo assemblies of whole genomes.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25188499 PMCID: PMC4154752 DOI: 10.1371/journal.pone.0106689
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Characteristics of TruSeq synthetic long-reads.
A: Read length distribution. B, C, & D: Position-dependent profiles of B: mismatches, C: insertions, and D: deletions compared to the reference genome. Error rates presented in these figures represent all differences with the reference genome, and can be due to errors in the reads, mapping errors, errors in the reference genome, or accurate sequencing of residual polymorphism.
Figure 2Depth of synthetic long-read coverage per chromosome arm.
The suffix “Het” indicates the heterochromatic portion of the corresponding chromosome. M refers to the mitochondrial genome of the y; cn, bw, sp strain. U and Uextra are additional scaffolds in the reference assembly that could not be mapped to chromosomes.
Size and correctness metrics for de novo assembly.
| Metric | Value |
| Number of contigs | 5598 |
| Total size of contigs | 147445959 |
| Longest contig | 567504 |
| Shortest contig | 1506 |
| Number of contigs | 2805 |
| Number of contigs | 331 |
| Mean contig size | 26339 |
| Median contig size | 10079 |
| N50 contig length | 69692 |
| L50 contig count | 554 |
| NG50 contig length | 48552 |
| LG50 contig count | 833 |
| Contig GC content | 42.26% |
| Genome fraction | 96.86% (92.24%) |
| Duplication ratio | 1.15 (1.14) |
| NA50 | 60103 (63010) |
| LA50 | 623 (618) |
| Mismatches per 100 Kbp | 7.77 (21.9) |
| Short indels ( | 5.10 (7.93) |
| Long indels ( | 0.46 (1.05) |
| Fully-unaligned contigs | 377 (179) |
| Partially unaligned contigs | 1214 (70) |
The N50 length metric measures the length of the contig for which 50% of the total assembly length is contained in contigs of that size or larger, while the L50 metric is the rank order of that contig if all contigs are ordered from longest to shortest. NG50 and LG50 are similar, but based on the expected genome size of 180 Mbp rather than the assembly length. QUAST [39] metrics are based on alignment of contigs to the euchromatic reference chromosome arms (which also contain most of the centric heterochromatin). NA50 and LA50 are analogous to N50 and L50, respectively, but in this case the lengths of aligned blocks rather than contigs are considered.
Values in parentheses represent metrics calculated upon inclusion of the heterochromatic reference scaffolds (XHet, 2LHet, 2RHet, 3LHet, 3RHet, YHet, and U), which contain gaps of arbitrary size and are in some cases not oriented with respect to one another [72]. Values outside of parentheses represent comparison of the assembly only to high-quality reference scaffolds X, 2L, 2R, 3L, 3R, and 4.
Alignment statistics for Celera Assembler contigs aligned to the reference genome.
| Reference | Aligned contigs | Alignment gaps | Length aligned (bp) | Percent aligned |
| X | 1141 | 797 | 20720725 | 92.4% |
| 2L | 547 | 271 | 22354714 | 97.1% |
| 2R | 586 | 291 | 20645481 | 97.6% |
| 3L | 712 | 349 | 23835623 | 97.1% |
| 3R | 657 | 304 | 27453817 | 98.3% |
| 4 | 74 | 40 | 1232723 | 91.2% |
| XHet | 32 | 8 | 153247 | 75.1% |
| 2LHet | 41 | 10 | 278753 | 75.6% |
| 2RHet | 278 | 68 | 2497813 | 75.9% |
| 3LHet | 206 | 75 | 2233661 | 87.4% |
| 3RHet | 231 | 74 | 2100876 | 83.5% |
| YHet | 29 | 38 | 151545 | 43.7% |
| M | 0 | 1 | 0 | 0% |
| U | 1158 | 1198 | 4512500 | 44.9% |
Alignment was performed with NUCmer [36], [37], filtering to extract only the optimal placement of each draft contig on the reference (see Supplemental Materials in File S1). Note that the number of gaps can be substantially fewer than the number of aligned contigs because alignments may partially overlap or be perfectly adjacent with respect to the reference. The number of gaps can also exceed the number of aligned contigs due to multiple partial alignments of contigs to the reference sequence.
Figure 3Results of generalized linear mixed model describing probability of accurate TE assembly.
Predictor variables include: TE length (, , ), GC content (, , ), divergence (, , ), and number of high-identity (0.01 substitutions per base compared to the canonical sequence) copies within family (, , ). Black lines represent predicted values from the GLMM fit to the binary data (colored points). The upper sets of points represent TEs which were perfectly assembled, while the lower set of points represent TEs which are absent from the assembly or were mis-assembled with respect to the reference. The exact positions of the colored points along the Y-axis should therefore be disregarded. Colors indicate different TE families (122 total). To visualize the interaction between divergence and the number of high-identity copies (, , ), we plotted predicted values for both families with low numbers of high-identity copies (dashed line) as well as families with high numbers of high-identity copies (solid line).
Figure 4Assembly metrics as a function of depth of coverage of TruSeq synthetic long-reads.
A: NG(X) contig length for full and down-sampled coverage data sets. This metric represents the size of the contig for which X% of the genome length (180 Mbp) lies in contigs of that size or longer. B: The proportion of genes and transposable elements accurately assembled (100% length and sequence identity) for full and down-sampled coverage data sets.