| Literature DB >> 24653210 |
Aleksey Zimin1, Kristian A Stevens, Marc W Crepeau, Ann Holtz-Morris, Maxim Koriabine, Guillaume Marçais, Daniela Puiu, Michael Roberts, Jill L Wegrzyn, Pieter J de Jong, David B Neale, Steven L Salzberg, James A Yorke, Charles H Langley.
Abstract
Conifers are the predominant gymnosperm. The size and complexity of their genomes has presented formidable technical challenges for whole-genome shotgun sequencing and assembly. We employed novel strategies that allowed us to determine the loblolly pine (Pinus taeda) reference genome sequence, the largest genome assembled to date. Most of the sequence data were derived from whole-genome shotgun sequencing of a single megagametophyte, the haploid tissue of a single pine seed. Although that constrained the quantity of available DNA, the resulting haploid sequence data were well-suited for assembly. The haploid sequence was augmented with multiple linking long-fragment mate pair libraries from the parental diploid DNA. For the longest fragments, we used novel fosmid DiTag libraries. Sequences from the linking libraries that did not match the megagametophyte were identified and removed. Assembly of the sequence data were aided by condensing the enormous number of paired-end reads into a much smaller set of longer "super-reads," rendering subsequent assembly with an overlap-based assembly algorithm computationally feasible. To further improve the contiguity and biological utility of the genome sequence, additional scaffolding methods utilizing independent genome and transcriptome assemblies were implemented. The combination of these strategies resulted in a draft genome sequence of 20.15 billion bases, with an N50 scaffold size of 66.9 kbp.Entities:
Keywords: (see cocoa genome); Pinus taeda L.; conifer; de novo assembly; fosmid; genome; loblolly pine; placeholder
Mesh:
Year: 2014 PMID: 24653210 PMCID: PMC3948813 DOI: 10.1534/genetics.113.159715
Source DB: PubMed Journal: Genetics ISSN: 0016-6731 Impact factor: 4.562
Overview of WGS sequence obtained for this project
| Library type | Instrument | Fragment size (bp) | Read length (bp) | Coverage |
|---|---|---|---|---|
| Illumina Paired-end | GAIIx | 200–657 | 156–160 | 22× |
| Illumina Paired-end | HiSeq | 200–657 | 100–128 | 42× |
| Illumina Paired-end | MiSeq | 350–657 | 255 | <1× |
| Illumina Mate Pair | GAIIx | 1300–5500 | 156–160 | 13× |
| Fosmid DiTag | GAIIx | 35,000–40,000 | 156–160 | <1× |
The final column reports the nonredundant depth of coverage that was obtained for each library and instrument type based on a genome size of 22 Gbp.
Summary of outcomes from preprocessing stages for long fragment libraries, with lengths ranging from 1000 to 5500 bp
| Estimated jump size (bp) | Library count | Reads sequenced | After error correction and mapping | Nonjunction pairs | Redundant reads | Final reads with both >63 bp | Estimated clone coverage |
|---|---|---|---|---|---|---|---|
| 1000–1999 | 5 | 127.3 | 85.3 (67%) | 5.6 (7%) | 9.7 (12%) | 42.1 | 1.5× |
| 2000–2999 | 16 | 651.9 | 430.0 (66%) | 26.6 (6%) | 43.0 (11%) | 207.4 | 11.9× |
| 3000–3999 | 18 | 705.4 | 474.4 (67%) | 26.4 (6%) | 88.1 (20%) | 213.2 | 16.2× |
| 4000–4999 | 6 | 186.6 | 127.8 (69%) | 6.6 (5%) | 13.6 (11%) | 63.8 | 6.3× |
| 5000–5500 | 3 | 55.3 | 38.2 (69%) | 5.7 (15%) | 19.8 (61%) | 57.6 | 2.1× |
Read counts are given in millions.
Reads that successfully passed error correction and mapped to the haplotype of the target megagametophyte.
Figure 1Partition library construction. (A) Scheme to partition the single megagametophyte DNA sample into multiple libraries. Megagametophyte DNA was sonicated and then run out on an agarose gel; the target size range was excised and partitioned into equal spaced slices. The goal was to set the number and sizes of slices such that the coefficient of variation on the insert-size distribution within each fraction was small enough to support high quality de novo assembly. (B) The empirical fragment size distributions of partitioned libraries made from our target megagametophyte.
Figure 2An overview of the pFosDT5.4 fosmid cloning vector used for DiTag creation. The cloning site (where the genomic DNA is inserted) is flanked by forward and reverse Illumina primers to enable end sequencing. Two Nb.BbvCl nicking endonuclease sites are located 5′ of the Illumina primers on both strands to allow for creation of DiTags by a nick translation method. The vector backbone has had all FspBl and Csp6l sites removed to allow for creation of DiTags by endonuclease digestion.
Figure 3Schematic for our two methods for converting a fosmid library into an Illumina compatible DiTag library using the fosmid vector created for this project. (A) A nick translation approach, similar to the approach used in Williams , was implemented for approximately one-half of the libraries. (B) An endonuclease digestion protocol was also used for approximately one-half of the libraries. Although the location of the junction sites is more constrained, in practice, we obtained higher yields from this method.
Summary of DiTag sequencing
| Library | Lanes sequenced | Median reads sequenced per lane (millions) | Median reads after error correction and mapping (millions) | Median nonjunction reads (millions) | Median redundant reads (millions) | Median final reads with both >63 bp (millions) | Median clone coverage per lane (sum) |
|---|---|---|---|---|---|---|---|
| N1 | 6 | 2.5 | 1.5 (57%) | 0.2 (16%) | 0.5 (37%) | 0.3 (12%) | 0.3× (1.8×) |
| N2 | 6 | 1.8 | 1.2 (68%) | 0.03 (3%) | 0.6 (53%) | 0.3 (15%) | 0.2× (1.2×) |
| N3 | 6 | 0.9 | 0.6 (59%) | 0.05 (8%) | 0.2 (46%) | 0.1 (13%) | 0.1× (0.6×) |
| N4 | 3 | 1.5 | 1.0 (66%) | 0.1 (5%) | 0.3 (37%) | 0.3 (22%) | 0.3× (0.9×) |
| R1 | 2 | 4.1 | 1.8 (44%) | 0.7 (39%) | 0.9 (85%) | 0.2 (4%) | 0.1× (0.2×) |
| R2 | 2 | 2.9 | 1.1 (39%) | 0.4 (33%) | 0.5 (63%) | 0.1 (5%) | 0.1× (0.2×) |
| R3 | 2 | 5.7 | 2.2 (39%) | 1.0 (45%) | 1.2 (97%) | 0.2 (3%) | 0.1× (0.2×) |
| All lanes | 40 | 93 | 39 (42%) | 7 (18%) | 15 (47%) | 9 (10%) | (7.5×) |
Four nick translation libraries (N1–N4) and three endonuclease digestion libraries (R1–R3) are detailed. Each library was sequenced in replicate in a number of lanes. All values reported other than the totals are medians. To allow for a more direct comparison between libraries, median values per lane are used.
Comparison of P. taeda whole-genome assemblies before and after additional scaffolding
| Total sequence in contigs (bp) | 20,148,103,497 | 20,148,103,497 |
| Total span of scaffolds (bp) | 22,564,679,219 | 23,180,477,227 |
| N50 contig size (bp) | 8,206 | 8,206 |
| N50 scaffold size (bp) | 30,681 | 66,920 |
| No. of contigs >500 bp | 4,047,642 | 4,047,642 |
| No. of scaffolds >500 bp | 2,319,749 | 2,158,326 |
The contigs are the same for both assemblies. N50 statistics use a genome size of 22 × 109.
Figure 4Library complexity curves quantify library complexity and the diminishing returns as sequencing progresses. Library complexity is an estimate of the number of distinct molecules in the library (Daley and Smith 2013). The convexly shaped complexity curve plots the number of distinct molecules observed against the number of sequenced reads. Only our deeply sequenced library exhibited diminishing returns. Libraries sequenced to lower depth are shown in the inset.
Figure 5Plot of k-mer counts in the haploid paired-end WGS sequence. After counting all 24- and 31-mers in the QuORUM error-corrected reads, we plotted the number of distinct k-mers (y-axis) that occur exactly X times (x-axis). For our haploid data, each plot has a single primary peak at the X-value corresponding to the depth of coverage of the 1N genome.
Haploid (1N) genome size estimates
| k-mer length | ||
|---|---|---|
| 31 | 24 | |
| Total k-mers | 1.08 × 1012 | 1.16 × 1012 |
| 53.76 occurrences | 56.99 occurrences | |
| Genome size (Gbp) | 20.12 | 20.42 |
Selected statistics for haploid paired-end sequence data by platform and fragment size
| Platform | Median fragment size (bp) | Sequenced Gbp | Error-corrected Gbp | Read 1 expected error-corrected length (bp) | Read 2 expected error-corrected length (bp) | Expected gap size (bp) | Gaps filled (%) | Maximal super-reads (%) |
|---|---|---|---|---|---|---|---|---|
| GAIIx 160+156 | 209 | 2.3 | 2.2 | 157.0 | 151.3 | −99.3 | 96 | 6 |
| 220 | 2.9 | 2.8 | 157.1 | 151.3 | −88.4 | 96 | 7 | |
| 234 | 3.3 | 3.1 | 157.6 | 149.7 | −73.3 | 95 | 8 | |
| 246 | 2.3 | 2.2 | 157.8 | 151.7 | −63.5 | 94 | 9 | |
| 260 | 2.4 | 2.3 | 157.8 | 150.9 | −48.7 | 94 | 9 | |
| 273 | 263 | 252 | 157.5 | 150.0 | −34.5 | 92 | 12 | |
| 285 | 2.4 | 2.3 | 157.8 | 151.4 | −24.2 | 92 | 12 | |
| 325 | 2.2 | 2.1 | 157.5 | 149.3 | 18.3 | 91 | 15 | |
| 441 | 1.9 | 1.8 | 157.1 | 141.7 | 142.2 | 84 | 18 | |
| 565 | 1.5 | 1.4 | 156.9 | 140.5 | 267.6 | 76 | 22 | |
| 637 | 24.0 | 1.1 | 156.6 | 134.8 | 345.6 | 71 | 23 | |
| HiSeq 2x125 | 273 | 326.0 | 315.4 | 123.8 | 122.0 | 27.1 | 83 | 9 |
| HiSeq 2x128 | 220 | 96.6 | 92.9 | 126.5 | 124.1 | −30.6 | 90 | 6 |
| 234 | 59.5 | 57.7 | 127.2 | 125.4 | −18.6 | 87 | 6 | |
| 246 | 61.6 | 59.6 | 126.6 | 125.3 | −5.9 | 87 | 7 | |
| 260 | 100.2 | 97.1 | 127.1 | 125.1 | −2.3 | 84 | 9 | |
| 285 | 95.6 | 91.7 | 126.4 | 123.2 | 35.4 | 86 | 10 | |
| 325 | 92.3 | 88.9 | 126.2 | 124.3 | 74.5 | 80 | 12 | |
| 441 | 35.8 | 34.4 | 126.2 | 124.0 | 190.8 | 73 | 15 | |
| 565 | 43.4 | 41.4 | 125.7 | 122.2 | 317.1 | 68 | 19 | |
| MiSeq 2x255 | 325 | 2.5 | 2.38 | 248.2 | 238.8 | −162.0 | 95 | 14 |
| 441 | 2.0 | 1.85 | 247.5 | 232.2 | −38.6 | 89 | 18 | |
| 565 | 1.6 | 1.44 | 246.9 | 246.9 | 96.0 | 83 | 23 | |
| 637 | 1.2 | 1.03 | 246.6 | 246.6 | 177.9 | 79 | 24 |
The percentage of reads that are gap filled is from the total number of reads entering gap fill, and the percentage of reads becoming maximal super-reads is from those passing error correction. As fragment size increases, so does the expected gap size. Both the error-corrected read length and the rate at which the gaps are successfully filled are decreasing functions of the fragment size. Nevertheless, the overall trend is that participation in the formation of super-reads increases with insert size. Only the paired-end reads that are successfully transformed into maximal super-reads are passed to subsequent assembly steps.
Libraries and read statistics for sequences contributing to our largest fosmid pool assembly
| Library | Pool size (fosmids) | Read lengths (bp) | Mean fragment length (bp) | No. of pairs | No. of pairs after cleaning |
|---|---|---|---|---|---|
| Paired-end | 4600 ± 10% | 160, 156 | 261 | 90,392,267 | 73,325,963 |
| Mate pair | 6400 ± 10% | 160, 156 | 3345 ± 151 | 11,978,560 | 8,390,844 |
Figure A1The percentage of identity of long contigs (>20,000 bp) from the fosmid pool assembly when aligned to the WGS assembly. A total of 109 contigs were 100% identical, and another 2011 were between 99.5 and 100% identical.
Assembly statistics for the largest pool of fosmids
| No. | Sum of lengths | Vector on either end | |
|---|---|---|---|
| Contigs >20,000 bp | 3798 | 109,412,316 | 623 |
| Contigs >30,000 bp | 1651 | 56,742,427 | 585 |
| Scaffolds >20,000 bp | 5323 | 171,852,787 | 2001 |
| Scaffolds >30,000 bp | 3719 | 131,752,852 | 1911 |
The final column shows the number of contigs/scaffolds for which one or both ends contained fosmid vector sequence, indicating that the contig extended all the way to the end of the insert.
Loblolly pine V1.01 assembly compared to contemporary draft conifer genomes
| Species | Loblolly pine ( | Norway spruce ( | White spruce ( |
|---|---|---|---|
| Cytometrically estimated genome size (Gbp) | 21.6 | 19.6 | 15.8 |
| Total scaffold span (Gbp) | 22.6 | 12.3 | 23.6 |
| Total contig span | 20.1 | 12.0 | 20.8 |
| Referenced genome-size estimate (Gbp) | 22 | 18 | 20 |
| N50 contig size (kbp) | 8.2 | 0.6 | 5.4 |
| N50 scaffold size (kbp) | 66.9 | 0.72 | 22.9 |
| No. of scaffolds | 14,412,985 | 10,253,693 | 7,084,659 |
| Annotation of 248 conserved CEGMA genes ( | 185 (74%) complete 203 (82%) complete + partial, 91% annotated full length | 124 (50%) complete 189 (76%) complete + partial, 66% annotated full length | 95 (38%) complete 184 (74%) complete + partial, 52% annotated full length |
N50 contig and scaffold sizes are based on the estimated genome size listed in the table.
O’Brien .
Fuchs .
Bai .
Determined as the number of non-N characters in the published reference sequence.