| Literature DB >> 28751502 |
David B Neale1, Patrick E McGuire2, Nicholas C Wheeler2, Kristian A Stevens3, Marc W Crepeau3, Charis Cardeno3, Aleksey V Zimin4,5, Daniela Puiu5, Geo M Pertea5, U Uzay Sezen6, Claudio Casola7, Tomasz E Koralewski7, Robin Paul6, Daniel Gonzalez-Ibeas6, Sumaira Zaman6, Richard Cronn8, Mark Yandell9, Carson Holt9, Charles H Langley3, James A Yorke4, Steven L Salzberg5,10,11,12, Jill L Wegrzyn6.
Abstract
A reference genome sequence for Pseudotsuga menziesii var. menziesii (Mirb.) Franco (Coastal Douglas-fir) is reported, thus providing a reference sequence for a third genus of the family Pinaceae. The contiguity and quality of the genome assembly far exceeds that of other conifer reference genome sequences (contig N50 = 44,136 bp and scaffold N50 = 340,704 bp). Incremental improvements in sequencing and assembly technologies are in part responsible for the higher quality reference genome, but it may also be due to a slightly lower exact repeat content in Douglas-fir vs. pine and spruce. Comparative genome annotation with angiosperm species reveals gene-family expansion and contraction in Douglas-fir and other conifers which may account for some of the major morphological and physiological differences between the two major plant groups. Notable differences in the size of the NDH-complex gene family and genes underlying the functional basis of shade tolerance/intolerance were observed. This reference genome sequence not only provides an important resource for Douglas-fir breeders and geneticists but also sheds additional light on the evolutionary processes that have led to the divergence of modern angiosperms from the more ancient gymnosperms.Entities:
Keywords: annotation; conifer; genome assembly; gymnosperm; mega-genome; shade tolerance
Mesh:
Year: 2017 PMID: 28751502 PMCID: PMC5592940 DOI: 10.1534/g3.117.300078
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Summary statistics for Douglas-fir assembly version Pm_1.0
| Feature | Contigs | Scaffolds | Singletons |
|---|---|---|---|
| Number of elements | 3,541,350 | 2,814,118 | 6,349,354 |
| Maximum size (bp) | 692,189 | 3,893,220 | 150 |
| Total size (bp) | 14,647,181,470 | 14,947,693,430 | 746,567,730 |
| N50 size (bp) | 44,136 | 340,704 |
All contigs and scaffolds longer than 150 bp were included in the statistics shown here. N50 sizes are computed based on an estimated total genome size of 16 Gbp. Singletons are small contigs that did not overlap with the rest of the assembly.
Figure 1Comparison of the k-mer uniqueness ratio in the assemblies of Pseudotsuga menziesii (Douglas-fir), Pinus taeda (loblolly pine), O. sativa (rice), and B. taurus (domestic cow). Shown is the percentage of sequences of length k (k-mers) in each genome that occur exactly once (i.e., that are unique) as a function of the value of k. Curves that are lower in the figure are relatively more repetitive and thus more difficult to assemble.
Summary of Douglas-fir gene models (assembly Pm_1.0)
| Category | Gene-model count |
|---|---|
| A – Single gene-model features used for classification | |
| Multi-exonic | 47,874 |
| Full-length | 28,403 |
| Expression | 17,411 |
| InterPro | 36,243 |
| Protein evidence | 41,609 |
| Mono-exonic | 6956 |
| Full-length | 6874 |
| Expression | 1649 |
| Interpro | 6956 |
| Protein evidence | 6944 |
| B – Gene-model classification based on combined features | |
| Multi-exonic | |
| Total high quality (expression/Interpro/ protein evidence) | 34,239 |
| Full-length | 20,616 |
| Partial | 13,623 |
| Mono-exonic | |
| Total high quality (expression/Interpro/ protein evidence) | |
| Full-length | 1641 |
| Total high quality full-length (Multi- and mono-exonic) | 22,257 |
| Total high quality (Multi- and mono-exonic) | 35,880 |
| Total low quality (Multi- and mono-exonic) | 18,950 |
InterPro. gene model containing at least a recognizable protein domain; expression, gene model supported by RNA-Seq data corresponding to de novo assembled Douglas-fir transcripts; protein evidence, gene model with protein sequence similar to existing plant proteins in public databases (used as input for MAKER-P).
Assessment of proteome completeness (%) using DOGMA and BUSCO approaches
| Approach/domain | DF HQ | DF All | PT HQC | PT HQ All | PL HQC | PL HQ All | PA MC | PA HC |
|---|---|---|---|---|---|---|---|---|
| DOGMA | ||||||||
| CDA Found1 | 464 | 774 | 149 | 344 | 525 | 741 | 580 | 550 |
| CDA Found2 | 272 | 471 | 97 | 229 | 280 | 431 | 141 | 287 |
| CDA Found3 | 172 | 291 | 40 | 115 | 123 | 242 | 29 | 188 |
| Total Found CDA | 908 | 1536 | 286 | 688 | 928 | 1414 | 750 | 1025 |
| Total % completeness | 45 | 76 | 14 | 34 | 46 | 70 | 37 | 51 |
| BUSCO | ||||||||
| Complete | 299 | 523 | 216 | 321 | 466 | 593 | 107 | 455 |
| Single | 184 | 355 | 161 | 236 | 383 | 468 | 76 | 318 |
| Multi | 115 | 168 | 55 | 85 | 83 | 125 | 31 | 137 |
| Fragment | 203 | 283 | 144 | 193 | 148 | 188 | 242 | 230 |
| Missing | 938 | 634 | 1080 | 926 | 826 | 659 | 1091 | 755 |
DOGMA is the 965 single-domain CDAs and 1052 multiple-domain CDAs (Conserved Domain Arrangements) across eukaryotes and BUSCO is the Benchmarking Universal Single-Copy Orthologs. Explanation of headings: DF, Douglas-fir; PT, Pinus taeda; PL, Pinus lambertiana; PA, Picea abies; HQ, High Quality; HQC, High-Quality Complete; MC, Medium Content; HC, High Content.
Figure 2Phylogenetic placement of the PAL genes across land plants (dicots, monocots, lycophytes, and gymnosperms, with bryophytes as an outgroup). The green clade represents monocots (angiosperms), orange represents dicots (angiosperms), and the thick black line represents gymnosperms (signature of lineage-specific duplication). The conifer clade is bisected by the lycophyte Selaginella moellendorfii and the basal angiosperm Amborella trichopoda (cyan) indicative of early divergence and subsequent convergence in gene functions. The bryophyte Physcomitrella patens (outgroup) is shown in red. An interactive version of the tree is available at http://itol.embl.de/tree/1379989172117161489685231#.
Figure 3Gene-family evolution in 16 land plants. The red, blue, and green branches correspond to dicots, monocots, and Pinaceae, respectively. The paired numbers separated by a slash positioned on or nearby each branch indicate gene duplications (left of slash) and gene losses (right of slash). The scale bar is in million years. Athaliana, A. thaliana; Tcacao, Theobroma cacao; Ppersica, Prunus persica; Rcommunis, Ricinus communis; Mesculenta, Manihot esculenta; Ptrichocarpa, Populus trichocarpa; Osativa, O. sativa; Bdistachyon, B. distachyon; Sbicolor, Sorghum bicolor; Sitalica, Setaria italica; Macuminata, M. acuminata; Pita, Pinus taeda; Pila, Pinus lambertiana; Pabies, Picea abies; Psme, Pseudotsuga menziesii; Ppatens, Physcomitrella patens.
Figure 4Enlarged excerpts from a phylogenetic tree comparing Douglas-fir and other conifers together with select angiosperm light-harvesting-complex proteins. The full tree is presented in Figure S4 in File S2 and an interactive version is available at http://itol.embl.de/tree/1379989172344871490022208#. Members of the genera Pinus (pink) and Picea (blue) are shown in the two innermost data bars in both excerpts, circling the labels. Featured here are antenna PSII protein LHCb3 (purple clade) in (A) and PSII LHCb6 (red clade) and PSI LHCa5 (olive green clade) in (B). (A) None of the members of the genus Pinus nor Douglas-fir have proteins that cluster with the LHCb3 clade. (B) The same trend is observed for the LHCb6 and PSI LHCa5 proteins. These proteins may have been lost in most conifers, but this is not a pan-conifer event.