| Literature DB >> 32948606 |
Alison D Scott1, Aleksey V Zimin2,3,4, Daniela Puiu2,4, Rachael Workman4, Monica Britton5, Sumaira Zaman6, Madison Caballero7, Andrew C Read8, Adam J Bogdanove8, Emily Burns9, Jill Wegrzyn10, Winston Timp4, Steven L Salzberg2,4,11, David B Neale12.
Abstract
The giant sequoia (Sequoiadendron giganteum) of California are massive, long-lived trees that grow along the U.S. Sierra Nevada mountains. Genomic data are limited in giant sequoia and producing a reference genome sequence has been an important goal to allow marker development for restoration and management. Using deep-coverage Illumina and Oxford Nanopore sequencing, combined with Dovetail chromosome conformation capture libraries, the genome was assembled into eleven chromosome-scale scaffolds containing 8.125 Gbp of sequence. Iso-Seq transcripts, assembled from three distinct tissues, was used as evidence to annotate a total of 41,632 protein-coding genes. The genome was found to contain, distributed unevenly across all 11 chromosomes and in 63 orthogroups, over 900 complete or partial predicted NLR genes, of which 375 are supported by annotation derived from protein evidence and gene modeling. This giant sequoia reference genome sequence represents the first genome sequenced in the Cupressaceae family, and lays a foundation for using genomic tools to aid in giant sequoia conservation and management.Entities:
Keywords: Sequoiadendron giganteum; conifer; disease resistance genes; genome assembly; giant sequoia; gymnosperm
Mesh:
Year: 2020 PMID: 32948606 PMCID: PMC7642918 DOI: 10.1534/g3.120.401612
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1Flowchart of inputs and processing steps contributing to the giant sequoia v2.0 assembly.
Assembly statistics for the initial and final scaffolded assembly of giant sequoia
| Assembly | Total sequence (bp) | N50 contig size (bp) | N50 scaffold size (bp) | Number of contigs | Number of scaffolds |
|---|---|---|---|---|---|
| Giant sequoia 1.0 | 8,122,145,191 | 347,954 | 490,521 | 49,651 | 39,821 |
| Giant sequoia 2.0 | 8,125,622,286 | 347,954 | 690,549,816 | 52,886 | 8,215 |
Summary of largest scaffolds in giant sequoia 2.0
| Scaffold ID | Length (bp) | Centromere? | Number of telomeres | Number of gaps | Total gap length (bp, estimated) |
|---|---|---|---|---|---|
| chr1 | 986,618,365 | Y | 1 | 4415 | 441,500 |
| chr2 | 873,713,311 | Y | 2 | 3812 | 877,827 |
| chr3 | 843,110,718 | Y | 1 | 3788 | 378,800 |
| chr4 | 722,823,090 | Y | 2 | 3028 | 666,733 |
| chr5 | 690,549,816 | Y | 2 | 2902 | 382,479 |
| chr6 | 676,903,824 | Y | 1 | 3005 | 1,306,128 |
| chr7 | 659,235,867 | Y | 2 | 2790 | 279,000 |
| chr8 | 649,867,199 | Y | 2 | 2953 | 295,300 |
| chr9 | 641,211,466 | Y | 1 | 2707 | 1,748,814 |
| chr10 | 632,191,860 | Y | 2 | 2642 | 339,803 |
| chr11 | 443,565,592 | Y | 2 | 1885 | 1,006,377 |
| Sc7zsyj_3574 | 171,454,409 | N | 1 | 731 | 1,052,509 |
Summary of largest scaffolds in giant sequoia 2.0, showing that the 11 largest scaffolds represent near-complete chromosomes. All chromosomes other than these top 12 were less than 1 Mbp in length. Number of gaps and total gap length are shown in the final two columns; small gaps of unknown size were assigned a size of 100 bp. Where all gaps fell into this category, the total gap length is the number of gaps x 100.
BUSCO completeness of giant sequoia 2.0 assembly and annotation
| Giant sequoia v2.0 | Giant sequoia v2.0 (≥3kbp) | Transcriptome | Transcriptome mapped to genome | High-confidence gene set | |
|---|---|---|---|---|---|
| Number of input sequences | 8215 | 8120 | 25859 | 22697 | 41633 |
| Complete BUSCOs (C) | 612 | 613 | 1377 | 1184 | 806 |
| Complete and single-copy BUSCOs (S) | 576 | 577 | 1333 | 1140 | 751 |
| Complete and duplicated BUSCOs (D) | 36 | 36 | 44 | 44 | 55 |
| Fragmented BUSCOs (F) | 192 | 191 | 95 | 84 | 260 |
| Missing BUSCOs (M) | 810 | 810 | 142 | 346 | 548 |
| Total BUSCO groups searched | 1614 | 1614 | 1614 | 1614 | 1614 |
| Percentage found | 37.92% | 37.98% | 85.32% | 73.36% | 49.94% |
Completeness of giant sequoia 2.0 assembly and gene sets assessed with BUSCOv4.0.2. Giant sequoia v2.0 is the entire assembly and giant sequoia v2.0 (≥3kbp) only includes scaffolds at least 3kbp in length.
Comparison of giant sequoia v2.0 assembly and annotation to selected gymnosperm genome projects
| A | ||||||||
|---|---|---|---|---|---|---|---|---|
| Reference | ||||||||
| Genome size (Mbp) | 8,114 | 18,167 | 20,000 | 31,000 | 20,613 | 15,700 | 10,610 | 4,110 |
| Chromosomes | 11 | 12 | 12 | 12 | 12 | 12 | 12 | 22 |
| TE content (%) | 79 | 78 | N/A | 79 | 81 | 72 | 77 | 86 |
| N50 scaffold size (kb) | 690,549 | 14.05 | 71.50 | 246 | 107 | 340 | 1,360 | 475 |
Assembly (A) and annotation (B) statistics for giant sequoia v2.0 compared to recent gymnosperm genome projects. A Genome size, TE content, and N50 scaffold size are as reported in the literature. B Number of genes, average coding sequence (CDS) size, average intron size, and maximum intron length as calculated by gFACs.
Gene models proposed by BRAKER2, before and after filtering
| Initial model set | Intermediate filtered set | High-confidence set | |
|---|---|---|---|
| Total Genes | 1,460,545 | 32,360 | 41,632 |
| Average CDS length (bp) | 613.90 | 1099.08 | 1146.4 |
| Average number of exons | 2.78 | 4.22 | 4.48 |
| Average intron length (bp) | 2,362 | 2,233 | 3,894 |
| Max intron length (bp) | 385,133 | 159,979 | 1,399,110 |
| Total monoexonics | 941,659 | — | 5,165 |
| Total multiexonics | 518,886 | 32,360 | 36,466 |
Intermediate set was filtered by removing monoexonic models, models with greater than 50% of their length in a masked region, models annotated as retrodomains, and models lacking functional annotation with EnTAP. The high-confidence set includes the intermediate set, plus monoxonic and multiexonic models derived from transcript evidence, removing any fully nested gene models.
Figure 2Repeat and gene density of giant sequoia 2.0. Gene density shown in green, repeat density shown in purple, both plotted in 1Mb windows. Locations of the consensus NLR genes indicated by black bars.
Figure 3Gene family evolution along a gymnosperm cladogram. Numbers of expanded (bright blue, above branches) and contracted (light blue, below branches) orthogroups indicated in along each branch. Giant sequoia (Segi) experienced an overall expansion, with 3,671 orthogroups expanding and 843 contracting.
Figure 4Rapid evolution along a gymnosperm cladogram. Numbers on each branch indicate the number of rapidly evolving gene families. Giant sequoia (Segi) has experienced rapid evolution in 363 gene families.
Figure 5Maximum likelihood tree of encoded NB-ARC domains of the 300 consensus NLRgenes detected in the giant sequoia 2.0 assembly. Red branches indicate bootstrap support greater than 80%. The inner ring indicates predicted N-terminal TIR (blue) or CC (orange) domains. One of the 300 NLR contains motifs present in TIR and CC NLR proteins (pink). The outer ring indicates presence of an RPW8 motif present in the RNL sub-group of CC-NLRs. Tree is available at: http://itol.embl.de/shared/acr242