| Literature DB >> 35100403 |
David B Neale1, Aleksey V Zimin2,3, Sumaira Zaman4,5, Alison D Scott1, Bikash Shrestha4, Rachael E Workman6, Daniela Puiu2,3, Brian J Allen1, Zane J Moore1, Manoj K Sekhwal7, Amanda R De La Torre7, Patrick E McGuire1, Emily Burns8, Winston Timp2,3,6, Jill L Wegrzyn4,9, Steven L Salzberg2,3,10,11.
Abstract
Sequencing, assembly, and annotation of the 26.5 Gbp hexaploid genome of coast redwood (Sequoia sempervirens) was completed leading toward discovery of genes related to climate adaptation and investigation of the origin of the hexaploid genome. Deep-coverage short-read Illumina sequencing data from haploid tissue from a single seed were combined with long-read Oxford Nanopore Technologies sequencing data from diploid needle tissue to create an initial assembly, which was then scaffolded using proximity ligation data to produce a highly contiguous final assembly, SESE 2.1, with a scaffold N50 size of 44.9 Mbp. The assembly included several scaffolds that span entire chromosome arms, confirmed by the presence of telomere and centromere sequences on the ends of the scaffolds. The structural annotation produced 118,906 genes with 113 containing introns that exceed 500 Kbp in length and one reaching 2 Mb. Nearly 19 Gbp of the genome represented repetitive content with the vast majority characterized as long terminal repeats, with a 2.9:1 ratio of Copia to Gypsy elements that may aid in gene expression control. Comparison of coast redwood to other conifers revealed species-specific expansions for a plethora of abiotic and biotic stress response genes, including those involved in fungal disease resistance, detoxification, and physical injury/structural remodeling and others supporting flavonoid biosynthesis. Analysis of multiple genes that exist in triplicate in coast redwood but only once in its diploid relative, giant sequoia, supports a previous hypothesis that the hexaploidy is the result of autopolyploidy rather than any hybridizations with separate but closely related conifer species.Entities:
Keywords: zzm321990 Sequoia sempervirenszzm321990 ; coast redwood; conifer; genome assembly and annotation; gymnosperm; hexaploid genome
Mesh:
Year: 2022 PMID: 35100403 PMCID: PMC8728005 DOI: 10.1093/g3journal/jkab380
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.542
Sequence data generated for the coast redwood genome
| Data type | Average read length (bp) | Number of reads | Genome coverage | Source of tissue |
|---|---|---|---|---|
| Illumina paired end | 2 × 150 | 21,588,293,516 | 122x | 1 |
| Dovetail Chicago | 2 × 151 | 6,924,430,790 | 39x | 2 |
| Dovetail Hi-C | 2 × 151 | 6,918,097:308 | 39x | 2 |
| ONT | 7:775 | 74,815,884 | 22x | 2 |
Coverage is computed using the estimated genome size of 26.5 Gbp as described in the following text.
Results for processing of Illumina reads into super-reads and mega-reads
| Derived data type | Average read length (bp) | Number of reads | Genome coverage |
|---|---|---|---|
| Super-reads | 352 | 212,922,309 | 2.8x |
| Mega-reads | 6670 | 71,380,616 | 18x |
MaSuRCA reduced the data from over 21 billion Illumina reads into ∼213 million super-reads, preserving the information contained in the Illumina data. The super-reads were then used to transform ONT reads into highly accurate mega-reads.
Figure 1The distribution of 31-mers in coast redwood Illumina short-read data collected from a haploid sample. The primary peak is at X = 57. The number of 31-mers is estimated using the area under the curve, excluding the low-count k-mers that likely are due to errors in base-calling. The three peaks (X = 57: ∼118 and ∼180) reflect the hexaploid nature of the genome; these contain 31-mers that are identical in two or three of the subgenomes.
Assembly statistics for coast redwood, for the initial v1.0 assembly, the scaffolded v2.1 assembly, and the final v2.2 assembly
| Assembly version | Total sequence (bp) | N50 contig size (bp) | N50 scaffold size (bp) | Number of contigs | Number of scaffolds |
|---|---|---|---|---|---|
| v1.0 | 26,454,318,454 | 96,840 | 109,446 | 548,924 | 517,860 |
| v2.1 | 26,454,315,051 | 96,840 | 52,996,825 | 548,924 | 393,401 |
| v2.2 | 26,454,268,401 | 96,840 | 44,944,384 | 548,915 | 393,407 |
For consistency, we computed the N50 sizes for contigs and scaffolds using 26.5 Gbp as the estimated genome size
Transcriptome statistics: Illumina short-read and PacBio Iso-Seq needle tissue assemblies
| Total transcripts | 798,384 |
|---|---|
| N50 (bp) | 1,382 |
| BUSCO completeness | 94.4 |
| Transcripts (frame selected) | 256,322 |
| Transcripts (filtered for completeness) | 202,414 |
| N50 (bp) | 918 |
| BUSCO completeness | 90.8% |
| Total full-length | 132,195 |
| Aligned transcripts | 114,113 |
| N50 (bp) | 933 |
| BUSCO completeness | 68.9% |
| Max intron length (bp) | 1,977,916 |
| Total multiexon transcripts | 73,578 |
| Total single-exon transcripts | 40,535 |
Structural genome annotation summary
| Predicted gene space (Braker) | Filtered gene space (structure and function) | Final gene space (predictions and transcriptome) | |
|---|---|---|---|
| Total genes | 3,657,738 | 108,231 | 118,906 |
| Average gene size (bp) | 1,317 | 6,157 | 11,894 |
| Average CDS length (bp) | 663 | 1,151 | 1,150 |
| Average number of exons | 2.57 | 3.89 | 4.17 |
| Average intron lengths (bp) | 2,496 | 2,422 | 4,640 |
| Maximum intron length (bp) | 389,578 | 245,956 | 1,950,289 |
| Total single-exon genes | 3,048,838 | 31,110 | 32,122 |
| Total multiexon genes | 608,900 | 77,121 | 86,784 |
| BUSCO completeness (%) | 44.7 | 32.9 | 65.5 |
Figure 2Frequency of different tree topologies when comparing multicopy genes in coast redwood to orthologous genes in giant sequoia, dawn redwood, and other species.