| Literature DB >> 32694197 |
Peter M Thielen1, Amanda L Pendleton2,3, Robert A Player1, Kenneth V Bowden1, Thomas J Lawton1, Jennifer H Wisecaver4,3.
Abstract
Setaria viridis (green foxtail) is an important model system for improving cereal crops due to its diploid genome, ease of cultivation, and use of C4 photosynthesis. The S. viridis accession ME034V is exceptionally transformable, but the lack of a sequenced genome for this accession has limited its utility. We present a 397 Mb highly contiguous de novo assembly of ME034V using ultra-long nanopore sequencing technology (read N50 = 41kb). We estimate that this genome is largely complete based on our updated k-mer based genome size estimate of 401 Mb for S. viridis Genome annotation identified 37,908 protein-coding genes and >300k repetitive elements comprising 46% of the genome. We compared the ME034V assembly with two other previously sequenced Setaria genomes as well as to a diversity panel of 235 S. viridis accessions. We found the genome assemblies to be largely syntenic, but numerous unique polymorphic structural variants were discovered. Several ME034V deletions may be associated with recent retrotransposition of copia and gypsy LTR repeat families, as evidenced by their low genotype frequencies in the sampled population. Lastly, we performed a phylogenomic analysis to identify gene families that have expanded in Setaria, including those involved in specialized metabolism and plant defense response. The high continuity of the ME034V genome assembly validates the utility of ultra-long DNA sequencing to improve genetic resources for emerging model organisms. Structural variation present in Setaria illustrates the importance of obtaining the proper genome reference for genetic experiments. Thus, we anticipate that the ME034V genome will be of significant utility for the Setaria research community.Entities:
Keywords: MinION; Oxford Nanopore Technologies; long-read assembly; structural variation
Mesh:
Year: 2020 PMID: 32694197 PMCID: PMC7534418 DOI: 10.1534/g3.120.401345
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1Multistage assembly pipeline. ONT long-reads were assembled de novo to generate assembly v0.1. The resulting assembly was first polished with ONT reads and then with Illumina short-reads to create assemblies v0.2 and v0.3, respectively. The assembly contigs were scaffolded to chromosome-level pseudo chromosomes to generate the final assembly v1.0.
Nuclear genome assembly characteristics of ME034V v1.0
| Total Length (bp) | Contig Count | Contig N50 (bp) | Unspanned Gaps | % Masked | |
|---|---|---|---|---|---|
| All | 397,031,521 | 44 | 19,521,898 | 35 | 46.02% |
| Chromosome 1 | 42,132,932 | 3 | 24,807,925 | 2 | 43.18% |
| Chromosome 2 | 48,726,069 | 3 | 22,686,334 | 2 | 43.03% |
| Chromosome 3 | 49,814,079 | 3 | 26,462,178 | 2 | 47.83% |
| Chromosome 4 | 39,642,072 | 2 | 21,223,292 | 1 | 53.00% |
| Chromosome 5 | 46,382,547 | 4 | 25,487,884 | 3 | 39.71% |
| Chromosome 6 | 36,113,639 | 6 | 11,701,730 | 5 | 54.03% |
| Chromosome 7 | 35,147,422 | 8 | 8,095,101 | 7 | 41.38% |
| Chromosome 8 | 42,437,421 | 13 | 7,531,332 | 12 | 56.60% |
| Chromosome 9 | 56,635,340 | 2 | 29,513,894 | 1 | 39.28% |
Unspanned gaps are those without ONT read support following reference (A10) guided assignment of contiguous contigs.
Determined by RepeatMasker.
Summary statistics of ME034V v1.0 primary gene models
| Gene model statistics | |
|---|---|
| No. protein-coding genes | 37,908 |
| No. transcripts | 49,829 |
| Mean gene length | 2,436 bp |
| Avg. no. exons per gene | 4.06 |
| Mean exon length | 389.04 bp |
| No. genes supported by RNA-Seq | 23,724 (62.58%) |
| No. genes with functional annotation | 25,628 (67.61%) |
| No. genes assigned to an orthogroup | 36,521 (96.34%) |
TPM > 1 from merged ME034V RNA-Seq data.
Assigned one or more Interpro or GO term.
Figure 2ME034V genome assembly gene and repeat content. a) Gene and repeat density across the genome assembly. b) Repeat abundance by repeat type and genome location. Repeats present in genic regions are further broken down based on whether the genic repeat is on the same strand or different strand compared to the gene.
Figure 3Whole genome alignments of three different Setaria genome assemblies. a) ME034V vs. A10; b) ME034V vs. S. italica; c) S. italica vs. A10. Numbers along axes indicate chromosomes. MUMmer (C = 100, L = 1000) alignment matches in the forward and reverse orientation are provided as red and blue circles, respectively.
Figure 4Length distribution of deletions in ME034V assembly compared to average size of common transposable elements. Histogram of length distributions of predicted deletions (gray bars) overlapped by density plots depicting the size distribution of annotated copia (orange) and gypsy (blue) retrotransposons in the ME034V assembly.
Figure 5Exemplar structural variants in the ME034V genome. a) Synteny plot revealing a copia insertion at window Chr03:33,177,187-33,245,787 in ME034V that is missing in the homologous locus in A10, window Chr03:34,798,544-34,868,544 (see also Figure S5; Figure S6) b) Copia insertion unique to A10 (window Chr_03:28,599,689-28,645,328), and absent in ME034V (window Chr03:29,091,552-29,132,959) (see also Figure S7; Figure S8). c) Presence of a gypsy element shared between ME034V (window Chr02:32,419,004-32,465,022) and A10 (window Chr_02:31,649,962-31,696,029) is indicated by near-perfect alignment in the synteny plots (see also Figure S9; Figure S10). For all synteny plots, blue-gray bars connect the two genomes when DNA sequence with >70% identity is observed (red line indicates threshold). Blue and cyan bars above the top track indicate sequence identity along the chromosomal segment from 0–100%, with the color indicating wither single copy (blue) or double copy (cyan) matches. Purple rectangles indicate genes and green rectangles indicate LTRs (alternating hues aid in distinguishing between unique elements). The strandedness of the genes and LTRs is indicated by placing elements encoded on the forward strand higher relative to elements encoded on the reverse strand. Putative target site duplications (TSD) are indicated in collinear regions in (a) and (b). Orange rectangles in part c indicate the 1:1 homologous region absent in accessions Estep_ME018 and Feldman_MF156; support for this deletion is visualized by split reads (cyan) at the left (d) and right (e) breakpoint of the read alignment. Read pairs are connected with gray lines, and reads on the forward and reverse strand are colored pink and purple, respectively.
Figure 6Analysis of gene families in Setaria. a) Comparison of orthogroups in ME034V, A10, and S. italica. Conserved orthogroups (green) were present one or more times in all 27 genomes in the analysis. Monocot-specific orthogroups (blue) were present in two or more monocot genomes and absent from all others. b) The species phylogeny was taken from the PLAZA 4.0 monocots online database. Branch thicknesses and colors are scaled based on the number of predicted duplicated events to have occurred at the descendant node; thinner, blue branches indicate fewer duplications; thicker, red branches indicate more (see Table S10). The 742 duplications predicted at the Setaria ancestral node are indicated with the gray arrow. Tests for enrichment of functional categories was performed on this gene set: c) top ten most significantly enriched GO categories (see Table S11); d) all significantly enriched plantiSMASH specialized metabolism enzyme classes (see Table S12).