| Literature DB >> 34568917 |
Cassandra L Ettinger1, Frank J Byrne2, Matthew A Collin3,4, Derreck Carter-House1, Linda L Walling3,4, Peter W Atkinson2,4, Rick A Redak2, Jason E Stajich1,4.
Abstract
Homalodisca vitripennis (Hemiptera: Cicadellidae), known as the glassy-winged sharpshooter, is a xylem feeding leafhopper and an important agricultural pest as a vector of Xylella fastidiosa, which causes Pierce's disease in grapes and a variety of other scorch diseases. The current H. vitripennis reference genome from the Baylor College of Medicine's i5k pilot project is a 1.4-Gb assembly with 110,000 scaffolds, which still has significant gaps making identification of genes difficult. To improve on this effort, we used a combination of Oxford Nanopore long-read sequencing technology combined with Illumina sequencing reads to generate a better assembly and first-pass annotation of the whole genome sequence of a wild-caught Californian (Tulare County) individual of H. vitripennis. The improved reference genome assembly for H. vitripennis is 1.93-Gb in length (21,254 scaffolds, N50 = 650 Mb, BUSCO completeness = 94.3%), with 33.06% of the genome masked as repetitive. In total, 108,762 gene models were predicted including 98,296 protein-coding genes and 10,466 tRNA genes. As an additional community resource, we identified 27 orthologous candidate genes of interest for future experimental work including phenotypic marker genes like white. Furthermore, as part of the assembly process, we generated four endosymbiont metagenome-assembled genomes, including a high-quality near complete 1.7-Mb Wolbachia sp. genome (1 scaffold, CheckM completeness = 99.4%). The improved genome assembly and annotation for H. vitripennis, curated set of candidate genes, and endosymbiont MAGs will be invaluable resources for future research of H. vitripennis.Entities:
Keywords: Glassy-winged sharpshooter; Hemiptera; Homalodisca vitripennis; Wolbachia; endosymbionts; genome annotation; genome assembly; insect vector; leafhopper
Mesh:
Year: 2021 PMID: 34568917 PMCID: PMC8496328 DOI: 10.1093/g3journal/jkab255
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1Genome assembly assessment and comparison. (A) k-mer frequency histogram output from findGSE using k = 21. The gray line represents the observed k-mer frequency, the teal line represents the fit for the heterozygous k-mer peak, the blue line represents the fitted model without k-mer correction, and the red line represents the fitted model with k-mer correction, which is used to estimate the genome size. (B) Plot depicting cumulative sequence length (y-axis) as the number of scaffolds increases (x-axis) comparing the H. vitripennis draft genome in this study to the reference genome from the i5k project. (C) Stacked barcharts depicting BUSCO analyses for the eukarytota_odb10 and hemiptera_odb10 gene sets for both the H. vitripennis genome reported here and the i5k reference genome. Bars show the percent of genes found in each assembly as a percentage of the total gene set and are colored by BUSCO status (missing = gray, fragmented = yellow, complete and duplicated = green, and complete and single-copy = blue).
Estimates of genome heterozygosity, length, and repeat content
| Genomescope | findGSE | ||
|---|---|---|---|
|
| Heterozygosity (%) | 1.65 | 1.26 |
| Genome haploid size (Gb) | 1.74 | 1.9 | |
| Repeat (%) | 45.86 | 37.79 | |
|
| Heterozygosity (%) | 1.68 | 1.29 |
| Genome haploid size (Gb) | 1.74 | 1.96 | |
| Repeat (%) | 36.91 | 31.3 | |
|
| Heterozygosity (%) | 1.65 | 1.29 |
| Genome haploid size (Gb) | 1.75 | 1.93 | |
| Repeat (%) | 34.28 | 28.93 | |
|
| Heterozygosity (%) | 1.6 | 1.27 |
| Genome haploid size (Gb) | 1.75 | 1.89 | |
| Repeat (%) | 33.14 | 27.55 | |
|
| Heterozygosity (%) | 1.56 | 1.16 |
| Genome haploid size (Gb) | 1.75 | 1.96 | |
| Repeat (%) | 32.37 | 27.03 |
These estimates include the percentage heterozygosity, haploid genome size and percentage of repeat content based on k-mer analysis using GenomeScope and findGSE for a range of k-mers (k = 19, 21, 23, 25, 27).
Assembly statistics and assessment
| Assembly | This study | i5k | |
|---|---|---|---|
| QUAST | # contigs | 34,952 | 149,799 |
| # scaffolds (≥0 bp) | 21,254 | 111,110 | |
| # scaffolds (≥1000 bp) | 19,715 | 59,570 | |
| # scaffolds (≥5000 bp) | 14,959 | 13,241 | |
| # scaffolds (≥10,000 bp) | 12,524 | 7,359 | |
| # scaffolds (≥25,000 bp) | 8,796 | 4,438 | |
| # scaffolds (≥50,000 bp) | 5,168 | 3,132 | |
| Total length (≥0 bp) | 1,930,946,379 | 1,445,215,006 | |
| Total length (≥1000 bp) | 1,929,918,132 | 1,418,424,409 | |
| Total length (≥5000 bp) | 1,916,091,697 | 1,325,420,810 | |
| Total length (≥10,000 bp) | 1,898,148,486 | 1,285,066,097 | |
| Total length (≥25,000 bp) | 1,833,358,540 | 1,240,043,308 | |
| Total length (≥50,000 bp) | 1,703,319,989 | 1,194,181,890 | |
| Largest contig | 7,378,560 | 7,131,305 | |
| GC (%) | 32.87 | 32.65 | |
| N50 | 650,435 | 656,130 | |
| N75 | 171,660 | 211,051 | |
| L50 | 750 | 542 | |
| L75 | 2,178 | 1,423 | |
| # N's per 100 kbp | 71.13 | 3,005.46 | |
| BUSCO: hemiptera_odb10 | Complete BUSCOs (C) | 2,367 (94.3%) | 2,306 (91.9%) |
| Complete and single-copy BUSCOs (S) | 2,152 (85.7%) | 2,247 (89.5%) | |
| Complete and duplicated BUSCOs (D) | 215 (8.6%) | 59 (2.4%) | |
| Fragmented BUSCOs (F) | 108 (4.3%) | 150 (6.0%) | |
| Missing BUSCOs (M) | 35 (1.4%) | 54 (2.1%) | |
| Total BUSCO groups searched | 2,510 | 2,510 | |
| BUSCO: eukaryota_odb10 | Complete BUSCOs (C) | 218 (85.5%) | 236 (92.6%) |
| Complete and single-copy BUSCOs (S) | 191 (74.9%) | 230 (90.2%) | |
| Complete and duplicated BUSCOs (D) | 27 (10.6%) | 6 (2.4%) | |
| Fragmented BUSCOs (F) | 29 (11.4%) | 10 (3.9%) | |
| Missing BUSCOs (M) | 8 (3.1%) | 9 (3.5%) | |
| Total BUSCO groups searched | 255 | 255 |
Various statistics calculated by QUAST for the assembly in this study and the i5k reference assembly are provided here including the number of contigs in the assembly, the number of scaffolds of various lengths, the total assembly length, percent GC, the N50, and the L50. All statistics from QUAST are based on contigs of size ≥3000 bp, unless specifically noted (e.g., “# contigs (≥0 bp)” and “Total length (≥0 bp)” include all contigs in each assembly). We also report here the results of the BUSCO assessment of both assemblies using the hemiptera_odb10 and eukaryota_odb10 gene sets.
Figure 2Endosymbiont assessment in genome and identification. (A) BlobTools2 visualization of H. vitripennis scaffolds showing taxa-colored GC coverage plot. Each circle represents a scaffold in the assembly, scaled by length, and colored by superkingdom (eukaryota = blue, bacteria = orange, viruses = yellow, and unidentified = gray). On the x-axis is the average GC content of each scaffold and on the y-axis is the average coverage of each scaffold to the draft assembly. The marginal histograms show cumulative genome length (Mb) for coverage (y-axis) and GC content bins (x-axis). (B) Placement of Wolbachia sp. GWSS-01 (colored in orange) in the GTDB phylogenetic tree. (C) Placement of Ca. Baumannia cicadellinicola GWSS-02 (colored in orange) in the GTDB phylogenetic tree. (D) Placement of Ca. Sulcia muelleri GWSS-03 and GWSS-04 (colored in orange) in the GTDB phylogenetic tree.
Genome feature summary for endosymbiont MAGs
| MAG ID | Taxonomy | Total length (bp) | Number of scaffolds | N50 | Mean coverage | GC (%) | Number of genes | 16S rRNA copy present | Completion (%) | Redundancy (%) | Reference alignment (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GWSS-01 |
| 1,712,771 | 1 | 1,712,771 | 93.10 | 33.66 | 1,691 | Yes | 99.36 | 1.71 | NA |
| GWSS-02 |
| 610,888 | 12 | 78,712 | 1280.76 | 32.65 | 531 | Yes | 66.46 | 1.25 | 66.40 |
| GWSS-03 |
| 209,259 | 1 | 209,259 | 1592.13 | 24.95 | 199 | Yes | 25.86 | 0 | 70.55 |
| GWSS-04 |
| 179,112 | 6 | 41,952 | 786.73 | 26.84 | 148 | No | 17.76 | 1.34 | 33.10 |
Genomic characteristics are summarized for each MAG, including putative taxonomic identity, length (bp), number of scaffolds, N50, mean coverage, percent GC content, number of genes, presence of 16S ribosomal RNA gene, completion and contamination estimates as generated by CheckM, and alignment to an existing reference genome using D-GENIES. MAGs are sorted by percent completion.
Genome annotation statistics
| Total gene models | 108,762 |
| Total number protein-coding genes | 98,296 |
| Total number of tRNAs | 10,466 |
| Total number of complete CDS | 89,929 |
| Total number of exons | 351,975 |
| Total number of CDS | 322,333 |
| Mean CDS AED | 0.002 |
| Mean mRNA AED | 0.024 |
| Mean gene size (bp) | 2,958.4 |
| Mean exon length (bp) | 214.9 |
| Mean CDS length (bp) | 193.9 |
| Mean 5'UTR length (bp) | 148.9 |
| Mean 3'UTR length (bp) | 810.9 |
| Mean tRNA length (bp) | 70.2 |
| Total number of gene models with 2 isoforms | 628 |
| Total number of gene models with 3 isoforms | 52 |
| Total number of gene models with 4 isoforms | 5 |
| Proteins with PFAM domain (%) | 14.4 |
| Proteins with InterProScan Hit (%) | 23 |
| Proteins with EggNog Hit (%) | 24.3 |
A summary of genome annotation results is reported here including the total number of gene models, protein-coding genes, tRNAs, complete (e.g., having both a start and stop codon) coding sequences (CDS), exons, and CDS regions, the mean CDS and mRNA annotation edit distances (AED), the mean gene size (bp), exon length (bp), CDS length (bp), 5'-UTR length (bp), 3'-UTR length (bp), and tRNA length (bp), the total number of gene models with 2, 3, or 4 isoforms, and the percentage of proteins with a PFAM domain, InterProScan or EggNog match.
Figure 3Repetitive element diversity and divergence landscape. (A) A barplot representing the percent of the genome composed of elements from each repeat class. (B) A stacked barplot representing the percent of the genome made of repeat elements from each repeat class binned by 1% sequence divergence (CpG adjusted Kimura divergence). Bars are colored repeat class (LINE = pink, SINE = orange, LTR = green, DNA = light blue, RC = yellow, and Unknown = dark blue). Abbreviations: long-interspersed nuclear element (LINE), small-interspersed nuclear element (SINE), long-terminal repeat retrotransposon (LTR), DNA transposons (DNA), and rolling-circle transposons (RC).
Orthologous candidate genes identified for use in genetic analyses
| Gene name | Gene ID | Category | Scaffold | Start | Stop | Strand |
|---|---|---|---|---|---|---|
|
| J6590_063422 | Eye color marker | scaffold_912 | 460005 | 478693 | − |
|
| J6590_023567 | Eye color marker | scaffold_152 | 394070 | 408336 | + |
|
| J6590_025764 | Eye color marker | scaffold_175 | 597341 | 620522 | − |
|
| J6590_079319 | Eye color marker | scaffold_1776 | 27869 | 36915 | − |
|
| J6590_010106 | Eye color marker | scaffold_46 | 401764 | 405463 | + |
|
| J6590_030756 | Eye color marker | scaffold_237 | 304451 | 312309 | + |
|
| J6590_021669 | Eye color marker | scaffold_136 | 1442727 | 1477619 | − |
|
| J6590_059208 | Eye color marker | scaffold_778 | 21807 | 32946 | + |
|
| J6590_086284 | Eye color marker | scaffold_2636 | 59559 | 69160 | + |
|
| J6590_055645 | Body color marker | scaffold_679 | 520340 | 534402 | − |
|
| J6590_045190 | Wing shape marker | scaffold_458 | 853125 | 882297 | + |
|
| J6590_040001 | Wing shape marker | scaffold_363 | 1027789 | 1033916 | + |
|
| J6590_019057 | Wing shape marker | scaffold_113 | 220253 | 229632 | + |
|
| J6590_017333 | Eye shape marker | scaffold_97 | 1592445 | 1593704 | − |
|
| J6590_029566 | Promoter of interest | scaffold_221 | 1237702 | 1242922 | + |
|
| J6590_045793 | Promoter of interest | scaffold_469 | 433018 | 434887 | − |
|
| J6590_054039 | Promoter of interest | scaffold_640 | 212856 | 215547 | − |
|
| J6590_054038 | Promoter of interest | scaffold_640 | 189320 | 196363 | − |
|
| J6590_108590 | Promoter of interest | scaffold_4772 | 8657 | 13082 | + |
|
| J6590_108371 | Promoter of interest | scaffold_193 | 446565 | 453062 | + |
|
| J6590_010109 | Promoter of interest | scaffold_46 | 468344 | 476975 | − |
|
| J6590_020497 | Promoter of interest | scaffold_126 | 1164356 | 1180159 | − |
| β- | J6590_031071 | Promoter of interest | scaffold_241 | 149582 | 155900 | + |
| β- | J6590_027853 | Promoter of interest | scaffold_199 | 1301426 | 1304526 | + |
| β- | J6590_005648 | Promoter of interest | scaffold_20 | 2635108 | 2648607 | − |
| β- | J6590_073055 | Promoter of interest | scaffold_1324 | 191354 | 201726 | − |
|
| J6590_064570 | Promoter of interest | scaffold_950 | 15066 | 25700 | − |
Here for each identified gene, we provide the gene name, gene ID (e.g., the loci name provided to NCBI), scaffold number, strand direction, and start and stop locations. We also report the category of interest for each gene. Broadly, these fall into two larger groupings: (1) promoter of interest or (2) a morphological marker category based on phenotype from the literature (e.g., eye color, body color, wing shape, and eye shape).