| Literature DB >> 35874012 |
Eka Giorgashvili1, Katja Reichel1, Calvinna Caswara1, Vuqar Kerimov2, Thomas Borsch1,3, Michael Gruenstaeudl1.
Abstract
Most plastid genome sequences are assembled from short-read whole-genome sequencing data, yet the impact that sequencing coverage and the choice of assembly software can have on the accuracy of the resulting assemblies is poorly understood. In this study, we test the impact of both factors on plastid genome assembly in the threatened and rare endemic shrub Calligonum bakuense. We aim to characterize the differences across plastid genome assemblies generated by different assembly software tools and levels of sequencing coverage and to determine if these differences are large enough to affect the phylogenetic position inferred for C. bakuense compared to congeners. Four assembly software tools (FastPlast, GetOrganelle, IOGA, and NOVOPlasty) and seven levels of sequencing coverage across the plastid genome (original sequencing depth, 2,000x, 1,000x, 500x, 250x, 100x, and 50x) are compared in our analyses. The resulting assemblies are evaluated with regard to reproducibility, contig number, gene complement, inverted repeat length, and computation time; the impact of sequence differences on phylogenetic reconstruction is assessed. Our results show that software choice can have a considerable impact on the accuracy and reproducibility of plastid genome assembly and that GetOrganelle produces the most consistent assemblies for C. bakuense. Moreover, we demonstrate that a sequencing coverage between 500x and 100x can reduce both the sequence variability across assembly contigs and computation time. When comparing the most reliable plastid genome assemblies of C. bakuense, a sequence difference in only three nucleotide positions is detected, which is less than the difference potentially introduced through software choice.Entities:
Keywords: Calligonum; assembly software; genome assembly; nucleotide differences; phylogenetic position; plastid genome; reproducibility; sequencing coverage
Year: 2022 PMID: 35874012 PMCID: PMC9296850 DOI: 10.3389/fpls.2022.779830
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 6.627
Figure 1Habit and natural environment (A) and the current distribution area (B) of C. bakuense. The map indicates the localities of all sampled natural populations of C. bakuense, including those that individuals Cb01A and Cb04B were sampled from.
Assembly statistics for the plastid genomes of the two individuals of C. bakuense under study regarding the impact of assembly software choice, run replication, and seed selection.
|
|
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||
| FaPl | orig. | repl1 | 1 | 200,694 | 118,168 | 1 | No | 05 h 20 min | |
| FaPl | orig. | repl2 | 1 | 143,261 | 135,202 | 1 | No | 06 h 40 min | |
| FaPl | 2,000x | 1 | 162,404 | 162,128 | 1 | Yes | 01 h 16 min | ||
| FaPl | 500x | 1 | 163,292 | 162,896 | 1 | Yes | 24 min | ||
| GetO | orig. | repl1 | 1 | 162,128 | 162,128 | 1 | Yes | 44 min | |
| GetO | orig. | repl2 | 1 | 162,128 | 162,128 | 1 | Yes | 44 min | |
| GetO | 2000x | 2 | 118,241 | 118,215 | 1 | Yes | 01 h 08 min | ||
|
|
|
|
|
|
|
|
| ||
| IOGA | orig. | repl1 | 21 | 89,039 | 88,068 | 1 | Yes | 09 h 42 min | |
| IOGA | orig. | repl2 | 21 | 89,039 | 88,068 | 1 | No | 06 h 43 min | |
| IOGA | 2,000x | 83 | 129,550 | 118,520 | 1 | No | 07 h 50 min | ||
| IOGA | 500x | 51 | 91,976 | 89,718 | 1 | No | 02 h 22 min | ||
| NOVO | orig. | repl1 | seed1 | 1 | 170,093 | 131,660 | 1 | No | 01 h 05 min |
| NOVO | orig. | repl2 | seed1 | 1 | 170,099 | 170,099 | 1 | Yes | 57 min |
| NOVO | 2,000x | seed1 | 1 | 162,128 | 162,128 | 1 | Yes | 23 min | |
| NOVO | 500x | seed1 | 1 | 162,128 | 162,128 | 1 | Yes | 07 min | |
| NOVO | orig. | repl1 | seed2 | 1 | 162,128 | 162,128 | 1 | Yes | 01 h 05 min |
| NOVO | orig. | repl2 | seed2 | 1 | 170,106 | 170,106 | 1 | Yes | 01 h 00 min |
| NOVO | 2,000x | seed2 | 1 | 162,128 | 162,128 | 1 | Yes | 23 min | |
| NOVO | 500x | seed2 | 1 | 162,128 | 162,128 | 1 | Yes | 07 min | |
|
| |||||||||
| FaPl | orig. | repl1 | 1 | 175,272 | 175,272 | 1 | Yes | 09 h 24 min | |
| FaPl | orig. | repl2 | 1 | 192,943 | 118,215 | 1 | No | 03 h 36 min | |
| FaPl | 2,000x | 1 | 163,890 | 162,129 | 1 | Yes | 01 h 16 min | ||
| FaPl | 500x | 1 | 163,292 | 163,292 | 1 | Yes | 24 min | ||
| GetO | orig. | repl1 | 1 | 162,129 | 162,129 | 1 | Yes | 04 h 16 min | |
| GetO | orig. | repl2 | 1 | 162,129 | 162,129 | 1 | Yes | 01 h 04 min | |
| GetO | 2,000x | 2 | 118,238 | 118,215 | 1 | Yes | 01 h 28 min | ||
|
|
|
|
|
|
|
|
| ||
| IOGA | orig. | repl1 | 54 | 90,241 | 88,240 | 1 | Yes | 11 h 14 min | |
| IOGA | orig. | repl2 | 85 | 90,630 | 87,507 | 1 | Yes | 07 h 42 min | |
| IOGA | 2,000x | 102 | 55,966 | 27,790 | 2 | No | 08 h 17 min | ||
| IOGA | 500x | 40 | 75,285 | 74,394 | 1 | No | 03 h 18 min | ||
| NOVO | orig. | repl1 | seed1 | 1 | 162,129 | 162,129 | 1 | Yes | 01 h 23 min |
| NOVO | orig. | repl2 | seed1 | 1 | 162,129 | 162,129 | 1 | Yes | 01 h 24 min |
| NOVO | 2,000x | seed1 | 1 | 162,129 | 162,129 | 1 | Yes | 20 min | |
| NOVO | 500x | seed1 | 1 | 162,129 | 162,129 | 1 | Yes | 06 min | |
| NOVO | orig. | repl1 | seed2 | 1 | 162,129 | 162,129 | 1 | Yes | 01 h 00 min |
| NOVO | orig. | repl2 | seed2 | 1 | 162,129 | 162,129 | 1 | Yes | 01 h 32 min |
| NOVO | 2,000x | seed2 | 1 | 162,129 | 162,129 | 1 | Yes | 20 min | |
| NOVO | 500x | seed2 | 1 | 162,129 | 162,129 | 1 | Yes | 06 min |
The assemblies that represent the final genome sequences are highlighted in bold. The assembly software tools compared are abbreviated as “FaPl” (for FastPlast), “GetO” (for GetOrganelles), “IOGA,” and “NOVO” (for NOVOPlasty). Run replicates are abbreviated as “repl1” or “repl2,” the original sequencing depth as “orig.” Other abbreviations used: asmb., assembly; comp., computation; cov., coverage; equal., equality in sequence; repl., replicate.
Assembly statistics for the plastid genomes of the two individuals of C. bakuense under study regarding the impact of different levels of sequencing coverage.
|
|
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||
| GetO | orig. | Yes | 1 | 162,128 | 162,128 | 1 | 30,526 | Yes | 44 min |
| GetO | 2,000x | No | 2 | 118,241 | 118,215 | 1 | 30,526 | Yes | 01 h 08 min |
| GetO | 1,000x | No | 1 | 62,295 | n.s.d. | - | n.a. | n.a. | 06 min |
|
|
|
|
|
|
|
|
|
|
|
| GetO | 250x | Yes | 1 | 162,128 | 162,128 | 1 | 30,526 | Yes | 02 min |
| GetO | 100x | Yes | 1 | 162,128 | 162,128 | 1 | 30,526 | Yes | 01 min |
| GetO | 50x | No | 2 | 118,241 | 118,220 | 1 | 28,610 | Yes | 01 min |
| NOVO | orig. | Yes | 1 | 170,093 | 131,660 | 1 | 44,559 | No | 01 h 05 min |
| NOVO | 2,000x | Yes | 1 | 162,128 | 162,128 | 1 | 30,526 | Yes | 23 min |
| NOVO | 1,000x | Yes | 1 | 162,128 | 162,128 | 1 | 30,526 | Yes | 11 min |
| NOVO | 500x | Yes | 1 | 162,128 | 162,128 | 1 | 30,526 | Yes | 07 min |
| NOVO | 250x | Yes | 1 | 162,128 | 162,128 | 1 | 30,526 | Yes | 05 min |
| NOVO | 100x | Yes | 1 | 162,128 | 162,128 | 1 | 30,526 | Yes | 02 min |
| NOVO | 50x | No | 1 | 117,861 | 117,849 | 1 | n.a. | n.a. | 09 min |
|
| |||||||||
| GetO | orig. | Yes | 1 | 162,129 | 162,129 | 1 | 30,526 | Yes | 04 h 16 min |
| GetO | 2,000x | No | 2 | 118,238 | 118,215 | 1 | 30,526 | Yes | 01 h 28 min |
| GetO | 1000x | No | 1 | 67,160 | n.s.d. | - | n.a. | n.a. | 06 min |
|
|
|
|
|
|
|
|
|
|
|
| GetO | 250x | Yes | 1 | 162,129 | 162,129 | 1 | 30,526 | Yes | 02 min |
| GetO | 100x | Yes | 1 | 162,129 | 162,129 | 1 | 30,526 | Yes | 01 min |
| GetO | 50x | No | 2 | 118,236 | 118,215 | 1 | 30,526 | Yes | 01 min |
| NOVO | orig. | Yes | 1 | 162,129 | 162,129 | 1 | 30,526 | Yes | 01 h 23 min |
| NOVO | 2,000x | Yes | 1 | 162,129 | 162,129 | 1 | 30,526 | Yes | 20 min |
| NOVO | 1,000x | Yes | 1 | 162,129 | 162,129 | 1 | 30,526 | Yes | 15 min |
| NOVO | 500x | Yes | 1 | 162,129 | 162,129 | 1 | 30,526 | Yes | 06 min |
| NOVO | 250x | Yes | 1 | 162,129 | 162,129 | 1 | 30,476 | Yes | 04 min |
| NOVO | 100x | No | 4 | 112,054 | 112,054 | 1 | 30,526 | Yes | 07 min |
| NOVO | 50x | No | 1 | 75,891 | n.s.d. | - | n.a. | n.a. | 24 min |
For assemblies under the original sequencing depth, only the first run replicate is displayed; for all assemblies performed with NOVOPlasty, seed sequence 1 was employed. Abbreviations used: compl., complete genome assembled; n.a., not applicable; n.s.d., no similarity detected by QUAST; all other abbreviations used as in .
Figure 2Map of the complete plastid genome of individual Cb01A of C. bakuense as assembled by GetOrganelle under a coverage cap of 500x. This assembly represents the final plastid genome sequence for Cb01A.
Figure 3Comparisons of the number of SNPs and the lengths of the four genome regions across the plastid genome assemblies of C. bakuense as generated by different assembly software and levels of sequencing coverage. Subplot (A) displays the results of PCoAs, subplots (B,C) the results of comparisons between a target assembly and the final plastid genome sequence, and subplot (D) the results of assembly comparisons between the two individuals of C. bakuense under study. In the PCoA plots, the percentages indicate the variance explained by the first (x-axis) and second (y-axis) principal coordinate, and the integers express the range of the data. The abbreviations for the four distance metrics are: “SNPcount” for the total number of SNPs between two assemblies; “LSClendif,” “SSClendif,” and “IRlendif” for the differences in sequence length in the LSC, SSC, and IR between two assemblies, respectively.
Figure 4Overview of the relative lengths of the LSC, the SSC, and the two IRs across the plastid genome assemblies of the individuals Cb01A (A) and Cb04B (B) of C. bakuense as generated by different assembly software and levels of sequencing coverage.
Overview of incorrect or missing annotations among the plastid genome assemblies of C. bakuense as generated under different assembly software, sequencing coverage, seed sequences, and run replicates.
|
|
|
| |||
|---|---|---|---|---|---|
|
| |||||
| FaPl | orig. | repl1 | rpl23a,b, rrn16a,b | ||
| FaPl | orig. | repl2 | rpl2a,b, ycf2a,b, rpl23a | ||
| FaPl | 2,000x | ||||
| FaPl | 500x | ||||
| IOGA | orig. | repl1 | psbA, rpl23a,b, ycf2a,b, rrn16a,b | ||
| IOGA | orig. | repl2 | psbA, ycf2a,b | ||
| IOGA | 2,000x | psbA, ycf2a,b, ycf1a,b, ndhH | |||
| IOGA | 500x | psbA, rps2a,b, ycf2a,b, ndhH | trnH-GUG | ||
| NOVO | orig. | repl1 | seed1 | trnH-GUG, psbA, trnK-UUU, matK, rps16 | |
| NOVO | orig. | repl2 | seed1 | trnH-GUG, psbA, trnK-UUU, matK, rps16, trnQ-UUG, psbK, psbI, trnS-GCU, trnG-UCC, trnR-UCU, atpA, atpF, atpH, atpI, rps2, rpoC2 | |
| NOVO | 2,000x | seed1 | |||
| NOVO | 500x | seed1 | |||
| NOVO | orig. | repl1 | seed2 | trnH-GUG, psbA, trnK-UUU, matK, rps16, trnQ-UUG, psbK, psbI, trnS-GCU, trnG-UCC, trnR-UCU, atpA, atpF, atpH, atpI, rps2, rpoC2 | |
| NOVO | orig. | repl2 | seed2 | trnH-GUG, psbA, trnK-UUU, matK, rps16, trnQ-UUG, psbK, psbI, trnS-GCU, trnG-UCC, trnR-UCU, atpA, atpF, atpH, atpI, rps2, rpoC2 | |
| NOVO | 2,000x | seed2 | |||
| NOVO | 500x | seed2 | |||
|
| |||||
| FaPl | orig. | repl1 | |||
| FaPl | orig. | repl2 | rpl23a | ||
| FaPl | 2,000x | ||||
| FaPl | 500x | ndhF | |||
| IOGA | orig. | repl1 | psbA, rpl23b, rpl2a | ||
| IOGA | orig. | repl2 | psbA, ndhH, rpl23a, rpl2a,b | ||
| IOGA | 2,000x | psbA, petB | |||
| IOGA | 500x | psbA, rps23a,b |
All plastid genome assemblies generated with GetOrganelle for both individuals and with NOVOPlasty for Cb04B exhibited a complete gene complement and a full genome size and are, thus, not listed. The last column denotes cases of incomplete genomes despite the assembly being circular and indicated as complete by the assembly software. A location in IRa is indicated.
Figure 5Visualization of the sequencing coverage across the plastid genome of individual Cb01A as generated with FastPlast and the location of SNPs of assemblies generated under different levels of sequencing coverage. Red bars in the visualization of sequencing coverage indicate calculation windows with a depth equal to, or less than, 50% of genome-wide sequencing depth. The four rings beneath the coverage visualization indicate the location of SNPs relative to the final genome sequence for the following assemblies: replicate run 1 (A) and 2 (B) under the original sequencing depth; a coverage cap of 2,000x (C); a coverage cap of 500x (D). Black bars within each ring represent the occurrence of three SNPs per 100 bp.
Figure 6Phylogenetic position of C. bakuense among other species of Calligonum. C. bakuense is represented by the final plastid genomes of individuals Cb01A and Cb04B, which are highlighted in bold. The displayed phylogenetic tree represents the best tree inferred under ML, visualized as (A) cladogram with bootstrap node support (given above branches) and (B) the corresponding phylogram with exact branch lengths.