| Literature DB >> 32208422 |
Agnes Scheunert1, Marco Dorfner1, Thomas Lingl1, Christoph Oberprieler1.
Abstract
The chloroplast genome harbors plenty of valuable information for phylogenetic research. Illumina short-read data is generally used for de novo assembly of whole plastomes. PacBio or Oxford Nanopore long reads are additionally employed in hybrid approaches to enable assembly across the highly similar inverted repeats of a chloroplast genome. Unlike for PacBio, plastome assemblies based solely on Nanopore reads are rarely found, due to their high error rate and non-random error profile. However, the actual quality decline connected to their use has rarely been quantified. Furthermore, no study has employed reference-based assembly using Nanopore reads, which is common with Illumina data. Using Leucanthemum Mill. as an example, we compared the sequence quality of seven chloroplast genome assemblies of the same species, using combinations of two sequencing platforms and three analysis pipelines. In addition, we assessed the factors which might influence Nanopore assembly quality during sequence generation and bioinformatic processing. The consensus sequence derived from de novo assembly of Nanopore data had a sequence identity of 99.59% compared to Illumina short-read de novo assembly. Most of the errors detected were indels (81.5%), and a large majority of them is part of homopolymer regions. The quality of reference-based assembly is heavily dependent upon the choice of a close-enough reference. When using a reference with 0.83% sequence divergence from the studied species, mapping of Nanopore reads results in a consensus comparable to that from Nanopore de novo assembly, and of only slightly inferior quality compared to a reference-based assembly with Illumina data. For optimal de novo assembly of Nanopore data, appropriate filtering of contaminants and chimeric sequences, as well as employing moderate read coverage, is essential. Based on these results, we conclude that Nanopore long reads are a suitable alternative to Illumina short reads in plastome phylogenomics. Few errors remain in the finalized assembly, which can be easily masked in phylogenetic analyses without loss in analytical accuracy. The easily applicable and cost-effective technology might warrant more attention by researchers dealing with plant chloroplast genomes.Entities:
Year: 2020 PMID: 32208422 PMCID: PMC7092973 DOI: 10.1371/journal.pone.0226234
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Schematic bioinformatic workflow for comparative assembly of chloroplast genomes.
Tools used for each processing step are given in italics. Leucanthemum virgatum was assembled de novo with Illumina data and subsequently used as reference for mapping.
Read statistics for Illumina and Nanopore sequencing runs.
| raw reads (300+300 run / 80+80 run) | 381,932 (25,290 / 356,642) | 373,518 (22,810 / 350,708) | 52,223 |
| PhiX spike-in reads removed (%) | 2,620 (0.7) | 5,560 (1.5) | n.a. |
| quality- and length-filtered reads (%) | 167,126 (43.8) | 119,502 (32.0) | 217 (0.4) |
| remaining reads for analysis (%) | 212,186 (55.6) | 248,456 (66.5) | 52,006 (99.6) |
Number of raw reads is given for each of both MiSeq 300-bp and 80-bp paired-end sequencing runs; for Nanopore it refers to reads after processing by Porechop. Length-filtered reads include those which were too short after trimming. Read numbers for Nanopore filtered reads as well as remaining reads refer to processed, not fully processed reads (for information on those see text).
L., Leucanthemum; n.a., not applicable.
Mapping statistics for reference-based assembly of Illumina and Nanopore data.
| Illumina ( | Illumina ( | Nanopore ( | Nanopore ( | |
|---|---|---|---|---|
| input reads (after splitting) | 248,456 | 248,456 | 12,404 (39,080) | 12,404 (39,080) |
| mapped reads (%) | 238,574 (96.02) | 237,686 (95.67) | 38,377 (98.20) | 38,363 (98.17) |
| ambiguously mapped (%) | 36,641 (14.75) | 35,934 (14.46) | 2,341 (5.99) | 2,325 (5.95) |
| read lengths | 50–300 bp | 50–300 bp | 1–3,000 bp | 1–3,000 bp |
| mean coverage (stdev) | 143.41x (88.36) | 143.43x (87.92) | 754.34x (473.68) | 756.68x (472.80) |
| fraction of reference at coverage x or higher | 30x: 99.74% 20x: 99.88% | 30x: 99.66% 20x: 99.81% | 100x: 99.72% 50x: 99.76% | 100x: 99.72% 50x: 99.76% |
| mean mapping quality | 36.5 | 33.4 | 22.4 | 19.6 |
| duplicate reads removed | 7,644 | 7,646 | n.a. | n.a. |
| remaining reads after deduplication (%) | 230,930 (96.8) | 230,040 (96.8) | n.a. | n.a. |
The respective reference used for mapping (the L. virgatum Illumina de novo assembly or A. frigida as obtained from Genbank, both without the IRA) is given in parentheses. Nanopore input reads are given before and after splitting by mapPacBio.sh into 3,000-bp chunks. Mapped reads include ambiguously mapped reads. The percentages given for both refer to all input reads, the fraction of remaining reads after deduplication is relative to mapped reads. No deduplication was performed for Nanopore reads.
L., Leucanthemum; stdev, standard deviation; n.a., not applicable.
De novo assembly statistics.
| input reads (post-trimming in | 212,186 | 248,456 | 14,351 (10,422) | 8,338 (4,632) |
| best k-mer | 65 | 55 | n.a. | n.a. |
| no. of | 5 | 3 | 5 (7) | 5 (6) |
| contig sizes | 621–82,641 bp | 18,400–82,675 bp | 6,819–54,677 bp | 6,655–53,891 bp |
| mean coverage (stdev) | 128.30x (109.88) | 152.28x (113.00) | 559.39x (323.29) | 321.43x (167.74) |
| fraction at coverage x or higher | 30x: 93.38% 20x: 98.66% | 30x: 99.74% 20x: 99.91% | 100x: 99.71% 50x: 99.77% | 100x: 91.17% 50x: 99.83% |
Three de novo assemblies were produced for L. vulgare, using Illumina data, Nanopore data or Nanopore data improved by Illumina data (hybrid approach using Nanocorr). Assembly was done with Unicycler for Illumina reads and Canu for Nanopore reads. Nanopore input reads are also given after Canu correction and trimming steps (in brackets). Number of contigs is given before and after splitting by Exonerate for further use. Values for "fraction at coverage x or higher" refer to the final assembled sequence without the IRA and denote the percentage with a certain read coverage after mapping of input reads.
L., Leucanthemum; stdev, standard deviation; no, number; n.a., not applicable.
Fig 2Chloroplast genome map for Leucanthemum vulgare.
Genes on the outside of the outer circle are transcribed counterclockwise, genes on the inside are transcribed clockwise. Introns are illustrated with white color within genes; genes containing an intron are additionally marked with *. Pseudogenes are preceded by a ψ. The trans-spliced rps12 gene is marked with °. Color-coding of genes depicts their affiliation to the functional groups given. The inner circle indicates the borders of the large single-copy (LSC) and small single-copy (SSC) regions as well as the inverted repeats (IR). The innermost gray shaded area shows the G+C content of the cp genome. The gene order is identical in L. virgatum (see S1 Fig), whereas their exact positions and the extent of the inverted repeat slightly differ (Fig 3).
Fig 3Comparison of the inverted repeat (IR) boundaries in Leucanthemum vulgare and L. virgatum.
Genes above the green bars are transcribed in reverse direction, those below in forward direction. For both taxa, the IR extends into the ycf1 and rps19 genes, resulting in two pseudogenes (denoted by a ψ) at the single-copy / IR junctions. Lengths are given for whole genes and their duplicated fragments as well as the large single-copy (LSC), small single-copy (SSC) and IR regions (IRA and IRB). Arrows show basepair (bp) distance from the junctions for the ndhF and trnH genes. The SSC in L. virgatum is exemplarily reverse-complemented with respect to that in L. vulgare; both configurations exist in individual plants according to Palmer (1983) [73]. The figure is not to scale.
Genes present in the sequenced Leucanthemum vulgare and L. virgatum genomes.
| Category | Group | Gene name |
|---|---|---|
| Self-replication | Large subunit of ribosome | rpl2 |
| Small subunit of ribosome | rps2, 3, 4, 7 | |
| DNA-dependent RNA polymerase | rpoA, B, C1 | |
| rRNA genes | rrn4.5 | |
| tRNA genes | H-GUG, K-UUU | |
| Photosynthesis (light reaction) | Photosystem I + assembly factors | psaA, B, C, I, J, ycf3 |
| Photosystem II | psbA, B, C, D, E, F, H, I, J, K, L, M, N, T, Z | |
| NADH dehydrogenase | ndhA | |
| Cytochrome b6/f complex | petA, B | |
| F-type ATP synthase | atpA, B, E, F | |
| Photosynthesis (dark reaction) | Envelope membrane protein | cemA |
| Cytochrome c synthesis gene | ccsA | |
| RubisCO large subunit | rbcL | |
| Other genes | Maturase | matK |
| Subunit of the acetyl-coA-carboxylase | accD | |
| Proteolytic subunit of Clp-protease | clpP | |
| Translation initiation factor A | infA | |
| Genes of unknown function / ORFs | Hypothetical chloroplast reading frame | ycf1, 2 |
A genes containing an intron;
B genes containing two introns;
C two gene copies in the genome due to the inverted repeat. ORF, Open Reading Frame.
Fig 4Nucleotide variant density in L. vulgare and L. virgatum chloroplast genomes.
Illumina reads from L. virgatum were mapped to the Illumina de novo assembled sequence of L. vulgare (lacking the second inverted repeat) and variants called. Plot data on variant density generated by VCFtools in 1000-bp windows. The position of the first inverted repeat (IRB) is indicated by red lines. Peaks depicting potentially useful marker candidates for Leucanthemum are highlighted by arrows.
Chloroplast molecular markers with high variability between L. vulgare and L. virgatum.
| Marker | Position [bp] | Length [bp] | Variants | Variation [%] |
|---|---|---|---|---|
| 11,909–12,778 | 869 | 12 | 1.38 | |
| 107,238–109,469 | 2,231 | 19 | 0.85 | |
| | 107,249–107,999 | 750 | 12 | 1.60 |
| 109,470–110,488 | 1,018 | 8 | 0.79 | |
| 110,654–111,640 | 986 | 11 | 1.12 | |
| 121,133–126,226 | 5,093 | 28 | 0.55 | |
| | 122,119–122,913 | 794 | 11 | 1.39 |
| | 125,081–125,569 | 488 | 9 | 1.84 |
Markers chosen based on high-variability (> eight variants) windows in a mapping of Leucanthemum virgatum Illumina reads to the L. vulgare Illumina de novo assembly as shown in Fig 4. For each marker, the position on the L. vulgare genome and the variable fraction of the marker relative to its length are given. For long markers, hotspots are given which pertain to the peak regions in Fig 4; their position is given based on the first and last variant in the region.
Comparison of six assemblies of the Leucanthemum vulgare plastome, obtained using different methods and data types, with a "gold-standard" Illumina de novo assembly.
| Illumina | Illumina map (virg) | Illumina map (Art) | Nanopore hybrid | Nanopore | Nanopore map (virg) | Nanopore map (Art) | |
|---|---|---|---|---|---|---|---|
| assembly length without IRA [bp] | 125,633 | 125,383 | 124,851 | 125,609 | 125,270 | 125,451 | 125,165 |
| % GC content | 36.35 | 36.38 | 36.45 | 36.36 | 36.37 | 36.36 | 36.43 |
| % identity to | n.a. | 99.74 | 99.27 | 99.98 | 99.59 | 99.51 | 98.29 |
| % alignment positions with N | n.a. | 0 | 0 | 0 | 0.12 | 0 | 0 |
| total mismatches | n.a. | 329 | 922 | 25 | 362 | 619 | 2,164 |
| no. of substitutions (% total mismatches) | n.a. | 19 (5.8) | 56 (6.1) | 1 (4.0) | 67 (18.5) | 39 (6.3) | 198 (9.1) |
| no. of gaps (% total mismatches) | n.a. | 280 (85.1) | 824 (89.4) | 24 (96.0) | 253 (69.9) | 381 (61.6) | 1217 (56.2) |
| deletion events | n.a. | 21 | 47 | 5 | 221 | 57 | 149 |
| no. of inserted bases (% total mismatches) | n.a. | 30 (9.1) | 42 (4.6) | 0 (0.0) | 42 (11.6) | 199 (32.1) | 749 (34.6) |
| insertion events | n.a. | 5 | 3 | 0 | 42 | 42 | 67 |
Two data types (Illumina, Nanopore reads) and three analysis pipelines were used to assemble the L. vulgare plastome: de novo assembly, reference-based (mapping) assembly and hybrid de novo assembly using Nanopore reads corrected by Illumina data (via Nanocorr). Comparisons were made based on alignment of the respective assembly to L. vulgare de novo (Illumina). Percentages of identity and "alignment positions with N" are referable to the length of the alignment; the latter denotes the amount of Ns required for alignment to the Illumina de novo assembly.
L., Leucanthemum; IRA, inverted repeat A; n.a., not applicable; d.n., de novo; map, mapping assembly; virg, L. virgatum Illumina de novo assembly used as reference for mapping; Art, Artemisia frigida as obtained from Genbank used for mapping; both references without the IRA.
Fig 5Chloroplast genome read coverage of the Nanopore de novo assembly.
Leucanthemum vulgare Nanopore long reads used for assembly (post-correction and post-trimming) were mapped to the assembled sequence lacking the second inverted repeat (IRA) and coverage-across-reference plot data extracted with Qualimap. Black lines represent the long-range PCR fragments reads are based on. Note the uneven coverage across fragments and coverage peaks at fragment overlaps. bp, basepairs.