| Literature DB >> 30594129 |
Weiwen Wang1, Miriam Schalamun2,3, Alejandro Morales-Suarez4, David Kainer2, Benjamin Schwessinger2, Robert Lanfear2.
Abstract
BACKGROUND: Chloroplasts are organelles that conduct photosynthesis in plant and algal cells. The information chloroplast genome contained is widely used in agriculture and studies of evolution and ecology. Correctly assembling chloroplast genomes can be challenging because the chloroplast genome contains a pair of long inverted repeats (10-30 kb). Typically, it is simply assumed that the gross structure of the chloroplast genome matches the most commonly observed structure of two single-copy regions separated by a pair of inverted repeats. The advent of long-read sequencing technologies should remove the need to make this assumption by providing sufficient information to completely span the inverted repeat regions. Yet, long-reads tend to have higher error rates than short-reads, and relatively little is known about the best way to combine long- and short-reads to obtain the most accurate chloroplast genome assemblies. Using Eucalyptus pauciflora, the snow gum, as a test case, we evaluated the effect of multiple parameters, such as different coverage of long-(Oxford nanopore) and short-(Illumina) reads, different long-read lengths, different assembly pipelines, with a view to determining the most accurate and efficient approach to chloroplast genome assembly.Entities:
Keywords: Chloroplast genome; Genome assembly; Illumina; Long-reads; Nanopore; Polishing
Mesh:
Year: 2018 PMID: 30594129 PMCID: PMC6311037 DOI: 10.1186/s12864-018-5348-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Comparison of chloroplast genome assemblies. The coverage of the long- and short-reads is shown along the top and left-hand-side of each panel, respectively. The left-hand-side also shows which assembler was used for each row of assemblies in that panel. Hinge and Canu are long-read-only assemblers, whereas Unicycler is short-read-only and hybrid assembler. The Hinge and Canu results in b, c, d were polished by Racon+Nanopolish, and the Unicycler results used Karect-corrected short-reads. a. The total coverage of the chloroplast genome across all contigs output by the assembler. Panels marked with a red ‘x’ contained a single contig covering the whole chloroplast genome. The heatmap indicates the chloroplast genome coverage. b. The assembly length of different assemblies after manual curation (e.g. removing duplicate regions). Panels marked with an ‘x’ denote assemblies with the expected length, in the range 155,938 bp–155,945 bp. c. The mapping rate of validation reads to the assemblies after manual curation. Assemblies with highest mapping rate (99.43%) are marked with a red ‘x’. d. The average per-base error rate of validation reads mapped to each manually-curated genome assembly. Assemblies with the lowest error rate (0.0007) are marked with a red ‘x’
The minimum coverage of long-/short-reads required to assemble one contig spanning the entire chloroplast genome
| Long-read | 5-10 kb | 10-20 kb | 20-30 kb | 30-40 kb | 40-50 kb | |
| Long-read | N/A | N/A | 20x | 5x | 5x | 5x |
| Short-read | N/A | N/A | 8x | 8x | 8x | 8x |
N/A: No one contig spanning the entire chloroplast genome can be assembled under this long-read length range.
Fig. 2Comparison of hybrid assembly sequences with ≥20x long- and short-read coverage. Short_20x indicates 20x coverage of short-reads were used in these assemblies (with ≥20x long-read coverage). For hybrid assemblies with ≥20x short-reads, if the long-read coverage was ≥20x, all assemblies with the same short-read coverage was identical. The number at the top is the position
Fig. 3a. Annotated E. pauciflora chloroplast genome. Genes shown on the inside of the circle are transcribed clockwise, whereas genes shown on the outside of the circle transcribed counterclockwise. The grey region in the inside circle shows the GC content across the chloroplast genome. This figure was produced by OGDraw v1.2 [93]. b. A phylogenetic tree of 32 Eucalyptus taxa based on analysis of full chloroplast genomes