| Literature DB >> 30911035 |
Seung Chul Shin1, Hyun Kim2, Jun Hyuck Lee2,3, Han-Woo Kim2,3, Joonho Park4, Beom-Soon Choi5, Sang-Choon Lee5, Ji Hee Kim6, Hyoungseok Lee2,3, Sanghee Kim7.
Abstract
Parochlus steinenii is a winged midge from King George Island. It is cold-tolerant and endures the harsh Antarctic winter. Previously, we reported the genome of this midge, but the genome assembly with short reads had limited contig contiguity, which reduced the completeness of the genome assembly and the annotated gene sets. Recently, assembly contiguity has been increased using nanopore technology. A number of methods for enhancing the low base quality of the assembly have been reported, including long-read (e.g. Nanopolish) or short-read (e.g. Pilon) based methods. Based on these advances, we used nanopore technologies to upgrade the draft genome sequence of P. steinenii. The final assembled genome was 145,366,448 bases in length. The contig number decreased from 9,132 to 162, and the N50 contig size increased from 36,946 to 1,989,550 bases. The BUSCO completeness of the assembly increased from 87.8 to 98.7%. Improved assembly statistics helped predict more genes from the draft genome of P. steinenii. The completeness of the predicted gene model increased from 79.5 to 92.1%, but the numbers and types of the predicted repeats were similar to those observed in the short read assembly, with the exception of long interspersed nuclear elements. In the present study, we markedly improved the P. steinenii genome assembly statistics using nanopore sequencing, but found that genome polishing with high-quality reads was essential for improving genome annotation. The number of genes predicted and the lengths of the genes were greater than before, and nanopore technology readily improved genome information.Entities:
Mesh:
Year: 2019 PMID: 30911035 PMCID: PMC6434015 DOI: 10.1038/s41598-019-41549-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Library preparation.
| After DNA repair | After end repair | After ligation | |
|---|---|---|---|
| PicoGreen assay (ng/μL) | 16 | 29 | 62 |
| Total amount (ng) | 1,600 | 870 | 930 |
Summary of nanopore read statistics.
| Raw data | Corrected read | |
|---|---|---|
| Total read number | 1,999,088 | 341,108 |
| Total read bases (bp) | 10,970,289,711 | 5,742,044,883 |
| Mean read length (bp) | 5487.61 (10.4) | 16,986 |
| Max length (bp) | 96,705 | 87,202 |
| Read length N50 (bp) | 12,381 | 17,615 |
| Number above 5 kbp/total length (bp)/percentage of the total reads (%) | 692,507/8,819,419,598/80 | 340,083/5,739,314,651/100 |
| Number above 10 kbp/total length (bp)/percentage of the total reads (%) | 378,620/6,548,956,539/60 | 327,418/5,616,993,576/96 |
| Number above 20 kbp/total length (bp)/percentage of the total reads (%) | 101,037/2,638,003,734/24 | 81,947/2,110,920,760/39 |
kbp = kilo base pairs. The raw data were base-called using Guppy software, and Canu was used to correct the longest reads up to 40× coverage as default.
Genome assembly statistics.
| IR | NR | |
|---|---|---|
| Number of scaffolds | 4,127 | 162 |
| Number of contigs | 9,132 | 162 |
| Total scaffold sequence (bp) | 138,124,775 | 145,366,448 |
| Total contig sequence (bp) | 130,756,571 | 145,366,448 |
| Length of N50 scaffold (bp) | 176,193 | 1,989,550 |
| Length of N50 contig (bp) | 36,946 | 1,989,550 |
| Max scaffold length (bp) | 655,752 | 9,644,260 |
| Max contig length (bp) | 320,332 | 9,644,260 |
IR = the draft genome sequence assembled from the Illumina reads; NR = the draft genome sequence assembled with nanopore reads. The Illumina reads were initially assembled using ALLPATHS-LG with Illumina short reads, and gap-filled using GapFiller. The nanopore reads were assembled with nanopore reads corrected by Canu using SMARTdenovo.
Figure 1Data analysis overview. We used Albacore (ver. 2.3.1) to base-call the nanopore sequencing reads, and used Canu (ver. 1.7.1) to correct the nanopore reads. We assembled the resulting corrected reads into contigs using SMARTdenovo, and genome polishing was performed using Pilon (ver. 1.22) and Nanopolish (ver. 0.10.1).
Summary of genome polishing.
| Assembly | Assembler | Genome polishing | Identity between aligned regions |
|---|---|---|---|
| IR | ALLPATHS-LG | None | |
| NR | SMARTdenovo | None | 98.15% |
| NR + np | SMARTdenovo | Nanopolish | 98.68% |
| NR + pl | SMARTdenovo | Pilon | 98.90% |
| NR + np + pl | SMARTdenovo | Nanopolish + Pilon | 98.93% |
| NR + np + pl × 2 | SMARTdenovo | Nanopolish + Pilon × 2 |
|
IR = the draft genome sequence assembled from the Illumina reads; NR = the draft genome sequence assembled from nanopore reads. The identity between aligned regions values were calculated using nucmer and dnadiff. The bold characters indicate the best identity.
Figure 2Benchmarking universal single-copy orthologs (BUSCO) analysis of draft genome sequences. The genome completeness values of six draft genome sequences were calculated using BUSCO against Eukaryota odb9, Insecta odb9, and Diptera odb9. Before genome polishing, the low-quality NR reduced the completeness of the genome and increased the number of “Fragmented BUSCOSs” and “Missing BUSCOs.” Genome polishing of the NR improved the completeness of the genome, and the use of Illumina reads markedly improved genome polishing with signal-level data in BUSCO analysis.
BUSCO completeness assessments for genomes.
| Database | Assemblies and genome polishing | Complete BUSCOs | Duplicated BUSCOs | Fragmented BUSCOs | Missing BUSCOs | Total BUSCO groups searched orthologs |
|---|---|---|---|---|---|---|
| Eukaryota | IR | 87.8% | 5.3% | 3.0% | 9.2% | 303 |
| NR | 67.7% | 1.3% | 22.4% | 12.2% | 303 | |
| NR + np | 93.4% | 1.7% | 5.3% | 1.3% | 303 | |
|
|
|
|
|
| 303 | |
| NR + np + pl | 98.7% | 2.3% | 1.0% | 0.3% | 303 | |
| NR + np + pl × 2 | 98.7% | 2.0% | 1.0% | 0.3% | 303 | |
| Insecta | IR | 86.6% | 5.2% | 5.1% | 8.3% | 1,658 |
| NR | 72.2% | 1.4% | 16.2% | 11.6% | 1,658 | |
| NR + np | 92.3% | 1.4% | 4.8% | 2.9% | 1,658 | |
| NR + pl | 97.9% | 2.2% | 1.4% | 0.7% | 1,658 | |
|
|
|
|
|
| 1,658 | |
| NR + np + pl × 2 | 98.3% | 2.2% | 0.8% | 0.8% | 1,658 | |
| Diptera | IR | 77.7% | 3.7% | 10.6% | 11.7% | 2,799 |
| NR | 48.8% | 1.1% | 22.9% | 28.3% | 2,799 | |
| NR + np | 78.5% | 1.3% | 13.6% | 8.0% | 2,799 | |
| NR + pl | 91.3% | 2.0% | 6.0% | 2.7% | 2,799 | |
|
|
|
|
|
| 2,799 | |
| NR + np + pl × 2 | 92.0% | 2.3% | 5.5% | 2.6% | 2,799 |
IR = the draft genome sequence assembled from the Illumina reads; NR = the draft genome sequence assembled from nanopore reads. The bold characters indicate the best statistics of genome completeness assessment using BUSCO.
Major repetitive content and tRNAs.
| IR | NR | NR + np | NR + pl | NR + np + pl | NR + np + pl × 2 | |
|---|---|---|---|---|---|---|
| Interspersed repeats | 7,639,658 (26,042) | 14,540,409 (32,830) | 14,662,939 (33,009) | 14,547,597 (32,603) | 14,751,532 (33,069) | 14,754,452 (33,063) |
| Simple repeats | 1,165,508 | 1,225,771 | 1,208,581 | 1,219,354 | 1,217,748 | 1,218,017 |
| Low complexity | 438,219 | 433,317 | 430,197 | 430,290 | 430,938 | 432,152 |
| tRNA | 13,137 (172) | 11,529 (151) | 11,306 (151) | 11,411 (153) | 11,328 (152) | 11,328 (152) |
IR = the draft genome sequence assembled from the Illumina reads; NR = the draft genome sequence assembled from nanopore reads. The total lengths of the repeats and tRNAs were calculated using RepeatMasker[30] and tRNAscan-SE[35], respectively, and the number of elements is given in parentheses.
Statistics of interspersed repeats contents.
| IR | NR | NR + np | NR + pl | NR + np + pl | NR + np + pl × 2 | |
|---|---|---|---|---|---|---|
| SINE | 68,267 (88) | 100,381 (97) | 101,304 (97) | 101,569 (98) | 102,052 (98) | 102,006 (98) |
| LINE | 524,538 (1,291) | 942,262 (1,600) | 959,395 (1,614) | 949,814 (1,593) | 963,093 (1,610) | 963,118 (1,609) |
| LTR | 279,691 (568) | 1,595,603 (1,087) | 1,600,930 (1,102) | 1,596,730 (1,097) | 1,604,972 (1,108) | 1,605,234 (1,104) |
| DNA | 267,157 (1,038) | 370,673 (1,234) | 375,621 (1,250) | 375886 (1,239) | 378,520 (1,253) | 378,616 (1,251) |
| Unclassified | 6,500,005 (23,057) | 11,531,490 (28,812) | 11,625,779 (28,946) | 11,523,598 (28,576) | 11,702,895 (29,000) | 11,705,478 (29,001) |
| Total interspersed repeats | 7,639,658 | 14,540,409 | 14,662,939 | 14,547,597 | 14,751,532 | 14,754,452 |
IR = the draft genome sequence assembled from the Illumina reads; NR = the draft genome sequence assembled from nanopore reads. The total lengths of repeats and tRNAs were calculated using RepeatMasker, and the number of elements is given in parentheses. Long terminal repeats (LTRs) are retrotransposons, and non-LTR retrotransposons comprise long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs).
Summary of MAKER2 annotation.
| IR | NR | NR + np | NR + pl | NR + np + pl | NR + np + pl × 2 | ||
|---|---|---|---|---|---|---|---|
| gene | numbera | 11690 |
| 11971 | 12074 | 11938 | 11935 |
| lengthb | 51671609 (4420.2) | 47351244 (2792.6) | 59346690 (4957.5) | 59414543 (4920.9) | 59995550 (5026.9) | ||
| CDS | number | 90583 (7.7) | 72775 (4.3) | 104540 (8.7) | 103425 (8.6) | 103928 (8.7) | |
| Length | 19208721 (1643.2) | 11638566 (686.4) | 18935550 (1581.8) | 21627003 (1811.6) | 21615393 (1811.1) | ||
| exon | number | 91886 (7.9) | 87307 (5.1) | 104883 (8.7) | 105527 (8.8) | 105335 (8.8) | |
| Length | 21402569 (1830.8) | 20493668 (1208.6) | 21782057 (1819.6) | 23810842 (1994.5) | 23809534 (1994.9) | ||
| intron | number | 80196 (6.9) | 70351 (4.1) | 95491 (8.0) | 92809 (7.7) | 93400 (7.8) | |
| Length | 30269040 (2589.32) | 26857576 (1584.0) | 35294728 (2923.2) | 36459217 (3054.0) | 36186016 (3031.9) | ||
| 5′-UTR | number | 4514 (1.3) | 5399 (1.5) | 4627 (1.3) | 4537 (1.3) | 4581 (1.3) | |
| Length | 471401 (134.4) | 807804 (219.2) | 484738 (136.2) | 484557 (138.5) | 484432 (136.7) | ||
| 3′-UTR | number | 4117 (1.1) | 5049 (1.3) | 4394 (1.1) | 4255 (1.1) | 4274 (1.1) | |
| Length | 1722447 (447.0) | 2038703 (525.4) | 1785240 (441.6) | 1699282 (432.8) | 1709709 (433.5) |
CDS = coding sequence; IR = the draft genome sequence assembled from the Illumina reads; NR = the draft genome sequence assembled from nanopore reads; UTR = untranslated region. The numbers and total lengths of the genes, CDSs, exons, introns, and UTRs were calculated from a GFF3 file generated by MAKER2[21,36], and the unit averages are given in parentheses. In each row, the best results are shown in bold.
aDenotes the number of elements.
bDenotes the total length of the elements.
Figure 3Annotation edit distance (AED) metric for controlling the quality of annotation for the final gene predictions of the six drafts of the genome sequences. (A) The cumulative AED distribution for all six draft genomes. (B) Box plot of AED scores for all six draft genomes.
Figure 4Gene set completeness of predicted gene model of draft genome sequences using benchmarking universal single-copy orthologs (BUSCO) analysis. The gene set completeness of the six draft genome sequences was calculated using BUSCO against Eukaryota odb9, Insecta odb9, and Diptera odb9. Before genome polishing, the low-quality bases of the NR reduced the accuracy of prediction in the gene model through MAKER2. Therefore, the gene set completeness was reduced and there was an increase in the number of “Fragmented BUSCOSs” and “Missing BUSCOs.” Genome polishing of the NR improved the gene set completeness, and genome polishing using Illumina reads markedly improved genome polishing using signal-level data in the BUSCO analysis.
BUSCO completeness assessments for gene sets.
| Database | Assemblies and genome polishing | Complete BUSCOs | Duplicated BUSCOs | Fragmented BUSCOs | Missing BUSCOs | Total BUSCO groups searched orthologs |
|---|---|---|---|---|---|---|
| Eukaryota odb9 | IR | 79.5% | 3.0% | 7.6% | 12.9% | 303 |
| NR | 45.2% | 1.0% | 39.6% | 15.2% | 303 | |
| NR + np | 86.1% | 1.0% | 6.6% | 7.3% | 303 | |
| NR + pl | 89.4% | 1.7% | 4.0% | 6.6% | 303 | |
|
|
|
|
|
| 303 | |
| NR + np + pl × 2 | 89.4% | 1.7% | 4.3% | 6.3% | 303 | |
| Insecta odb9 | IR | 79.7% | 4.5% | 6.4% | 13.9% | 1,658 |
| NR | 44.8% | 1.6% | 30.1% | 25.1% | 1,658 | |
| NR + np | 84.1% | 1.9% | 6.2% | 9.7% | 1,658 | |
| NR + pl | 89.5% | 2.5% | 3.2% | 7.3% | 1,658 | |
|
|
|
|
|
| 1,658 | |
| NR + np + pl × 2 | 90.0% | 2.6% | 3.0% | 6.9% | 1,658 | |
| Diptera odb9 | IR | 67.8% | 3.5% | 13.0% | 16.3% | 2,799 |
| NR | 25.2% | 0.6% | 24.6% | 50.2% | 2,799 | |
| NR + np | 73.1% | 1.7% | 13.2% | 13.7% | 2,799 | |
| NR + pl | 83.6% | 2.6% | 8.4% | 8.0% | 2,799 | |
|
|
|
|
|
| 2,799 | |
| NR + np + pl × 2 | 83.9% | 2.4% | 8.1% | 8.0% | 2,799 |
IR = the draft genome sequence assembled from the Illumina reads; NR = the draft genome sequence assembled from nanopore reads. The bold characters indicate the best statistics of gene sets completeness using BUSCO.