| Literature DB >> 34671046 |
Jin Young Lee1, Minyoung Kong1, Jinjoo Oh1, JinSoo Lim2, Sung Hee Chung3, Jung-Min Kim3, Jae-Seok Kim3, Ki-Hwan Kim4, Jae-Chan Yoo5, Woori Kwak6,7.
Abstract
Assembling high-quality microbial genomes using only cost-effective Nanopore long-read systems such as Flongle is important to accelerate research on the microbial genome and the most critical point for this is the polishing process. In this study, we performed an evaluation based on BUSCO and Prokka gene prediction in terms of microbial genome assembly for eight state-of-the-art Nanopore polishing tools and combinations available. In the evaluation of individual tools, Homopolish, PEPPER, and Medaka demonstrated better results than others. In combination polishing, the second round Homopolish, and the PEPPER × medaka combination also showed better results than others. However, individual tools and combinations have specific limitations on usage and results. Depending on the target organism and the purpose of the downstream research, it is confirmed that there remain some difficulties in perfectly replacing the hybrid polishing carried out by the addition of a short-read. Nevertheless, through continuous improvement of the protein pores, related base-calling algorithms, and polishing tools based on improved error models, a high-quality microbial genome can be achieved using only Nanopore reads without the production of additional short-read data. The polishing strategy proposed in this study is expected to provide useful information for assembling the microbial genome using only Nanopore reads depending on the target microorganism and the purpose of the research.Entities:
Mesh:
Year: 2021 PMID: 34671046 PMCID: PMC8528807 DOI: 10.1038/s41598-021-00178-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Summary of generated sequencing data for E. coli genome used in this study.
| Library name | Sequencing platform | Read type | Read count | Bases (bp) |
|---|---|---|---|---|
| Short-read | Illumina Miseq | Paired-end | 2,302,658 | 341,783,000 |
| Long-read | Nanopore Minion | Single-end | 2,350,791 | 10,392,655,168 |
Summary statistic of the circularized initial E. coli genome assembly from CANU.
| Chromosome | Plasmid | ||
|---|---|---|---|
| Number of sequences | 1 | Number of sequences | 1 |
| Number of A's | 1,356,990 (24.72%) | Number of A's | 23,515 (25.49%) |
| Number of C's | 1,389,721 (25.32%) | Number of C's | 23,810 (25.81%) |
| Number of G's | 1,387,128 (25.27%) | Number of G's | 20,254 (21.96%) |
| Number of T's | 1,355,108 (24.69%) | Number of T's | 24,671 (26.74%) |
| Total | 5,488,947 | Total | 92,250 |
List of the polishing tools used in this study.
| Tools | Authors | Published Year |
|---|---|---|
| Nanopolish[ | Nicholas J Loman | 2015 |
| Racon[ | Robert Vaser et al | 2017 |
| Medaka | Oxford Nanopore | 2018 |
| NextPolish[ | Jiang Hu et al | 2019 |
| PEPPER[ | Kishwar Shafin et al | 2020 |
| Apollo[ | Can Firtina et al | 2020 |
| Homopolish[ | Yao-Ting Huang et al | 2021 |
| NeuralPolish[ | Neng Huang et al | 2021 |
Figure 1BUSCO evaluation result using enterobacterales_odb10 for each polishing tool. Bar indicates the number of single complete BUSCO genes and yellow line indicates the number of duplicated complete BUSCO genes.
Figure 2BUSCO evaluation result of 10 round iterative polishing for 4 polishing tools.
Figure 3Prokka gene prediction result of 10 round iterative polishing for 4 polishing tools. (a) the number of predicted genes in each round, (b) the number of estimated pseudogenes in each round.
Figure 4BUSCO evaluation result for combination polishing using enterobacterales_odb10. Bar indicates the completeness of BUSCO. Green color indicates the result of short-read based pilon polishing. Light green color indicates the highest accuracy from Nanopore read based polishing.
Figure 5Gene prediction result using Prokka and exact read alignment rate using bowtie2 for each polishing combination. (a) Number of predicted genes and estimated pseudogenes using Prokka. (b) Alignment rate of perfect matched Illumina short-reads to the polished genome.
Summary of the number of differences and types in Prokka predicted genes in combination polishing compared to short-read pilon polishing.
| Combination | Total Mismatch | Gene Merged | Gene Split | Additional Prediction | Loss Prediction |
|---|---|---|---|---|---|
| Medaka × Homo | 26 | 18 | 1 | 5 | 2 |
| PEPPER × Homo | 24 | 19 | 2 | 3 | 0 |
| Next × Homo | 28 | 20 | 0 | 6 | 2 |
| Racon × Medaka | 54 | 16 | 33 | 5 | 0 |
| PEPPER × Medaka | 50 | 15 | 28 | 5 | 2 |
Differences between PEPPER × Homopolish and PEPPER × Homopolish × Pilon.
| Position | Type | Description | Difference |
|---|---|---|---|
| 2,229,653–2,231,971 | CDS | Ion-translocating oxidoreductase complex subunit C | Split |
| 2,276,362–2,276,416 | CDS | putative oxidoreductase YdhV | Merged |
| 2,339,610–2,339,601 | CDS | Chitooligosaccharide deacetylase ChbG | Merged |
| 4,057,083–4,057,490 | CDS | putative fimbrial-like protein YraK | Missing |
| 2,518,544–2,517,555 | CDS | Tyrosine recombinase XerC | 195 bp |
| 2,881,870–2,882,727 | CDS | Nickel/cobalt efflux system RcnA | −6 bp |
| 3,112,312–3,111,914 | CDS | L-rhamnonate dehydratase | 72 bp |
| 3,113,114–3,112,350 | CDS | L-rhamnonate dehydratase | 21 bp |
| 4,169,438–4,170,403 | CDS | tRNA-dihydrouridine synthase B | −105 bp |
| 5,017,871–5,018,644 | CDS | Acetylglutamate kinase | 3 bp |
| 5,061,026–5,061,718 | CDS | NADH pyrophosphatase | 81 bp |
7 predicted hypothetical proteins, not match between two polished assemblies are not listed.
Figure 6Visualized read alignment to the polished genome using IGV. Polished genome using Homopolish did not contain the sample-specific variation, however, read-based polishing PEPPER × Medaka polishing successfully reflect the sample-specific variation.
Summary of generated sequencing data for L. latics and S. thermophilus genome used in this study.
| Library name | Sequencing platform | Read type | Read count | Bases |
|---|---|---|---|---|
| L_lactis | Nanopore Flongle | Single-end | 212,379 | 846,604,405 |
| S_thermo | Nanopore Flongle | Single-end | 153,886 | 1,228,787,058 |
Summary statistic of the L. lactis and S. thermophilus genomes after circularization using Circlator.
| Number of sequences | 1 | Number of sequences | 1 |
|---|---|---|---|
| Number of A's | 807,167 (32.31%) | Number of A's | 560,550(30.25%) |
| Number of C's | 440,839 (17.65%) | Number of C's | 362,323(19.55%) |
| Number of G's | 444,943 (17.81%) | Number of G's | 361,434(19.50%) |
| Number of T's | 804,929 (32.22%) | Number of T's | 569,043(30.70%) |
| Total | 2,497,878 | Total | 1,853,350 |
Figure 7BUSCO evaluation result using enterobacterales_odb10 for two probiotic species.