| Literature DB >> 31818249 |
Mao Qin1, Shigang Wu1, Alun Li1, Fengli Zhao1, Hu Feng1, Lulu Ding1, Jue Ruan2.
Abstract
BACKGROUND: The advent of third-generation sequencing (TGS) technologies opens the door to improve genome assembly. Long reads are promising for enhancing the quality of fragmented draft assemblies constructed from next-generation sequencing (NGS) technologies. To date, a few algorithms that are capable of improving draft assemblies have released. There are SSPACE-LongRead, OPERA-LG, SMIS, npScarf, DBG2OLC, Unicycler, and LINKS. Hybrid assembly on large genomes remains challenging, however.Entities:
Keywords: LRScaf; Nanopore; PacBio; Scaffolding algorithm; Third generation sequencing technologies
Mesh:
Year: 2019 PMID: 31818249 PMCID: PMC6902338 DOI: 10.1186/s12864-019-6337-2
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Descriptive statistics of datasets for the experiment
| Species | Type | Total bases (bp) | Coverage | Median (bp) | Longest (bp) | Source |
|---|---|---|---|---|---|---|
| Illumina | 350,000,031 | 70.0 X | 100 | 100 | ERA000206 | |
| Illumina a | 256,927,500 | 54.9 X | 300 | 300 | SRR826442; SRR826444; SRR826446; SRR826450 | |
| PacBio | 93,994,356 | 20.1 X | 8712 | 41,331 | SRX669475; SRX533603 | |
| Nanopore (2D) a | 136,895,083 | 29.2 X | 6153 | 43,600 | ||
| Nanopore (Full) b | 21,972,483 | 4.7 X | 5743 | 47,422 | ERX708228 | |
| Nanopore (All) b | 158,867,566 | 34.0 X | 6086 | 47,422 | ERX708228 | |
| Nanopore (Raw) b | 311,558,723 | 66.5 X | 3557 | 94,116 | ERX708228 | |
| Illumina | 1,268,786,706 | 105.1 X | 202 | 202 | SRR527545; SRR527546 | |
| PacBio | 249,319,042 | 20.7 X | 4554 | 27,575 | SRX533604 | |
| Nanopore (Nanocorr) b | 526,588,732 | 43.6 X | 5512 | 72,879 | SRP055987 | |
| Nanopore (Raw) b | 2,392,848,698 | 198.2 X | 5059 | 191,145 | SRP055987 | |
| Illumina | 8,420,975,500 | 70.3 X | 250 | 250 | ERR2173372 | |
| Nanopore | 3,421,779,258 | 28.6 X | 7543 | 269,087 | ERR2173373 | |
| Illumina a | 6,919,422,000 | 59.0 X | 300 | 300 | ||
| PacBio a | 2,400,246,920 | 20.5 X | 15,357 | 41,753 | ||
| Illumina | 43,519,132,800 | 111.5 X | 150 | 150 | PRJNA515358 c | |
| PacBio | 7,999,992,602 | 20.5 X | 4117 | 50,493 | PRJNA318714 | |
| Illumina | 39,007,839,296 | 42.6 X | 311 | 311 | PRJEB19787 | |
| Nanopore | 27,483,806,911 | 30.0 X | 13,061 | 15,387 | PRJEB19787 | |
| PacBio | 49,999,992,839 | 23.7 X | 1347 | 17,784 | PRJNA10769 | |
| PacBio | 59,999,995,767 | 20.0 X | 1569 | 208,628 | SAMN02744161 | |
| Nanopore | 114,380,310,980 | 35.0 X | 4569 | 1,537,349 | PRJEB23027 |
Note: a refers to DBG2OLC dataset; b refers to LINKS dataset; c the dataset was sequenced in this study
The summary of draft assemblies of E. coli, S. cerevisiae, A. thaliana, O. sativa, S. pennellii, Z. mays and H. sapiens
| Species | Method/Source | Sum | NG50 | NGA50 | Longest | Misassemblies (#) | BUSCO (Complete) |
|---|---|---|---|---|---|---|---|
| SOAPdenovo2 | 4.6 Mbp | 25.2 kbp | 25.2 kbp | 91.7 kbp | 0 | 97.3% | |
| SPAdes | 4.6 Mbp | 112.4 kbp | 105.6 kbp | 265.2 kbp | 2 | 98.6% | |
| ABySS a | 5.2 Mbp | 179.7 kbp | 146.9 kbp | 358.7 kbp | 5 | 98.6% | |
| SparseAssembler b | 4.4 Mbp | 3.0 kbp | 3.0 kbp | 14.9 kbp | 2 | 64.9% | |
| SOAPdenovo2 | 12.1 Mbp | 18.7 kbp | 18.6 kbp | 146.7 kbp | 3 | 96.2% | |
| SPAdes | 11.8 Mbp | 104.1 kbp | 85.7 kbp | 451.4 kbp | 22 | 97.2% | |
| Celera Assembly a | 14.9 Mbp | 58.8 kbp | 54.7 kbp | 257.3 kbp | 19 | 98.7% | |
| DISCOVAR | 117.9 Mbp | 323.0 kbp | 314.6 kbp | 2.5 Mbp | 67 | 98.5% | |
| MaSuRCA | 119. 5 Mbp | 413.2 kbp | 356.5 kbp | 1.7 Mbp | 145 | 98.3% | |
| Platanus | 113.0 Mbp | 145.5 kbp | 143.7 kbp | 800.8 kbp | 31 | 98.3% | |
| SOAPdenovo2 | 115.1 Mbp | 236.7 kbp | 227.0 kbp | 1.5 Mbp | 39 | 98.3% | |
| SparseAssembler | 93.0 Mbp | 12.8 kbp | 12.7 kbp | 114.5 kbp | 1 | 94.7% | |
| SparseAssembler b | 74.7 Mbp | 4.4 kbp | 4.2 kbp | 35.8 kbp | 90 | 74.6% | |
| DISCOVAR | 313.8 Mbp | 27.1 kbp | 23.6 kbp | 262.5 kbp | 1343 | 96.9% | |
| MaSuRCA | 339.2 Mbp | 30.6 kbp | 29.1 kbp | 219.4 kbp | 1288 | 96.7% | |
| Platanus | 307.9 Mbp | 16.8 kbp | 16.6 kbp | 154.3 kbp | 367 | 95.6% | |
| SOAPdenovo2 | 301.2 Mbp | 18.5 kbp | 18.3 kbp | 207.7 kbp | 91 | 97.1% | |
| SparseAssembler | 155.3 Mbp | – | – | 43.0 kbp | 2 | 85.8% | |
| DISCOVAR | 851.9 Mbp | 66.4 kbp | 59.6 kbp | 1.3 Mbp | 4235 | 94.2% | |
| MaSuRCA | 884.2 Mbp | 61.3 kbp | 54.9 kbp | 617.2 Mbp | 6621 | 94.9% | |
| Platanus | 641.3 Mbp | 15.4 kbp | 15.2 kbp | 270.1 kbp | 115 | 91.7% | |
| SOAPdenovo2 | 768.5 Mbp | 28.2 kbp | 26.8 kbp | 323.3 kbp | 632 | 92.6% | |
| SparseAssembler | 305.2 Mbp | – | – | 51.1 kbp | 11 | 76.5% | |
| PhredPhrap+ABySS (GCA_000005005.5) | 2.0 Gbp | 40.0 kbp | 36.2 kbp | 849.5 kbp | 15,133 | 91.9% | |
| SRPRISM+ARGO (GCF_000306695.2) | 2.8 Gbp | 127.5 kbp | 127.1 kbp | 1.0 Mbp | 106 | 80.3% | |
| DISCOVAR (GCA_001517065.1) | 2.8 Gbp | 115.7 kbp | 115.3 kbp | 961.2 kbp | 336 | 83.7% |
Note: a refers to LINKS dataset; b refers to DBG2OLC dataset; “-”: Not available
Fig. 1A validating model of alignment. The P1 and P2 are the two points for breaking a long read into three regions (R1, R2, and R3).
Fig. 2The construction of a link using a long read lri and two contigs ci and cj. a A basic schematic for a long read building link between contigs. b The distance distribution of links.
Fig. 3The schematic illustration for travelling the complex region
Fig. 4The performances for tested scaffolders over 5 NGS de novo assemblers on A. thaliana. The value in parentheses next to NGS de novo assembler is the CPU Time for scaffolder performed on this assembler.
Fig. 5The performances for tested scaffolders over 5 NGS de novo assemblers on O. sativa. The value in parentheses next to NGS de novo assembler is the CPU Time for scaffolder performed on this assembler. The number of misassemblies for DBG2OLC on SparseAssembler is not available by QUAST. It denotes in grey.
The performances of tested scaffolders for O. sativa, Z. mays, and H. sapiens using PacBio long reads
| Species | Method | Sum | NG50 | NGA50 | Longest | Mis (#) | BUSCO (Complete) | CPU Time (Hours) | Peak Memory (GB) |
|---|---|---|---|---|---|---|---|---|---|
| DBG2OLC (SA) | 561.6 Mbp | 93.5 kbp | – | 488.0 kbp | – | 39.9% | 8.1 | 50.6 | |
| MaSuRCA-Hy (MSR) | 390.5 Mbp | 1469 | 1315.8 (495.1) | 353.9 | |||||
| OPERA-LG (SOAP) | 346.1 Mbp | 143.6 kbp | 98.5 kbp | 1.1 Mbp | 1346 | 96.7% | 149.1 | 66.4 | |
| SMIS (MSR) | 352.9 Mbp | 204.1 kbp | 117.5 kbp | 1.2 Mbp | 1971 | 97.2% | 3118.2 | 15.7 | |
| SSPACE-LR (DIS) | 324.1 Mbp | 96.5 kbp | 61.1 kbp | 1.0 Mbp | 3247 | 97.7% | 56.1 | ||
| LRScaf (SOAP) | 354.6 Mbp | 137.7 kbp | 102.1 kbp | 1.1 Mbp | 97.9% | 28.4 | |||
| SSPACE-LR | 2.0 Gbp | 48.0 kbp | 40.6 kbp | 849.5 kbp | 92.7% | 479.6 | |||
| LRScaf | 2.4 Gbp | 21,689 | 103.8 | ||||||
| DBG2OLCa | 2.8 Gbp | 5.5 Mbp | 4.5 Mbp | 27.3 Mbp | 767 | 93.9% | – | ||
| LRScaf | 2.8 Gbp | 0.8 | 20.3 |
Note: The best genomic assembly metrics are highlighted in Bold; Mis: the number of misassemblies; SA SparseAssembler, MSR MaSuRCA de novo, SOAP SOAPdenovo2, DIS DISCOVAR de novo, PLA Platanus, SSPACE-LR SSPACE-LongRead, MaSuRCA-Hy MaSuRCA hybrid pipeline; ‘-’: not available; We report two CPU times for MaSuRCA-Hy (the first value is the time for hybrid pipeline and the second value enclosed in the parenthesis is the time for de novo pipeline); OPERA-LG and SMIS are excluded on Z. mays because both of them fail to run. SIMS and SSPACE-LongRead are excluded on H. sapiens (CHM1) since the run time exceeds the one-month time limit. OPERA-LG is excluded from H. sapiens (CHM1) because of the lack of NGS data. a the assembly is offered by Dr. Chengxi Ye (The developer for DBG2OLC)
The performances of tested scaffolders for A. thaliana, S. pennellii, and H. sapiens using Nanopore long reads
| Species | Method | Sum | NG50 | NGA50 | Longest | Mis. (#) | BUSCO (Complete) | CPU Time (Hours) | Peak Memory (GB) |
|---|---|---|---|---|---|---|---|---|---|
| DBG2OLC (SA) | 150.4 Mbp | 2.6 Mbp | – | 12.2 Mbp | 25.9% | 2.9 | 18.9 | ||
| MaSuRCA-Hy (MSR) | 123.2 Mbp | 2.5 Mb | 2.3 Mbp | 9.2 Mbp | 211 | 98.3% | 145.6 (17.9) | 60.3 | |
| OPERA-LG (SA) | 116.4 Mbp | 7.3 Mbp | 2.8 Mbp | 14.3 Mbp | 97.2% | 188.9 | 45.9 | ||
| SMIS (SOAP) | 116.2 Mbp | 2.4 Mbp | 180 | 98.2% | 2391.7 | 26.8 | |||
| SSPACE-LR (PLA) | 120.6 Mbp | 3.2 Mbp | 2.2 Mbp | 6.8 Mbp | 178 | 104.6 | |||
| LRScaf (DIS) | 123.1 Mbp | 9.0 Mbp | 12.4 Mbp | 115 | 98.3% | 21.4 | |||
| DBG2OLC (SA) | 1.5 Gbp | 243.7 kbp | – | 1.7 Mbp | 18.9% | 27.0 | 21.6 | ||
| MaSuRCA-Hy (MSR) | 950.4 Mbp | 331.1 kbp | 159.6 kbp | 3.1 Mbp | 17,957 | 95.6% | 1389.1 (202.8) | 239.7 | |
| OPERA-LG (DIS) | 952.0 Mbp | 730.0 kbp | 280.1 kbp | 3.5 Mbp | 11,404 | 92.1% | 286.6 | 25.4 | |
| SSPACE-LR (DIS) | 871.1 Mbp | 82.9 kbp | 69.5 kbp | 1.3 Mbp | 6796 | 94.7% | 650.0 | ||
| LRScaf (DIS) | 952.8 Mbp | 95.8% | 36.8 | ||||||
| LRScaf | 2.9 Gbp |
Note: The best genomic assembly metrics are highlighted in Bold; Mis. the number of misassemblies, SA SparseAssembler, MSR MaSuRCA de novo, SOAP SOAPdenovo2, DIS DISCOVAR de novo, PLA Platanus, SSPACE-LR SSPACE-LongRead, MaSuRCA-Hy MaSuRCA hybrid pipeline; ‘-’: not available; We report two CPU times for MaSuRCA-Hy (the first value is the time for hybrid pipeline and the second value enclosed in the parenthesis is the time for de novo pipeline); SMIS is excluded on S. pennellii because the run time exceeds the one-month time limit. SIMS and SSPACE-LongRead are excluded on H. sapiens (NA12878) because the run time exceeds the one-month time limit. OPERA-LG is excluded from H. sapiens (NA12878) because of the lack of NGS data