| Literature DB >> 25225118 |
Lucian Ilie1, Bahlul Haider, Michael Molnar, Roberto Solis-Oba.
Abstract
BACKGROUND: De novo genome assembly of next-generation sequencing data is one of the most important current problems in bioinformatics, essential in many biological applications. In spite of significant amount of work in this area, better solutions are still very much needed.Entities:
Mesh:
Year: 2014 PMID: 25225118 PMCID: PMC4174676 DOI: 10.1186/1471-2105-15-302
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Estimated genome sizes for the datasets in Table 3
| Dataset | Actual size | Estimated size | Difference (%) |
|---|---|---|---|
| 1 | 4,215,606 | 4,225,613 | 0.24% |
| 2 | 1,042,519 | 1,353,267 | 29.81% |
| 3 | 2,190,731 | 2,210,896 | 0.92% |
| 4 | 1,892,775 | 1,887,639 | -0.27% |
| 5 | 4,277,185 | 4,717,507 | 10.29% |
| 6 | 2,343,476 | 2,342,227 | -0.05% |
| 7 | 4,639,675 | 4,681,050 | 0.89% |
| 8 | 3,843,301 | 4,204,996 | 9.41% |
| 9 | 100,286,070 | 107,544,824 | 7.24% |
The datasets used for evaluation
| Dataset | Organism | Accession | Reference | Genome | Read | Number | Number of | Coverage |
|---|---|---|---|---|---|---|---|---|
| number | genome | length | length | of reads | base pairs | |||
| 1 |
| DRR000852 | NC_000964.3 | 4,215,606 | 75 | 3,519,504 | 263,962,800 | 62.62 |
| 2 |
| ERR021957 | NC_000117.1 | 1,042,519 | 37 | 7,825,944 | 289,559,928 | 277.75 |
| 3 |
| SRR387784 | NC_015875.1 | 2,190,731 | 100 | 4,407,248 | 440,724,800 | 201.18 |
| 4 |
| SRR063416 | NC_006570.2 | 1,892,775 | 101 | 6,907,220 | 697,629,220 | 368.57 |
| 5 |
| SRR397962 | NC_005823.1 | 4,277,185 | 100 | 7,127,250 | 712,725,000 | 166.63 |
| 6 |
| SRR413299 | NC_002950.2 | 2,343,476 | 100 | 9,497,946 | 949,794,600 | 405.29 |
| 7 |
| SRR072099 | NC_000913.2 | 4,639,675 | 36 | 30,355,432 | 1,092,795,552 | 235.53 |
| 8 |
| SRR400550 | NC_009012.1 | 3,843,301 | 36 | 31,994,160 | 1,151,789,760 | 299.69 |
| 9 |
| SRR065390 | WS222 | 100,286,070 | 100 | 67,617,092 | 6,761,709,200 | 67.42 |
The datasets are sorted increasingly by the total number of base pairs. All datasets and reference genome sequences are obtained from the NCBI, except C.elegans that is from http://www.wormbase.org.
Predicted read copy count comparison for the datasets in Table 3
| Dataset | MB09 | SAGE |
|---|---|---|
| 1 | 3.19 |
|
| 2 | 4.85 |
|
| 3 | 7.24 |
|
| 4 | 9.37 |
|
| 5 | 5.31 |
|
| 6 | 9.40 |
|
| 7 | - |
|
| 8 | - |
|
| 9 | - |
|
Predicted read copy count comparison between the algorithm of [26], denoted MB09, and the procedure used by SAGE. The values given are the percentages of correctly predicted copy counts. The MB09 algorithm could not process the last 3 datasets from Table 3.
NGA50 comparison; best results in bold
| NGA50 | ABySS | SGA | SOAP2 | SPAdes | SAGE |
|---|---|---|---|---|---|
| 1 | 423,890 | 68,419 | 551,507 | 441,472 |
|
| 2 | 301,840 | 97,593 | 225,668 |
| 669,089 |
| 3 | 23,245 | 21,876 | 26,356 | 18,167 |
|
| 4 |
| 23,314 | 23,294 | 23,762 | 23,961 |
| 5 | 117,711 | 83,128 | 132,993 | 38,735 |
|
| 6 | 35,564 | 37,013 | 42,835 | 32,926 |
|
| 7 |
| 10,038 | 98,665 | 36,300 | 96,980 |
| 8 | 52,944 | 23,747 | 54,744 | 52,142 |
|
| 9 | 18,210 | 20,436 | 31,973 | 20,468 |
|
| Avg. | 122,322 | 42,840 | 132,004 | 151,137 |
|
NGA75 comparison; best results in bold
| NGA75 | ABySS | SGA | SOAP2 | SPAdes | SAGE |
|---|---|---|---|---|---|
| 1 | 162,208 | 40,124 | 306,202 |
| 306,386 |
| 2 | 160,704 | 51,570 | 125,082 |
| 307,765 |
| 3 | 9,847 | 7,570 | 6,785 | 6,052 |
|
| 4 |
| 13,117 | 13,117 | 13,638 | 13,377 |
| 5 | 58,556 | 40,333 | 64,594 | 22,071 |
|
| 6 | 20,005 | 18,062 | 19,982 | 15,716 |
|
| 7 |
| 5,270 | 54,790 | 18,706 | 54,784 |
| 8 | 28,805 | 8,618 | 25,243 | 25,255 |
|
| 9 | 7,126 | 7,596 | 13,232 | 8,122 |
|
| Avg. | 57,632 | 21,362 | 69,892 | 83,926 |
|
Max alignment; best results in bold
| Max | ABySS | SGA | SOAP2 | SPAdes | SAGE |
|---|---|---|---|---|---|
| 1 | 800,991 | 241,307 | 1,014,436 |
| 1,016,322 |
| 2 | 359,339 | 210,791 | 339,457 |
| 669,089 |
| 3 |
|
| 125,563 | 74,151 |
|
| 4 | 87,729 | 87,426 | 87,417 | 87,801 |
|
| 5 | 413,583 | 319,895 | 320,270 | 137,901 |
|
| 6 |
| 167,699 | 167,686 | 154,317 | 172,565 |
| 7 | 326,073 | 54,214 | 325,634 | 162,291 |
|
| 8 | 186,547 | 106,016 | 186,433 |
| 186,424 |
| 9 | 213,835 | 239,959 | 382,096 | 171,314 |
|
| Avg. | 298,476 | 172,547 | 327,666 | 301,886 |
|
Genome coverage (%); best results in bold
| Coverage | ABySS | SGA | SOAP2 | SPAdes | SAGE |
|---|---|---|---|---|---|
| 1 |
| 98.67 | 98.63 | 98.63 | 98.95 |
| 2 | 98.57 | 98.04 | 94.65 | 99.36 |
|
| 3 |
| 82.82 | 81.54 | 82.01 | 83.19 |
| 4 |
| 93.07 | 92.58 | 93.80 | 93.56 |
| 5 | 99.49 | 98.77 | 98.75 | 95.15 |
|
| 6 |
| 95.08 | 95.62 | 95.95 | 97.77 |
| 7 |
| 94.10 | 94.80 | 95.09 | 95.42 |
| 8 | 95.78 | 92.63 | 92.81 | 94.08 |
|
| 9 | 95.49 | 95.19 | 95.28 | 95.48 |
|
| Avg. |
| 94.26 | 93.85 | 94.40 |
|
Unaligned contigs; best results in bold
| Bad contigs | ABySS | SGA | SOAP2 | SPAdes | SAGE |
|---|---|---|---|---|---|
| 1 | 1 |
| 8 | 3 |
|
| 2 |
| 16 | 16 |
| 19 |
| 3 |
| 26 | 29 | 30 | 14 |
| 4 |
| 1 | 2 | 11 | 1 |
| 5 |
| 1 |
| 266 |
|
| 6 |
|
|
| 423 |
|
| 7 |
| 4 |
| 2 | 2 |
| 8 | 1 |
| 11 | 1 | 3 |
| 9 | 978 | 304 | 272 | 363 |
|
Misassemblies; best results in bold
| Indel/mm | ABySS | SGA | SOAP2 | SPAdes | SAGE |
|---|---|---|---|---|---|
| 1 | 1 |
|
| 1 | 10 |
| 2 |
|
|
|
|
|
| 3 | 149 |
| 118 | 115 | 169 |
| 4 | 94 | 35 | 34 | 28 |
|
| 5 | 8 |
| 7 | 114 | 11 |
| 6 | 10 |
| 15 | 8 | 17 |
| 7 | 26 |
| 5 | 22 | 25 |
| 8 | 18 |
| 9 | 6 | 32 |
| 9 | 1399 |
| 246 | 713 | 1196 |
Local misassembles; best results in bold
| Indel/mm | ABySS | SGA | SOAP2 | SPAdes | SAGE |
|---|---|---|---|---|---|
| 1 |
|
| 13 | 1 | 11 |
| 2 | 11 |
| 41 | 9 | 26 |
| 3 | 100 | 87 | 85 |
| 96 |
| 4 |
|
| 47 |
| 56 |
| 5 | 29 | 42 | 58 | 159 |
|
| 6 | 11 | 17 | 40 |
| 14 |
| 7 | 80 |
| 71 | 17 | 83 |
| 8 | 9 |
| 69 | 6 | 164 |
| 9 | 6282 | 2209 | 2703 |
| 1474 |
Average number of indels and mismatches per 100 kbp; best results in bold
| Indel/mm | ABySS | SGA | SOAP2 | SPAdes | SAGE |
|---|---|---|---|---|---|
| 1 | 8.65 |
| 7.14 | 5.78 | 5.61 |
| 2 | 825.09 | 833.32 |
| 834.67 | 893.48 |
| 3 | 2407.49 | 2417.37 | 2387.42 |
| 2402.27 |
| 4 | 518.38 | 522.02 | 527.28 | 509.66 |
|
| 5 | 19.01 | 18.89 | 45.86 | 581.17 |
|
| 6 |
| 35.81 | 13.83 | 22.33 | 26.75 |
| 7 | 36.20 |
| 29.77 | 17.77 | 47.86 |
| 8 | 14.45 |
| 73.19 | 28.18 | 46.04 |
| 9 | 50.08 |
| 34.84 | 76.92 | 47.63 |
Time and space comparison
| Data | ABySS | SGA | SOAP2 | SPAdes | SAGE |
|---|---|---|---|---|---|
| 1 | 2.07 / 3.77 | 11.14 / 8.37 | 0.68 / 17.15 | 6.79 / 5.92 |
|
| 2 | 2.56 / 7.75 | 13.99 / 12.00 |
| 12.26 / 22.88 | 0.57 / |
| 3 | 1.03 / | 12.87 / 9.23 |
| 6.68 / 2.98 | 0.37 / 1.78 |
| 4 | 1.41 / | 13.34 / 9.77 | 0.53 / 13.68 | 6.67 / 1.74 |
|
| 5 | 1.13 / | 13.69 / 9.87 |
| 7.03 / 3.11 | 0.43 / 1.90 |
| 6 | 2.05 / | 12.69 / 13.32 | 0.52 / 9.52 | 5.43 / 3.73 |
|
| 7 | 1.78 / | 11.03 / 11.57 | 0.45 / 8.28 | 4.72 / 3.24 |
|
| 8 | 2.57 / 1.40 | 4.21 / 4.45 |
| 2.88 / | 3.04 / 5.98 |
| 9 | 3.99 / | 19.88 / 9.90 | 1.47 / 4.94 | 19.72 / 7.17 |
|
| Avg. | 2.07 / | 12.54 / 9.83 |
| 8.02 / 5.74 | 0.78 / 2.86 |
The results are presented in the format “time/space” with the time in seconds and space in megabytes, both per input mega base pairs. The best results are shown in bold. The last row gives the average values.