| Literature DB >> 23556006 |
Shin-Hung Lin1, Yu-Chieh Liao.
Abstract
A plethora of algorithmic assemblers have been proposed for the de novo assembly of genomes, however, no individual assembler guarantees the optimal assembly for diverse species. Optimizing various parameters in an assembler is often performed in order to generate the most optimal assembly. However, few efforts have been pursued to take advantage of multiple assemblies to yield an assembly of high accuracy. In this study, we employ various state-of-the-art assemblers to generate different sets of contigs for bacterial genomes. A tool, named CISA, has been developed to integrate the assemblies into a hybrid set of contigs, resulting in assemblies of superior contiguity and accuracy, compared with the assemblies generated by the state-of-the-art assemblers and the hybrid assemblies merged by existing tools. This tool is implemented in Python and requires MUMmer and BLAST+ to be installed on the local machine. The source code of CISA and examples of its use are available at http://sb.nhri.org.tw/CISA/.Entities:
Mesh:
Year: 2013 PMID: 23556006 PMCID: PMC3610655 DOI: 10.1371/journal.pone.0060843
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1A schematic overview of CISA.
Phase (1): employing the largest contig as a representative contig and identifying the contigs which were aligned to the ends of the representative contig with more than 80% alignment to perform possible extension (hollow ellipses and solid arrows represent before and after extension, respectively). Phase (2): removing and splitting misassembled contigs. Two misassembled contigs are shown in black. The left element represents a misjoined error because it was assembled in two different contigs in two individual assemblies (blue and red); the right element represents an insertion error (dot), which was never seen in other assemblies. Phase (3): iteratively merging contigs with a proper end-to-end overlap and estimating the size of repetitive regions. Phase (4): merging two contigs with a small overlap greater than the maximum size of repetitive regions.
Evaluation of sequence assemblies for E. coli.
| Name | Num Contigs | Assembly Bases | DCJ Distance | %Missed | %Extra | Intact CDS | Max Contig | N50 | Blast-based Intact CDS | Assembly errors (Indel> = 5, Inversion, Relocation) | |
|
| |||||||||||
| Abyss | 133 | 4626205 | 107 | 1.25 | 0.64 | 4263 | 222425 | 96511 | 4249 | 8 (6, 0, 2) | |
| CLC | 379 | 4546926 | 304 | 2.81 | 0.07 | 4258 | 107342 | 29905 | 4228 | 2 (0, 1, 1) | |
| Edena | 211 | 4569446 | 154 | 1.87 | 0.05 | 4254 | 186686 | 57790 | 4191 | 4 (2, 1, 1) | |
| SOAPdenovo | 553 | 4547211 | 475 | 2.68 | 0.15 | 4220 | 103369 | 17944 | 4131 | 1 (0, 0, 1) | |
| Velvet | 283 | 4550675 | 207 | 2.51 | 0.06 | 4246 | 166094 | 54359 | 4194 | 8 (5, 0, 3) | |
|
|
|
|
|
|
|
|
|
|
|
| |
| GAA | 314 | 4578451 | 245 | 2.09 | 0.23 | 4248 | 157184 | 51218 | 4205 | 4 (2, 0, 2) | |
| GAA | 311 | 4602917 | 224 | 2.01 | 0.26 | 4244 | 163308 | 51085 | 4208 | 5 (3, 0, 2) | |
| MAIA | 110 | 4513348 | 96 | 2.80 | 0.02 | 4272 | 312145 | 126075 | 4212 | 5 (2, 0, 3) | |
| minimus2 | 155 | 4598769 | 133 | 1.94 | 0.71 | 4262 | 202745 | 86241 | 4243 | 8 (5, 1, 2) | |
| minimus2A | 74 | 4608653 | 68 | 1.66 | 0.77 | 4270 | 417704 | 134584 | 4262 | 10 (7, 1, 2) | |
| minimus2B | 69 | 4215087 | 69 | 11.82 | 2.68 | 4269 | 312145 | 126233 | 3855 | 10 (5, 2, 3) | |
| minimus2 | 73 | 4597392 | 67 | 3.36 | 2.25 | 4268 | 296685 | 129557 | 4199 | 11 (7, 1, 3) | |
|
| |||||||||||
| Abyss | 190 | 4642915 | 119 | 1.08 | 0.51 | 4244 | 222412 | 68544 | 4243 | 3 (0, 0, 3) | |
| Edena | 421 | 4584984 | 331 | 1.87 | 0.15 | 4175 | 104739 | 24632 | 4048 | 3 (1, 1, 1) | |
| SOAPdenovo | 560 | 4596003 | 272 | 2.41 | 0.14 | 4209 | 105615 | 31837 | 4105 | 3 (1, 0, 2) | |
| Velvet | 264 | 4569720 | 191 | 2.22 | 0.09 | 4228 | 161713 | 47550 | 4159 | 16 (4, 7, 5) | |
|
|
|
|
|
|
|
|
|
|
|
| |
| GAA | 354 | 4616496 | 223 | 1.63 | 0.30 | 4218 | 153242 | 46435 | 4152 | 6 (2, 1, 3) | |
| GAA | 311 | 4636486 | 222 | 1.28 | 0.49 | 4213 | 162326 | 50871 | 4152 | 6 (3, 1, 2) | |
| MAIA | 263 | 4351338 | 239 | 9.10 | 0.48 | 4226 | 161713 | 47928 | 3882 | 1 (0, 0, 1) | |
| minimus2 | 164 | 4593120 | 141 | 2.49 | 0.76 | 4241 | 184547 | 63799 | 4177 | 10 (4, 3, 3) | |
| minimus2 | 94 | 4588207 | 88 | 6.95 | 4.52 | 4244 | 225809 | 88309 | 4077 | 13 (6, 3, 4) | |
lease note that GAA and minimus2 were designed to merge two assemblies at a time. All 2-combinations were thus performed and the average scores were taken.
Please note that GAA and minimus2 were performed iteratively in ten random orders and the obtained scores were averaged.
A: The order of combination is Abyss + CLC + Edena + SOAPdenovo + Velvet
B: The order of combination is SOAPdenovo + CLC + Velvet + Edena + Abyss
Evaluation of sequence assemblies for S. aureus.
| Name | Num Contigs | Assembly Bases | DCJ Distance | %Missed | %Extra | Intact CDS | Max Contig | N50 | Blast-based Intact CDS | Assembly errors (Indel> = 5, Inversion, Relocation) | ||
|
| ||||||||||||
| ABySS | 929 | 2769174 | 898 | 3.36 | 0.31 | 2480 | 32717 | 7810 | 2337 | 2 (0, 0, 2) | ||
| Edena | 931 | 2757686 | 882 | 3.49 | 0.25 | 2463 | 37100 | 6969 | 2293 | 2 (0, 0, 2) | ||
| SOAPdenovo | 944 | 2781524 | 917 | 2.80 | 0.42 | 2485 | 26967 | 6427 | 2348 | 3 (0, 0, 3) | ||
| Velvet | 1152 | 2775301 | 1124 | 3.14 | 0.54 | 2421 | 22892 | 5348 | 2238 | 2 (0, 0, 2) | ||
|
|
|
|
|
|
|
|
|
|
|
| ||
| GAA | 1015 | 2783335 | 959 | 2.90 | 0.45 | 2464 | 29922 | 6634 | 2312 | 2 (0, 0, 2) | ||
| GAA | 1046 | 2794625 | 956 | 2.77 | 0.53 | 2463 | 30079 | 6750 | 2314 | 2 (0, 0, 2) | ||
| MAIA | 769 | 2776022 | 767 | 3.66 | 0.69 | 2474 | 51874 | 8610 | 2360 | 1 (0, 0, 1) | ||
| minimus2 | 739 | 2770378 | 723 | 3.09 | 0.32 | 2516 | 35867 | 9006 | 2401 | 2 (0, 0, 2) | ||
| minimus2 | 568 | 2769000 | 560 | 3.02 | 0.26 | 2548 | 42022 | 11094 | 2450 | 2 (0, 0, 2) | ||
|
| ||||||||||||
| ABySS | 659 | 2854631 | 590 | 2.41 | 0.25 | 2486 | 35459 | 9229 | 2305 | 7 (2, 0, 5) | ||
| Edena | 3287 | 2557545 | 3143 | 13.46 | 0.62 | 1909 | 8680 | 1256 | 1053 | 4 (1, 0, 3) | ||
| SOAPdenovo | 674 | 2872327 | 522 | 2.45 | 0.36 | 2539 | 47607 | 9762 | 2361 | 3 (2, 0, 1) | ||
| Velvet | 502 | 2858949 | 432 | 2.39 | 0.27 | 2556 | 54726 | 13005 | 2421 | 19 (10, 4, 5) | ||
|
|
|
|
|
|
|
|
|
|
|
| ||
| GAA | 1287 | 2798306 | 1166 | 4.96 | 0.47 | 2371 | 36637 | 8374 | 2045 | 9 (4, 1, 4) | ||
| GAA | 1150 | 2827068 | 1022 | 4.25 | 0.62 | 2401 | 38358 | 9026 | 2123 | 10 (4, 2, 4) | ||
| MAIA | 505 | 2859291 | 498 | 3.57 | 0.86 | 2552 | 52790 | 12800 | 2376 | 2 (1, 0, 1) | ||
| minimus2 | 421 | 2863142 | 399 | 2.42 | 0.54 | 2552 | 50951 | 13159 | 2425 | 13 (5, 2, 6) | ||
| minimus2 | 302 | 2852733 | 302 | 3.12 | 0.92 | 2579 | 54766 | 16835 | 2468 | 22 (6, 7, 9) | ||
Please note that GAA and minimus2 were designed to merge two assemblies at a time. All 2-combinations were thus performed and the average scores were taken.
Please note that GAA and minimus2 were performed iteratively in ten random orders and the obtained scores were averaged.
Evaluation of sequence assemblies for H. volcanni.
| Name | Num Contigs | Assembly Bases | DCJ Distance | %Missed | %Extra | Intact CDS | Max Contig | N50 | Blast-based Intact CDS | Assembly errors (Indel> = 5, Inversion, Relocation) | |
|
| |||||||||||
| assembly1 | 157 | 3920004 | 117 | 2.94 | 0.03 | 3981 | 217295 | 127504 | 3953 | 8 (3, 0 5) | |
| assembly2 | 1555 | 3855484 | 1674 | 4.93 | 0.43 | 3557 | 55518 | 9092 | 3144 | 201 (154, 1, 46) | |
| assembly3 | 580 | 3871717 | 602 | 4.05 | 0.51 | 3575 | 53121 | 12830 | 3411 | 499 (479, 9, 11) | |
|
|
|
|
|
|
|
|
|
|
|
| |
| GAA | 693 | 3934772 | 688 | 3.94 | 1.37 | 3730 | 122155 | 54582 | 3593 | 237 (213, 3, 21) | |
| MAIA | 893 | 3619301 | 875 | 24.19 | 6.95 | 3671 | 265643 | 16602 | 2946 | 59 (56, 0, 3) | |
| minimus2 | 179 | 4168210 | 192 | 5.35 | 8.03 | 3754 | 224742 | 113003 | 3855 | 212 (200, 4, 8) | |
| minimus2 | 71 | 4195273 | 77 | 4.46 | 7.87 | 3796 | 332251 | 167497 | 3951 | 186 (177, 4, 5) | |
Please note that GAA and minimus2 were designed to merge two assemblies at a time. All 2-combinations were thus performed and the average scores were taken.
Please note that GAA and minimus2 were performed iteratively in ten random orders and the obtained scores were averaged.
Figure 2Integration of five assemblies of 36 bp paired-end reads for E. coli using CISA and minimus2.
(A) From left to right, CISA integrates the five assemblies (Abyss, CLC, Edena, SOAPdenovo and Velvet from the outer to the inner) and sequentially generates the processed contigs after phase (1), (2), (3), and (4). Each contig color is randomly assigned. The white and grey segments in the inner circle show missing and laying of contigs in the genome. The dark-grey segments represent overlaps between contigs. (B) From left to right, minimus2 firstly merges SOAPdenovo (the inner) with CLC (the outer), then merges the output (the inner) with Velvet (the outer) in the second run, merges the output (the inner) with Edena (the outer) in the third run, and finally merges the output (the inner) with Abyss (the outer) in the fourth run to generate a hybrid assembly.