| Literature DB >> 25253180 |
James Lindsay, Hamed Salooti, Ion Măndoiu, Alex Zelikovsky.
Abstract
BACKGROUND: Interest in de novo genome assembly has been renewed in the past decade due to rapid advances in high-throughput sequencing (HTS) technologies which generate relatively short reads resulting in highly fragmented assemblies consisting of contigs. Additional long-range linkage information is typically used to orient, order, and link contigs into larger structures referred to as scaffolds. Due to library preparation artifacts and erroneous mapping of reads originating from repeats, scaffolding remains a challenging problem. In this paper, we provide a scalable scaffolding algorithm (SILP2) employing a maximum likelihood model capturing read mapping uncertainty and/or non-uniformity of contig coverage which is solved using integer linear programming. A Non-Serial Dynamic Programming (NSDP) paradigm is applied to render our algorithm useful in the processing of larger mammalian genomes. To compare scaffolding tools, we employ novel quantitative metrics in addition to the extant metrics in the field. We have also expanded the set of experiments to include scaffolding of low-complexity metagenomic samples.Entities:
Mesh:
Year: 2014 PMID: 25253180 PMCID: PMC4168704 DOI: 10.1186/1471-2105-15-S9-S9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1SILP2 Flowchart.
Figure 2Solving the maximum likelihood ILP via graph decomposition. (a) Graph decomposition into 2-connected components: Red (1-cut) node splits the graph into two 2-connected components A and B. The ILP is solved for each component separately. If the direction of the cut node in the ILP solution for B is opposite to the one in the solution for A, then the solution of B is inverted. Then ILP solutions for A and B are collapsed into the parent solution. (b) Graph decomposition into two 3-connected components: Red and yellow (2-cut) nodes split the graph into two 3-connected components A and B. The ILP is solved for component A twice - for the same and the opposite directions assigned to two 2-cut nodes. Then these two solutions are used in the objective for the ILP of component B. Finally, ILP solutions for A and B are collapsed into the parent solution.
Figure 3SILP2's NSDP algorithm for processing 3-connected components.
Figure 4MCC comparison. MCC of SILP2, OPERA, and MIP across bundle sizes for staph, rhodo, and chr14 datasets. At bundle size 1 OPERA exceeded the allowed runtime of 2 days for the chr14 dataset.
Figure 5TPN50, ALN50 and NA50 as a function of bundle size for single-genome datasets. TPN50 is obtained by breaking incorrect scaffolds, ALN50 is the post-alignment metric developed by us, and NA50 is the QUAST equivalent. The colors indicated in the legend correspond to the bundle size 1 through 7. OPERA was unable to complete on bundle size 1 for chr14 dataset.
The number of reconstructed genes found in the corrected scaffolds for single-genome datasets.
| genome | bundle | SILP2 | OPERA | MIP | total |
|---|---|---|---|---|---|
| staph | 1 | 1,727.70 | 1,168.50 | 1,545.00 | 2692 |
| 2 | 1,727.70 | 1,168.50 | 1,559.50 | ||
| 3 | 1,727.70 | 1,210.60 | 1,575.30 | ||
| 5 | 1,727.70 | 1,262.70 | 1,584.60 | ||
| 7 | 1,727.40 | 1,280.40 | 1,588.50 | ||
| rhodo | 1 | 2022.7 | 1618.6 | 1897.3 | 3067 |
| 2 | 2022.7 | 1618.6 | 1907 | ||
| 3 | 2022.6 | 1751.1 | 1894 | ||
| 5 | 2022.6 | 1834.2 | 1921.3 | ||
| 7 | 2022.6 | 1853.3 | 1933.3 | ||
| chr14 | 1 | 350.9 | - | 349.6 | 529 |
| 2 | 352.00 | 330.10 | 350.40 | ||
| 3 | 352.40 | 336.90 | 350.40 | ||
| 5 | 352.40 | 337.50 | 351.70 | ||
| 7 | 352.40 | 337.60 | 3.00 | ||
| NA12878 2x | 1 | 30817 | - | 30817 | 34039 |
| 2 | 30850 | 30809 | 30849 | ||
In order for a gene to be considered reconstructed 90% of its sequence must be found in a contiguous scaffold. Dashes (-) indicate the method was unable to complete and therefore the gene count could not be computed.
Runtime (in seconds) for scaffolding single-genome datasets.
| genome | bundle | SILP1 | SILP2 | OPERA | MIP |
|---|---|---|---|---|---|
| staph | 1 | 1237 | 6.4 | 2538.1 | 35.8 |
| 2 | 738 | 4.5 | 1456.5 | 17 | |
| 3 | 305 | 4 | 878.5 | 12.834 | |
| 5 | 142 | 3.9 | 386.9 | 10.54 | |
| 7 | 51 | 4.3 | 241 | 10.115 | |
| rhodo | 1 | 1134 | 10 | 2297 | 118.953 |
| 2 | 632 | 4.1 | 455.2 | 25.3 | |
| 3 | 486 | 3.6 | 5.7 | 10.995 | |
| 5 | 86 | 3.4 | 2 | 8.778 | |
| 7 | 75 | 3 | 1.6 | 8.217 | |
| chr14 | 1 | - | 64.7 | - | 706.3 |
| 2 | - | 27.6 | 99.25 | 189.685 | |
| 3 | 629 | 25.5 | 11 | 137.67 | |
| 5 | 370 | 21.5 | 12 | 107.85 | |
| 7 | 400 | 19.25 | 10.75 | 94.9875 | |
| NA12878 2x | 1 | - | 55.2 | - | 89.3 |
| 2 | - | 1670 | 76.49 | 53.28 | |
| 3 | 37751 | 3878 | 7875 | 121.61 | |
| 5 | 27341 | 3183 | 4270 | 134.6 | |
| 7 | 27470 | 3626 | 2180 | 125.66 | |
All timing was captured only during the scaffolding phase of each tool, all read alignment and formatting procedures were excluded from timing. The number is the average of 10 runs for each genome. A dash (-) indicated the tool was unable to complete in the allotted time of 2 days for staph,rhodo,chr14 and 3 days for NA12878.
Combined runtime and accuracy results for the simulated low-complexity metagenome datasets.
| METHOD | FRAC STAPH | RUNTIME | SCFN50 | ALN50 | TPN50 | MCC |
|---|---|---|---|---|---|---|
| SILP2 0 | 1.00 | 14.3 | 51,775.0 | 20,647 | 34,495 | 67.6 |
| SILP2 0 | 0.50 | 13.3 | 50,450.0 | 20,103 | 36,356 | 69.1 |
| SILP2 0 | 0.25 | 12.7 | 47,731.0 | 20,761 | 35,323 | 69.3 |
| SILP2 0 | 0.00 | 11.5 | 21,753.0 | 13,948 | 15,649 | 39.9 |
| SILP2 1 | 1.00 | 14.3 | 52,557.0 | 20,655 | 33,683 | 67.5 |
| SILP2 1 | 0.50 | 13.8 | 48,701.0 | 20,337 | 35,750 | 69.0 |
| SILP2 1 | 0.25 | 13.4 | 49,766.0 | 20,752 | 35,146 | 69.0 |
| SILP2 1 | 0.00 | 11.0 | 21,925.0 | 13,847 | 15,511 | 39.3 |
| SILP2 2 | 1.00 | 14.3 | 43,144.0 | 20,631 | 31,160 | 66.3 |
| SILP2 2 | 0.50 | 13.5 | 42,244.0 | 20,198 | 32,477 | 67.5 |
| SILP2 2 | 0.25 | 13.2 | 45,137.0 | 21,161 | 31,562 | 67.6 |
| SILP2 2 | 0.00 | 10.8 | 22,190.0 | 13,813 | 16,205 | 41.6 |
| SILP2 3 | 1.00 | 14.1 | 43,646.0 | 19,998 | 28,856 | 65.3 |
| SILP2 3 | 0.50 | 13.2 | 41,893.0 | 19,790 | 30,504 | 66.6 |
| SILP2 3 | 0.25 | 13.0 | 42,188.0 | 19,945 | 30,449 | 66.4 |
| SILP2 3 | 0.00 | 11.5 | 21,820.0 | 13,781 | 15,635 | 40.0 |
| OPERA | 1.00 | 2247.2 | 15,573.0 | 13,082 | 10,386 | 10.1 |
| OPERA | 0.50 | 1567.6 | 13,928.0 | 12,006 | 10,440 | 10.7 |
| OPERA | 0.25 | 884.0 | 14,786.0 | 12,617 | 10,507 | 10.5 |
| OPERA | 0.00 | 544.3 | 11,121.0 | 10,720 | 10,273 | 4.9 |
| MIP | 1.00 | 129.9 | 20,104.0 | 12,861 | 18,672 | 18.4 |
| MIP | 0.50 | 121.3 | 19,807.0 | 12,488 | 17,613 | 17.4 |
| MIP | 0.25 | 114.0 | 18,520.0 | 12,269 | 16,680 | 17.2 |
| MIP | 0.00 | 114.1 | 12,690.0 | 10,894 | 12,434 | 8.7 |
| BAMBUS2 | 1.00 | 1025.89 | 11,251.0 | 11,238 | - | - |
| BAMBUS2 | 0.50 | 1452.75 | 10,781.0 | 10,822 | - | - |
| BAMBUS2 | 0.25 | 1676.75 | 10,806.0 | 10,834 | - | - |
| BAMBUS2 | 0.00 | 2272 | 11,526.0 | 11,698 | - | - |
The second column indicates the percentage of total read pairs used from the staph genome testcase, all of the rhodo pairs were used. SCFN50 is the uncorrected N50 reported by each scaffolding tool. The N50 of the contigs alone is 10,339 bp. The integer appended to SILP2 indicates the bundle weight; 0: none, 1: coverage, 2: repeat, 3:coverage * repeat. All methods were run at bundle size 1, the reported number is the average of 10 runs except for OPERA where 6 of the test cases exceeded runtime limits. TPN50 and MCC were unable to be computed for BAMBUS2 because the generated AGP had a non-standard format.