| Literature DB >> 29258419 |
Abstract
BACKGROUND: The next generation sequencing (NGS) techniques have been around for over a decade. Many of their fundamental applications rely on the ability to compute good genome assemblies. As the technology evolves, the assembly algorithms and tools have to continuously adjust and improve. The currently dominant technology of Illumina produces reads that are too short to bridge many repeats, setting limits on what can be successfully assembled. The emerging SMRT (Single Molecule, Real-Time) sequencing technique from Pacific Biosciences produces uniform coverage and long reads of length up to sixty thousand base pairs, enabling significantly better genome assemblies. However, SMRT reads are much more expensive and have a much higher error rate than Illumina's - around 10-15% - mostly due to indels. New algorithms are very much needed to take advantage of the long reads while mitigating the effect of high error rate and lowering the required coverage.Entities:
Keywords: Genome assembly; PacBio sequencing; Read aligner; Read overlapper
Mesh:
Year: 2017 PMID: 29258419 PMCID: PMC5735879 DOI: 10.1186/s12859-017-1953-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1All k-mer matches between reads q and r before (a) and after (b) clustering
Fig. 2Computing the alignment. The dark grey region contains all k-mer matches and is extended by the light grey ones using k ′-mer matches
SMRT datasets used in for evaluation
| Genome | Reference | Coverage | Chemistry | Genome size |
|---|---|---|---|---|
| number | (Mbp) | |||
|
| NC_000913 | 85x | P5C3 | 4.64 |
|
| NC_001133.9 | 117x | P4C2 | 12.16 |
|
| WS222 | 80x | P6C4 | 100.2 |
|
| TAIR10 | 110x | P4C2 | 134.6 |
|
| Ref v5 | 90x | P5C3 | 129.7 |
Comparison for the 1Gbp datasets (coverage levels in parentheses)
|
|
Sensitivity, specificity, precision, and F 1-score are given as percentages; A dash mean that the program crashed with segmentation fault. The best values are shown in bold. The bottom of the table shows the average values, each computed from the five corresponding values in the table
Time and memory comparison for the 1Gbp datasets
| Genome | Time (h) | BLASR | DALIGNER | GraphMap | MHAP | Minimap | HISEA |
|---|---|---|---|---|---|---|---|
| Memory (GB) | |||||||
| E.coli | Time | 113.0 | 3.0 | 0.3 | 3.0 |
| 4.0 |
| Memory |
| 124.6 | 42.3 | 210.0 | 8.8 | 25.5 | |
| S.cerevisiae | Time | 283.2 | – | 0.6 | 10.6 |
| 23.5 |
| Memory |
| – | 71.0 | 210.0 | 15.1 | 56.5 | |
| C.elegans | Time | 333.6 | 4.1 | 0.6 | 4.3 |
| 23.6 |
| Memory | 14.5 | 248.2 | 59.0 | 210.0 |
| 46.4 | |
| A.thaliana | Time | 43.2 | 8.1 | 0.6 | 5.9 |
| 12.2 |
| Memory | 10.3 | 248.2 | 60.0 | 210.0 |
| 45.3 | |
| D.melanogaster | Time | 355.2 | 12.5 | 0.4 | 4.8 |
| 95.1 |
| Memory | 16.7 | 204.2 | 59.0 | 210.0 |
| 48.1 |
CPU time is in hours and the memory in GB. The best results are in bold
Comparison of several types of sensitivity computations on the 1Gbp datasets
|
|
For each dataset, four types of sensitivity computations are used: “presence” only checks for the read pair, “length” also checks the correct length, “bounds” checks for correct alignment bounds (the one used in this paper), and the last one is from Berlin et al. [11]
Testing larger sketch sizes for MHAP. Starting with the value we have used for testing, 1256, the sketch size is increased with increments of 512 up to 3816
| Genome | Parameter | MHAP skecth size | |||||
|---|---|---|---|---|---|---|---|
| 1256 | 1768 | 2280 | 2792 | 3304 | 3816 | ||
|
| Sensitivity | 83.74 | 85.75 | 86.52 | 86.87 | 87.05 | 87.16 |
| Specificity | 99.90 | 99.86 | 99.84 | 99.82 | 99.81 | 99.80 | |
| Precision | 97.15 | 96.99 | 96.82 | 96.89 | 96.88 | 97.70 | |
| F1-score | 89.95 | 91.02 | 91.38 | 91.61 | 91.70 | 92.13 | |
|
| Sensitivity | 62.08 | 63.67 | 64.32 | 64.52 | 64.62 | 64.69 |
| Specificity | 99.77 | 99.72 | 99.66 | 99.63 | 99.58 | 99.56 | |
| Precision | 89.29 | 88.79 | 88.69 | 88.62 | 88.55 | 88.30 | |
| F1-score | 73.24 | 74.16 | 74.56 | 74.67 | 74.72 | 74.67 | |
|
| Sensitivity | 80.43 | 81.81 | 82.37 | 82.62 | 82.69 | 82.73 |
| Specificity | 99.97 | 99.93 | 99.90 | 99.88 | 99.85 | 99.82 | |
| Precision | 45.46 | 35.71 | 29.32 | 25.80 | 23.75 | 22.13 | |
| F1-score | 58.09 | 49.72 | 43.25 | 39.32 | 36.90 | 34.92 | |
|
| Sensitivity | 76.19 | 77.05 | 77.38 | 77.49 | 77.55 | 77.57 |
| Specificity | 99.91 | 99.87 | 99.86 | 99.85 | 99.84 | 99.83 | |
| Precision | 88.78 | 88.50 | 88.68 | 88.35 | 88.55 | 88.33 | |
| F1-score | 82.00 | 82.38 | 82.65 | 82.56 | 82.69 | 82.60 | |
|
| Sensitivity | 71.86 | 73.36 | 73.89 | 74.12 | 74.24 | 74.30 |
| Specificity | 99.94 | 99.92 | 99.91 | 99.88 | 99.87 | 99.86 | |
| Precision | 72.47 | 72.00 | 72.07 | 72.46 | 71.45 | 71.62 | |
| F1-score | 72.16 | 72.67 | 72.97 | 73.28 | 72.82 | 72.94 | |
Note that the results for the first column (sketch size 1256) appear also in Table 2. They are repeated here for comparison convenience
Testing higher number of minimizers for Minimap. Starting with the value we have used for testing, w=5, we increase the number of minimizers by decreasing w all the way to the smallest value w=1. Note that the results for the first column (w=5) appear also in Table 2. They are repeated here for comparison convenience
| Genome | Parameter | Minimap window size | ||||
|---|---|---|---|---|---|---|
| 5 | 4 | 3 | 2 | 1 | ||
|
| Sensitivity | 91.80 | 93.08 | 94.13 | 95.24 | 96.29 |
| Specificity | 99.93 | 99.92 | 99.93 | 99.92 | 99.91 | |
| Precision | 97.13 | 97.22 | 97.42 | 97.51 | 97.58 | |
| F1-score | 94.39 | 95.10 | 95.75 | 96.36 | 96.93 | |
|
| Sensitivity | 9.35 | 9.64 | 9.94 | 10.36 | 11.00 |
| Specificity | 99.98 | 99.98 | 99.97 | 99.97 | 99.97 | |
| Precision | 94.30 | 94.18 | 93.28 | 91.90 | 88.58 | |
| F1-score | 17.01 | 17.49 | 17.97 | 18.62 | 19.57 | |
|
| Sensitivity | 85.38 | 86.63 | 87.63 | 88.77 | 89.80 |
| Specificity | 99.98 | 99.98 | 99.98 | 99.98 | 99.97 | |
| Precision | 89.80 | 89.77 | 89.05 | 88.11 | 85.76 | |
| F1-score | 87.53 | 88.17 | 88.33 | 88.44 | 87.73 | |
|
| Sensitivity | 23.55 | 26.90 | 31.21 | 37.08 | 45.56 |
| Specificity | 99.97 | 99.98 | 99.96 | 99.96 | 99.96 | |
| Precision | 84.00 | 84.77 | 85.48 | 86.43 | 87.94 | |
| F1-score | 36.79 | 40.84 | 45.73 | 51.90 | 60.02 | |
|
| Sensitivity | 40.72 | 42.82 | 45.51 | 49.11 | 54.00 |
| Specificity | 99.99 | 99.98 | 99.98 | 99.98 | 99.97 | |
| Precision | 83.93 | 83.12 | 82.87 | 81.85 | 81.25 | |
| F1-score | 54.84 | 56.52 | 58.75 | 61.39 | 64.88 | |
Fig. 3Sensitivity as a function of mean overlap length
Sensitivity, specificity, precision, and F 1-score for HISEA and MHAP program output within the Canu pipeline
|
|
Two coverage levels are considered for each dataset: 30x and 50x. The best values are shown in bold. The bottom of the table shows the average values, each computed from the five corresponding values in the table. All values are percentages
Pipeline assembly comparison; Canu assembler is used with MHAP and HISEA as read aligners
| Genome | Parameter | Canu + MHAP | Canu + HISEA | ||
|---|---|---|---|---|---|
| 30x | 50x | 30x | 50x | ||
|
| Contig # |
| 3 | 8 |
|
| NG50 |
| 3,969,196 | 1,223,211 |
| |
| Max contig |
| 3,969,196 | 1,525,215 |
| |
| % Ref |
| 99.97 | 99.82 |
| |
| Avg idy |
|
|
|
| |
| Breakpoints |
|
|
|
| |
|
| Contig # | 43 | 31 |
|
|
| NG50 | 540,299 | 687,498 |
|
| |
| Max contig | 964,505 | 1,534,125 |
|
| |
| % Ref | 98.90 | 99.35 |
|
| |
| Avg idy | 99.81 |
|
|
| |
| Breakpoints |
|
|
|
| |
|
| Contig # | 393 | 170 |
|
|
| NG50 | 636,401 | 1,987,017 |
|
| |
| Max contig | 2,648,207 | 4,224,025 |
|
| |
| % Ref | 96.00 |
|
| 99.80 | |
| Avg idy | 99.76 |
|
|
| |
| Breakpoints | 431 |
|
| 435 | |
|
| Contig # | 159 |
|
| 122 |
| NG50 | 3,331,858 | 6,715,370 |
|
| |
| Max contig |
| 14,177,369 | 12,890,806 |
| |
| % Ref | 92.22 |
|
| 92.51 | |
| Avg idy |
|
|
|
| |
| Breakpoints |
|
| 2,680 | 2,704 | |
|
| Contig # | 597 | 390 |
|
|
| NG50 | 1,933,939 | 4,983,913 |
|
| |
| Max contig | 8,238,062 | 17,900,724 |
|
| |
| % Ref | 95.08 | 98.55 |
|
| |
| Avg idy |
|
|
| 99.87 | |
| Breakpoints |
|
| 1,254 | 1,461 | |
Two coverage levels, 30x and 50x, are used for each genome. The best results are shown in bold
Assembly time and space comparison; the time is wall clock time in hours, the space is in GB
| Genome | Canu + MHAP | MHAP | Canu + HISEA | HISEA | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30x | 50x | 30x | 50x | 30x | 50x | 30x | 50x | |||||
| Time | Space | Time | Space | Time | Time | Time | Space | Time | Space | Time | Time | |
|
|
| 210 |
| 210 |
|
|
|
| 0.7 |
|
|
|
|
|
| 210 |
| 210 | 0.3 |
| 1.2 |
| 2.9 |
| 0.2 | 0.6 |
|
|
| 210 |
| 210 |
|
| 37.7 |
| 75.5 |
| 11.5 | 17.1 |
|
|
| 210 |
| 210 |
|
| 42.3 |
| 98.0 |
| 15.3 | 35.0 |
|
|
| 210 |
| 210 |
|
| 51.8 |
| 112.8 |
| 19.7 | 33.6 |
The same setup as in Table 8 is used. The best values are in bold