| Literature DB >> 22880129 |
Abstract
The third-generation of sequencing technologies produces sequence reads of 1000 bp or more that may contain high polymorphism information. However, most currently available sequence analysis tools are developed specifically for analyzing short sequence reads. While the traditional Smith-Waterman (SW) algorithm can be used to map long sequence reads, its naive implementation is computationally infeasible. We have developed a new Sequence mapping and Analyzing Program (SAP) that implements a modified version of SW to speed up the alignment process. In benchmarks with simulated and real exon sequencing data and a real E. coli genome sequence data generated by the third-generation sequencing technologies, SAP outperforms currently available tools for mapping short and long sequence reads in both speed and proportion of captured reads. In addition, it achieves high accuracy in detecting SNPs and InDels in the simulated data. SAP is available at https://github.com/davidsun/SAP.Entities:
Mesh:
Year: 2012 PMID: 22880129 PMCID: PMC3413671 DOI: 10.1371/journal.pone.0042887
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Time-consumption by different algorithms in mapping reads from simulated data and real data (s).
| Data type | Read length | Read coverage | SAP FastMap | SAP SlowMap | MAQ | SOAP2 | Blat | SHRiMP |
| Simulated data | 75 bp | 5× | 17.80 | 51.77 | 206.75 | 19.90 | 89.63 | - |
| 10× | 24.75 | 89.62 | 222.43 | 41.70 | 172.38 | - | ||
| 20× | 47.00 | 159.47 | 252.06 | 79.11 | 337.49 | - | ||
| 150 bp | 5× | 20.93 | 33.97 | 184.47 | 22.08 | 179.50 | - | |
| 10× | 33.73 | 51.80 | 186.24 | 46.61 | 303.91 | - | ||
| 20× | 56.39 | 100.43 | 193.43 | 81.39 | 679.51 | - | ||
| 1000 bp | 5× | 22.07 | 28.27 | 127.63 | 31.17 | 1436.59 | - | |
| 10× | 37.40 | 41.14 | 131.97 | 55.92 | 3326.57 | - | ||
| 20× | 57.95 | 70.61 | 148.68 | 84.05 | 5506.53 | - | ||
| Real data | ∼76 bp | ∼40× | 124.13 | 625.5 | 339.6 | 327.88 | 541.38 | 642609.1 |
: time-consumption was averaged across only seven individuals' exon-sequencing data.
Percentage of captured reads by different algorithms.
| Data type | Read length | Read coverage | SAP FastMap | SAP SlowMap | MAQ | SOAP2 | Blat | SHRiMP |
| Simulated data | 75 bp | 5× | 99.33 | 99.95 | 99.21 | 81.03 | 99.89 | - |
| 10× | 99.33 | 99.96 | 99.22 | 81.07 | 99.88 | - | ||
| 20× | 99.33 | 99.96 | 99.23 | 81.09 | 99.89 | - | ||
| 150 bp | 5× | 99.87 | 99.99 | 0.00 | 45.92 | 100.00 | - | |
| 10× | 99.86 | 99.99 | 0.00 | 45.94 | 100.00 | - | ||
| 20× | 99.86 | 99.99 | 0.00 | 45.92 | 100.00 | - | ||
| 1000 bp | 5× | 99.24 | 99.53 | 0.00 | 3.18 | 99.08 | - | |
| 10× | 99.20 | 99.49 | 0.00 | 3.13 | 99.04 | - | ||
| 20× | 99.23 | 99.52 | 0.00 | 3.18 | 99.07 | - | ||
| Real data | ∼76 bp | ∼40× | 92.05 | 93.10 | 92.79 | 88.33 | 92.54 | 95.00 |
: percentage was averaged across only seven individuals' exon-sequencing data.
Performance of different algorithms on SNPs detection in simulated data.
| Read length | Read coverage | SAP FastMap | SAP SlowMap | MAQ | SOAP2 | Blat | |||||
| Cov (%) | Acc (%) | Cov (%) | Acc (%) | Cov (%) | Acc (%) | Cov (%) | Acc (%) | Cov (%) | Acc (%) | ||
| 75 bp | 5× | 53.26 | 98.98 | 55.23 | 99.02 | 15.69 | 100.00 | 19.31 | 97.22 | 49.87 | 99.13 |
| 10× | 92.17 | 99.40 | 92.44 | 99.48 | 39.67 | 100.00 | 72.25 | 97.31 | 91.15 | 99.69 | |
| 20× | 94.02 | 98.78 | 94.02 | 98.78 | 81.55 | 100.00 | 93.96 | 95.37 | 93.75 | 99.68 | |
| 150 bp | 5× | 58.54 | 96.17 | 58.84 | 96.46 | 0.00 | 0.00 | 1.08 | 98.49 | 55.67 | 98.20 |
| 10× | 92.93 | 97.99 | 92.93 | 98.24 | 0.00 | 0.00 | 13.38 | 96.44 | 91.97 | 99.51 | |
| 20× | 93.97 | 97.76 | 93.98 | 97.97 | 0.00 | 0.00 | 55.28 | 97.25 | 93.47 | 99.51 | |
| 1000 bp | 5× | 80.70 | 95.25 | 81.13 | 97.96 | 0.00 | 0.00 | 0.00 | 0.00 | 76.09 | 98.98 |
| 10× | 94.23 | 97.11 | 94.28 | 98.33 | 0.00 | 0.00 | 0.00 | 0.00 | 92.12 | 99.79 | |
| 20× | 95.43 | 98.60 | 95.43 | 98.87 | 0.00 | 0.00 | 0.00 | 0.00 | 92.90 | 99.86 | |
Cov: prediction coverage; Acc: prediction accuracy.
Performance of different algorithms on detecting SNPs in real data.
| Individual number | SAP FastMap | SAP SlowMap | MAQ | SOAP2 | Blat | |||||
| Stat | Per (%) | Stat | Per (%) | Stat | Per (%) | Stat | Per (%) | Stat | Per (%) | |
| NA12878 | 577/659 | 87.56 | 700/815 | 85.89 | 510/897 | 56.86 | 515/554 | 92.96 | 502/515 | 97.48 |
| NA18870 | 649/750 | 86.53 | 680/798 | 85.21 | 562/1217 | 46.18 | 563/624 | 90.22 | 542/559 | 96.96 |
| NA18501 | 657/751 | 87.48 | 589/696 | 84.63 | 581/1107 | 52.48 | 581/633 | 91.79 | 534/551 | 96.92 |
| NA15510 | 630/722 | 87.26 | 639/761 | 83.97 | 558/951 | 58.68 | 563/622 | 90.51 | 529/542 | 97.60 |
| NA19240 | 754/840 | 89.76 | 670/791 | 84.70 | 665/1060 | 62.74 | 676/723 | 93.50 | 637/648 | 98.30 |
| NA18507 | 760/871 | 87.27 | 773/924 | 83.66 | 671/1133 | 59.22 | 633/691 | 91.61 | 622/637 | 97.65 |
| NA19137 | 654/727 | 89.96 | 796/922 | 86.33 | 571/1078 | 52.97 | 567/616 | 92.05 | 548/561 | 97.68 |
| NA18861 | 711/815 | 87.24 | 693/803 | 86.30 | 645/1083 | 59.56 | 629/684 | 91.96 | 605/615 | 98.37 |
| NA18956 | 666/747 | 89.16 | 686/790 | 86.84 | 574/982 | 58.45 | 555/601 | 92.35 | 552/566 | 97.53 |
| NA12156 | 670/763 | 87.81 | 710/852 | 83.33 | 584/932 | 62.66 | 583/628 | 92.83 | 548/559 | 98.03 |
| NA19129 | 710/812 | 87.44 | 662/793 | 83.48 | 617/1096 | 56.30 | 613/665 | 92.18 | 596/606 | 98.35 |
| NA18856 | 676/757 | 89.3 | 667/771 | 86.51 | 606/1045 | 57.99 | 620/672 | 92.26 | 575/584 | 98.46 |
| NA19143 | 611/741 | 82.46 | 721/845 | 85.33 | 511/1143 | 44.71 | 504/560 | 90.00 | 510/528 | 96.59 |
| NA18517 | 779/868 | 89.75 | 661/758 | 87.20 | 682/1191 | 57.26 | 653/709 | 92.10 | 626/637 | 98.27 |
| NA18555 | 687/777 | 88.42 | 629/789 | 79.72 | 608/981 | 61.98 | 586/627 | 93.46 | 595/604 | 98.51 |
| NA10851 | 680/772 | 88.08 | 756/874 | 86.50 | 601/914 | 65.76 | 608/666 | 91.29 | 574/590 | 97.29 |
| Average | 679.4/773.3 | 87.87 | 689.5/811.4 | 84.98 | 596.6/1050.6 | 56.79 | 590.6/642.2 | 91.96 | 568.4/581.4 | 97.78 |
: individual number was adapted from [16].
: statistics, the left one is the number of “valid” SNPs, and the right one is the total number of predicted SNPs.
: Percentage of “valid” SNPs in the predicted SNPs.
Performance of SAP on detecting InDels in simulated data.
| Read length | Read coverage | Deletions | Insertions | ||||||
| SAP FastMap | SAP SlowMap | SAP FastMap | SAP SlowMap | ||||||
| Cov (%) | Acc (%) | Cov (%) | Acc (%) | Cov (%) | Acc (%) | Cov (%) | Acc (%) | ||
| 75 bp | 5× | 12.38 | 98.04 | 21.02 | 98.77 | 5.41 | 100.00 | 13.19 | 95.15 |
| 10× | 50.41 | 99.09 | 69.06 | 98.91 | 44.57 | 87.09 | 60.75 | 87.12 | |
| 20× | 83.34 | 99.83 | 86.49 | 99.66 | 83.41 | 89.63 | 86.11 | 89.11 | |
| 150 bp | 5× | 29.62 | 97.09 | 37.94 | 96.01 | 21.84 | 94.82 | 28.74 | 95.16 |
| 10× | 81.20 | 99.40 | 85.24 | 98.50 | 67.06 | 93.13 | 72.67 | 90.90 | |
| 20× | 86.14 | 99.14 | 87.19 | 98.81 | 82.04 | 91.36 | 82.73 | 91.29 | |
| 1000 bp | 5× | 54.76 | 97.76 | 70.35 | 97.31 | 39.85 | 97.72 | 49.99 | 97.04 |
| 10× | 88.86 | 97.94 | 92.82 | 97.50 | 73.04 | 89.98 | 75.12 | 90.08 | |
| 20× | 91.24 | 99.42 | 92.27 | 99.43 | 76.20 | 93.96 | 76.17 | 93.75 | |
Cov: prediction coverage; Acc: prediction accuracy.