| Literature DB >> 24044377 |
Abstract
BACKGROUND: Many Single Nucleotide Polymorphism (SNP) calling programs have been developed to identify Single Nucleotide Variations (SNVs) in next-generation sequencing (NGS) data. However, low sequencing coverage presents challenges to accurate SNV identification, especially in single-sample data. Moreover, commonly used SNP calling programs usually include several metrics in their output files for each potential SNP. These metrics are highly correlated in complex patterns, making it extremely difficult to select SNPs for further experimental validations.Entities:
Mesh:
Year: 2013 PMID: 24044377 PMCID: PMC3848615 DOI: 10.1186/1471-2105-14-274
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Preprocessing steps in each of the four algorithms
| 1.03 | 1.2 | 1.1.18 | 1.6 | |
| SOAP output | SAM/BAM | BAM | SAM/BAM | |
| Penalty | Remove using Atlas-SNP-mapper | Removed | Remove using picard [ | |
| Remove | Keep all hits | Keep all hits | Keep all hits | |
| Yes | No | Yes | Yes | |
| No | No | Yes | Yes |
Metrics considered in calling SNPs by each of the four algorithms
| Recalibrated | Raw | Recalibrated | Recalibrated | |
| Yes | Yes | No | Yes | |
| Yes | No | No | Yes | |
| Penalty in quality score | No | No | No | |
| No | Yes | No | No | |
| No | Yes | No | No | |
| No | Yes | No | No | |
| No | Yes | No | No | |
| Yes | No | Yes | No | |
| No | NO | Yes | No |
Criteria for calling a SNP in each of the four algorithms
| No | Yes | Yes | Yes | |
| No | Both strands must be covered by variant allele | Yes | Yes | |
| No | variant allele coverage ≥ 3 upper limits for coverage | Yes | No | |
| No | Heterozygous: ≥ 10% Homozygous variant: ≥ 90% | No | No | |
| No | No | No | No |
Figure 1Box plots for sequencing quality score (generated by FastQC). The blue line represents the mean quality score for each base. Red lines represent medians. Yellow boxes represent 25th to 75th percentiles. The upper and lower whiskers represent 10 and 90 percentiles, respectively.
Figure 2The overall workflow of comparing the four SNP calling algorithms.
Key metrics in each of the four algorithms
| Consensus score [0, 99] | |
| Posterior Probability | |
| Genotype quality [0,99], QUAL | |
| Genotype quality [0,99], QUAL, FisherStrand, HaplotypeScore, MappingQualityRankSumTest, ReadPosRankSumTest |
Number of SNVs called by each of the four algorithms using raw and trimmed data
| | |||
| 940 | 545 | 395 | |
| 432 | 315 | 117 | |
| 532 | 376 | 156 | |
| 669 | 444 | 225 | |
| | |||
| 968 | 564 | 404 | |
| 448 | 321 | 127 | |
| 570 | 398 | 172 | |
| 729 | 478 | 251 | |
* Atlas-SNP2 requires at least 3X to call a SNV. For the other three algorithms, we choose the called SNVs with ≥ 3X coverage.
Figure 3The comparison results of trimmed data without any post-output filters. All SNVs require ≥ 3X coverage.
Number of SNVs called by the SOAPsnp with different cutoffs of consensus score
| 41 | 10 | 31 | 968 | |
| 8 | 2 | 6 | 927 | |
| 11 | 5 | 6 | 919 | |
| 6 | 2 | 4 | 908 | |
| 25 | 8 | 17 | 902 | |
| 261 | 179 | 82 | 877 | |
| 125 | 101 | 24 | 616 | |
| 21 | 13 | 8 | 491 | |
| 58 | 36 | 22 | 470 | |
| 24 | 12 | 12 | 412 | |
| 13 | 4 | 9 | 388 |
* number of SNVs that have consensus score ≥ the cutoff values.
Number of SNVs called by Atlas-SNP2 with different cutoffs of the posterior probability
| 448 | 321 | 127 | |
| 476 | 342 | 134 | |
| 539 | 393 | 146 |
Number of SNVs called by GATK-UGT with different cutoffs of genotype quality
| 729 | 478 | 251 | |
| 724 | 476 | 248 | |
| 723 | 476 | 247 | |
| 681 | 450 | 231 | |
| 681 | 450 | 231 | |
| 676 | 446 | 230 | |
| 476 | 217 | 259 |
Number of SNVs called by GATK-UGT with different cutoffs of HaplotypeScore
| 613 | 419 | 194 | |
| 638 | 431 | 207 | |
| 653 | 437 | 216 | |
| 680 | 448 | 232 | |
| 693 | 453 | 240 | |
| 703 | 459 | 244 | |
| 707 | 462 | 245 | |
| 718 | 468 | 250 | |
| 729 | 478 | 251 |
Number of SNVs called by SAMtools with different cutoffs of genotype quality
| 570 | 398 | 172 | |
| 567 | 397 | 170 | |
| 565 | 396 | 169 | |
| 564 | 395 | 169 | |
| 563 | 395 | 168 | |
| 559 | 393 | 166 | |
| 558 | 393 | 165 |
Number of SNVs called by each of the four algorithms with different coverage cutoffs
| 877 (537, 340) | 539 (393, 146) | 650 (427, 223) | 570 (398, 172) | |
| 397 (230, 167) | 291 (195, 96) | 309 (187, 122) | 270 (174, 96) | |
| 280 (162, 118 ) | 218 (138, 80) | 223 (127, 96) | 203 (121, 82) | |
| 222 (130, 92) | 187 (116, 71) | 186 (105, 81) | 167 (100, 67) | |
| 194 (115, 79) | 160 (99, 61) | 156 (93, 63) | 145 (87, 58) | |
| 168 (99, 69) | 145 (93, 52) | 134 (81, 53) | 127 (81, 46) | |
| 153 (88, 65) | 138 (87, 51) | 126 (75, 51) | 115 (73, 42) | |
| 137 (78, 59) | 126 (82, 44) | 111 (65, 46) | 100 (64, 36) |
Comparing four algorithms using different coverage cutoffs for dbSNPs and non-dbSNPs
| 592 | 108 (18.24%) | 82 (13.85%) | 125 (21.11%) | 277 (46.79%) | |
| 276 | 68 (24.64%) | 32 (11.59%) | 50 (18.12%) | 126 (45.65%) | |
| 201 | 61 (30.35%) | 20 (9.95%) | 33 (16.42%) | 87 (43.28%) | |
| 169 | 54 (31.95%) | 15 (8.88%) | 33 (19.53%) | 67 (39.64%) | |
| 153 | 53 (34.64%) | 15 (9.80%) | 29 (18.95%) | 56 (36.60%) | |
| 134 | 43 (32.09%) | 12 (8.96%) | 29 (21.64%) | 50 (37.31%) | |
| 123 | 38 (30.89%) | 15 (12.20%) | 25 (20.33%) | 45 (36.59%) | |
| 110 | 34 (30.91%) | 11 (10.00%) | 27 (24.55%) | 38 (34.55%) | |
| 402 | 151 (37.56%) | 99 (24.63%) | 76 (18.91%) | 76 (18.91%) | |
| 211 | 76 (36.02%) | 41 (19.43%) | 53 (25.12%) | 41 (19.43%) | |
| 161 | 57 (35.04%) | 30 (18.63%) | 37 (22.98%) | 37 (22.98%) | |
| 127 | 38 (29.92%) | 27 (21.26%) | 29 (22.83%) | 33 (25.98%) | |
| 106 | 33 (31.13%) | 21 (19.81%) | 22 (20.75%) | 30 (28.30%) | |
| 93 | 32 (34.41%) | 17 (18.28%) | 22 (23.66%) | 22 (23.66%) | |
| 87 | 28 (32.18%) | 16 (18.39%) | 23 (26.44%) | 20 (22.99%) | |
| 79 | 25 (31.65%) | 18 (22.78%) | 20 (25.32%) | 16 (20.25%) | |
“Total” means the total number of SNVs called by four algorithms. “By 1” means the number (percentage) of SNVs called by only one of the four algorithms. “By 2” means the number (percentage) of SNVs called by any two algorithms. “By 3” means the number (percentage) of SNVs called by any three algorithms. “By 4” means the number (percentage) of SNVs called by four algorithms.
Figure 4The agreement of dbSNPs with different coverage cutoffs in each of the four algorithms.
Figure 5The agreement of non-dbSNPs with different coverage cutoffs in each of the four algorithms.
Positive calling rate and sensitivity
| called as SNV | A | B | |
| called as Reference (i.e., not SNV) | C | -- | |
For a specific calling program (e.g., SOAPsnp), A is the number of SNVs identified as an empirical truth (i.e., called by at least 3 calling programs) and also called by this calling program; B is the number of SNVs identified as an empirical truth, but not called by this calling program; C is the number of SNVs called by this calling program, but is not an empirical truth. Positive calling rate is calculated as A/(A + B); sensitivity is calculated as A/(A + C).
Positive calling rates of the four calling programs under different coverage cutoffs for dbSNPs and non-dbSNPs
| ≥ 3X | 0.734 | 0.888 | 0.902 | 0.892 |
| ≥ 4X | 0.735 | 0.867 | 0.882 | 0.868 |
| ≥ 5X | 0.704 | 0.841 | 0.874 | 0.876 |
| ≥ 6X | 0.723 | 0.819 | 0.867 | 0.870 |
| ≥ 7X | 0.696 | 0.808 | 0.817 | 0.862 |
| ≥ 8X | 0.747 | 0.796 | 0.864 | 0.852 |
| ≥ 9X | 0.727 | 0.782 | 0.813 | 0.849 |
| ≥ 10X | 0.769 | 0.780 | 0.862 | 0.828 |
| ≥ 3X | 0.438 | 0.863 | 0.628 | 0.628 |
| ≥ 4X | 0.545 | 0.896 | 0.713 | 0.615 |
| ≥ 5X | 0.602 | 0.875 | 0.708 | 0.610 |
| ≥ 6X | 0.641 | 0.873 | 0.691 | 0.627 |
| ≥ 7X | 0.620 | 0.852 | 0.730 | 0.672 |
| ≥ 8X | 0.594 | 0.827 | 0.736 | 0.674 |
| ≥ 9X | 0.615 | 0.824 | 0.745 | 0.690 |
| ≥ 10X | 0.559 | 0.818 | 0.674 | 0.667 |
Sensitivities of the four calling programs under different coverage cutoffs for dbSNPs and non-dbSNPs
| ≥ 3X | 0.980 | 0.868 | 0.958 | 0.883 |
| ≥ 4X | 0.960 | 0.960 | 0.938 | 0.858 |
| ≥ 5X | 0.950 | 0.967 | 0.925 | 0.883 |
| ≥ 6X | 0.940 | 0.950 | 0.910 | 0.870 |
| ≥ 7X | 0.941 | 0.941 | 0.894 | 0.882 |
| ≥ 8X | 0.937 | 0.937 | 0.886 | 0.873 |
| ≥ 9X | 0.914 | 0.971 | 0.871 | 0.886 |
| ≥ 10X | 0.923 | 0.985 | 0.862 | 0.815 |
| ≥ 3X | 0.912 | 0.546 | 0.837 | 0.570 |
| ≥ 4X | 0.968 | 0.915 | 0.926 | 0.628 |
| ≥ 5X | 0.959 | 0.946 | 0.919 | 0.676 |
| ≥ 6X | 0.952 | 1.000 | 0.903 | 0.677 |
| ≥ 7X | 0.942 | 1.000 | 0.885 | 0.750 |
| ≥ 8X | 0.932 | 0.977 | 0.886 | 0.705 |
| ≥ 9X | 0.930 | 0.977 | 0.884 | 0.674 |
| ≥ 10X | 0.917 | 1.000 | 0.861 | 0.667 |