| Literature DB >> 33167946 |
Jing Chen1, Jun-Tao Guo2.
Abstract
BACKGROUND: Insertion and deletion (indel) is one of the major variation types in human genomes. Accurate annotation of indels is of paramount importance in genetic variation analysis and investigation of their roles in human diseases. Previous studies revealed a high number of false positives from existing indel calling methods, which limits downstream analyses of the effects of indels on both healthy and disease genomes. In this study, we evaluated seven commonly used general indel calling programs for germline indels and four somatic indel calling programs through comparative analysis to investigate their common features and differences and to explore ways to improve indel annotation accuracy.Entities:
Keywords: Cancer; Deletion; Germline variants; Indel; Insertion; Somatic variants
Mesh:
Year: 2020 PMID: 33167946 PMCID: PMC7653722 DOI: 10.1186/s12920-020-00818-6
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
A list of indel calling programs
| Programs | General /somatic | Type of variants | Core algorithms | Notes and references |
|---|---|---|---|---|
| Dindel | General | Indel | Alignment-based | Bayesian approach [ |
| GATK_HC | General | SNP + Indel | Haplotype-based | Collection of candidate haplotypes [ |
| GATK_UG | General | SNP + Indel | Alignment-based | Bayesian genotype likelihood model [ |
| Pindel | General | Indel | Split read mapping | A pattern growth approach [ |
| Platypus | General | SNP + Indel | Haplotype-based | Collection of candidate haplotypes [ |
| SAMTools | General | SNP + Indel | Alignment-based | Bayesian model [ |
| Varscan | General | SNP + Indel | Alignment-based | Heuristic method [ |
| GATK Mutect2 | Somatic | SNP + Indel | Allele frequency | Re-assembly of haplotypes methods [ |
| Strelka | Somatic | SNP + Indel | Allele frequency | Bayesian approach [ |
| Strelka2 | Somatic | SNP + Indel | Allele frequency | A mixture model [ |
| Varscan2 | Somatic | SNP + Indel | Heuristic methods | Heuristic and statistical methods [ |
Fig. 1Comparison of different methods regarding false negative indels. A schematic comparison between single-sample based method (a) and pooled-sample based method (b) with a pooled reference benchmark. FN: false negative
Performance of different general indel annotation programs
| Tool | TP indels | FP indels | Recall | Precision | F |
|---|---|---|---|---|---|
| Varscan | 533,101 | 0.42 | |||
| GATK_UG | 884,763 | 1,802,477 | 0.69 | 0.33 | 0.45 |
| GATK_HC | 948,738 | 2,026,903 | 0.74 | 0.32 | 0.45 |
| Pindel | 446,622 | 619,846 | 0.35 | 0.42 | 0.38 |
| Dindel | 3,097,117 | 0.24 | 0.37 | ||
| Platypus | 941,046 | 3,403,565 | 0.74 | 0.22 | 0.33 |
| SAMTools | 930,860 | 15,083,658 | 0.73 | 0.06 | 0.11 |
The bold represents the highest value in each column
Fig. 2Comparisons of indels from seven general indel calling programs. a Indel size distribution. b Indel type distribution. c Coding indel type distribution. FS frame shift, NFS non-frame shift
Pair-wise comparison between general indel calling programs
| Recall | Varscan | GATK_UG | GATK_HC | Pindel | Dindel | Platypus | SAMTools |
|---|---|---|---|---|---|---|---|
| Varscan | – | 0.40 | 0.41 | 0.30 | 0.41 | 0.39 | 0.36 |
| GATK_UG | 0.57 | - | 0.65 | 0.33 | 0.66 | 0.64 | 0.60 |
| GATK_HC | 0.59 | 0.43 | – | 0.34 | 0.72 | 0.68 | 0.65 |
| Pindel | 0.62 | 0.59 | 0.57 | – | 0.34 | 0.33 | 0.31 |
| Dindel | 0.58 | 0.41 | 0.38 | 0.56 | – | 0.70 | 0.68 |
| Platypus | 0.57 | 0.34 | 0.38 | 0.59 | 0.35 | – | 0.66 |
| SAMTools | 0.55 | 0.40 | 0.40 | 0.60 | 0.27 | 0.29 | – |
Fig. 3Comparison of different methods for common indels from different programs. A schematic comparison between single-sample based method (a) and pooled-sample based method (b) with a pooled reference benchmark. Green represents true positives. Red represents false positive predictions. Blue blocks are the benchmark indels
Performance comparison of different program combinations (showing average values)
| # of Tools | TP indels | FP indels | Recall | Precision | F |
|---|---|---|---|---|---|
| 1 | 811,440 | 3,779,758 | 0.64 | 0.31 | 0.37 |
| 2 | 639,772 | 899,660 | 0.51 | 0.48 | 0.45 |
| 3 | 528,467 | 496,588 | 0.41 | 0.56 | 0.45 |
| 4 | 450,280 | 322,289 | 0.37 | 0.60 | 0.44 |
| 5 | 394,064 | 230,561 | 0.31 | 0.64 | 0.41 |
| 6 | 354,111 | 179,699 | 0.28 | 0.67 | 0.38 |
| 7 | 326,184 | 150,069 | 0.26 | 0.68 | 0.37 |
Top 3 indel annotation program combinations (2 programs and 3 programs)
| F rank | Combination of 2 tools | TP | FP | Recall | Precision | F |
|---|---|---|---|---|---|---|
| 1 | GATK_UG + GATK_HC | 822,516 | 1,107,610 | 0.65 | 0.43 | 0.51 |
| 2 | GATK_UG + Dindel | 839,132 | 1,226,226 | 0.66 | 0.41 | 0.50 |
| 3 | GATK_HC + Platypus | 871,596 | 1,403,334 | 0.68 | 0.38 | 0.49 |
Fig. 4Overlapped indels by GATK_UG, GATK_HC and Dindel. a All indels; b coding indels only
Performance comparison of different somatic indel annotation programs
| Tools | Total indels | Cancer type | COSMIC indels | Potential germline indels and rate |
|---|---|---|---|---|
| Strelka | 2,186 | Bladder | 5 | 884 (0.40) |
| 5,521 | Breast | 5 | 2536 (0.46) | |
| 14,174 | Colon | 11 | 5227 (0.37) | |
| Strelka2 | 867 | Bladder | 0 | 225 (0.26) |
| 2162 | Breast | 0 | 768 (0.36) | |
| 9920 | Colon | 2 | 3583 (0.36) | |
| Varscan2 | 1804 | Bladder | 2 | 438 (0.24) |
| 3796 | Breast | 4 | 879 (0.23) | |
| 6286 | Colon | 8 | 831 (0.13) | |
| Mutect2 | 19,124 | Bladder | 10 | 761 (0.04) |
| 44,373 | Breast | 16 | 1708 (0.04) | |
| 30,503 | Colon | 31 | 4971 (0.16) |
Fig. 5Somatic indel size distribution. a Program based; and b cancer type based
Fig. 6Overlapped indel annotations of different cancer types. a Bladder cancer; b breast cancer; and c colon cancer
Performance on different number of somatic program combinations (The data shown are average values)
| Cancer types | # of Tools | Total indels | COSMIC indels | Potential germline indels and rate |
|---|---|---|---|---|
| Bladder | 1 | 5995 | 4 | 577 (0.24) |
| 2 | 285 | 1 | 64 (0.22) | |
| 3 | 92 | 1 | 20 (0.24) | |
| 4 | 22 | 0 | 6 (0.27) | |
| Breast | 1 | 13,963 | 6 | 1463 (0.27) |
| 2 | 616 | 1 | 185 (0.25) | |
| 3 | 181 | 1 | 47 (0.19) | |
| 4 | 36 | 0 | 5 (0.14) | |
| Colon | 1 | 15,221 | 13 | 6666 (0.26) |
| 2 | 3142 | 3 | 948 (0.23) | |
| 3 | 1051 | 2 | 300 (0.18) | |
| 4 | 161 | 1 | 14 (0.09) |