| Literature DB >> 29162841 |
Yu-Chin Hsu1, Yu-Ting Hsiao1, Tzu-Yuan Kao1, Jan-Gowth Chang2, Grace S Shieh3.
Abstract
Due to lack of normal samples in clinical diagnosis and to reduce costs, detection of small-scale mutations from tumor-only samples is required but remains relatively unexplored. We developed an algorithm (GATKcan) augmenting GATK with two statistics and machine learning to detect mutations in cancer. The averaged performance of GATKcan in ten experiments outperformed GATK in detecting mutations of randomly sampled 231 from 241 TCGA endometrial tumors (EC). In external validations, GATKcan outperformed GATK in TCGA breast cancer (BC), ovarian cancer (OC) and melanoma tumors, in terms of Matthews correlation coefficient (MCC) and precision, where MCC takes both sensitivity and specificity into account. Further, GATKcan reduced high fractions of false positives detected by GATK. In mutation detection of somatic variants, classified commonly by VarScan 2 and MuTect from the called variants in BC, OC and melanoma, ranked by adjusted MCC (adjusted precision) GATKcan was the top 1, followed by MuTect, VarScan 2 and GATK. Importantly, GATKcan enables detection of mutations when alternate alleles exist in normal samples. These results suggest that GATKcan trained by a cancer is able to detect mutations in future patients with the same type of cancer and is likely applicable to other cancers with similar mutations.Entities:
Mesh:
Year: 2017 PMID: 29162841 PMCID: PMC5698426 DOI: 10.1038/s41598-017-14896-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Exome-seq datasets summary.
| EC (Exome-seq) | OC (Exome-seq, WUGSC) | OC (Exome-seq, BI) | |
|---|---|---|---|
| No. of samples | 248 | 79 | 136 |
| Sequencing technology | Illumina GAIIx or Hiseq. 2000 | Illumina GAIIx or ABI 3730 | Illumina GAIIx |
| Coverage per sample | at least 20x | at least 20x | at least 20x |
| Read architecture | 100 bp paired end | 100 bp paired end | 76 bp paired end |
| Target area | whole exome | whole exome | whole exome |
| Data set source | TCGA Research Network | TCGA Research Network | TCGA Research Network |
| Aligner | BWA | BWA | Picard |
|
|
| ||
| No. of samples | 503 | 342 | |
| Sequencing technology | Illumina Hiseq. 2000 | Illumina HiSeq. 2000 | |
| Coverage per sample | ~20x | ~82x | |
| Read architecture | 100 bp paired end | 76 bp paired end | |
| Target area | whole exome | whole exome | |
| Data set source | TCGA Research Network | TCGA Research Network | |
| Aligner | BWA | BWA/Picard | |
The averaged performances of GATK and GATKcan in detecting mutations from (A) ~61,507 variants of randomly sampled 231 endometrial tumors in ten repeats, and (B) ~52,291 variants of randomly sampled 197 endometrial tumors in ten repeats, checked against the 184,824 reported mutations in EC of TCGA.
| GATK | GATKcan | |||
|---|---|---|---|---|
| A. | ||||
| Training | TPR | — | -----*----- | 99.0 (1.0) |
| cFPR (s.e.) | — | -----*----- | 11.9 (1.3) | |
| Test | TPR (s.e.) | 88.2 (0.6) | -----*----- | 96.1 (1.9) |
| cFPR (s.e.) | 65.2 (0.1) | -----*----- | 12.2 (1.6) | |
| precision (s.e.) | 12.6 (0.4) | -----*----- | 46.0 (3.2) | |
| MCC (s.e.) | 14.5 (0.5) | -----*----- | 61.8 (2.0) | |
| B. | ||||
| Training | TPR | — | -----*----- | 98.8 (0.5) |
| cFPR (s.e.) | — | -----*----- | 12.1 (1.6) | |
| Test | TPR (s.e.) | 88.1 (1.3) | -----*----- | 96.1 (2.1) |
| cFPR (s.e.) | 64.9 (0.4) | -----*----- | 12.3 (1.6) | |
| precision (s.e.) | 12.6 (0.9) | -----*----- | 45.6 (2.9) | |
| MCC (s.e.) | 14.6 (1.0) | -----*----- | 61.4 (1.8) | |
§The unit of all performance measures and their s.e.’s are %. *Denotes the P value of the two sample t-test < 10−7.
Figure 1The averaged performance of GATK and GATKcan listed by allelic fractions when applied to exome-seq of 231 randomly sampled endometrial tumors in the ten repeats, where Fig. 1(a) and (b) illustrate TPR and conditional FPR of the two algorithms, respectively.
The averaged performances of GATK, GATKcan (trained by ten & 44 EC tumors), VarScan 2 and MuTect in detecting mutations from ~2,102 (~1,741) somatic variants of 231 (197) randomly sampled endometrial tumors in ten repeats, checked against the TCGA reported mutations.
| Test set | GATK | GATKcan | Mutect | VarScan 2 | ||
|---|---|---|---|---|---|---|
| 231 samples | TPR† | 98.8 (0.0) | 98.6 (0.6) | -----*----- | 94.3 (0.1) | 99.6 (0.0) |
| cFPR† (s.e.) | 81.0 (0.4) | 5.7 (0.4) | -----*----- | 37.9 (0.5) | 60.5 (0.7) | |
| precision† (s.e.) | 54.7 (0.3) | 94.4 (0.3) | -----*----- | 71.1 (0.2) | 61.9 (0.2) | |
| MCC† (s.e.) | 29.4 (0.4) | 92.9 (0.5) | -----*----- | 59.5 (0.4) | 48.9 (0.5) | |
| 197 samples | TPR† | 98.8 (0.3) | 98.5 (1.0) | -----*----- | 94.1 (1.0) | 99.6 (0.1) |
| cFPR† (s.e.) | 80.2 (2.3) | 6.1 (0.9) | -----*----- | 36.8 (5.2) | 59.6 (4.5) | |
| precision† (s.e.) | 55.0 (0.7) | 94.2 (0.6) | -----*----- | 71.9 (2.1) | 62.4 (0.9) | |
| MCC† (s.e.) | 30.3 (1.5) | 92.5 (0.3) | -----*----- | 60.3 (3.5) | 49.5 (3.1) |
†Adjusted performance measures. The unit of all performance measures and their s.e.’s are %. *The P value of the two sample t-test between GATKcan and MuTect < 10−8.
The performance of GATK and GATKcan in detecting mutations of 50,799 variants from exome-seq of 503 TCGA BC tumors.
| GATK | GATKcan (s.e.) | Mutect | VarScan 2 | ||
|---|---|---|---|---|---|
| trained by | |||||
| 10 tumors | 44 tumors | ||||
| TPR | 85.1 | 70.6 (0.3) | 70.7 (0.2) | — | — |
| cFPR | 55.1 | 3.2 (0.2) | 3.4 (0.2) | — | — |
| precision | 5.3 | 44.3 (1.2) | 43.4 (1.6) | — | — |
| MCC | 11.2 | 53.9 (0.7) | 53.4 (1.0) | — | — |
| TPR† | 99.3 | 99.5 (0.3) | 99.5 (0.3) | 99.3 | 98.6 |
| cFPR† | 68.5 | 3.2 (1.0) | 4.7 (0.3) | 27.0 | 10.9 |
| precision† | 40.7 | 93.8 (1.8) | 90.9 (0.5) | 63.5 | 81.0 |
| MCC† | 35.0 | 94.9 (1.3) | 92.7 (0.2) | 67.5 | 83.9 |
Next, the four algorithms identified mutations from 458 somatic variants, on which TPR†, cFPR†, precision† and MCC† were computed. The unit of all performance measures and their s.e.’s are %.
The averaged performance of GATK and GATKcan in detecting mutations of 27,167 called variants (of 432 genes) from exome-seq of 215 TCGA OC tumors, checked against the reported mutations by TCGA.
| GATK | GATKcan (s.e.) | Mutect | VarScan 2 | ||
|---|---|---|---|---|---|
| trained by | |||||
| 10 tumors | 44 tumors | ||||
| TPR | 85.2 | 89.1 (1.1) | 89.7 (0.7) | — | — |
| cFPR | 70.4 | 5.0 (0.2) | 5.1 (0.3) | — | — |
| precision | 2.9 | 30.5 (1.1) | 30.1 (1.4) | — | — |
| MCC | 5.0 | 50.4 (1.2) | 50.2 (1.4) | — | — |
| TPR† | 98.2 | 98.2 (0.0) | 98.2 (0.0) | 100.0 | 100.0 |
| cFPR† | 86.9 | 4.1 (1.2) | 3.9 (0.8) | 34.5 | 23.8 |
| precision† | 34.2 | 91.7 (2.1) | 92.0 (1.4) | 62.5 | 65.9 |
| MCC† | 17.9 | 92.5 (1.7) | 92.7 (1.1) | 64.0 | 70.9 |
Next, the four algorithms identified mutations from 178 somatic variants, on which TPR†, cFPR†, precision† and MCC† were computed. The unit of all performance measures and their s.e.’s are %.
The averaged performance of GATK and GATKcan in detecting mutations of 33,053 variants (of 498 genes) from exome-seq of 342 TCGA melanoma tumors.
| GATK | GATKcan (s.e.) | Mutect | VarScan 2 | ||
|---|---|---|---|---|---|
| trained by | |||||
| 10 tumors | 44 tumors | ||||
| TPR | 89.9 | 98.7 (0.4) | 98.9 (0.3) | — | — |
| cFPR | 64.1 | 4.4 (0.3) | 4.6 (0.2) | — | — |
| precision | 27.6 | 85.9 (0.9) | 85.3 (0.6) | — | — |
| MCC | 22.9 | 89.8 (0.7) | 89.5 (0.5) | — | — |
| TPR† | 98.5 | 99.2 (0.3) | 99.2 (0.3) | 99.4 | 99.3 |
| cFPR† | 76.6 | 8.8 (1.2) | 8.7 (1.1) | 57.0 | 47.9 |
| precision† | 80.3 | 97.3 (0.3) | 97.3 (0.3) | 84.7 | 86.8 |
| MCC† | 37.3 | 92.4 (0.3) | 92.5 (0.2) | 58.4 | 65.3 |
Next, the four algorithms identified mutations from 1,784 somatic variants, on which TPR†, cFPR†, precision† and MCC† were computed. The unit of all performance measures and their s.e.’s are %.
(A) The six thresholds of GATKcan trained by randomly sampled 10 TCGA EC tumors and performances of GATKcan in the ten training experiments. (B) The four thresholds of GATKcan trained by ~10% of 539 reported indels and 112 artifacts (from 241 TCGA EC tumors), and performances of GATKcan in the ten training experiments.
| Repeat | α | dNM | FS | MQ | MQRankSum | QD | Mann-Whitney test (P value) | Training | |
|---|---|---|---|---|---|---|---|---|---|
| TPR | cFPR | ||||||||
|
| |||||||||
| 1 | 0.5 | 991.1 | 46.1 | 50.0 | −7.85 | 0.11 | 0.010 | 98.6 | 10.7 |
| 2 | 0.5 | 988.1 | 43.6 | 49.5 | −7.90 | 0.12 | 0.010 | 98.1 | 10.7 |
| 3 | 0.5 | 990.5 | 45.8 | 39.2 | −10.10 | 0.05 | 0.010 | 99.3 | 14.3 |
| 4 | 0.3 | 985.6 | 49.3 | 48.2 | −10.14 | 0.08 | 0.010 | 98.6 | 12.2 |
| 5 | 0.3 | 984.3 | 51.8 | 48.0 | −10.37 | 0.09 | 0.009 | 98.9 | 11.6 |
| 6 | 0.4 | 981.5 | 50.9 | 50.0 | −9.59 | 0.09 | 0.010 | 98.7 | 10.8 |
| 7 | 0.4 | 989.3 | 45.7 | 39.9 | −8.97 | 0.05 | 0.082 | 99.3 | 14.0 |
| 8 | 0.3 | 982.3 | 55.2 | 50.0 | −8.36 | 0.18 | 0.010 | 99.6 | 11.9 |
| 9 | 0.5 | 987.9 | 47.0 | 49.9 | −10.22 | 0.11 | 0.010 | 98.2 | 10.5 |
| 10 | 0.5 | 989.1 | 45.6 | 39.9 | −9.58 | 0.04 | 0.010 | 99.2 | 14.3 |
|
| |||||||||
|
|
|
|
|
|
|
| |||
|
|
| ||||||||
| 1 | 0.5 | 753.4 | 86.3 | 0.3 | 0.055 | 100.0 | 0.0 | ||
| 2 | 0.5 | 405.0 | 133.3 | 0.4 | 0.090 | 100.0 | 0.0 | ||
| 3 | 0.3 | 581.6 | 91.6 | 0.3 | 0.055 | 100.0 | 0.0 | ||
| 4 | 0.5 | 395.2 | 136.6 | 0.8 | 0.064 | 100.0 | 0.0 | ||
| 5 | 0.5 | 658.8 | 102.7 | 0.4 | 0.013 | 100.0 | 0.0 | ||
| 6 | 0.6 | 464.0 | 121.0 | 0.2 | 0.037 | 100.0 | 0.0 | ||
| 7 | 0.3 | 397.6 | 95.1 | 0.2 | 0.027 | 100.0 | 0.0 | ||
| 8 | 0.3 | 619.2 | 99.1 | 0.4 | 0.046 | 100.0 | 0.0 | ||
| 9 | 0.5 | 566.9 | 135.7 | 0.5 | 0.100 | 100.0 | 0.0 | ||
| 10 | 0.5 | 633.6 | 139.9 | 0.8 | 0.003 | 100.0 | 0.0 | ||