| Literature DB >> 28185561 |
Ítalo Faria do Valle1,2, Enrico Giampieri1, Giorgia Simonetti3, Antonella Padella3, Marco Manfrini3, Anna Ferrari3, Cristina Papayannidis3, Isabella Zironi1, Marianna Garonzi4, Simona Bernardi5, Massimo Delledonne4,6, Giovanni Martinelli3, Daniel Remondini7, Gastone Castellani1.
Abstract
BACKGROUND: Detecting somatic mutations in whole exome sequencing data of cancer samples has become a popular approach for profiling cancer development, progression and chemotherapy resistance. Several studies have proposed software packages, filters and parametrizations. However, many research groups reported low concordance among different methods. We aimed to develop a pipeline which detects a wide range of single nucleotide mutations with high validation rates. We combined two standard tools - Genome Analysis Toolkit (GATK) and MuTect - to create the GATK-LODN method. As proof of principle, we applied our pipeline to exome sequencing data of hematological (Acute Myeloid and Acute Lymphoblastic Leukemias) and solid (Gastrointestinal Stromal Tumor and Lung Adenocarcinoma) tumors. We performed experiments on simulated data to test the sensitivity and specificity of our pipeline.Entities:
Keywords: Cancer; Somatic single nucleotide variants; Whole exome sequencing
Mesh:
Year: 2016 PMID: 28185561 PMCID: PMC5123378 DOI: 10.1186/s12859-016-1190-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Pipeline of SNV detection in sequencing data of cancer samples. Summary of steps and their respective tools in the detection of SNVs in paired normal-cancer sequencing data
Artificial tumor samples. Coordinate list of the single nucleotide variants inserted in the artificial tumor samples and their variant allelic frequencies
| Chromosome | Position | REF > ALT | Artificial tumors variant allelic frequencies | Normal variant allelic frequencies | ||
|---|---|---|---|---|---|---|
| 0.02 – 0.26 | 0.5 – 0.86 | 0.97 – 1 | ||||
| 11 | 19854088 | G > A | 0.03 | 0.69 | 1.00 | 0 |
| 11 | 36484167 | C > T | 0.08 | 0.62 | 1.00 | 0.027 |
| 11 | 4608116 | T > C | 0.13 | 0.71 | 1.00 | 0.020 |
| 11 | 4661826 | T > C | 0.11 | 0.60 | 0.97 | 0.028 |
| 11 | 4673788 | G > A | 0.26 | 0.64 | 1.00 | 0.021 |
| 11 | 4928841 | T > C | 0.13 | 0.61 | 1.00 | 0 |
| 11 | 5372856 | A > G | 0.24 | 0.69 | 1.00 | 0.023 |
| 11 | 5373562 | C > A | 0.09 | 0.68 | 1.00 | 0.029 |
| 11 | 5443887 | T > C | 0.10 | 0.86 | 1.00 | 0 |
| 11 | 5443893 | G > A | 0.10 | 0.86 | 1.00 | 0 |
| 11 | 5462255 | C > G | 0.16 | 0.56 | 1.00 | 0 |
| 11 | 5906203 | T > G | 0.19 | 0.70 | 1.00 | 0 |
| 11 | 6519642 | G > A | 0.08 | 0.61 | 1.00 | 0 |
| 11 | 824789 | T > C | 0.11 | 0.63 | 1.00 | 0.026 |
| 12 | 25398281 | C > T | 0.12 | 0.63 | 1.00 | 0 |
| 12 | 75715330 | C > A | 0.13 | 0.60 | 1.00 | 0 |
| 22 | 24891418 | A > C | 0.21 | 0.70 | 1.00 | 0.030 |
| 22 | 44083442 | T > C | NA | 0.78 | 1.00 | 0 |
| 13 | 101289801 | C > A | 0.13 | 0.65 | 1.00 | 0 |
| 20 | 61537337 | G > T | 0.13 | 0.65 | 1.00 | 0 |
| 17 | 48557299 | G > T | 0.11 | 0.74 | 1.00 | 0 |
| 5 | 45262378 | G > T | 0.08 | 0.50 | 1.00 | 0 |
| 1 | 94476902 | T > C | 0.15 | 0.65 | 1.00 | 0 |
| 2 | 110372199 | G > T | NA | 0.57 | 1.00 | 0 |
| 5 | 64907465 | C > A | 0.10 | 0.57 | 1.00 | 0 |
Fig. 2The GATK-LODN method reduces the number of GATK false positive calls. Comparison of the number of SNVs between GATK and MuTect before (a) and after (b) applying the GATK-LODN method for each cancer whole exome sequencing dataset. AML: Acute Myeloid Leukemia, ALL: Acute Lymphoblastic Leukemia, GIST: Gastrointestinal Stromal Tumor, LA: Lung Adenocarcinoma
Relaxing MuTect parameters increases the number of false positive calls. Number of variants found by MuTect, before and after relaxing the ΘT and ΘN parameters for six Acute Myeloid Leukemia (AML) normal-cancer sample pairs
| Patients | MuTect | MuTect Adapteda |
|---|---|---|
| a1024 | 11 | 39 |
| a1025 | 31 | 41 |
| b1014 | 22 | 54 |
| b2002 | 10 | 25 |
| b2035 | 43 | 419 |
| b2042 | 58 | 338 |
aApplying the computation of ΘT and ΘN, from the MuTect algorithm, with lowered threshold values (4.5 and 3, respectively) downstream to the GATK analysis
The GATK-LODN method increases the GATK performance for both mutation detection and classification. The Sanger sequencing validation was performed in two rounds: in the first round we tested whether the methods correctly detected the mutation and in the second one we assessed whether the methods correctly classified the mutations as somatic events. The variant subsets tested (AML datatset) presented variants method specific and variants detected by one or more methods
| Mutation Detectiona | Mutation Classificationb | |||
|---|---|---|---|---|
| Tested | Validated | Tested | Validated | |
| GATK-LODN - specific | 4 | 1 | 2 | 2 |
| GATK-LODN (All variants) | 9 | 6 | 4 | 3 |
| GATK (without LODN) - specific | 37 | 11 | 9 | 2 |
| GATK (without LODN) (All Variants) | 48 | 18 | 14 | 5 |
| MuTect - specific | 22 | 21 | 8 | 8 |
| MuTect (All Variants) | 29 | 27 | 11 | 10 |
| MuTect & GATK | 7 | 6 | 3 | 2 |
avariants tested for correct mutation detection
bvariants tested for correct classification as somatic events
The GATK-LODN method presented good performance in artificial tumor samples. Performance of MuTect and GATK-LODN for artificial tumor samples that had variants with diverse allelic frequencies
| Artificial Tumor Samples | ||||
|---|---|---|---|---|
| Low Frequency Variants ( | Intermediate Frequency Variants ( | High Frequency Variants ( | ||
| MuTect | Somatic Candidates | 22 | 25 | 25 |
| TP | 19 | 25 | 25 | |
| FN | 0 | 0 | 0 | |
| FP | 3 | 0 | 0 | |
| PPV | 19/22 | 25/25 | 25/25 | |
| FDR | 3/22 | 0/25 | 0/25 | |
| GATK-LODN | Somatic Candidates | 27 | 32 | 33 |
| TP | 17 | 23 | 23 | |
| FN | 5 | 5 | 2 | |
| FP | 5 | 7 | 8 | |
| PPV | 17/22 | 23/30 | 23/31 | |
| FDR | 5/22 | 7/30 | 8/31 | |
TP True positives, FN False negatives, FP False positives, PPV Positive Predictive Value (#TP / #FP + #TP), FDR False Discovery Rate (#FP / #FP + #TP), VAF Variant Allelic Frequency
GATK results were not reported in the table since it detected more than 2200 candidates out of 22 or 25 TPs
Fig. 3Number of False Negatives and True positives at different coverage levels. Three artificial tumors were created with 22, 25 and 25 SNVs, which had variant allelic fractions range of 0.02 to 0.25, 0.5 to 0.86, and 0.97 to 1.0, respectively. We counted the number of False Negatives (FN) and True positives (TP) for different levels of simulated sequencing coverage