| Literature DB >> 24073752 |
David L Goode1, Sally M Hunter2, Maria A Doyle3, Tao Ma4, Simone M Rowley2, David Choong2, Georgina L Ryland5, Ian G Campbell6.
Abstract
Differentiating true somatic mutations from artifacts in massively parallel sequencing data is an immense challenge. To develop methods for optimal somatic mutation detection and to identify factors influencing somatic mutation prediction accuracy, we validated predictions from three somatic mutation detection algorithms, MuTect, JointSNVMix2 and SomaticSniper, by Sanger sequencing. Full consensus predictions had a validation rate of >98%, but some partial consensus predictions validated too. In cases of partial consensus, read depth and mapping quality data, along with additional prediction methods, aided in removing inaccurate predictions. Our consensus approach is fast, flexible and provides a high-confidence list of putative somatic mutations.Entities:
Year: 2013 PMID: 24073752 PMCID: PMC3978449 DOI: 10.1186/gm494
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Figure 1Frequency, concordance and read depths of somatic variant predictions in 27 ovarian tumors. (A) The total number of somatic SNVs predicted in each sample by the three algorithms. Bars are colored by tumor grade and histological subtype. (B) Concordance between somatic variant predictions from different algorithms. The minimum, maximum, and median numbers of predictions per 'call set’ per sample are shown. The filtering criterion used for each program is indicated. (C) Non-reference allele frequency, measured as the fraction of reads in the germline samples carrying the non-reference allele, at sites predicted to be somatic variants, within each call set. Each box covers the interquartile range, with a horizontal line representing the median. Whiskers indicate the minimum and maximum values. (D) Fraction of reads carrying the non-reference allele in the tumor samples at sites predicted to be somatic variants, within each call set. Call sets were defined by the programs that made a prediction above threshold using the following programs: M. MuTect, J = JointSNVMix2, S. SomaticSniper.
Total variant calls made by each caller, pre-filtering and post-filtering out variants not suitable for validation by Sanger sequencing
| All SNVs predicted, n (% of all variants)b | 1,483 (16.1%) | 83 (0.9%) | 462 (5.0%) | 298 (3.2%) | 1,756 (19.0%) | 2,387 (25.9%) | 2,757 (29.9%) | 9,226 |
| SNVs suitable for validation (% of all SNVs suitable for validation)c | 1385 (54.4%) | 16 (0.6%) | 370 (14.5%) | 57 (2.2%) | 279 (11.0%) | 80 (3.1%) | 360 (14.1%) | 2,547 |
| Average number (range) of filtered SNVs per sample | 51.3 | 0.6 | 13.7 | 2.1 | 10.3 | 3.0 | 13.3 | 94.3 |
| (1 to 246) | (0 to 6) | (5 to 42) | (0 to 11) | (3 to 33) | (0 to 23) | (5 to 25) | (38 to 321) | |
| Number (%) of SNVs that could not be used for validation | 98 (6.6%) | 67 (80.7%) | 92 (19.9%) | 241 (80.9%) | 1477 (84.1%) | 2307 (96.6%) | 2397 (86.9%) | 6679 (86.9%) |
Abbreviations: J. JointSNVMix2; JS, JointSNVMix2 + SomaticSniper; M, MuTect, MJ, MuTect + JointSNVMix2; MJS, MuTect + JointSNVMix2; + SomaticSniper; MS, MuTect + SomaticSniper; NRAF, non-reference allele frequency; S, SomaticSniper; SNV, single nucleotide variant.
aCall sets are identified as described in Figure 1.
bPredicted variants were filtered on the basis of read depth and NRAF, as described in the text.
c'Suitable for validation’ means evidence for SNV met criteria for validation by Sanger sequencing (tumor NRAF ≥0.2, germline NRAF ≤0.05 and ≥8 reads in both samples.
Validation results by call set
| Variants assesseda | 181 | 13 | 37 | 28 | 31 | 26 | 48 | 364 |
| Germlinec | 1 | 0 | 5 | 3 | 15 | 4 | 11 | 39 |
| Did not validatec | 1 | 8 | 3 | 15 | 12 | 21 | 36 | 97 |
Abbreviations: J, JointSNVMix2; JS, JointSNVMix2 + SomaticSniper; M, MuTect, MJ, MuTect + JointSNVMix2; MJS, MuTect + JointSNVMix2; + SomaticSniper; MS, MuTect + SomaticSniper; NRAF, non-reference allele frequency; S, SomaticSniper; SNV, single nucleotide variant.
aNumber of sites that were successfully amplified and Sanger sequenced.
bIndicates the number of SNVs that were confirmed as somatic.
cFalse-positive SNVs were either detected in the matched germline (normal) sample ('Germline’) or did not validate in the tumor sample ('Did not validate’).
Summary of validation results by program and combination of programs
| MuTect | 82.8%a | 94.8%b |
| JointSNVMix2 | 78.6%a | 85.2%b |
| SomaticSniper | 74.5%a | 95.6%b |
| Consensus of 3 programs | 98.9%c | 78.2%d |
| All variants | 38.6%c | 100%d |
Abbreviations: TP, true positive.
aNumber of TPs predicted by program/all predictions by program for which validation was attempted.
bNumber of TPs predicted by program/all TPs.
cNumber of TPs predicted in the group/all predictions by program for which validation was attempted.
dNumber of TPs predicted in the group/all TPs.
Sequencing coverage and read-mapping quality features for true-positive and false-positive predictions
| Median read depth | | | | |
| Germline | 105 | 25h | 28h | 12g |
| Tumor | 94 | 20h | 21g | 19g |
| Mean non-reference allele frequency | | | | |
| Germline | 0.08% | 1.4%g | 1.27%g | 1.77%g |
| Tumor | 43.10% | 30.9%h | 27.3%h | 39.80% |
| Mean percentage uniquely mapped reads | 97.10% | 85.3%f | 83.7%f | 89.30% |
| Fraction of SNVs with <95% reads mapping uniquely | 8.3% (19/229) | 52.2%h (71/136) | 57.7%h (56/97) | 38.4%f (15/39) |
| Fraction of predicted SNVs with >5% of reads mate-rescued | 4.4% (10/229) | 39.7%h (54/136) | 46.3%h (45/97) | 23.1%e (9/39) |
| Mean percentage of reads mapped to multiple locations | 1.90% | 4.7%e | 4.40% | 5.5%f |
Abbreviations: DNV, Did not validate; FP, false positive; J, JointSNVMix2; JS, JointSNVMix2 + SomaticSniper; M, MuTect, MJ, MuTect + JointSNVMix2; MJS, MuTect + JointSNVMix2; + SomaticSniper; MS, MuTect + SomaticSniper; NRAF, non-reference allele frequency; S, SomaticSniper; SNV, single nucleotide variant.; TP, true positive.
a'Includes all SNVs that validated in the tumor sample.
bCombined value for these two categories of FPs.
c Indicates SNVs that were not confirmed in the tumor sample.
dIndicates SNVs that were also detected in the matched germline.
Significance was tested using Wilcoxon rank-sum test for continuous variables and Fisher’s exact test for fraction of reads mapping uniquely/by mate-pair rescue/to multiple locations: :eP<0.05; fP<1 × 10-5; gP<1 × 10-10;,,hP<1 × 10-15.
Figure 2Sequencing coverage and read-mapping results for validated somatic mutations and false positives. (A) Log2-scaled read depths for sites harboring validated somatic mutations (red dots) and mutations that failed validation (black stars), in the tumor and normal samples from each individual predicted to harbor the mutation. (B) Fraction of reads containing the non-reference allele that mapped uniquely by BWA for validated somatic mutations (Somatic), mutations that were also detected in the germline sample during validation (Germline) and those that were not detected in either the tumor or the germline during validation (Did Not Validate; DNV). (C) Fraction of reads mapped by BWA using mate-pair rescue for validated somatic mutations (Somatic), mutations that were also detected in the germline sample during validation (Germline) and those that were not detected in either the tumor or the germline during validation (DNV). (B,C) Each box covers the interquartile range, with a horizontal line representing the median. Whiskers indicate the minimum and maximum values.
Additional filtering improved specificity of predictions lacking full consensus
| | | | ||
|---|---|---|---|---|
| Base validation ratee | 44/78 (56%) | 6/105 (6%) | 50/183 (27%) | |
| Percentage of mate-rescued reads <7%f | 44/64 (64%) | 6/69 (9%) | 50/133 (38%) | 0% |
| RD >10 (tumor and germline)g | 40/68 (59%) | 5/72 (7%) | 45/140 (32%) | 10% |
| Germline non-reference allele frequency ≤0.02h | 42/65 (65%) | 6/75 (8%) | 48/140 (34%) | 4% |
| GATK prediction for SNV in tumori | 43/66 (66%) | 5/47 (11%) | 48/113 (42%) | 4% |
| GATK + mate-rescued | 43/56 (77%) | 5/31 (16%) | 48/87 (55%) | 4% |
| GATK + mate-rescued + RD >10 | 39/46 (85%) | 4/19 (21%) | 43/65 (66%) | 14% |
Abbreviations: RD, read depth; SNV, single nucleotide variant; TP, true positive.
aValidation rate = TPs/total SNVs assessed.
bPartial-consensus predictions (made by two programs).
cNo consensus predictions (made by only one program).
dPercentage of TPs that would be discarded if the indicated set of filters was applied, that is, loss in sensitivity.
eRefers to specificity prior to filtering.
fFiltering on percentage of reads made from mate-pair rescue.
gFiltering on RD increased from 8 to 10 in tumor and germline.
hNon-reference allele frequencies in germline decreased from ≤0.05 to ≤0.02.
iFiltering SNVs predicted in the tumor but not the germline by GATK Unified Genotyper.
Recommendations for applying consensus results to somatic mutation predictions
| Consensus | Accurate, high-confidence predictions suitable for further analysis |
| Partial consensus | Roughly equal numbers of genuine somatic mutations and false positives. Utility dependent on need to maximize sensitivity. Further filtering can be performed to improve confidence |
| No consensus | Largely false positives; may be disregarded unless compelling biological interest exists. Explore using high-confidence list of variants from consensus |