| Literature DB >> 24443148 |
Moritz Gerstung1, Elli Papaemmanuil, Peter J Campbell.
Abstract
MOTIVATION: Targeted resequencing of cancer genes in large cohorts of patients is important to understand the biological and clinical consequences of mutations. Cancers are often clonally heterogeneous, and the detection of subclonal mutations is important from a diagnostic point of view, but presents strong statistical challenges.Entities:
Mesh:
Year: 2014 PMID: 24443148 PMCID: PMC3998123 DOI: 10.1093/bioinformatics/btt750
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.General illustration of our approach. (a) Distribution of observed and expected VAFs across samples. The histograms denote the VAF and of a recurrent artifact occurring at low frequencies in ∼20% of the samples in forward, but not in the reverse orientation. The solid lines denote the expected distribution based on a beta-binomial model, Equation (1), with mean and defined as the average across all samples with VAF . The third histogram denotes the SF3B1 K700E variant present at clonal and subclonal frequencies, with the curve denoting the expected frequency distribution. (b) Heatmap of 1000 nt from five adjacent bait sets targeting the SF3B1 gene in 683 samples. The intensity of each pixel represents VAF of cytosine, , in a given sample (y, left axis) and position (x). If the relative frequency is identical, pixels tend to be black. Curves on the bottom indicate the error rates and in forward and reverse directions (right y-axis). The black line is the estimated dispersion . The prior π of finding a true variant is derived from the COSMIC database. Circles are drawn around variants with a posterior ; the area of each circle is proportional to the VAF. At position 650 resides the K700E hotspot mutation with many variant calls. (c–f) Bayes factors [Equation (7)] as a function of forward (x) and reverse (y) allele counts for different error rates and dispersions . (g) A variant-specific prior π influences the Bayes factor needed to call a variant at a given cutoff on the posterior probability, Equation (9)
Forty-three genes analyzed in 683 MDS samples with average coverage in parentheses
Fig. 2.Variant calling in control data. (a) Power (true-positive rate) of detecting variants with different frequency and coverage for fixed dispersion ρ. (b) Power of detecting variants when ρ is estimated from the data using a VAF cutoff of 0.1. (c) AUC as a function of cohort size for different variant allele frequencies. The two lines for each VAF refer to the case and to the case , respectively. (d) Specificity of different algorithms on 32 normal control samples. (e) Scatterplot of Bayes factors for 20 replicates. Colors denote variants meeting a posterior threshold of 0.5 in only one of the two replicates. Open circles are known polymorphisms. (f) Concordance of variant calls as a function of the posterior cutoff. Filled segments show the number of variants called in either of the two replicates (top and bottom; left axis) and the overlapping fraction (middle) when a given posterior cutoff is applied. The black line (right axis) shows the relative proportion of overlapping to total calls
Fig. 3.Variants in MDS. (a) Number of non-polymorphic variant calls versus cutoff P0 and prior weight . (b) Ratio of non-silent to silent variant calls. (c) Venn diagram of the distribution of shearwater variants across a normal panel, known SNPs and COSMIC variants. (d) Distribution of variant allele frequencies. (e) Venn diagram of calls from different algorithms. (f) Number of SF3B1 K700E calls as a function of false positives for different variant callers
Fig. 4.Prognostic effect of different variant callers. (a–d) The fraction of AML-free patients (either death or AML transformation) versus the time in months after sampling is shown. Patients are split into groups depending on whether the patient has a non-silent mutation in the given gene, found exclusively by Caveman, by shearwater only or by both. The gray line denotes patients with no mutations. P-values in the caption are from a log-rank test against the wild-type group, C is the corresponding C-statistic. While the Kaplan–Meyer curves and N refer to the fraction of patients exclusive to each method, P and C include the joint cases. (e) C-statistic for shearwater for different parameters. (f) C-statistic under permutation tests shuffling all calls in the set of variants exclusive to one variant caller. (g) C-statistic for different AND combinations of genotypes. (h) C-statistic for different OR combinations of genotypes