Kin Chan1, Steven A Roberts1,2, Leszek J Klimczak3, Joan F Sterling1, Natalie Saini1, Ewa P Malc4, Jaegil Kim5, David J Kwiatkowski5,6, David C Fargo3, Piotr A Mieczkowski4, Gad Getz5,7, Dmitry A Gordenin1. 1. Genome Integrity and Structural Biology Laboratory, National Institute of Environmental Health Sciences, US National Institutes of Health, Research Triangle Park, North Carolina, USA. 2. School of Molecular Biosciences, Washington State University, Pullman, Washington, USA. 3. Integrative Bioinformatics Support Group, National Institute of Environmental Health Sciences, US National Institutes of Health, Research Triangle Park, North Carolina, USA. 4. Department of Genetics, Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, USA. 5. Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA. 6. Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA. 7. Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts, USA.
Abstract
Elucidation of mutagenic processes shaping cancer genomes is a fundamental problem whose solution promises insights into new treatment, diagnostic and prevention strategies. Single-strand DNA-specific APOBEC cytidine deaminase(s) are major source(s) of mutation in several cancer types. Previous indirect evidence implicated APOBEC3B as the more likely major mutator deaminase, whereas the role of APOBEC3A is not established. Using yeast models enabling the controlled generation of long single-strand genomic DNA substrates, we show that the mutation signatures of APOBEC3A and APOBEC3B are statistically distinguishable. We then apply three complementary approaches to identify cancer samples with mutation signatures resembling either APOBEC. Strikingly, APOBEC3A-like samples have over tenfold more APOBEC-signature mutations than APOBEC3B-like samples. We propose that APOBEC3A-mediated mutagenesis is much more frequent because APOBEC3A itself is highly proficient at generating DNA breaks, whose repair can trigger the formation of single-strand hypermutation substrates.
Elucidation of mutagenic processes shaping cancer genomes is a fundamental problem whose solution promises insights into new treatment, diagnostic and prevention strategies. Single-strand DNA-specific APOBEC cytidine deaminase(s) are major source(s) of mutation in several cancer types. Previous indirect evidence implicated APOBEC3B as the more likely major mutator deaminase, whereas the role of APOBEC3A is not established. Using yeast models enabling the controlled generation of long single-strand genomic DNA substrates, we show that the mutation signatures of APOBEC3A and APOBEC3B are statistically distinguishable. We then apply three complementary approaches to identify cancer samples with mutation signatures resembling either APOBEC. Strikingly, APOBEC3A-like samples have over tenfold more APOBEC-signature mutations than APOBEC3B-like samples. We propose that APOBEC3A-mediated mutagenesis is much more frequent because APOBEC3A itself is highly proficient at generating DNA breaks, whose repair can trigger the formation of single-strand hypermutation substrates.
Elucidation of mutagenic processes shaping cancer genomes is a fundamental problem whose solution promises insights into new treatment, diagnostic, and prevention strategies[1]. Single-strand DNA-specific APOBEC cytidine deaminase(s) are major source(s) of mutations in several cancer types[2-4]. Previous indirect evidence implicated APOBEC3B as the more likely major mutator deaminase, while APOBEC3A's role is not established[5,6]. Using yeast models enabling controlled generation of long single-strand genomic DNA substrates[7], we show the mutation signatures of APOBEC3A and APOBEC3B are statistically distinguishable. We then apply three complementary approaches to identify cancer samples with mutation signatures resembling either APOBEC. Strikingly, APOBEC3A-like samples have over ten-fold more APOBEC-signature mutations than APOBEC3B-like samples. We propose that APOBEC3A mutagenesis is much stronger because APOBEC3A itself is highly proficient at generating DNA breaks[8-10], whose repair can trigger formation of single-strand hypermutation substrates.Recently, we and others have shown that some cancers have an abundance of apparently simultaneous, closely-spaced mutations, variously referred to as ‘kataegis’[11] and ‘mutation clusters’[12]. Many clusters are strand- and nucleotide-coordinated, consisting entirely of mutations at cytosines on one DNA strand, most frequently within 5′-TCW-3′ motifs (mutated cytosine as capital underlined C; W denotes adenine or thymine)[4]. These characteristics are consistent with the mutagenic properties of several APOBEC cytidine deaminases which target 5′-TC-3′ motifs in single-strand DNA (ssDNA)[13-17].Analyses of cancer mutation datasets have implicated APOBEC3B (A3B) as the leading candidate[5,6], with APOBEC3A (A3A) as another possible mutator[4]. Numerous recent reports have linked high A3B expression to various cancers, reflecting a widely held view that A3B is the likely major mutator[3,4,9,18-22]. On the other hand, there is evidence that A3A could be a mutator in cancers[8-10,23,24]. Consistent with this possibility, breast cancers from carriers of a germline A3B deletion allele, fusing the A3A transcript to A3B's 3′ regulatory sequences, actually tend to have higher TC-signature mutation loads than cancers from non-carriers[25]. Such fusion transcripts are more stable, resulting in higher steady-state levels of A3A enzyme[26].A more conclusive way to distinguish between possible sources of mutagenesis in cancers is to match mutation signatures extracted from statistical analysis of each cancer with well-defined signature(s) of candidate mutagen(s)[27,28]. Thus, we collected large numbers of mutations induced by either A3A or A3B, in a yeast reporter strain (deleted for uracil glycosylase) that generates chromosomal ssDNA upon temperature shift[7]. Telomere uncapping in the presence of ssDNA-damaging mutagens results in selectable mutation clusters inactivating multiple reporter genes[7,29]. Crucially, resection of the complementary strand precludes excision repair and uracils from cytidine deaminations gave rise to C → T transitions[7]. pLogo analysis[30] of mutations identified by whole genome sequencing (WGS) of yeast revealed almost diametrically opposite motif preferences: A3A favored YTCA, while A3B favored RTCA (Y = pyrimidine, R = purine, see Fig. 1a–f and Supplementary Table 1). This was corroborated by our fold enrichment methodology[12] (see Fig. 1g, 1h). Re-analysis of mutation data from Neuberger and colleagues, generated by expressing A3A or A3B in a conventional yeast system[31], yielded similar results (see Supplementary Fig. 1). The motif preferences of APOBECs in yeast should be suitable models for the enzymes’ preferences in human cells, since the local sequence contexts flanking cytosines in both species’ genomes are quite similar, except for depletion of CpG motifs in human[31].
Figure 1
Analyses of mutations induced by APOBECs in ung1Δ yeast. pLogos show overrepresented nucleotides in a motif above the horizontal axis and underrepresented nucleotides below[30]. The size of each letter indicates magnitude of over- or underrepresentation. Fixed positions in each motif are highlighted by a box. n(fg) denotes the number of mutations at C, TC, or TCA, while n(bg) denotes the number of contexts at C, TC, or TCA. (a) All C:G → T:A substitutions induced by (a) A3A or (b) A3B, with C fixed at position 0, indicating overrepresentation of TC. (c) A3A and (d) A3B pLogos with fixed TC, revealing overrepresentation of TCA. (e) A3A and (f) A3B pLogos with fixed TCA, revealing near-diametrically opposite preferences at the –2 nucleotide, two positions 5′ of the deamination site. (g and h) Enrichment values for APOBEC-related motifs among genome-wide, scattered, clustered, and C-coordinated clustered mutations induced by (g) A3A or (h) A3B. Note that the similar enrichment values for the same motif (e.g., TCA) among different mutation categories suggest that the APOBECs targeted their cognate motifs with similar specificity, whether the ssDNA was clustered and presumably persistent (i.e., at uncapped telomeres), or scattered and presumably transient (e.g., transcription intermediates).
pLogos also showed that mutations at TCA (a component of TCW) were overrepresented for both APOBECs, while TCT was underrepresented. Then TCA enrichment in cancers should exceed TCW enrichment, if TC mutations were caused by either A3A or A3B. We evaluated 15 cohorts of recently published cancer WGS samples[25,32]. Five cancer types[2-4] (six cohorts, see Fig. 2) had high rates of APOBEC-signature mutagenesis: bladder (BLCA), breast (BRCA), head and neck (HNSC), lung adenocarcinoma (LUAD) and squamous cell (LUSC). In BLCA, BRCA, and HNSC, high TCA enrichment for APOBEC mutations was clearly evident. APOBEC mutagenesis was also detectable in LUAD and LUSC, as shown by high TCA enrichment values in C-coordinated mutation clusters, despite high genome-wide mutation loads from non-APOBEC sources. Low-APOBEC mutagenesis cancer types (e.g. multiple myeloma[33], where only a small percentage of samples exhibit significant APOBEC mutagenesis) are included in Supplementary Table 2.
Figure 2
Enrichment for mutations at various target motifs among all genome samples, and sample-by-sample comparison of genome-wide enrichment at TCA vs. TCW, within six cohorts of highly APOBEC-mutated cancer types. (a–f) Enrichment for mutations at TC, TCW, TCA, RTCA, and YTCA are shown for (a) BLCA, (b) BRCA, (d) HNSC, (e) LUAD, and (f) LUSC cohorts from TCGA, as well as (c) a BRCA cohort from ICGC. High genome-wide non-APOBEC mutation loads obscured the presence of APOBEC mutagenesis in the lung cancers. Nevertheless, APOBEC signature enrichment values in C-coordinated clusters of (e) LUAD and (f) LUSC are similar to those in other cancer types (a–d), confirming that examination of such clusters is the most sensitive means to detect APOBEC mutagenesis. (g–l) Sample-by-sample comparison of enrichment for mutations at TCA vs. TCW for (g) BLCA, (h) BRCA, (i) BRCA ICGC, (j) HNSC, (k) LUAD, and (l) LUSC cohorts. Samples are binned by quartile of TCW enrichment. χ2 tests for trend (p-values in each panel) confirm that as TCW enrichment increases, there is significant skewing toward samples with TCA enrichment > TCW enrichment.
We next examined the relationship between TCW and TCA enrichments on a per-sample basis for the six high-APOBEC cohorts (see Fig. 2). Results for low-APOBEC cohorts are in Supplementary Figure 2. We ordered all samples within each cohort by ascending TCW enrichment and binned into quartiles (see Fig. 2). As TCW enrichment increased, there was a statistically significant trend toward samples with TCA enrichment > TCW enrichment (χ2 test for trend p-values in Fig. 2). This suggested that A3A, A3B, or both, were mutagenizing cancers with high TCW enrichment. Similar results were obtained when analyzing exomes from BLCA, BRCA, HNSC, LUAD, LUSC, and cervical cancer (CESC) (see Supplementary Fig. 2), bolstering the conclusion that APOBEC(s) preferentially targeting TCA were acting in many cancers. APOBEC-signature mutation load in cancer exomes are statistically correlated with A3B and A3A transcript abundance[4,9]. However, this was not a reliable metric for distinguishing mutagenicity of specific APOBECs within these cancer genomes (see Supplementary Fig. 3), possibly because mRNA abundance in excised tumors need not correlate with mRNA (or protein) abundance at time of mutagenesis.We next sub-categorized TCA-enriched samples into A3A- and A3B-like subsets by comparing YTCA enrichment vs. RTCA enrichment (see Fig. 3). Samples with non-random ratio of YTCA vs. RTCA mutations (see “Y/RTCA enrichment analysis” in Online Methods) were binned by quartile of TCA enrichment. χ2 tests for trend (p-values in Fig. 3) indicated significant skewing toward A3A-like signatures as TCA enrichment increased. Results for other cohorts are in Supplementary Figure 4. We estimated the minimal number of TCA mutations attributable to an APOBEC in each sample (see Fig. 3g and Supplementary Table 3), which revealed the overall A3A-like median value (1,480) was over 11-fold greater than the A3B-like median (133). Thus, A3A is a much more prolific mutator than A3B.
Figure 3
Y/RTCA analysis and estimated mutation load of highly APOBEC-mutated cohorts. (a–f) Only samples significantly enriched for mutations at TCA (q < 0.05) are shown. Samples with YTCA mutation to RTCA mutation ratio statistically different from random (q < 0.05) are binned by quartile of TCA enrichment and plotted as filled symbols. Samples with YTCA > RTCA enrichment are considered A3A-like. Those with RTCA > YTCA enrichment are A3B-like. p-values from χ2 test for trend are shown. Samples significantly enriched at TCA, but with YTCA vs. RTCA ratio not statistically different from random (q > 0.05), are plotted as unfilled, gray-bordered symbols and not included in tests for trend. (g) The minimal estimated number of TCA mutations attributable to APOBEC mutagenesis, for A3A- (pink) or A3B-like (green) samples in each cohort, and for all six cohorts combined, are shown along with medians and Kolmogorov-Smirnov p-values. See Online Methods for analytical details.
To verify these findings, we compared proportions of mutations at each NTCA in cancers vs. each yeast model, using root mean square deviation (RMSD) calculations (see “NTCA proportion analysis” in Online Methods), and generated corresponding pLogos for the BRCA ICGC cohort (see Fig. 4). Results for five other high-APOBEC WGS cohorts are in Supplementary Figure 5. NTCA and pLogo analyses concurred with Y/RTCA results: lower TCA enrichment quartile samples were usually A3B-like (smaller RMSD vs. A3B model), transitioning to A3A-like samples (smaller RMSD vs. A3A model) in the upper quartiles.
Figure 4
Three-way comparison of Y/RTCA, NTCA, and pLogo methodologies for identifying samples with A3A- or A3B-like signatures in the BRCA ICGC cohort. Samples are binned by quartile of TCA enrichment, with 1st (lowest) quartile in (a), 2nd quartile in (b), 3rd quartile in (c), and 4th (highest) quartile in (d). All samples in the figure passed statistical filtering for significant TCA-signature enrichment (q < 0.05). Samples in Y/RTCA analysis also passed statistical filtering for non-random ratio of YTCA vs. RTCA, by χ2 test for goodness of fit. Samples in NTCA analysis passed analogous filtering for non-random proportion of four NTCA's. In each RMSD graph (middle panels), samples are arranged by increasing TCA enrichment. All three analyses indicated that lower TCA enrichment samples predominantly have A3B-like signatures, while high enrichment samples in the upper quartiles are all A3A-like. See Online Methods for analytical details.
Recent publications reported that A3B germline deletion carriers are at higher risk for breast cancer[34,35], and tumors from these patients have higher APOBEC-signature mutagenesis[25]. Thus, we investigated possible relationships between A3B germline copy number variation and prevalence of A3A- or A3B-like mutation signatures. By all three analyses, A3B deletion samples from the BRCA ICGC cohort were predominantly A3A-like (see Fig. 5a). In contrast, A3B wild-type samples showed a roughly equal split between A3A- (Fig. 5b) and A3B-like (Fig. 5c) signatures. Fisher's exact tests (p = 0.0024 by Y/RTCA and p = 0.0277 by NTCA analyses) confirmed significant skewing toward A3A-like signatures among A3B deletion samples. Similar results were obtained when the other high-APOBEC cohorts were evaluated (see Supplementary Fig. 6).
Figure 5
Relationship between A3B germline copy number and mutation signatures in the BRCA ICGC cohort. Samples passed same filtering criteria as in Figure 4. (a) A3B deletion samples (one homozygous, denoted by arrowhead; remainder heterozygous) skew toward A3A-like signatures. (b) A3A-like and (c) A3B-like subsets of samples with wild-type (WT) A3B copy number.
Our results (summarized in Fig. 6) strongly suggest that, in general, A3A is the predominant mutagenic deaminase in cancers. In cancers, APOBEC signatures were clearly detectable because abasic sites from uracil excision in ssDNA were not repaired. Instead, they were likely bypassed by error-prone translesion DNA polymerases to create mutations (see ref. 36 and references therein). Our approach relies on the supposition that, with respect to the motif preferences of APOBECs, cytosines in yeast ssDNA are suitable models for cytosines in ssDNA of humancancers. Since the molecular machinery of DNA transactions are not identical between the two species, we do not rule out the possibility that APOBEC motif preferences might be at least somewhat different between yeast and human. As sequencing technologies mature, it should become feasible to put this question to a rigorous test, by analyzing APOBEC motif preferences at thousands of mutated cytosines in human tissue culture models and comparing to our results in yeast.
Figure 6
Summary of data analyses and conclusions. (a and b) Sets of mutations induced by either (a) A3A or (b) A3B in yeast are successively more enriched at TC, TCW, and TCA. (a) For A3A, mutations at YTCA are more enriched than mutations at RTCA. (b) In contrast for A3B, mutations at RTCA are more enriched than mutations at YTCA. (c) By Y/RTCA enrichment analysis, out of 243 cancer genome samples with significant TCA mutagenesis, 101 (41.6%) are A3A-like and 63 (25.9%) are A3B-like. The remaining 79 (32.5%) are indeterminate. (d) By NTCA proportion analysis, 124 cancer samples (51.0%) are A3A-like, 75 (30.9%) are A3B-like, and 44 (18.1%) are indeterminate. (e) In A3B-like cancer samples, background A3B mutagenesis results in low overall TCA enrichment signatures with higher RTCA (especially ATCA) enrichment. (f) In A3A-like cancer samples, background A3B mutagenesis is dwarfed by A3A mutagenic activity, leading to high TCA enrichment signatures with even higher YTCA (especially CTCA) enrichment.
The finding that A3A-signature mutagenesis is more prominent in cancers might seem surprising, since A3B mRNA abundance tends to be higher than A3A's in cancer samples (see Supplementary Fig. 3). However, A3A is a much more potent inducer of DNA damage, likely via strand breakage as demonstrated by staining for γ-H2AX (a marker for double-strand breaks) and/or comet assay[8-10,23]. This is also consistent with observations that APOBEC-signature mutations and clusters are frequently co-localized with rearrangement breakpoints in cancers[11,12,37]. We propose that A3A-signature mutagenesis is more prominent, at least in part, because A3A itself can trigger homology-directed repair mediated generation of ssDNA substrates (by end resection[38] or break-induced replication[39]) much more readily than A3B can.As clinical cancer genetics progresses toward genomic analysis of each cancer sample, we have recently integrated sample-specific APOBEC-signature mutation analysis into a standard platform for analysis of large cancer genome datasets[40-42]. Analyses to distinguish between A3A- vs. A3B-like signatures will be incorporated into future pipeline updates, since this might prove important when weighing treatment options, given the substantially higher genotoxic and mutagenic potential of A3A. Moreover, early detection of APOBEC-signature mutation enrichment, e.g. in cell-free circulating DNA, could have important diagnostic or prognostic value, especially for individuals at higher risk, such as A3B deletion carriers.When detected in a tumor sample, a high prevalence of APOBEC mutagenesis might be exploited for therapeutic purposes. It has been suggested that hypermutation could enhance the effectiveness of immune stimulation therapy to treat cancer, by generating tumor-specific neoantigens (proteins with new epitopes), that might trigger targeted destruction by the immune system[43,44]. There are two immune therapies for bladder cancers[45,46], which often have high APOBEC enrichment (see Fig. 2) and A3A-like signatures (see Fig. 3 and Supplementary Fig. 5). These clinical observations raise the intriguing possibility that hypermutation in bladder cancers (mainly by A3A) could contribute substantially to the success of immune therapies. Likewise, other A3A-like, high-APOBEC mutagenesis cancers could be promising candidates for similar immune stimulation treatments.
Online Methods
Construction of integrated A3A- and A3B-expressing yeast strains
Human A3A or A3B open read frames (ORFs) with appended 5′ ClaI and 3′ StuI restriction sites were codon optimized for expression in yeast, and purchased from DNA 2.0 as inserts within the pJ201 vector. Each ORF was released from the vector backbone by ClaI and StuI double digestion, and ligated into the multi-cloning site of a tetracycline-regulatable pCM252-derived vector[47], to create plasmids pSR435 (bearing A3A) and pSR440 (A3B) with hph (hygromycin resistance) as the selectable marker instead of TRP1. A fragment of each plasmid containing the APOBEC ORF, the tetracycline-regulated promoter, and the hph marker, was amplified by PCR with primers (see Supplementary Table 4 for primer sequences) to add flanks with homology to either side of the LEU2 gene on Chromosome III.Purified PCR product was transformed[48] into a yeast host strain descended from CG379[49], with the following genotype: MATα his7-2 leu2-3,112 trp1-289 cdc13-1 ung1::NAT. CAN1, URA3, and ADE2 were deleted from their native loci and reintroduced into a closely-spaced triple reporter gene array near the de novo telomere on the left arm of Chromosome V[7]. Transformants with an APOBEC-hph cassette stably integrated into the LEU2 locus target (by homologous recombination) were selected by replica plating onto hygromycin plates, and verified by diagnostic replicas on single-colony isolates, followed by DNA sequencing of the insert.
Mutagenesis by A3A and A3B in yeast
Yeast were inoculated into 5 mL of YPDA media (1% yeast extract, 2% peptone, 2% dextrose, 0.01% adenine sulfate, filter-sterilized) and grown at 23°C for 72 hours. Yeast then were diluted ten-fold into 5 mL of fresh YPDA with 20 μg/mL doxycycline hyclate (Sigma-Aldrich) and shifted to 37°C for 6 hours. Cells then were washed into 5 mL of phosphate-buffered saline and held at 37°C for 42 hours more. Appropriate dilutions were plated onto synthetic complete to verify viability, and onto arginine dropout plates with 60 mg/mL canavanine sulfate and 20 mg/mL adenine sulfate to identify Canr Ade− double mutants, i.e., colonies with mutation clusters.
Whole-genome sequencing of yeast
Yeast colonies with mutation clusters were streaked onto YPDA. A single-colony isolate from each streak was verified for Can, Ura, Ade, and respiratory competency phenotypes by replica plating. Genomic DNA was purified from isolates of interest using a QIAcube robot, per manufacturer's instructions (QIAGEN). 100-nucleotide paired-end reads were obtained from a HiSeq 2000 sequencer (Illumina). Reads were mapped to the ySR127 reference genome and mutations were identified using the fixed ploidy caller in CLC Genomics Workbench 7.5 (QIAGEN). To minimize the possibility of analyzing mutations that were accumulated during routine passaging and culture growth, only unique mutations were included in mutation signature analyses. Illumina reads were uploaded to the NCBI Sequence Read Archive.
Cancer and other yeast sequencing data
Cancer genome and exome datasets were obtained from publications[25,32] or from the dbGaP TCGA controlled access Data Portal. hg19 was the human genome reference for our analyses. Cancer mutation catalogues were filtered to remove calls that overlapped with entries in dbSNP or the UCSC Genome Browser simpleRepeat track. Data from multiple myeloma genomes were from[12]. Additional yeast data were obtained from[31] and re-analyzed, as described in detail below and previously in[4], using the sacCer3 reference genome. Only mutations from the ung1Δ background were analyzed, as these were the closest equivalents to our yeast data.
APOBEC mRNA abundance and A3B germline copy number data
APOBEC RNAseq data for 5,868 tumor and 834 normal samples across 17 cancer types (bladder, breast, cervical, colorectal, glioblastoma multiforme, head and neck, kidney chromophobe and renal clear cell, acute myeloid leukemia, lower grade glioma, lung adenocarcinoma and squamous cell carcinoma, ovarian, prostate, melanoma, thyroid, and uterine corpus endometrial) were downloaded from the Broad GDAC Firehose standard data run of Feb. 15, 2014. Segmented copy number (CN) data for 7,191 tumor-normal pairs from these same cancer types were downloaded also. 5,526 samples had both RNAseq and CN data. These data were available for 17 bladder, 95 breast, 25 head and neck, 44 lung adenocarcinoma, and 44 lung squamous cell genomic samples (225 total), which allowed mRNA abundance vs. TCA minimal mutation load correlation, and mutation signature vs. A3B CN, analyses in this study. A3B CN data for the breast cancer ICGC cohort were obtained from[25].
A3B copy number annotation
Examination of the segmented CN data revealed that most A3B germline deletion events were localized between chr22: 39,363,650 and 39,375,350. Some samples had a short deletion within, or multiple discontinuous segmentation events overlapping, this region. This necessitated binning of the region into twelve 1-kb windows and identification of all segmental copy number variation (CNV) events overlapping any window. Cutoffs for classification were determined by examination of the histogram of inferred A3B CN values (see Supplementary Fig. 6f): A3B CN ≤ 0.7, homozygous deletion (homo.del); 0.7 < A3B CN ≤ 1.69, heterozygous deletion (het.del); 1.69 < A3B CN ≤ 2.29, wild type (WT); and A3B CN > 2.29, amplification (amp). 7,061 samples each had a unique segmental CNV. Among the remaining 130 samples that had more than one segmental CNV, classification was based on the segmental CN farthest removed from the wild-type value of 2. CN call totals were: 99 homo.del (1.38%), 998 het.del (13.88%), 5699 WT (79.25%), and 395 amp (5.49%).
Mutation cluster analysis
Mutation cluster analysis was performed as described previously[4,12]. Mutations spaced ≤ 10 bases apart were treated as a single mutagenic event, since low fidelity translesion DNA synthesis polymerases often synthesize a short tract 3′ of lesion bypass, and mis-incorporate at high frequencies[50,51]. Groups of closely-spaced mutations were identified, such that any pair of adjacent mutations within each group was separated by less than 10 kb. To identify clusters that were unlikely to have formed by random distribution of mutations within a genome, we computed a p-value for each group. Let x = number of bases spanned by a group (from first mutation to last), k = number of mutations in a group, π = number of total mutations divided by number of total bases in a genome, and j = an indexing parameter. Then by the negative binomial distribution[52], the cluster p-value:π was computed using all mutations (i.e., including those filtered for dbSNP and simpleRepeat), as this could only increase the p-values. Each group with p-value ≤ 10−4 was considered a bona fide mutation cluster. A recursive approach was applied, i.e., all clusters passing p-value filtering were identified, even if such a cluster was a subset within a larger group that did not pass the p-value filter. Clusters composed of only mutations that originated from cytosines along the same DNA strand were classified as C-coordinated. Mutations not found in a cluster were classified as scattered.
Mutation signature analyses
Overall structure of signature analysis involving complementary approaches used to identify, statistically evaluate and compare mutation signatures is outlined in Figure 6 and detailed in sections below.
Enrichment calculations
For all analyses, substitutions at C:G base pairs were treated as mutations at C. Enrichment quantifies how frequently C → G or C → T mutations occur at a specific sequence context compared to C → G or C → T mutations at cytosines overall. C → A substitutions were excluded because such mutations are rare due to abasic site bypass[7,36], and to avoid confounding overlap with frequent G → T substitutions in some cancers[53]. To compute enrichment for mutations at TCA, let MutT = number of TCA → TGA or TCA → TTA mutations and ConTCA = number of occurrences of TCA (and reverse complement TGA) contexts within the set of 41-mers centered on each mutation within a sample. Similarly, let Mut = number of C → G or C → T mutations and ConC = number of cytosines or guanines within the set of 41-mers centered on each mutation within a sample. Then the enrichment for mutations at TCA:Enrichments for the other contexts TC, TCW, RTCA, YTCA, and each NTCA, were calculated analogously.
Identification of samples significantly mutated by APOBEC(s)
Statistical overrepresentation of APOBEC mutagenesis within each sample was evaluated by one-sided Fisher's exact test. Taking TCA as an example, the test computed the p-value for a comparison between the ratio MutT / (Mut - MutT) vs. the ratio ConTCA / (ConC - ConTCA), based on the prediction that the former ratio exceeds the latter. All samples not matching this prediction were assigned p = 1. Benjamini-Hochberg (BH) p-value correction for multiple testing[54] was applied by the p.adjust() function in the R statistical computing package. Samples with these adjusted q-values < 0.05 were considered significant.
Estimating the number of mutations created by APOBEC(s)
A minimal estimate for the number of TCA mutations created by APOBEC(s) was computed as:Since enrichment = 1 implies TCA mutations are neither more nor less frequent (when corrected for motif abundance) than mutations at C in general, this minimal estimate reports the number of TCA mutations in excess of enrichment = 1. It is only this excess which should be attributed to mutagenesis by an APOBEC. Samples with Fisher's exact test q > 0.05 for enrichment at TCA were assigned a MinT = 0.
Y/RTCA enrichment analysis
The χ2 test for goodness of fit was used to identify samples that had a ratio of YTCA to RTCA mutations which differed statistically from random, by comparing observed vs. expected mutation counts. The expected number of YTCA mutations, given the null hypothesis of random mutagenesis, simply scales with fraction of motifs at YTCA:The expected number of RTCA mutations was computed analogously. p-values were corrected by the BH method, with q-values < 0.05 considered significant. Samples within each cohort were filtered first for significant TCA mutagenesis enrichment, then for significant difference from random distribution of YTCA vs. RTCA mutations. Samples passing only the first filter were plotted in the relevant figures as unfilled, gray-bordered circles, while samples passing both filters were plotted in colored circles, and included in χ2 tests for trend toward A3A-like signatures with increasing TCA enrichment.
NTCA proportion analysis
Similarly, the χ2 test for goodness of fit was used to identify samples that had a proportion of observed ATCA:CTCA:GTCA:TTCA mutations which differed statistically from random. The expected number of mutations at each NTCA:p-values from comparing observed vs. expected mutation counts were corrected by the BH method, with q-values < 0.05 considered significant. Only samples passing filtering for both significant TCA mutagenesis enrichment and non-randomness of NTCA proportion were included in root mean square deviation (RMSD, also called root mean square error) comparisons. RMSD is used commonly to quantify the similarity between two corresponding sets of quantities, e.g. the three-dimensional spatial coordinates of alpha-carbon atoms in one protein structure vs. another[55].RMSD was used to quantify the difference between the normalized enrichment observed in each sample for mutations at each NTCA vs. the corresponding normalized enrichment values in each yeast model. Taking ATCA as an example, the normalized enrichment:Let yNENT = normalized enrichment for mutations at NTCA observed in a yeast model. Then the RMSD of a cancer sample vs. a yeast model:Samples with RMSD vs. A3A < RMSD vs. A3B were considered A3A-like, while those with RMSD vs. A3B < RMSD vs. A3A were A3B-like.
pLogo analysis
pLogos identify nucleotides statistically over- or underrepresented in a ‘foreground’ set of sequences, relative to abundances within a ‘background’ set[30]. pLogos were generated using all C → T substitutions from yeast data and all C → G or C → T substitutions from cancer samples. Each element within the set of foreground sequences comprised the two bases immediately 5′ of a mutation, the mutated base itself (always C), and one base immediately 3′. The corresponding background was the set of 41-mers each centered on a mutation included in the foreground. The deaminated C was set to position 0. Nucleotides above the horizontal axis were overrepresented, while those below the axis were underrepresented. The height of each nucleotide denotes the magnitude of over- or underrepresentation. Red lines represent cutoffs for p = 0.05. In rare cases, the number of bases in the background set was apparently greater than could be accommodated by the pLogo online tool, so the set of C → G or C → T substitutions was analyzed separately from the G → C or G → A set. As such pairs of pLogos were always very similar, we reported those generated from C → G or C → T substitutions only.
Additional statistical analyses
Additional statistical analyses, including Kolmogorov-Smirnov test, Spearman's correlation, χ2 test with Yates correction, and χ2 test for trend, were performed using Graphpad Prism 6 (Graphpad Software).
Code availability
APOBEC mutagenesis pattern was analyzed similarly to the analysis incorporated into the Broad's Institute TCGA GDAC Firehose[42]. R code is available upon request.
Authors: Caleb F Davis; Christopher J Ricketts; Min Wang; Lixing Yang; Andrew D Cherniack; Hui Shen; Christian Buhay; Hyojin Kang; Sang Cheol Kim; Catherine C Fahey; Kathryn E Hacker; Gyan Bhanot; Dmitry A Gordenin; Andy Chu; Preethi H Gunaratne; Michael Biehl; Sahil Seth; Benny A Kaipparettu; Christopher A Bristow; Lawrence A Donehower; Eric M Wallen; Angela B Smith; Satish K Tickoo; Pheroze Tamboli; Victor Reuter; Laura S Schmidt; James J Hsieh; Toni K Choueiri; A Ari Hakimi; Lynda Chin; Matthew Meyerson; Raju Kucherlapati; Woong-Yang Park; A Gordon Robertson; Peter W Laird; Elizabeth P Henske; David J Kwiatkowski; Peter J Park; Margaret Morgan; Brian Shuch; Donna Muzny; David A Wheeler; W Marston Linehan; Richard A Gibbs; W Kimryn Rathmell; Chad J Creighton Journal: Cancer Cell Date: 2014-08-21 Impact factor: 31.743
Authors: Rowan G Casey; James W F Catto; Liang Cheng; Michael S Cookson; Harry Herr; Sharokh Shariat; J Alfred Witjes; Peter C Black Journal: Eur Urol Date: 2014-11-14 Impact factor: 20.096
Authors: Michael B Burns; Lela Lackey; Michael A Carpenter; Anurag Rathore; Allison M Land; Brandon Leonard; Eric W Refsland; Delshanee Kotandeniya; Natalia Tretyakova; Jason B Nikas; Douglas Yee; Nuri A Temiz; Duncan E Donohue; Rebecca M McDougle; William L Brown; Emily K Law; Reuben S Harris Journal: Nature Date: 2013-02-06 Impact factor: 49.962
Authors: Cynthia J Sakofsky; Steven A Roberts; Ewa Malc; Piotr A Mieczkowski; Michael A Resnick; Dmitry A Gordenin; Anna Malkova Journal: Cell Rep Date: 2014-05-29 Impact factor: 9.423
Authors: Benjamin Jm Taylor; Serena Nik-Zainal; Yee Ling Wu; Lucy A Stebbings; Keiran Raine; Peter J Campbell; Cristina Rada; Michael R Stratton; Michael S Neuberger Journal: Elife Date: 2013-04-16 Impact factor: 8.140
Authors: Artur A Serebrenik; Gabriel J Starrett; Sterre Leenen; Matthew C Jarvis; Nadine M Shaban; Daniel J Salamango; Hilde Nilsen; William L Brown; Reuben S Harris Journal: Proc Natl Acad Sci U S A Date: 2019-10-14 Impact factor: 11.205
Authors: Joseph C F Ng; Jelmar Quist; Anita Grigoriadis; Michael H Malim; Franca Fraternali Journal: Nucleic Acids Res Date: 2019-02-20 Impact factor: 16.971
Authors: Avrum Spira; Mary L Disis; John T Schiller; Eduardo Vilar; Timothy R Rebbeck; Rafael Bejar; Trey Ideker; Janine Arts; Matthew B Yurgelun; Jill P Mesirov; Anjana Rao; Judy Garber; Elizabeth M Jaffee; Scott M Lippman Journal: Proc Natl Acad Sci U S A Date: 2016-09-16 Impact factor: 11.205
Authors: Abby M Green; Sébastien Landry; Konstantin Budagyan; Daphne C Avgousti; Sophia Shalhout; Ashok S Bhagwat; Matthew D Weitzman Journal: Cell Cycle Date: 2016 Impact factor: 4.534
Authors: Rémi Buisson; Adam Langenbucher; Danae Bowen; Eugene E Kwan; Cyril H Benes; Lee Zou; Michael S Lawrence Journal: Science Date: 2019-06-28 Impact factor: 47.728
Authors: Clarissa Gerhauser; Francesco Favero; Thomas Risch; Ronald Simon; Lars Feuerbach; Yassen Assenov; Doreen Heckmann; Nikos Sidiropoulos; Sebastian M Waszak; Daniel Hübschmann; Alfonso Urbanucci; Etsehiwot G Girma; Vladimir Kuryshev; Leszek J Klimczak; Natalie Saini; Adrian M Stütz; Dieter Weichenhan; Lisa-Marie Böttcher; Reka Toth; Josephine D Hendriksen; Christina Koop; Pavlo Lutsik; Sören Matzk; Hans-Jörg Warnatz; Vyacheslav Amstislavskiy; Clarissa Feuerstein; Benjamin Raeder; Olga Bogatyrova; Eva-Maria Schmitz; Claudia Hube-Magg; Martina Kluth; Hartwig Huland; Markus Graefen; Chris Lawerenz; Gervaise H Henry; Takafumi N Yamaguchi; Alicia Malewska; Jan Meiners; Daniela Schilling; Eva Reisinger; Roland Eils; Matthias Schlesner; Douglas W Strand; Robert G Bristow; Paul C Boutros; Christof von Kalle; Dmitry Gordenin; Holger Sültmann; Benedikt Brors; Guido Sauter; Christoph Plass; Marie-Laure Yaspo; Jan O Korbel; Thorsten Schlomm; Joachim Weischenfeldt Journal: Cancer Cell Date: 2018-12-10 Impact factor: 31.743
Authors: Hannah L Klein; Giedrė Bačinskaja; Jun Che; Anais Cheblal; Rajula Elango; Anastasiya Epshtein; Devon M Fitzgerald; Belén Gómez-González; Sharik R Khan; Sandeep Kumar; Bryan A Leland; Léa Marie; Qian Mei; Judith Miné-Hattab; Alicja Piotrowska; Erica J Polleys; Christopher D Putnam; Elina A Radchenko; Anissia Ait Saada; Cynthia J Sakofsky; Eun Yong Shim; Mathew Stracy; Jun Xia; Zhenxin Yan; Yi Yin; Andrés Aguilera; Juan Lucas Argueso; Catherine H Freudenreich; Susan M Gasser; Dmitry A Gordenin; James E Haber; Grzegorz Ira; Sue Jinks-Robertson; Megan C King; Richard D Kolodner; Andrei Kuzminov; Sarah Ae Lambert; Sang Eun Lee; Kyle M Miller; Sergei M Mirkin; Thomas D Petes; Susan M Rosenberg; Rodney Rothstein; Lorraine S Symington; Pawel Zawadzki; Nayun Kim; Michael Lisby; Anna Malkova Journal: Microb Cell Date: 2019-01-07
Authors: Artur A Serebrenik; Prokopios P Argyris; Matthew C Jarvis; William L Brown; Martina Bazzaro; Rachel I Vogel; Britt K Erickson; Sun-Hee Lee; Krista M Goergen; Matthew J Maurer; Ethan P Heinzen; Ann L Oberg; Yajue Huang; Xiaonan Hou; S John Weroha; Scott H Kaufmann; Reuben S Harris Journal: Clin Cancer Res Date: 2020-02-14 Impact factor: 12.531