Theo A Knijnenburg1, Linghua Wang2, Michael T Zimmermann3, Nyasha Chambwe1, Galen F Gao4, Andrew D Cherniack4, Huihui Fan5, Hui Shen5, Gregory P Way6, Casey S Greene6, Yuexin Liu7, Rehan Akbani7, Bin Feng8, Lawrence A Donehower9, Chase Miller10, Yang Shen11, Mostafa Karimi11, Haoran Chen11, Pora Kim12, Peilin Jia12, Eve Shinbrot10, Shaojun Zhang13, Jianfang Liu14, Hai Hu14, Matthew H Bailey15, Christina Yau16, Denise Wolf17, Zhongming Zhao12, John N Weinstein7, Lei Li18, Li Ding19, Gordon B Mills20, Peter W Laird5, David A Wheeler10, Ilya Shmulevich1, Raymond J Monnat21, Yonghong Xiao22, Chen Wang23. 1. Institute for Systems Biology, Seattle, WA 98109, USA. 2. Department of Genomic Medicine, Division of Cancer Medicine, University of Texas MD Anderson Cancer Center, Houston, TX 77054, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA. 3. Genomic Sciences and Precision Medicine Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226-0509, USA; Department of Health Sciences Research, Mayo Clinic College of Medicine, 200 First Street SW, Rochester, MN 55905, USA. 4. The Eli and Edythe L. Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, MA 02142, USA. 5. Center for Epigenetics, Van Andel Research Institute, Grand Rapids, MI 49503, USA. 6. Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19103, USA. 7. Department of Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA. 8. TESARO Inc., Waltham, MA 02451, USA. 9. Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX 77030, USA. 10. Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA. 11. Department of Electrical and Computer Engineering, 3128 TAMU, Texas A&M University, College Station, TX 77843, USA. 12. Center for Precision Health, School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA. 13. Department of Genomic Medicine, Division of Cancer Medicine, University of Texas MD Anderson Cancer Center, Houston, TX 77054, USA. 14. Chan Soon-Shiong Institute of Molecular Medicine at Windber, Windber, PA 15963, USA. 15. Division of Oncology, Department of Medicine, Washington University, St. Louis, MO 63110, USA; McDonnell Genome Institute, Washington University, St. Louis, MO 63110, USA. 16. University of California, San Francisco, San Francisco, CA 94115, USA; Buck Institute for Research on Aging, Novato, CA 94945, USA. 17. University of California, San Francisco, San Francisco, CA 94115, USA. 18. Department of Experimental Radiation Oncology, University of Texas MD Anderson Cancer, Houston, TX 77030, USA. 19. Division of Oncology, Department of Medicine, Washington University, St. Louis, MO 63110, USA; McDonnell Genome Institute, Washington University, St. Louis, MO 63110, USA; Department of Genetics, Washington University, St. Louis, MO 63110, USA; Siteman Cancer Center, Washington University, St. Louis, MO 63110, USA. 20. Department of Systems Biology, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA. 21. Departments of Pathology & Genome Sciences, University of Washington, Seattle, WA 98195-7705, USA. Electronic address: monnat@u.washington.edu. 22. TESARO Inc., Waltham, MA 02451, USA. Electronic address: yxiao@genospace.com. 23. Department of Health Sciences Research, Mayo Clinic College of Medicine, 200 First Street SW, Rochester, MN 55905, USA; Department of Obstetrics and Gynecology, Mayo Clinic College of Medicine, 200 First Street SW, Rochester, MN 55905, USA. Electronic address: wang.chen@mayo.edu.
Abstract
DNA damage repair (DDR) pathways modulate cancer risk, progression, and therapeutic response. We systematically analyzed somatic alterations to provide a comprehensive view of DDR deficiency across 33 cancer types. Mutations with accompanying loss of heterozygosity were observed in over 1/3 of DDR genes, including TP53 and BRCA1/2. Other prevalent alterations included epigenetic silencing of the direct repair genes EXO5, MGMT, and ALKBH3 in ∼20% of samples. Homologous recombination deficiency (HRD) was present at varying frequency in many cancer types, most notably ovarian cancer. However, in contrast to ovarian cancer, HRD was associated with worse outcomes in several other cancers. Protein structure-based analyses allowed us to predict functional consequences of rare, recurrent DDR mutations. A new machine-learning-based classifier developed from gene expression data allowed us to identify alterations that phenocopy deleterious TP53 mutations. These frequent DDR gene alterations in many human cancers have functional consequences that may determine cancer progression and guide therapy.
DNA damage repair (DDR) pathways modulate cancer risk, progression, and therapeutic response. We systematically analyzed somatic alterations to provide a comprehensive view of DDR deficiency across 33 cancer types. Mutations with accompanying loss of heterozygosity were observed in over 1/3 of DDR genes, including TP53 and BRCA1/2. Other prevalent alterations included epigenetic silencing of the direct repair genes EXO5, MGMT, and ALKBH3 in ∼20% of samples. Homologous recombination deficiency (HRD) was present at varying frequency in many cancer types, most notably ovarian cancer. However, in contrast to ovarian cancer, HRD was associated with worse outcomes in several other cancers. Protein structure-based analyses allowed us to predict functional consequences of rare, recurrent DDR mutations. A new machine-learning-based classifier developed from gene expression data allowed us to identify alterations that phenocopy deleterious TP53 mutations. These frequent DDR gene alterations in many humancancers have functional consequences that may determine cancer progression and guide therapy.
DNA damage repair (DDR) genes play key roles in maintaining human genomic stability. Loss of DDR function, conversely, is an important determinant of cancer risk, progression, and therapeutic response (Jeggo et al., 2016). DDR genes can be grouped into functional pathways defined by genetic, biochemical, and mechanistic criteria. Proteins in the same pathway often work in concert to repair specific types of DNA damage (Friedberg et al., 2004). Base excision repair (BER), nucleotide excision repair (NER), and the direct damage reversal/repair (DR) pathways repair DNA base damage, while mismatch repair (MMR) corrects base mispairs and small loops often found in repetitive sequence DNA. Homology-dependent recombination (HR), non-homologous end joining (NHEJ), the Fanconi anemia (FA) pathway, and translesion DNA synthesis (TLS) act alone or together to repair DNA strand breaks and complex events like interstrand crosslinks (Friedberg et al., 2004; Kass et al., 2016). All of the major DDR pathways, with the exception of the FA pathway, have been identified in virtually all organisms. This reflects the universal need to counter the chemical instability of DNA and repair additional damage (Aravind et al., 1999; Eisen and Hanawalt, 1999; Friedberg et al., 2004).The consequences of DDR deficiency are becoming better understood through analyses of DDR gene alterations in cancer (Alexandrov et al., 2013; Forbes et al., 2017; Garraway and Lander, 2013; Martincorena and Campbell, 2015). For example, frequent TP53 somatic mutations in many cancer types can disrupt the DNA damage response, apoptosis, or senescence pathways active in many early-stage cancers (Bartkova et al., 2005; Fischer, 2017; Gorgoulis et al., 2005; Pfister and Prives, 2017). DDR deficiency may also lead to specific mutational “signatures,” e.g., the short tandem repeat instability linked to the inactivation or silencing of DNA MMR in colorectal, ovarian, or endometrial cancer (Alexandrov et al., 2013; Helleday et al., 2014; Kass et al., 2016)The therapeutic implications of altered DDR function are becoming better known. Many anti-cancer agents act by generating DNA damage that, if unrepaired, may lead to cell death or senescence. DNA interstrand crosslinks (ICLs) and double strand breaks may be particularly difficult to repair, requiring coordination of the NER, BER, FA, and HR pathways (Duxin and Walter, 2015; Michl et al., 2016; Pearl et al., 2015). The loss of one or more DDR pathway, once recognized, can also be therapeutically targeted through synthetic lethality (Brown et al., 2017; Kaelin, 2005; Srivas et al., 2016). Examples include loss of expression of the DR pathway protein O6-methylguanine-DNA methyltransferase (MGMT), which sensitized cancer cells to alkylating chemotherapy agents (Soll et al., 2017; Weller et al., 2015); and BRCA mutant, HR-deficient breast and ovarian cancers, which are sensitive to inhibition of PARP1, a central protein in the BER pathway (Bryant et al., 2005; Farmer et al., 2005; Lord and Ashworth, 2017). Epigenetic silencing may phenocopy these DNA events.The Cancer Genome Atlas (TCGA) DNA Damage Repair Analysis Working Group (DDR-AWG) used newly standardized Pan-Cancer Atlas (PanCanAtlas) data to systematically analyze potential causes of loss of DDR function and the resulting consequences across 33 different humancancer types. The loss of specific DDR pathways in cancer, in contrast to other cellular “hallmarks” of cancer (Hanahan and Weinberg, 2011), often generates stable—and thus more readily interpretable—”footprints” in cancer genomes, detected as an increased mutation burden, altered mutational signatures, or copy-number alterations including loss of heterozygosity (LOH). We provide below the most comprehensive analysis to date of DDR pathway gene alterations and their consequences in humancancer. Our results provide a useful resource to guide both mechanistic and therapeutic analyses of the role of DDR in cancer.
RESULTS
We performed analyses with a curated list of 276 genes encompassing all major DNA repair pathways: 208 genes were annotated to one or more specific DDR pathway, with an additional 68 genes annotated to key DDR-related pathways such as nucleotide pool maintenance (e.g., RRM1/2, the regulatory and catalytic subunits of ribonucleotide reductase); critical DNA damage response kinases (e.g., ATM, ATR, CHEK1/2, and WEE1); and genes recurrently mutated in cancer that modulate DDR (e.g., TP53, IDH1, and PTEN). We defined a “core DDR” gene set of 71 DNA repair pathway-specific and 9 DNA damage response genes that were used to facilitate parsimonious pathway representations and analyses (Table S1; Figures S1A and S1B).
Prevalent DDR Alterations across Cancer Types
We first determined the prevalence of DDR alterations across PanCanAtlas cancer types by integrating data on somatic truncating and missense mutations (Figures S1C and S1D), deep copy-number deletions defined by GISTIC (Mermel et al., 2011), and epigenetic silencing events. Binary calls for each event class for the 276 DDR genes across 9,125 PanCanAtlas samples are available through Data and Software Availability. The frequency of these somatic DDR gene alterations is shown in Figures 1A–1C (core DDR genes) and Figures S1E–S1G (all 276 DDR genes). For a complete list of TCGA cancer type abbreviations, please see https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations. Cancer types with a higher global mutation burden (e.g., in UCEC [uterine corpus endometrial carcinoma], COAD [colon adenocarcinoma], and READ [rectum adenocarcinoma] cancers) also had a higher mutation frequency in DDR pathways. Similarly, cancer types with a large number of somatic copy-number alterations (SCNAs; e.g., OV [ovarian serous cystadenocarcinoma], SARC [sarcoma], ESCA [esophageal carcinoma], and STAD [stomach adenocarcinoma]) also had a larger number of SCNAs in DDR pathways.
Figure 1
Cancer Types Display Variable DNA Damage Repair Gene Somatic Alterations
(A) DDR gene alterations are frequent and non-uniformly distributed by type and frequency across cancer types. Clustered heatmap indicates the percentage (%) of samples in a cancer type (rows, with cancer types listed right, number of samples between parentheses) altered for at least one core gene in a given DDR pathway (columns, with core gene numbers indicated in parentheses for each pathway, bottom). Color intensity indicates the percentage altered, with the percentage given as a number in each cell. RGB color indicates mutations (red), deep deletions (blue), or epigenetic silencing through methylation (green). Gray scale indicates equal contribution from all three alteration types. A “u” symbol in cells indicates a statistically significant enrichment (FDR [false discovery rate] < 10%, difference in alteration percentages > 2%) in alterations. A “co” or “me” symbol in cells indicates a statistically significant (FDR < 10%) co-occurrence or mutual exclusivity of samples altered by mutation, deep deletion, or silencing. Only “co” relations were observed. The two rightmost columns, Mut.load and SCNA load, indicate average mutation frequency (non-silent mutations/Mb) and copy-number burden (number of copy-number segments) by cancer type.
(B) Mutations and deep deletions contribute disproportionately to alter HR genes across nearly all TCGA cancer types. Color and color intensity provide a visual summary of the relative contribution of alteration types to HR pathway variation. The vertical position of each cancer type symbol indicates the percentage altered samples. M+D, mutation and deletion; D+S, deletion and silencing.
(C) Multiple genes contribute to enrichment of DDR pathway alterations. Heatmap depicts for each core DDR pathway (columns) statistically enriched alteration frequencies for genes with >2% alterations. Color intensity indicates percentage altered, with the percentage given in each cell. Specific cancer examples representing gene and pathway associations are listed under each column.
(D) The top 50 most frequently mutated genes among 276 DDR genes. Genes are listed in order of frequency of non-synonymous mutations (y axis left, blue rectangles), together with the fraction of concurrent mutations and LOH events (y axis right, red bars). See also Figure S1 and S2.
Pathway enrichment analysis based on the core DDR gene list revealed several DDR pathways that were statistically enriched for alterations within a specific cancer type, e.g., HR pathway alterations in OV and BRCA (breast invasive carcinoma) cancers (Figure 1A). Nearly three-quarters (20/28, 71%) of associations among DDR pathways and cancer types were also observed using our more inclusive DDR gene set (Figure S1E). Some DDR genes were affected predominantly by one type of alteration, e.g., ALKBH3 and MGMT by epigenetic silencing. Other genes were altered in two or more ways, e.g., mutations in TP53, PTEN, PER1, and BRCA1/2 were frequently accompanied by LOH (Figure 1D). Gene-level mutation frequencies varied widely by cancer type and subtype. For example, mutations in TP53, PTEN, ERCC5, and IDH1 were highly enriched, while other genes such as SOX4, SLX1A, and GTF2H2 were much less frequently mutated (Figure S1J).We next investigated the association of DDR gene alterations with overall mutation burden. It is already well established that cancers with somatic POLE and POLD1 mutations in their exonuclease domains or with microsatellite instability (MSI) exhibit a substantially higher mutation burden (Barbari and Shcherbakova, 2017; Ionov et al., 1993; Shinbrot et al., 2014; Thibodeau et al., 1993). Here, we observed only two DDR genes (TP53 and POLE) that demonstrated a substantially different association between alterations and overall mutation burden when compared against a background of all other non-DDR genes (Figure S1K).One intriguing question is whether DDR genes would be identified as cancer drivers using mutation frequency-based prediction methods. To address this question, we analyzed 276 DDR genes in non-hypermutated cancer samples using five driver prediction tools: 20/20+ (Tokheim et al., 2016), Mut-Sig2CV (Lawrence et al., 2014), OncoDriveFML (Mularoni et al., 2016), MuSiC2 (Dees et al., 2012), and CompositeDriver (https://github.com/khuranalab/CompositeDriver). Our analysis used data and best practices from the PanCanAtlas Drivers/Essentiality Working Group (Bailey et al., 2018) and identified 48 DDR genes as potential drivers by at least one driver identification algorithm. Among these, 8 putative DDR driver genes were unique to a cancer type, and 18 were identified only as part of a PanCanAtlas analysis. The remaining 23 putative drivers were identified in analyses of one or more individual cancer type as well as in PanCanAtlas analyses (Table S2). These identifications are intriguing, though tentative in light of the challenge of identifying drivers against a background of biological heterogeneity. For example, TCEB1, which forms a complex with VHL, was identified as a prominent driver gene in KIRC (kidney renal clear cell carcinoma) (Sato et al., 2013).In order to assess potential genomic consequences of DDR gene alterations, we performed a mutation signature correlation analysis focusing on loss-of-function alterations in the 48 genes we identified as putative cancer drivers (Table S2). This analysis used 21 mutational signatures derived by the PanCancer Signature group (Covington et al., 2014). Significant associations are displayed in Figure S2A. We confirmed that mutational signature #19 (POLE signature, corresponding to COSMIC [Catalog of Somatic Mutations in Cancer] signature #10) is increased by over 4-fold in POLE-altered cancers (p = 1.0e–6) at the PanCanAtlas level, and in UCEC with greater significance (fold change = 10.1 and p = 4.8e–7, Figure S2B). We also confirmed the association of signature #15 (temozolomide signature, corresponding to COSMIC signature #11) with MGMT alterations (Figure S2C).
Frequent Epigenetic Silencing of DDR Genes and Pathways
Epigenetic silencing was identified as an alternative prominent mechanism leading to recurrent gene deficiency. Stringent calling criteria (detailed in STAR Methods) led to the identification of 12 DDR genes exhibiting strong and consistent methylation-driven transcriptional silencing (Figures 2A and 2B). The most frequently silenced DDR genes were core DR (direct repair) pathway genes MGMT (11% of all the samples) and ALKBH3 (8%), followed by the core MMR genes MLH3 (5%) and MLH1 (4%) (Figure 2C). Epigenetic silencing was less often observed for genes in the HR and FA pathways, e.g., BRCA1, RAD51C, NSMCE3, and FANCF. Methylation silencing was the dominant alteration in MGMT (92.4% of all alterations) and was significantly associated with signature #15 through mutation signature analysis (Figure S2C).
Figure 2
Epigenetic Silencing of DDR Genes and Pathways in Cancer
(A) Gene/probe pairs showing evidence of silencing. Gene expression for gene/probe pairs (x axis) was Z score-transformed based on probe methylation level then plotted as a mean Z score among samples within a methylated group. Negative false discovery rate (FDR)-corrected log10-transformed p values are plotted on the y axis. Green dashed lines indicate the cutoffs for mean Z scores and FDRs. Genes meeting cutoffs for evidence of silencing have red labels, with specific probes listed in parentheses (see STAR Methods for additional details).
(B) Gene expression and methylation are inversely correlated for silenced genes. Scatterplots show silenced gene/probe pairs for MGMT (two probes), EXO5, RAD51C, MLH1, and FANCF. Gene expression level is plotted on the y axis and methylation on the x axis with red dots representing silenced samples.
(C) Silenced genes are variably distributed across cancer types. Left: oncoprint plot displays the overall frequency of deleterious mutations, deletions, and epigenetic silencing events for each significantly silenced DDR gene (rows, with gene names listed to the right) across 8,739 PanCanAtlas samples. Cancer type is shown in the color key to the right. Frequencies were calculated over the entire cohort, with only altered samples plotted. Right top scale indicates the number of events by molecular type, with the distribution of alterations across cancer types.
(D) Heatmap depicting variable frequency of epigenetic silencing events across 33 cancer types and DDR pathways. Cancer types (rows, shown using the same color code as in C) and 12 significantly silenced DDR genes (columns). Bar plots (right) summarize the frequency of silencing events by pathway: DR (ALKBH3 and MGMT), HR(BRCA1, RAD51C, and NSMCE3), and MMR (MLH1, MLH3, and PMS2). Numbers (x axis) below each bar graph indicate the proportion of samples by cancer type with at least one epigenetically silenced gene annotated to that pathway.
(E) EXO5 silencing shows cancer subtype variation. Scatterplots as in (B) display the same silenced samples, now color-coded according to cancer subtypes as indicated by the dot color code bottom left. Grey dots represent samples that were expressed/not silenced. See also Figure S2 and S3.
The high frequency of EXO5 silencing (94.5% of all EXO5 alterations) was both unexpected and intriguing. Loss of function of this single strand-specific DNA exonuclease sensitizes cells to DNA base adduct and crosslink damage and UV, but not ionizing radiation, and leads to chromosomal instability (Sparks et al., 2012). EXO5 silencing was frequent in GBM (glioblastoma multiforme) (~46%) though not in LGG (brain lower grade glioma) (~4.5%) (Figure 2D). When broken down by cancer subtype, EXO5 silencing was exclusively observed in IDH wild-type GBM and LGG (50% and ~27%, respectively) and in a small fraction of HNSC (head and neck squamous cell carcinoma) human papillomavirus (HPV)-negative cancers (~4%). Other cancer subtypes with a high frequency of EXO5 silencing included STAD (HM-indel, i.e., MSI, ~57% and Epstein–Barr virus [EBV]-positive ~57%), HM-indel, i.e., MSI, COAD (44%), CIN (chromosomal instability) ESCA (~29%), and MSI UCEC (~28%) (Figure 2E). Of note, both EBV-positive (Cancer Genome Atlas Research Network, 2014) and MSI (Cancer Genome Atlas Network, 2012; Kandoth et al., 2013) cancer subtypes have been tightly linked to an extensive CpG island methylator (CIMP) phenotype (Hinoue et al., 2012; Toyota et al., 1999), whereas IDH wild-type, CIN, and HPV-negative cancer subtypes were not CIMP associated. Furthermore, EXO5 deficiency was significantly linked to signature #1 (Figure S2E). The consistent observation of EXO5 silencing in IDH wild-type brain cancers suggests EXO5 may play a role in the pathogenesis of subsets of these cancers.Frequencies of epigenetic silencing also varied at the pathway level, as defined by silencing of at least one gene in a pathway. Frequent silencing of core DR, HR, and MMR pathway genes was observed in gastrointestinal cancers (ESCA, COAD, STAD, and READ), HNSC, and GBM (Figure 2D). We also observed frequent DR pathway silencing in DLBCL (lymphoid neoplasm diffuse large B cell lymphoma) and in UCEC (with MGMT + ALKBH3 silencing in >60% or >40%, respectively). SKCM (skin cutaneous melanoma), ACC (adrenocortical carcinoma), LGG, and UCEC were characterized by a high frequency of MMR pathway silencing, whereas OV, UCS (uterine carcinosarcoma), CESC (cervical squamous cell carcinoma and endocervical adenocarcinoma), and TGCT (testicular germ cell cancers) all showed high-frequency DR and HR pathway silencing that was reflected in altered gene expression data (Figures S3A and S3B) and in limited reverse-phase protein array (RPPA) data on 23 DDR genes (Figures S3C–S3E).
Genomic Instability Linked to HR Deficiency and Prognosis
We observed positive correlations among six different SCNA scores that were used to characterize the extent of aneuploidy, LOH, and homologous recombination deficiency in PanCancer Atlas samples (Figure 3A). We also found moderate, statistically significant positive correlations between mutation burden and SCNA scores, contrary to a previously reported inverse correlation between SCNA and mutation frequency in 12 cancer types (Ciriello et al., 2013). A likely explanation for this discrepancy is our use of a much larger and more inclusive set of samples across more cancer types and including all mutation and SCNA data, as opposed to the restricted subset of ~500 selected functional events as in the previous study. SCNA burden (determined from the frequency of SCNA segments) and mutation burden (measured as non-silent mutations per Mb) were positively correlated across all PanCanAtlas samples (ρ = 0.36, p < 1e–16, Figure S4A), while hypermutated samples and hypersegmented samples were largely mutually exclusive.
Figure 3
Somatic Copy-Number Alteration Scores in Relation to Clinical Outcomes and DDR Gene Alterations
(A) Matrix heatmap of mean Spearman correlation between SCNA scores and mutation load across 33 cancer types.
(B) Forest plot of association between homologous recombination deficiency (HRD) score and progression-free interval (PFI). Results are shown for 28 cancer types with valid outcomes data. Cancer type symbols to the left are followed by sample number (N) included in the model and the number of PFI events. Hazard ratios (HR) and HR 95% confidence intervals are shown to the right. The “P Value” represents the Cox proportional hazards model p value for differences in survival between high versus low HRD score samples. *, a statistically significant association after applying a false discovery correction (threshold 10%).
(C) Representative Kaplan-Meier (KM) survival curves for PFI of four cancer types as a function of high versus low HRD Scoring. Cancer samples in GBM, ESCA, PRAD, and ACC were defined as high HRD scoring if the HRD score was above the median within a cancer type. Log rank test p values are displayed in the top right-hand corner of each plot.
(D) Volcano plot of significance and magnitude of DDR gene ridge regression coefficients. Our ridge regression model fitted the alteration status of 276 DDR genes to HRD score across 8,464 cancer samples. Homologous recombination repair (HR) genes above a significance threshold of FDR < 0.2 are plotted and labeled in red.
(E) HRD scores of two cancer types stratified by BRCA1 and RAD51B alteration status. The two cancer types with the largest number of BRCA1 or RAD51B alterations are plotted to show HRD score distributions as a function of gene alteration status. Mann-Whitney U test p values are displayed above the bracket for each cancer-type-specific comparison. See also Figures S4 and S7.
Genomic scarring with large-scale genome instability has been attributed to homologous recombination deficiency (HRD) (Watkins et al., 2014). We calculated a HRD score, combined from HRD-LOH (Abkevich et al., 2012), LST (large-scale state transitions) (Popova et al., 2012), and NtAI (number of telomeric allelic imbalances) scores (Birkbak et al., 2012), for all PanCanAtlas samples using SCNA calls generated from ABSOLUTE (Figure S4B). This analysis also substantially extends earlier analyses using 15 cancer types (Marquard et al., 2015). HRD scores varied widely across cancer types, with the highest scores in ovarian cancer (OV) and the lowest in KICH (kidney chromophobe), KIRP (kidney renal papillary cell carcinoma), LAML (acute myeloid leukemia), and THCA (thyroid carcinoma). Many cancer types also had small subsets of samples with high HRD scores.HRD scores were significantly associated with progression-free interval (PFI) in eight cancer types (Figures 3B and 3C). Higher HRD scores were often associated with shorter PFI (Figure 3C). Notable exceptions were higher HRD scores associated with better clinical outcomes in GBM and OV (Figure 3C), and with overall survival in OV (Figures S7B and S7C). This may reflect the use of potent, DNA-damaging platinum-based compounds as standard-of-care therapy for OV (Figure S7) (Mills et al., 2016). The association of a higher HRD score with better outcomes in GBM may be linked to IDH mutations. These have been associated to higher HRD (data not shown; Sulkowski et al., 2017) and better outcomes (Brat et al., 2015) than IDH wild-type GBM cancers. We identified additional contributions of DDR gene alterations to HRD scores by using a Bayesian ridge regression to model HRD scoring as a function of DDR gene alterations while controlling for cancer type as a covariate (Table S3). This analysis, performed in 8,464 samples, excluded potentially confounding POLE mutants and MSI-high cancers (Figures 3B–3D).HR pathway genes with significant positive HRD associations included BRCA1, BRCA2, RAD51B, and RAD51C. BRCA1 alterations had the largest positive weight for predicting higher HRD scores. Alterations in either BRCA1 or RAD51B increased HRD score in most cancer types including the top two cancer types with the greatest number of alterations in either gene (Figure 3E). TP53 alterations were also associated with higher HRD scores, consistent with previous observations of a higher SCNA burden in TP53-mutated cancers (Ciriello et al., 2013). Other alterations in several MMR and NER genes had significant negative associations with HRD score, e.g., MLH1, TCEB3, and MSH2, suggesting a potential mutually exclusive relationship between HR and MMR or NER deficiencies.Gene fusions with the potential to disrupt DDR function were identified using ChimerDB 3.0 (Lee et al., 2017) and TCGA Fusion Gene Data Portal (Yoshihara et al., 2015) then manually checked for read alignments to identify 188 high-confidence fusion events that involved 108 DDR genes in 205 cancer samples (Figure S4C). The most frequently affected DDR genes were SMARCA4, PTEN, RAD51B, and SMARCA1 (Figure S4D). A majority of the fusions, including 18 in HR and 5 in MMR genes, had breakpoints in DDR functional domains that were predicted to disrupt function with readily predictable therapeutic consequences, e.g., for the HR gene fusions involving RAD51B and BRCA1 (Figure S4E). Fusions also often had multiple partners, as well as breakpoints: e.g., RAD51B had three recurrent breakpoints with five different fusion partner genes (CEP170, ENOX1, NPC2, ZFYVE26, and PCNX) (Figure S4F).
Protein Structure Modeling of Recurrent DDR Mutations
We further examined potential functional consequences of relatively rare, recurrent mutations in 22 DDR genes by protein structural analyses and modeling. These included MGMT, PARP1, TOP3A, BLM, ERCC2, HFM1, POLE, POLD1, and POLQ, which collectively harbored 1,380 unique rare, recurrent non-synonymous mutations. Many of these mutations were found at domain interfaces or along solvent-accessible surfaces, and a substantial fraction (n = 370 or 26.8%) were predicted to be strongly destabilizing (ΔΔGfold ≥ 3 kBT) (Figures 4A and 4B; Figure S5). An additional small number (n = 26 or 1.9%) were predicted to be strongly stabilizing (ΔΔGfold ≤ −3kBT) and thus might affect function by restricting protein conformational changes (Figure 4C).
Figure 4
Rare, Recurrent DDR Gene Mutations Differentially Alter Protein Structural Stability
Key DDR genes (listed in C as column labels) were selected for protein modeling based on the frequency of rare and recurrent mutations and experimentally determined structures that covered a majority of amino acid residues.
(A) POLE mutations are clustered in protein functional domains. Spheres represent mutations colored by mutation frequency (boxed key top right) overlaid on a POLE structural model. 3D mutation hotspots are present in both the exonuclease (blue) and polymerase (green) domains.
(B) Most MGMT somatic mutations likely alter protein structure. The location of mutations shown on a structural model as spheres, colored by predicted effect. Mutations with protein folding energy (ΔΔGfold) ≥ 3 kBT are predicted to be strongly destabilizing, whereas those altering stability by less than |ΔΔGfold| < 1 kBT were considered not significant (NS).
(C) Many DDR gene somatic mutations are predicted to destabilize protein structure. Structure-based calculations of the effect of 1,380 mutations on ΔΔGfold are plotted, with the number of unique mutations/proteins given in parentheses below each protein name.
(D) Altered protein stability is associated with greater burden of genomic alterations. Plot uses standardized Z scores (see STAR Methods) across cancer types to compare samples harboring strongly destabilizing versus non-destabilizing mutations. Association strength depended on the DDR gene, e.g., destabilizing mutations in POLB were associated with lower SCNA and higher mutation burdens, while mutations in PARP1 were associated with a higher SCNA and lower mutation burden.
(E) Altered stability in four proteins was associated with a large shift (Z > 0.5) in SCNA burden. Split violin plots show the different distributions of SCNA burden among samples with a destabilizing versus non-destabilizing mutations in each gene.
(F) Altered stability in five proteins was associated with a large shift (Z > 2.0) in mutation burden. See also Figure S5.
POLE exonuclease domain mutations are positively associated with a higher mutation burden (p = 0.02) and negatively associated with SCNA burden (p = 0.006). Structure-based modeling predicted an additional subset of POLE mutations that may destabilize POLE structure and reduce catalytic efficiency. In similar fashion, most MGMT somatic mutations are likely to affect MGMT activity by destabilizing protein structure. Molecular dynamics simulations of a subset of mutations (n = 86) in six proteins (POLE, POLD1, POLQ, ERCC2, HFM1, and BLM) revealed many additional mutations that were not predicted to be destabilizing but did alter protein dynamics (Figure S5). Thus, molecular modeling and protein dynamics in concert may reveal mechanisms by which somatic mutations alter DDR protein function.In order to link other destabilizing DDR mutations to genomic instability, we compared both mutation and SCNA burden scores for genes with predicted strongly destabilizing versus non-destabilizing mutations (Figure 4D). The majority of predicted destabilizing mutations showed a positive association with either higher mutation and SCNA or mutation burden scores. Many genes displayed an inverse relationship between these two burden scores, with, e.g., destabilizing mutations in POLB associated with higher mutation and lower SCNA burden, whereas destabilizing PARP1 mutations were associated with lower mutation and higher SCNA burden (Figures 4D–4F). Thus, predicted destabilizing mutations may be directly influencing downstream burden scores by altering or abolishing DDR function.
The loss of TP53 function across many cancer types has significant functional consequences as measured by genomic instability in association with a higher SCNA burden and increased HRD scores (Figures 1D and 3D). Cancer-associated TP53 mutations may promote these consequences through simple loss of function, as well as by altering transcription or through dominant-negative, gain-of-function mechanisms (Bouaoun et al., 2016; Kastenhuber and Lowe, 2017; Olivier et al., 2010; Pfister and Prives, 2017; Stracquadanio et al., 2016). A subset of these consequences can also be phenocopied by other genomic alterations. In order to better predict the consequences of TP53 inactivation and identify potential phenocopies of TP53 loss, we constructed a TP53 classifier that predicts inactivation status from RNA sequencing expression data, adjusted for cancer type and mutation burden (details in STAR Methods), then used this to analyze cancer types with comparable numbers of TP53 alterations. The resulting classifier was highly sensitive and specific (Figure 5A), and when trained using PanCanAtlas data, it outperformed individual cancer models in 14 out of 19 cases (Figures S6A and S6B).
Figure 5
Machine Learning to Predict TP53 Inactivating Mutations in Cancer
(A) Robust classifier performance by receiver operating characteristic (ROC) and area under the ROC curve (AUROC). Training data, cross validation assessment, and held out test set (10%) for 19 cancer types were used.
(C) SCNA burden is correlated with known/predicted TP53 status. Plots show SCNA/CNV burden as fraction altered for known or predicted TP53 status. The SCNA profile for TP53 mutation c.375G>T in TP53 exon 4 appears similar to other TP53 loss events.
(D) SCNA in TP53-interacting genes MDM2 and CDKN2A phenocopies TP53 loss. Results shown are for PanCanAtlas TP53 wild-type samples.
(E) TP53 network gene alterations phenocopy TP53 deficiency. Mutations were manually curated and selected a priori. All mutation tests including only TP53 wild-type/non-hypermutated cancers are indicated by orange edges. Node color indicates event class (red, mutation; blue, copy-number loss; and purple, copy-number amplification); edge values indicate Cohen’s d effect size. Thin blue edges indicate predicted interactions from the STRING database. NS is “not significant” with p > 0.005. See also Figure S6.
Individual weights of the TP53 classifier identified 10 top negative-weighted genes, of which 9 are confirmed TP53 target genes (Kastenhuber and Lowe, 2017) (Figure 5B). The remaining gene, MPDU1, may have been identified by virtue of being located ~80 kb downstream of TP53 and thus sensitive to TP53 copy loss. Of note, our classifier was able to predict TP53deficiency independent of cancer type with a high AUROC (area under the receiver operating characteristic curve; 0.94), and in samples initially removed from training. These included cancer types with few TP53 events (THCA and UVM [uveal melanoma] cancers), as well as those dominated by TP53 events (OV and UCS cancers) (Figures S6C–S6F). The classifier was also able to distinguish TP53 mutant from wild-type BRCA and UCEC, with nearly all basal-subtype BRCA cancers predicted to be TP53 deficient (Figures S6G and S6H). An analogous approach has been used to predict RAS pathway activation in PanCanAtlas cancers (Way et al., 2018).The classifier enabled the identification of phenocopying mutations both in TP53 and in other functionally related genes. Consistent with previous pan-cancer analyses (Zack et al., 2013), we observed that predicted TP53 loss-of-function samples, including cancers with synonymous TP53 c.375G>T mutation, had an increased SCNA burden when compared with wild-type samples (Figure 5C). This synonymous mutation may act by altering a splice donor to produce alternatively spliced transcripts that compromise TP53 function (Collado-Torres et al., 2017). Samples with c.375G>T or c.375G>A mutations were also enriched for a 200 base pair truncation in exon 4 when compared with wild-type TP53 samples (OR [odds ratio] = 61.9, p < 2.2e–16). This mutation/truncation pairing was previously observed in a pancreatic cancer cell line and as a SNP (rs55863639) likely pathogenic for Li-Fraumeni syndrome (Leroy et al., 2014).Significantly increased classifier scores were also noted for cancers with MDM2 copy-number amplification and CDK2NA copy-number deletion in an analysis including only non-hypermutated cancers without deleterious TP53 mutation (Figure 5D). We had observed a copy-number dosage effect for CDK2NA copy-number deletions, where loss of the CDKN2A-encoded P14ARF protein can phenocopy TP53 alterations (Sherr, 2001). Among eight other tested genes, MDM4 and PPM1D copy-number amplification and ATM and CREBBP gene mutations were associated with increased TP53 classifier scores, while ATR, CHEK1/2, or RP6SKA3 mutations were not (Figure 5E). These results suggest the general utility of this approach, even in circumstances where a diversity of molecular events and potential downstream consequences might occur.
DDR Alterations Are Associated with Clinical Outcomes
We computed multiple DDR footprint scores based on quantitative estimates of DNA damage and investigated their association with clinical outcomes. These aggregated 43 DDR footprint scores such as mutation burden and copy-number burden and extended other published scores, e.g., repair proficiency scoring (RPS) (see Data and Software Availability for per-sample estimates). We tested DDR footprint score associations with overall survival (OS) and progression-free interval (PFI) across 28 cancer types by fitting Cox proportional hazards models, using survival outcome data generated by Liu et al. (2018).SCNA-based scores were more strongly associated with survival outcomes, compared with mutation- and expression-based DDR footprint scores (Figure S7A). LOH burden (number of genomic segments with LOH), combined HRD score, and HRD component scores (HRD-LOH [Abkevich et al., 2012], LST [Popova et al., 2012], and NtAI [Birkbak et al., 2012]) were significantly associated with OS and PFI for six to ten cancer types (ACC, BRCA, ESCA, GBM, KIRP, MESO [mesothelioma], PRAD [prostate adenocarcinoma], OV, UCEC, and UVM) (Figure S7B). In all but two cancer types (GBM and OV; Figures 3B and 3C), higher HRD or LOH burden scores were associated with poorer prognosis (Figure S7B). These associations remained statistically significant in multivariate Cox proportional hazard models after accounting for known covariates such as patient age, cancer type, and stage (Figure S7C).Fewer mutation-based associations with clinical outcomes were identified: e.g., a high mutation burden was associated with better prognosis in UCEC and STAD cancers, but worse prognosis in LGG and ACC cancers (Figure S7C). Of note, expression-derived DDR footprint scores (e.g., expression Cumulative Density Function trAnsform of Rank Distribution or eCARD scores [Zimmermann et al., 2016]) derived for a single cancer type (OV) had significant prognostic associations in 7 cancer types for OS and for 6 cancer types for PFI (Figure S7A) that were the opposite of associations observed in OV and STAD (Figure S7B). Low repair proficiency scores were associated with a worse OS as previously reported (Pitroda et al., 2014), though for PFI in only a subset of cancer types (Figure S7B). These associations between DNA damage consequences and survival across 28 humancancer types confirm previously reported survival associations and provide a rationale for extending these analyses to additional cancer types.
DISCUSSION
We used TCGA PanCanAtlas data to systematically analyze the prevalence, nature, and consequences of DDR gene and pathway alterations across 9,125 samples representing 33 different cancer types. DDR gene alterations were ubiquitous: approximately 1/3 of TCGA PanCanAtlas cancer types showed significant enrichment of somatic mutations in DDR genes, often accompanied by SCNA/LOH events. The functional consequences of these alterations could often be readily inferred. For example, the HR pathway was altered in nearly 40% of cancers, e.g., in BRCA1/2-mutated ovarian and triple-negative breast cancers, where HR is a key determinant of platinum chemotherapy as well as PARP inhibitor response.DNA methylation-dependent epigenetic silencing was also surprisingly frequent—though more variable—than mutation or deletion calls and encompassed one-third of TCGA cancer types. Nearly three-quarters (20 of the 28, or 71%) of statistically significant associations observed using our comprehensive annotation of DDR pathways (Figure S1E) were driven by silencing of genes in the MMR, HR, and DR pathways. Some of the recurrently silenced genes we identified have been previously identified, e.g., MGMT (Esteller et al., 1999), MLH1 (Cancer Genome Atlas Research Network, 2014; Kuismanen et al., 2000; Simpkins et al., 1999), MLH3 (Lhotska et al., 2015), BRCA1 (Esteller et al., 2000), and HLTF (SMARCA3) (Moinova et al., 2002). Epigenetic silencing of DR pathway genes in gastrointestinal, central nervous, and lymphoid cancers were all associated with a high mutation burden.We also identified new, epigenetically silenced genes including DDB2 and EXO5. DDB2 is a NER pathway gene necessary for UV damage repair, whereas EXO5 (DEM1) is a little-studied single-stranded exonuclease that has been linked to genetic instability and DNA damage hypersensitivity especially to DNA crosslinking agents (Sparks et al., 2012). These findings highlight the role of epigenetics in shaping the DDR deficiency landscape in cancer and may provide useful biomarkers for enhanced response to, e.g., alkylating agent therapy. Three additional significant pathway enrichments were also identified when we excluded epigenetic silencing: the DR pathway in LGG cancers, the NER pathway in BLCA (bladder urothelial carcinoma) cancers, and the BER pathway in LIHC (liver hepatocellular carcinoma) cancers (Data and Software Availability).Analyses of loss-of-function alterations within DDR pathways identified co-occurring alterations (largely consisting of mutations and epigenetic silencing) in the MMR pathway in UCEC and STAD cancers, though no pathway by cancer-type-specific, mutually exclusive combinations. We found little evidence for somatic mutation co-occurrence between MMR and NER pathways (Figure S1H), or MMR and HR pathways (Figure S1I). MMR and NER can repair DNA base pair mismatches, bulky DNA base damage, and small DNA loops. Thus, the loss of both MMR and NER might markedly sensitize cancer cells to alkylating agent therapy and provide a starting point to identify effective treatment combinations (Srivas et al., 2016). Of note, 1/5 (22%) of DDR genes participate in more than a single DDR pathway (Table S1; Figures S1A and S1B). Thus, alteration of these “pathway-promiscuous” DDR proteins may have a disproportionately large effect on genomic instability.The detailed understanding of DDR genes and pathways provides immediately plausible mechanisms by which many DDR gene alterations might increase specific mutation types, as well as overall mutational burden. Mutational signature analyses provide a second way to identify potential mutation sources and mechanisms (Alexandrov et al., 2013; Rogozin et al., 2017). For example, we found a previously undefined signature 8 was strongly associated with BRCA deficiency, especially BRCA1 (Figure S2D). EXO5 deficiency, identified here as an often epigenetically silenced DDR gene, was associated with signature 1 across multiple cancer types (Figure S2E) and has been associated with poor clinical outcomes (Gingras et al., 2016; Totoki et al., 2014). Co-occurrence plots of mutations and SCNA/LOH for the top 50 mutated DDR genes in TCGA samples (Figure 1D) and genome-wide DNA damage scores that further encompass LOH and aneuploidy (Figure 3A) suggest the potential for additional complex interactions among DDR gene alterations.We extended the insights gained from analyzing the type and distribution of DDR gene alterations in cancer by using additional approaches. For example, a combination of protein structural modeling and molecular dynamics simulations were used to predict the functional consequences of rare, recurrent nonsynonymous mutations in 22 DDR genes. We found that POLD1 mutations, despite being less common than the POLE or POLQ mutations that contribute to hereditary colorectal cancer risk (Bellido et al., 2016) and the hypermutated phenotype (Briggs and Tomlinson, 2013; Church et al., 2013), were as strongly associated with genomic instability. Molecular dynamics simulations further identified a subset of DDR gene mutations with the potential to alter protein conformational changes independent of effects on protein stability (Figure S5), raising the provocative question of whether destabilizing mutations alone contribute to genomic instability in cancer.A second extension of PanCanAtlas genomic data was to use machine learning to predict TP53 inactivation status from gene expression data. This approach identified both TP53 mutant and TP53 mutant phenocopies, as well as potential TP53 tissue-specific roles in, e.g., ESCA and CESC cancers. This approach was developed using TP53 but is general and thus could be applied to other DDR genes. These and additional approaches may have their greatest value in annotating rare mutations where there may be few or no clinical or experimental data from which to predict mutation functional consequences.In light of the central role played by DDR pathways in ensuring cell survival after DNA damage, we reasoned that DDR deficiency scores might be broadly predictive of both therapeutic response and clinical outcomes. When tested against PanCanAtlas survival data encompassing 28 cancer types (Liu et al., 2018), we identified associations among most DDR footprint scores and clinical outcomes after controlling for covariates such as age, cancer grade, and stage (Figure S7). These associations were consistent in linking a high mutation burden or genomic instability with worse clinical outcomes across almost all cancer types. We also identified HRD-high cancers including subsets of ovarian, uterine, lung squamous, esophageal, sarcoma, bladder, lung adenocarcinoma, head and neck, and gastric carcinomas. Virtually all of these subsets of cancers may have enhanced responsiveness to platinum-based compounds that are given as standard-of-care therapies. This extends the recognition of a bimodal distribution of HRD scores in breast and ovarian cancers and the enrichment of BRCA1/2 mutant or methylated cancers among HRD-high cancers that are more likely to respond to platinum-containing therapy (Telli et al., 2016). These results indicate the potential of HRD scoring to predict both platinum response and PARP inhibitor sensitivity.The high frequency of DDR gene and pathway alterations in TCGA PanCanAtlas cancer samples identifies opportunities to improve cancer therapy. For example, HR defects are common in many cancer types and may compromise successful DNA replication and genome stability (Macheret and Halazonetis, 2015; Zeman and Cimprich, 2014). Thus, combination therapies that induce or potentiate replication stress or impair replication fork protection may be particularly effective in killing HR-deficient cancers (Ray Chaudhuri et al., 2016; Rondinelli et al., 2017; Taglialatela et al., 2017).The many different cancer types that show epigenetic silencing of DR pathway genes such as MGMT and ALKBH3 may be especially vulnerable to types of DNA damage normally repaired by these proteins (Soll et al., 2017; Wang et al., 2015). DDR pathway deficiencies are also mechanistically linked to mutation burden and mutational diversity. Thus, DDR pathway deficiencies in cancer may potentiate immune-based therapies by driving neoantigen production to enhance immune recognition and targeting (Balachandran et al., 2017; Germano et al., 2017; Le et al., 2017) and thus may identify subsets of patients with a higher likelihood of responding to immune-based therapies.Our analysis of DDR pathways across the 9,125 cancers of 33 types included in PanCanAtlas data identified associations among genomic data types; confirmed and extended several previously reported findings; and provided insight into the mechanistic origins and consequences of DDR deficiency in cancer. These results collectively reinforce the importance of DDR gene function in shaping cancer risk, progression, and therapeutic response.
STAR★METHODS
Detailed methods are provided in the online version of this paper and include the following:KEY RESOURCES TABLECONTACT FOR REAGENT AND RESOURCE SHARINGEXPERIMENTAL MODEL AND SUBJECT DETAILSMETHOD DETAILSDNA Damage Repair Pathway CurationAlteration summary for DDR pathways across cancer typesFiltering and functional annotation of somatic mutations in DDR genesDDR gene epigenetic silencing eventsDetermination of deep deletions of DDR genes and SCNA-based DDR scoresComputation of homologous recombination deficiency (HRD) scoresRidge Regression AnalysisSurvival AnalysisCuration and Examination of DDR Fusion EventsGenerating Protein Structural Models at Atomic ResolutionMolecular Dynamics SimulationsIn-silico expression-based predictor of TP53 inactivationDATA AND SOFTWARE AVAILABILITY
CONTACT FOR REAGENT AND RESOURCE SHARING
Further information and requests for additional analysis details should be directed to and will be fulfilled by the Lead Contact: Chen Wang (wang.chen@mayo.edu).
EXPERIMENTAL MODEL AND SUBJECT DETAILS
We used TCGA PanCanAtlas cancer samples defined by the whitelist commonly agreed upon by TCGA Analysis Working Groups (AWGs) (Data and Software Availability) for all analyses. These 9,125 samples encompassed 33 different histopathologic cancer types representing most major classes of human adult cancer. Some analyses were performed for 32 cancer types, with a 33rd type, AML (LAML) excluded in selected analyses as indicated. While a complete analysis of this cohort’s overall demographics are beyond the scope of this paper, relevant demographic factors such as biologic gender, patient age and disease characteristics were accounted for and are noted when reporting the results of specific analyses.
METHOD DETAILS
DNA Damage Repair Pathway Curation
DNA damage repair (DDR) gene list including 276 genes was assembled from relevant gene lists including MSigDB v5.0 (see Key Resources Table) an online catalog of DDR genes from recently published resources (Pearl et al., 2015; Key Resources Table), and knowledge-based curation of information on specific DNA repair pathways or subpathways (see, e.g., (Bell and Kowalczykowski, 2016; Kowalczykowski, 2015) on HR and HR-associated subpathways). Three-quarters (n = 211, 76%) of these 276 genes encompassed nine major DDR pathways: base excision repair (BER), nucleotide excision repair (NER), mismatch repair (MMR), the Fanconi anemia (FA) pathway, homology-dependent recombination (HR), non-homologous DNA end joining (NHEJ), direct damage reversal/ repair (DR), translesion DNA synthesis (TLS), and nucleotide pool maintenance (NP)(Brown et al., 2017; Friedberg et al., 2004; Jeggo et al., 2016; Pearl et al., 2015; Tubbs and Nussenzweig, 2017). The remaining 65 genes have been linked to more than one DDR pathway, or coordinate cellular and molecular responses to DNA damage, and thus may represent an important focus for DDR pathway-associated therapeutic development (Brown et al., 2017; Pearl et al., 2015). The complete gene list is contained in Table S1.
Alteration summary for DDR pathways across cancer types
In order to provide a comprehensive overview of alterations in DDR genes, we combined three data sources of binary alteration calls in terms of loss-of-function events: 1). Deleterious mutations from exome sequencing; 2). Deep deletions from GISTIC calls; and 3). Epigenetic silencing through methylation versus expression analysis. Aggregation of the three data types and integration with whitelisted samples resulted in binary calls across 9,125 samples (see Data and Software Availability).Overall alteration scores were computed by merging three binary calls, i.e., a sample was called altered for a gene if it was mutated, deleted and/or epigenetically-silenced. A sample was called altered for a DDR pathway if at least one gene in the pathway was altered. A permutation test was performed to analyze whether the alteration percentages in the DDR pathways were enriched or depleted (higher or lower than expected) using a null model, for which the genes were randomly assigned to each of the pathways. A gene-pathway graph is used to denote gene to pathway membership relationship. A total of 1,000,000 permuted gene-to-pathway graphs were generated using Birewire (Gobbi et al., 2014) Permuted alteration percentages were then compared to the actual percentages using the standard permutation test (Knijnenburg et al., 2009, 2011). Additionally, we analyzed whether alterations of different types tended to be mutually exclusive or co-occurring. For each combination of a cancer type and DDR pathway, we have three binary vectors that indicate for the samples of that cancer type whether at least one gene in a DDR pathway was mutated, deleted or silenced. These vectors were randomly permuted 100,000 times, after which the original alteration percentage is compared with the permuted percentages.
Filtering and functional annotation of somatic mutations in DDR genes
A total of 313,497 somatic mutations in 276 curated DDR genes were selected from the MC3 MAF (v0.2.8) and underwent the following stepwise filtering process (Figure S1C). In brief, “PASS” filter mutations were selected and supplemented manually with mutations called in whole-genome amplified (wga) samples and by gap-filler assays that were not marked with the “PASS” filter tag. Mutations with low mutant allelic fractions and mutations had a low variant coverage were then removed. Thereafter, common variants reported in ExAc, the Phase-3 1000 Genomes Project or in the ESP6500 databases were removed. The intronic mutations, the mutations in 3′ or 5′ UTR regions or UTR flanking regions, silent mutations, and small, in-frame insertions and deletions were also removed. Mutations were removed if samples were duplicate-sequenced, marked as PanCanAtlas “Do not use” samples, or not included in the MC3 exome bam free best pairing samples. Approximately 89% of raw mutation calls were filtered out by this process, leaving a final list of 279,002 mutations that were used for further analysis.To estimate the probability of missense mutations being damaging, we further annotated these missense mutations using six commonly used functional prediction algorithms (Figure S1D): PolyPhen-2 (Adzhubei et al., 2013), SIFT (Kumar et al., 2009), Mutation Taster (Schwarz et al., 2014), Mutation Assessor (Reva et al., 2011), LR and LRT (Chun and Fay, 2009). The missense mutations that were called as “deleterious” or “damaging” by four or more algorithms were defined as “deleterious” mutations. TP53 hotspot mutation substitutions such as R175H and R273H that didn’t meet this threshold were manually rescued.
DDR gene epigenetic silencing events
Illumina Infinium DNA methylation bead arrays, including both HumanMethylation27 (HM27) and Human Methylation450 (HM450), were used to assay 9,106 cancer samples (i.e., 8,586 primary solid cancers, 361 metastatic cancers and 159 blood malignancies) across 33 different cancer types, together with 1,066 adjacent normal samples. We excluded 19 normal prostate samples for potential label switching as identified by the Working Group and confirmed by pathology re-review. An external HM450 brain dataset containing 58 sorted neuronal and glial cells (Guintivano et al., 2013) from post-mortem frontal cortex of normal individuals was introduced as a normal control as GBM had only two adjacent normal samples, as were 60 sorted blood samples from 6 healthy individuals examined by HM450 (Reinius et al., 2012)). Data from HM27 and HM450 were then combined and normalized using a probe-by-probe proportional re-scaling method to yield a common set of 22,601 probes with comparative methylation levels. Briefly, we modeled the difference between HM27 and HM450 by two different technical replicates (TCGA-07-0227/TCGA-AV-A03D, measured 44/198 times and 12/169 times on HM27 and HM450, respectively), and applied a proportional rescaling method to remove platform effects.Probes located within potential promoter regions (upstream and downstream 1500bp flanking regions of Transcription Start Sites (TSSs) of all annotated transcripts by UCSC) of the 276 DDR genes were examined for evidence of epigenetic silencing. We started with such probes that are consistently unmethylated (median beta value < 0.2) in each of the normal tissue types as well as sorted blood cells. Within each cancer type, for each probe/gene pair, the gene expression was first Z-score transformed using the mean and standard deviation calculated with the unmethylated cancer samples (i.e., sample with a beta value of (0, 0.1)). Samples across all cancer types were then pooled together. We chose the probes that exhibited epigenetic silencing using the following criteria: 1)at least 5 samples were observed with a beta value of 0.3 or above (defined as the methylated group); 2) mean Z score of the methylated group was lower than −1.65; 3) FDR-corrected p value according to one-side t test on Z scores was lower than 1e–2 between unmethylated and methylated group; and 4) the maximum beta value of the methylated group was higher than 0.75. Probes meeting these standards were retained to summarize epigenetic silencing events at the gene level. For genes with only one retained probe, a beta value cutoff of 0.3 was applied to call their silencing status, while genes with multiple probes left, the cutoff was relaxed to 0.2 but requiring that greater than half of the probes consistently silenced for that gene.For DDR gene RAD51C, there were no common probes located in the promoter region. However, probe cg14837411 on HM27 and probe cg27221688 on HM450 were only 100bp apart and both correlated with gene expression. We manually added these back based on a beta value cutoff of 0.2 or above and combining both probes to call RAD51C epigenetic silencing.
Determination of deep deletions of DDR genes and SCNA-based DDR scores
Binary deep deletion calls were made using output from a PanCan GISTIC2.0 run on the samples (Mermel et al., 2011). GISTIC calls of −2 (indicating a loss of more than half of baseline ploidy) were called as deep deletions. Deep deletion calls were transformed into binary matrix format and made available in the (see Data and Software Availability).The two SCNA burden scores (number of segments and fraction of genome altered) were computed using relative copy number segment data (see Data and Software Availability). For the first score, we counted the number of segments present in the copy number profile for each TCGA sample, and for the second score, we took the percentage of base pairs present in the copy number profile for each sample that belonged to segments with either greater than 0.1 or less than −0.1 log2 fold-change from baseline ploidy. All SCNA burden scores are available in Data and Software Availability.The aneuploidy score reports the total number of arm-level amplifications and deletions in each cancer and was computed using ABSOLUTE (Carter et al., 2012) and an arm-level clustering algorithm, as described in Taylor et al. (2018). The two LOH scores (total number of segments with LOH events and fraction of genome containing LOH events) were computed directly using output from ABSOLUTE (Carter et al., 2012). All aneuploidy and LOH scores are available online (see Data and Software Availability).
Computation of homologous recombination deficiency (HRD) scores
We calculated HRD scores following previous published 3 components of HRD/genome scarring scores: HRD-LOH (Abkevich et al., 2012), LST (Popova et al., 2012), NtAI (Birkbak et al., 2012) and the implementation of a sum of the three (Marquard et al., 2015). Segment LOH and SCNA were generated by TCGA Network Aneuploidy AWG using ABSOLUTE (Carter et al., 2012)
Ridge Regression Analysis
Bayesian ridge regression was performed on 276 DDR genes using alteration status and encoding tumor type as 33 additional binary variables. HRD scores were modeled for 8464 tumor samples (MSI-high and POLE mutant samples were excluded from this analysis). Maximally uninformative gamma-distributed priors with shape and rate parameters equal to 0.1 were used for the precision of coefficients weighting and regularization factor. Coefficient significances were computed by first dividing the coefficient values by the square roots of the diagonal elements of the variance-covariance matrix of the weights to obtain coefficient t-statistics, and then by performing two-tailed t tests with 8326 degrees of freedom on these values. The regression was performed in Python-3.6.3 using the linear_model module in scikit-learn-0.19.1.
Survival Analysis
We evaluated clinical outcomes associations using newly developed, well-validated clinical data for each cancer type by the PanCanAtlas Survival Working Group’s paper and following their recommendations (Liu et al., 2018). Their analysis was based on clinical outcomes data from primary sources that were analyzed with extensive quality control and validation.We considered two primary clinical outcomes: overall survival (OS), defined as “the period from date of diagnosis until death from any cause”; and progression free interval (PFI), defined as “the period from date of diagnosis until the occurrence of an event in which the patient with or without the tumor does not get worse.” Within each cancer type, we dichotomized the cohort according to the median score of each DNA damage footprint score to create two groups (high versus low) for survival comparisons. For a given cancer type, we fit univariate and multivariate Cox proportional hazard models using the DDR footprint score group as a predictor. For multivariate models, we considered age, tumor grade, and tumor American Joint Committee on Cancer pathology stage as covariates. We divided cancer grade into two categories of “low-grade” for grades 1–2, and “high-grade” for 3–4. Cancer stages were grouped as “early” for stages I and II, and “late” for stages III and IV. Additional covariates such as subtype, gender and race were considered but not used because of sparsity or convincing evidence for inclusion. For each set of DDR footprint Cox models, we applied the Benjamini-Hochberg correction to control the false discovery rate for DNA damage footprint scores.
Curation and Examination of DDR Fusion Events
We downloaded 30,001 fusion gene (FG) transcript candidates from the ChimerSeq module of ChimerDB 3.0 (December 2016) (Lee et al., 2017). By overlapping these putative fusion genes with 276 DDR genes, we identified 889 fusions involving 224 DDR genes. Among these, 464 FGs including 173 DDR genes were derived from TCGA cancer samples. To reduce false positives, we used read count information together with FusionScan, TopHat-Fusion, and PRADA. Following their filtering criteria, we used fusion transcripts having ≥ 2 seed/junction reads for FusionScan and PRADA analyses, and those with ≥ 100 spanning pairs for TopHat-Fusion analyses that identified 289 FGs. We were able to augment these with 343 additional FGs involving 141 DDR genes by downloading putative FGs from TCGA Fusion Data Portal using our 276 DDR genes as a query (September 2016) (Yoshihara et al., 2015). Combining these two FG datasets we identified 488 FGs involving 174 DDR genes in 477 cancer samples.We next checked the read alignments for each DDR fusion gene. RNA-seq data of 477 samples were downloaded using gdc-client. BAM files were processed to obtain unmapped reads only using Bowtie2. We created fusion transcript composed of 100 bp sequences before and after break points using nibFrag, one of the BLAT Suite of programs. After creating a 200 bp length index of each fusion transcript, we aligned the unmapped reads to this fusion sequence index, then manually checked the read alignments. If a FG had at least one 20nt-20nt seed spanning read at the break point with a stair-shaped mapping of all reads we considered this a high likelihood gene fusion. This analysis identified 192 high likelihood FGs involving 108 DDR genes in 209 cancer samples.
Generating Protein Structural Models at Atomic Resolution
We began our structure-based analysis from UniProt canonical isoform sequences, and searched the PDB (Berman et al., 2000) for existing experimental structures. Experimental structures exist for humanPOLQ domains including the DEAD-box, Helicase, Sec63, and polymerase. The first three domains were solved in a single crystal structure, 5A9J (Newman et al., 2015). The polymerase domain is available in 4X0P (Zahn et al., 2015). We will abbreviate these POLQ structures as DHS and Pol, respectively, that together encompass 65% of the POLQ protein. No experimental structure exists for full-length human POLE. Thus we utilized the high-resolution experimental structure from yeast (4M8O (Hogg et al., 2014); 57% sequence identity) to generate our initial model (Roy et al., 2010) of the first 1155 amino acids of POLE that encompasses the polymerase and exonuclease domains.Experimental structures for full-length humanPOLD1 and HMF1 are also lacking. Thus we utilized homology modeling using as templates 3IAY (Swan et al., 2009) (52% identical), and 4KIT (Mozaffari-Jovin et al., 2013) (31% identical), respectively. The structure of humanBLM was taken from 4CGZ (Newman et al., 2015) and TOP3A from 4CGY (Bocquet et al., 2014). ERCC2 has been solved (5IVW [He et al., 2016]), but at 94% sequence identity. Thus homology modeling was used to update the experimental structure to the target wild-type sequence.The remaining proteins for which we applied molecular modeling (Figure 4)are available in the PDB at > 95% sequence identity. We used homology modeling to revert protein to wild-type sequences and to fill in missing loops. FoldX (Schymkowitz et al., 2005; Van Durme et al., 2011) was used to perform in silico mutagenesis and side chain rotamer optimization, and to calculate ΔΔGfold for each variant. Results are summarized and visualized in Figure 4 and Figure S5.
Molecular Dynamics Simulations
In order to provide a detailed assessment for the most recurrent variants in selected DDR proteins (n = 86; observed in at least 2 samples for POLQ or 3 samples for others), we utilized Generalized Born implicit solvent molecular dynamics (MD) simulations, which were carried out using NAMD (Phillips et al., 2005) and the CHARMM27 with CMAP (Mackerell et al., 2004) force field. We utilized an interaction cutoff of 12Å with strength tapering (switching) beginning at 10Å, a simulation time step of 1fs, and conformations recorded every 2ps. Each initial conformation was used to generate 2 replicates that were each independently energy minimized for 10,000 steps, followed by heating to 300K over 0.5ns via a Langevin thermostat and a further 6ns of simulation generated at 2fs/step with the final 5ns analyzed. Implicit solvation accelerates system kinetics due to lack of explicit motion by the solvent required for solute motion. For POLE we studied in greater depth the recurrent V411A/D/I/L/M and P286R substitutions. For each, 30ns of simulation trajectory was generated after minimization and heating, for each variant in triplicate, and the final 20ns analyzed. Across the 6 variants plus the WT, 640ns of MD trajectory was generated. All proteins were modeled without substrate (as apo structures). In total, 1118ns of MD trajectory was generated and used to better understand the differences in effect of different protein variants on dynamics.All MD trajectories were first aligned to the initial wild-type conformation of each protein using Cα atoms. Root Mean Squared Deviation (RMSD) was calculated using Cα atoms. Principal Component (PC) analysis was performed in Cartesian space. Analysis was carried out using custom scripts, leveraging VMD (Humphrey et al., 1996) and the Bio3D R package(Grant et al., 2006). Protein structure visualization was performed in PyMol (The PyMOL Molecular Graphics System. Version 1.5.0.3. Schrödinger, LLC) and VMD.We used wild-type simulations as a benchmark for determining if variants altered the intrinsic dynamics of each protein (ΔPC). We first identified the region of PC space sampled by the densest 90% of each protein WT simulation (Figure S5). If the median PC coordinate for a variant was outside of this region, we considered the variant to have altered the activation of the corresponding PC motion. Differences in folding energy were classified as moderately (ΔΔGfold ≥ 1 kBT ≈ 0.6 kcal/mol) or strongly (ΔΔGfold ≥ 3 kBT ≈ 1.8 kcal/mol) destabilizing.
In-silico expression-based predictor of TP53 inactivation
We trained a classifier to use RNA-seq expression data to predict TP53 functional status. In brief, we trained a logistic regression classifier with an elastic net penalty using the Scikit-learn implementation of stochastic gradient descent (Pedregosa et al., 2010). The labels (y) for the supervised task included samples with MC3 annotated deleterious TP53 mutations (samples with silent mutations were considered TP53 wild-type) and samples with TP53deep copy number loss as predicted by the GISTIC2.0 algorithm (Mermel et al., 2011). We included cancer-types in the model that had greater than 15 samples in each class, and between 5% and 95% of samples in both classes. Other samples were removed (see Figure S6A). The features (X) consisted of the 8,000 most variably expressed genes by median absolute deviation (MAD). We dropped expression of TP53 itself from the features to prevent the model from relying on target gene data. MAD genes were z-scored and concatenated with binarized dummy variables for all cancer types and mutation burdens (total log10 mutation count) to adjust for potential confounding factors. To reduce the effect of mutation burden confounding, we also removed outlier samples with the most extreme hypermutation phenotypes (> 5 standard deviations above the mean log10 mutation count). The goal of the classification scheme was to determine the weights (w) that minimize the following objective function:
Where i indexes samples, p indexes genes, λ and l are regularization and elastic net mixing hyperparameters, respectively. We selected optimal hyperparameters by balanced 5-fold cross validation with the goal of inducing a sparse solution. We also used a balanced 10% held out set to test the performance of the classifier on data that was not used for training or hyperparameter optimization. We fit the final model on the remaining 90% of the data and report performance using receiver operating characteristic (ROC) curves and area under the ROC curve (AUROC) metrics.We manually selected an a priori set of genes known to interact with TP53 for our phenocopying experiment (L.A. Donehower, unpublished data). We tested MDM2, MDM4, and PPM1D amplifications, CDKN2A deletions, and ATM, ATR, CHEK1, CHEK2, CREBBP, and RPS6KA3 mutations. For the copy number tests, we included both deep and shallow alterations in the altered set compared to cancers with wild-type profiles only. We removed tumors with deleterious TP53 mutations or deep copy number loss (n = 4,037). From the remaining 5,629 tumors, we removed 219 hypermutated cancers leaving an analytic set of 5,410 cancer samples. We performed independent t tests and calculated Cohen’s D effect sizes comparing the assigned TP53 classifier scores for wild-type against altered cancers. We considered variants significant if they were less than a Bonferroni adjusted p value (p > 0.005). We visualized the results in a network diagram presented in Figure 5E. The underlying interaction network was downloaded from the STRING database (version 10.5) (Szklarczyk et al., 2017). The thickness of edges in the STRING network display interaction confidence and were generated by experimental data. Note that there are no direct interaction edges between RPS6KA3 and TP53 and PPM1D and TP53.We provide materials under an open source license to reproduce and expand upon this analysis at https://github.com/greenelab/pancancer. Additional details, including benchmarking analyses, are provided in Way et al. (2018).
Authors: Marco Barazas; Alessia Gasparini; Yike Huang; Asli Küçükosmanoğlu; Stefano Annunziato; Peter Bouwman; Wendy Sol; Ariena Kersbergen; Natalie Proost; Renske de Korte-Grimmerink; Marieke van de Ven; Jos Jonkers; Gerben R Borst; Sven Rottenberg Journal: Cancer Res Date: 2018-12-10 Impact factor: 12.701
Authors: Kylie L Gorringe; Ian G Campbell; Dane Cheasley; Lisa Devereux; Siobhan Hughes; Carolyn Nickson; Pietro Procopio; Grant Lee; Na Li; Vicki Pridmore; Kenneth Elder; G Bruce Mann; Tanjina Kader; Simone M Rowley; Stephen B Fox; David Byrne; Hugo Saunders; Kenji M Fujihara; Belle Lim Journal: NPJ Breast Cancer Date: 2020-08-07
Authors: Sahar Shahamatdar; Meng Xiao He; Matthew A Reyna; Alexander Gusev; Saud H AlDubayan; Eliezer M Van Allen; Sohini Ramachandran Journal: Cell Rep Date: 2020-03-03 Impact factor: 9.423
Authors: Na Li; Simone McInerny; Magnus Zethoven; Dane Cheasley; Belle W X Lim; Simone M Rowley; Lisa Devereux; Norah Grewal; Somayeh Ahmadloo; David Byrne; Jue Er Amanda Lee; Jason Li; Stephen B Fox; Thomas John; Yoland Antill; Kylie L Gorringe; Paul A James; Ian G Campbell Journal: J Natl Cancer Inst Date: 2019-12-01 Impact factor: 13.506
Authors: A G Waks; O Cohen; B Kochupurakkal; D Kim; C E Dunn; J Buendia Buendia; S Wander; K Helvie; M R Lloyd; L Marini; M E Hughes; S S Freeman; S P Ivy; J Geradts; S Isakoff; P LoRusso; V A Adalsteinsson; S M Tolaney; U Matulonis; I E Krop; A D D'Andrea; E P Winer; N U Lin; G I Shapiro; N Wagle Journal: Ann Oncol Date: 2020-02-20 Impact factor: 32.976