| Literature DB >> 32024854 |
Matthew A Reyna1,2, David Haan3, Marta Paczkowska4, Lieven P C Verbeke5,6, Miguel Vazquez7,8, Abdullah Kahraman9,10, Sergio Pulido-Tamayo5,6, Jonathan Barenboim4, Lina Wadi4, Priyanka Dhingra11, Raunak Shrestha12, Gad Getz13,14,15,16, Michael S Lawrence13,14, Jakob Skou Pedersen17,18, Mark A Rubin11, David A Wheeler19, Søren Brunak20,21, Jose M G Izarzugaza20,21, Ekta Khurana11, Kathleen Marchal5,6, Christian von Mering9, S Cenk Sahinalp12,22, Alfonso Valencia7,23, Jüri Reimand24,25, Joshua M Stuart26, Benjamin J Raphael27.
Abstract
The catalog of cancer driver mutations in protein-coding genes has greatly expanded in the past decade. However, non-coding cancer driver mutations are less well-characterized and only a handful of recurrent non-coding mutations, most notably TERT promoter mutations, have been reported. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancer across 38 tumor types, we perform multi-faceted pathway and network analyses of non-coding mutations across 2583 whole cancer genomes from 27 tumor types compiled by the ICGC/TCGA PCAWG project that was motivated by the success of pathway and network analyses in prioritizing rare mutations in protein-coding genes. While few non-coding genomic elements are recurrently mutated in this cohort, we identify 93 genes harboring non-coding mutations that cluster into several modules of interacting proteins. Among these are promoter mutations associated with reduced mRNA expression in TP53, TLE4, and TCF4. We find that biological processes had variable proportions of coding and non-coding mutations, with chromatin remodeling and proliferation pathways altered primarily by coding mutations, while developmental pathways, including Wnt and Notch, altered by both coding and non-coding mutations. RNA splicing is primarily altered by non-coding mutations in this cohort, and samples containing non-coding mutations in well-known RNA splicing factors exhibit similar gene expression signatures as samples with coding mutations in these genes. These analyses contribute a new repertoire of possible cancer genes and mechanisms that are altered by non-coding mutations and offer insights into additional cancer vulnerabilities that can be investigated for potential therapeutic treatments.Entities:
Mesh:
Year: 2020 PMID: 32024854 PMCID: PMC7002574 DOI: 10.1038/s41467-020-14367-0
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Overview of the pathway and network analysis approach.
Coding, non-coding, and combined gene scores were derived for each gene by aggregating driver p-values from the PCAWG driver predictions in individual elements, including annotated coding and non-coding elements (promoter, 5′ UTR, 3′ UTR, and enhancer). These gene scores were input to five network analysis algorithms (CanIsoNet[20], Hierarchical HotNet[21], an induced subnetwork analysis (Reyna and Raphael, in preparation), NBDI[22], and SSA-ME[23]), which utilize multiple protein–protein interaction networks, and to two pathway analysis algorithms (ActivePathways[19] and a hypergeometric analysis (Vazquez)), which utilize multiple pathway/gene-set databases. We defined a non-coding value-added (NCVA) procedure to determine genes whose non-coding scores contribute significantly to the results of the combined coding and non-coding analysis, where NCVA results for a method augment its results on non-coding data. We defined a consensus procedure to combine significant pathways and networks identified by these seven algorithms. The 87 pathway-implicated driver genes with coding variants (PID-C) are the set of genes reported by a majority (≥4/7) of methods on the coding data. The 93 pathway-implicated driver genes with non-coding variants (PID-N) are the set of genes reported by a majority of methods on non-coding data or in their NCVA results. Only five genes (CTNNB1, DDX3X, SF3B1, TGFBR2, and TP53) are both PID-C and PID-N genes.
Fig. 2Pathway and driver analysis identifies driver genes in the long tail of the driver p-values for coding and non-coding mutations.
a Pathway and network methods identify significant coding driver mutations. Driver p-values on protein-coding elements for 250 genes with most significant coding driver p-values; dashed and dotted lines indicate FDR = 0.1 and 0.25, respectively. Dark green bars are PID-C genes, and light green bars are non-PID-C genes. Blue squares below the x-axis indicate COSMIC Cancer Gene (CGC) Census genes. In total, 31 of 87 PID-C genes have coding driver p-values with FDR > 0.1. Several PID-C genes are labeled, including all CGC genes with coding FDR > 0.1. Overlap between PID-C and PID-N genes is indicated with asterisks. Source data are provided as a Source Data file. b Pathway and network methods identify rare non-coding driver mutations. Driver p-values on non-coding elements (promoter, 5′ UTR, and 3′ UTR of gene) for 250 genes with most significant non-coding driver p-values; dashed and dotted lines indicate FDR = 0.1 and 0.25, respectively. Dark yellow bars are PID-N genes, and light yellow bars are non-PID-N genes. Blue squares are as above. In total, 3 (TERT, HES1, and TOB1) of 93 PID-N genes have non-coding driver p-values with FDR ≤ 0.1, while 90 have FDR > 0.1 . Several PID-N genes are labeled, including PID-N genes with significant in cis gene expression changes (see Fig. 3) and all PID-N genes with non-coding FDR > 0.25. Overlap between PID-C and PID-N genes is indicated with asterisks. Source data are provided as a Source Data file. c Statistical significance of overlap between top-ranked genes according to coding driver p-values and PID-C genes with CGC genes. Fisher’s exact test p-values and driver FDR thresholds of 0.1 and 0.25 are highlighted. Green squares indicate overlap between PID-C genes and CGC genes. Source data are provided as a Source Data file. d Statistical significance of overlap of genes ranked by driver p-values on non-coding (promoter, 5′ UTR, 3′ UTR) elements and CGC genes. Driver FDR thresholds of 0.1 and 0.25 are highlighted. Yellow square indicates overlap between PID-N genes and CGC genes. Source data are provided as a Source Data file.
Fig. 3Gene expression changes are correlated with mutations in PID-N genes.
Evolutionary conservation of genomic elements estimated with PhyloP are shown as gray features. H3 histone lysine 4 tri-methylation sites (H3K4me3) measured in GM12878 HapMap B-lymphocytes cell lines are highlighted in the green track, indicating active promoter regions near transcription start sites[49]. Boxplot center lines show the median, boxplot bounds show the first quartile Q1 and the third quartile Q3, and whiskers show 1.5 (Q3–Q1) below and above Q1 and Q3, respectively. a TP53 promoter. TP53 coding and non-coding genomic loci with zoomed-in view of the TP53 promoter region. TP53 promoter mutations (six mutations in Biliary-AdenoCA, ColoRect-AdenoCA, Kidney-ChRCC, Lung-SCC, Ovary-AdenoCA, and Panc-AdenoCA cancer types) correlate significantly (Wilcoxon rank-sum test p = 0.001, FDR = 0.087) with reduced TP53 gene expression, where FPKM-UQ is upper quartile normalized fragments per kilobase million. Samples with copy-number gains and losses in the TP53 promoter region are annotated in red and blue, respectively. Two of the six TP53 promoter mutations overlap with transcription factor-binding sites (with one mutation matching three motifs). Source data are provided as a Source Data file. b TLE4 promoter. TLE4 coding and non-coding genomic loci with zoomed-in view of TLE4 promoter region. TLE4 promoter mutations in Liver-HCC samples (three mutations) correlate (Wilcoxon rank-sum test p = 0.02, FDR = 0.2) with lower TLE4 gene expression. Samples with copy-number gains and losses annotated in red and blue, respectively. One of the three TLE4 promoter mutations has a transcription factor-binding site for ZNF263. Source data are provided as a Source Data file. c TCF4 promoter. TCF4 coding and non-coding genomic loci with zoomed-in view of TCF4 promoter region. TCF4 promoter mutations in Lung-SCC samples (three mutations) correlate (Wilcoxon rank-sum test p = 0.03, FDR = 0.27) with lower TCF4 gene expression. Samples with copy-number gains and losses annotated in red and blue, respectively. One of the three TCF4 promoter mutations has a transcription factor-binding site for ZEB1. Source data are provided as a Source Data file.
Fig. 4Pathway and network modules containing PID-C and PID-N genes.
a Network of functional interactions between PID-C and PID-N genes. Nodes represent PID-C and PID-N genes and edges show functional interactions from the ReactomeFI network (gray), physical protein–protein interactions from the BioGRID network (blue), or interactions recorded in both networks (purple). Node color indicates PID-C genes (green), PID-N genes (yellow), or both PID-C and PID-N genes (orange); node size is proportional to the score of the gene; and the pie chart diagram in each node represents the relative proportions of coding and non-coding mutations associated with the corresponding gene. Dotted outlines indicate clusters of genes with roles in chromatin organization and cell proliferation, which predominantly contain PID-C genes; development, which includes comparable amounts of PID-C and PID-N genes; and RNA splicing, which contains PID-N genes. A core cluster of genes with many known drivers is also indicated. b Pathway modules containing PID-C and PID-N genes. Each row in the matrix corresponds to a PID-C or PID-N gene, and each column in the matrix corresponds to a pathway module enriched in PID-C and/or PID-N genes (see the Methods section). A filled entry indicates a gene (row) that belongs to one or more pathways (column) colored according to gene membership in PID-C genes (green), PID-N genes (yellow), or both PID-C and PID-N genes (orange). A dark colored entry indicates that a PID-C or PID-N gene belongs to a pathway that is significantly enriched for PID-C or PID-N genes, respectively. A lightly colored entry indicates that a PID-C or PID-N gene belongs to a pathway that is significantly enriched for the union of PID-C and PID-N genes, but not for PID-C or PID-N genes separately. Enrichments are summarized by circles adjacent each pathway module name and PID gene name. Boxed circles indicate that a pathway module contains a pathway that is significantly more enriched for the union of the PID-C and PID-N genes than the PID-C and PID-N results separately. The enriched modules and PID genes are clustered into four biological processes: chromatin, development, proliferation, and RNA splicing as indicated.
Fig. 5RNA splicing factors are targeted primarily by non-coding mutations and alter expression of similar pathways as coding mutations in splicing factors.
a Heatmap of Gene Set Enrichment Analysis (GSEA) Normalized Enrichment Scores (NES). The columns of the matrix indicate non-coding mutations in splicing-related PID-N genes and coding mutations in splicing genes reported in ref. [37], and the rows of the matrix indicate 47 curated gene sets[37]. Red heatmap entries represent an upregulation of the pathway in the mutated samples with respect to the non-mutated samples, and blue heatmap entries represent a downregulation. The first column annotation indicates mutation cluster membership according to common pathway regulation. The second column annotation indicates whether a mutation is a non-coding mutation in a PID-N gene or a coding mutation, with the third column annotation specifies the aberration type (promoter, 5′ UTR, 3′ UTR, missense, or truncating). The fourth column annotation indicates the cancer type for coding mutations. The mutations cluster into three groups: C1, C2, and C3. The pathways cluster into two groups: P1 and P2, where P1 contains an immune signature gene sets and P2 contains cell-autonomous gene sets. b t-SNE plot of mutated elements. Gene expression signatures for samples with non-coding mutations clusters in splicing-related PID-N genes with gene expression signatures for coding mutations in previously published splicing factors. The shape of each point denotes the mutation cluster assignment (C1, C2, or C3), and the color represents whether the corresponding gene is a PID-N gene with non-coding mutations or a splicing factor gene with coding mutations.
Summary of pathway database and interaction network data for each method.
| Method | Pathway databases or interaction networks |
|---|---|
| ActivePathways | Gene Ontology (GO)[ |
| CanIsoNet | STRING v10[ |
| Hierarchical HotNet | ReactomeFI 2015[ |
| Hypergeometric analysis | GO biological processes; CORUM[ |
| Induced subnetwork analysis | ReactomeFI 2015[ |
| NBDI | ReactomeFI 2015[ |
| SSA-ME | ReactomeFI 2015[ |
| Label | Synapse ID | ICGC DCC URL | ICGC DCC file name | Access (open/controlled) |
|---|---|---|---|---|
| PCAWG driver p-values | syn8494939 | final_integration_results_2017_03_16.tar.gz | Open | |
| Enhancer-gene mappings | syn7201027 | map.enhancer.gene.txt.gz | Open | |
| Somatic MAF file | syn7364923 | final_consensus_passonly.snv_mnv_indel.icgc.public.maf.gz | Open | |
| Somatic MAF file | syn7364923 | final_consensus_passonly.snv_mnv_indel.tcga.controlled.maf.gz | Controlled | |
| Hypermutated donors | syn7894281 | Hypermutated_spls_removed_ActiveDriver2_AllScores_211216.txt | Open | |
| Hypermutated samples | syn7814911 | Hypermutated_spls_removed_ActiveDriver2_AllScores_291116.aliquotid.txt | Open | |
| Mutations to coding and noncoding elements | syn8103141 | PCAWG_mutations_to_elements.icgc.public.txt.gz | Open | |
| Mutations to coding and noncoding elements | syn8103141 | PCAWG_mutations_to_elements.tcga.controlled.txt.gz | Controlled | |
| Mutation matrix | syn9684700 | PCAWG.gene_status.all.tsv.gz | Controlled | |
| Primary pathway databases | syn3164548 | Gene_sets_pathways_processes_functions.zip | Open | |
| Secondary pathway databases | syn11426307 | PCAWG-5.pathway.data.CNIO.tar.gz | Open | |
| ReactomeFI 2015 network | syn3254781 | Functional_interaction_network_Reactome_FI_Network_2015.zip | Open | |
| iRefIndex14 network | syn10903761 | irefindex14-kegg.tsv.gz | Open | |
| BioGRID network | syn3164609 | Protein_Protein_interaction_network_BIOGRID_filtered.zip | Open | |
| STRING v10 network | syn11712027 | string10_ppi_high_confident_edges.tsv | Open | |
| PCAWG gene expression data | syn5553991 | tophat_star_fpkm_uq.v2_aliquot_gl.tsv.gz | Controlled | |
| PCAWG pathway and network method results | syn21413360 | pathway_and_network_method_results.tar.gz | Open | |
| PCAWG pathway and network consensus results | syn11654843 | method_results_2017_10_10.tar.gz | Open | |
| Coding and noncoding elements | syn21416282 | gene-coding-and-non-coding-elements.tar.gz | Open | |
| Transcript expression data (Counts) | syn7536588 | pcawg.rnaseq.transcript.expr.counts.tsv.gz | Controlled | |
| Transcript expression data (FPKM) | syn7536589 | pcawg.rnaseq.transcript.expr.fpkm.tsv.gz | Controlled | |
| eQTL data | syn17096221 | all_somatic_eqtl.tsv.tar.gz | Controlled | |
| Gene-level copy-number data | syn8291899 | all_samples.consensus_CN.by_gene.170214.txt.gz | Open | |
| CanIsoNet PCAWG Ensembl transcripts | syn7536587 | pcawg.rnaseq.transcript.expr.tpm.tsv.gz | Open | |
| CanIsoNet GTEx Ensembl transcripts | syn7596599 | GTEX_v4.pcawg.transcripts.tpm.tsv.gz | Open | |
| CanIsoNet filtered PCAWG samples | syn7416381 | rnaseq.extended.metadata.aliquot_id.V4.tsv | Open | |
| CanIsoNet filtered GTEx samples | syn7596611 | GTEX_v4.metadata.tsv.gz | Open | |
| CanIsoNet protein–protein isoforms | syn10245952 | isoNet.tsv.gz | Open | |
| CanIsoNet shortest path results | syn9770515 | string_cosmic_neighbourhood_min900_shell3_20160527.tsv.xz | Open | |
| CanIsoNet functional regions | syn7345646 | allCombined.zip | Open | |
| CanIsoNet results (noncoding region) | syn9765614 | CanIsoNet results (noncoding region) | non_canIsoNet_mdi_results_noLymNoMel.tsv | Open |
| CanIsoNet results (coding region) | syn9765615 | CanIsoNet results (coding region) | cds_canIsoNet_mdi_results_noLymNoMel.tsv | Open |