Literature DB >> 31730267

Pan-cancer RNA-seq data stratifies tumours by some hallmarks of cancer.

F Graeme Frost¹, Praveen F Cherukuri^1,2,3, Samuel Milanovich^2,3,4, Cornelius F Boerkoel¹.

Abstract

Numerous genetic and epigenetic alterations cause functional changes in cell biology underlying cancer. These hallmark functional changes constitute potentially tissue-independent anticancer therapeutic targets. We hypothesized that RNA-Seq identifies gene expression changes that underly those hallmarks, and thereby defines relevant therapeutic targets. To test this hypothesis, we analysed the publicly available TCGA-TARGET-GTEx gene expression data set from the University of California Santa CruzToil recompute project using WGCNA to delineate co-correlated 'modules' from tumour gene expression profiles and functional enrichment of these modules to hierarchically cluster tumours. This stratified tumours according to T cell activation, NK-cell activation, complement cascade, ATM, Rb, angiogenic, MAPK, ECM receptor and histone modification signalling. These correspond to the cancer hallmarks of avoiding immune destruction, tumour-promoting inflammation, evading growth suppressors, inducing angiogenesis, sustained proliferative signalling, activating invasion and metastasis, and genome instability and mutation. This approach did not detect pathways corresponding to the cancer enabling replicative immortality, resisting cell death or deregulating cellular energetics hallmarks. We conclude that RNA-Seq stratifies tumours along some, but not all, hallmarks of cancer and, therefore, could be used in conjunction with other analyses collectively to inform precision therapy.

Entities: CellLine Chemical Disease Gene Species

Keywords: RNA-seq; cancer; hallmarks of cancer; pan-cancer; precision medicine; precision oncology; transcriptome; wgcna

Year: 2019 PMID： 31730267 PMCID： PMC6933344 DOI： 10.1111/jcmm.14746

Source DB: PubMed Journal: J Cell Mol Med ISSN： 1582-1838 Impact factor: 5.310

INTRODUCTION

Human cancers are classified based on anatomical, histopathological and molecular features. Hanahan and Weinberg posited cancer unifying changes in cell biology (hallmarks): resisting cell death, sustaining proliferative signalling, evading growth suppressors, activating invasion and metastasis, enabling replicative immortality and inducing angiogenesis.1 Now included are two ‘enabling’ hallmarks (genome instability and mutation, and tumour‐promoting inflammation) and two ‘emerging’ hallmarks (deregulating cellular energetics and avoiding immune detection).1, 2, 3 Aberrant signalling axes identify specific hallmarks in a given cancer.1 Treatments targeting each hallmark promise individualized therapies.1, 2, 4 Individualized cancer therapy requires identifying contributors to these hallmarks, that is targetable biochemical pathways and genetic interactions. Presently, this includes histopathological, DNA, cytogenic and proteomic analyses.5, 6, 7Gene expression analysis is currently used case‐by‐case to discover targets 8; there is no consensus framework for using such analyses across all cancer types in clinical care. Although previous studies of specific primary site cancers such as pancreatic 9 and breast cancers 10 identified transcriptomic subgroups, investigation of transcriptomic subgroups across all cancers is not well studied. Using The Cancer Genome Atlas (TGCA) 11 and FANTOM5 12 transcriptomic data sets, Kaczowski et al 13looked for primary site‐independent cancer subgroups by grouping cancers according to differential expression of individual transcripts initially in cultured cells and secondarily in tumours. This analysis found that cancers could be classified into molecular subtypes defined by expression of transcripts involved in DNA and biopolymer metabolism, tumour suppression, oxidoreductase activity and developmental or cell cycle signalling. The work of Kaczowski et al led us to hypothesize that cancer‐associated mutations and epimutations alter expression of co‐correlated groups of genes independent of cancer type and are detectable primarily in cancer tissue by RNA‐Seq. We reasoned that assessing for co‐correlated groups of genes is arguably more sensitive to changes in expression of gene networks underlying biological processes than is identifying common processes among individual transcripts. To test this, we analysed gene expression profiles from the University of California, Santa Cruz (UCSC) Toil recompute of the TCGA, Therapeutically Applicable Research to Generate Effective Treatments (TARGET)14 and Genotype‐Tissue Expression (GTEx)15 data sets available on the Xena Platform.16 We found consistent stratification of cancers by signatures of T cell activation, NK‐cell activation, complement cascade, ATM, Rb, angiogenic, MAPK, ECM receptor and histone modification signalling.

MATERIALS AND METHODS

Data sources and processing

The RNA‐Seq data and associated metadata files were downloaded from the UCSC Xena Data Browser16(2016‐09‐03 version, TcgaTargetGtex_rsem_gene_tpm (TTG) data set) (Table 1).These contained transcript‐non‐specific expression data for all coding genes as well as for long non‐coding RNA (lncRNA), pseudogenes and other non‐coding transcripts with unique Ensembl ENSG identifiers.17 The TTG data set quantifies gene expression as and were converted to for this analysis. The BioMart 18 database was used to extract genes having ENSG identifiers annotated with the protein_coding biotype. This eliminated 40,826 (67.5%) non‐coding entries leaving 19,672 protein‐coding entries (TTG‐C data set, Figure 1A, Table 1). The TTG‐C data set was then reduced to cancers that had corresponding normal samples and vice versa to create the T‐C‐PS and N‐C‐PS data sets, respectively (Table 1). Primary sites of uncertain histological equivalence between tumour and normal samples (eg blood cancers) or with sample numbers below 20 in either cancer or normal data sets were excluded.

Table 1

Characteristics of the data sets used in this study

Data set	Abbreviated name	Sample types	Number of samples	Number of primary sites	Number of genes
TcgaTargetGtex_rsem_gene_tpm	TTG	Tumour and normal	19,109	47	60,498*
TCGA‐TARGET‐GTEx_coding	TTG‐C	Tumour and normal	19,109	47	19,672
TTG_coding_common_primary	TTG‐C‐PS	Tumour and normal	12,166	15	19,672
Tumour_coding_common_primary	T‐C‐PS	Tumour	7,272	15	19,672
Normal_coding_common_primary	N‐C‐PS	Normal	4,894	15	19,672

The 60,498 ‘genes’ in the TcgaTargetGtex_gene_expected_count data set includes various species of non‐coding RNAs and pseudogenes which have unique Ensembl ENSG identifiers.

Figure 1

A flowchart depicting the analyses used in this study. Transcriptome profiles were first restricted to protein‐coding genes (A), then two different primary site‐correction approaches were taken to analyse the three data sets in parallel (B). Each data set was analysed using WGCNA to identify groups of genes (modules) that were co‐correlated, and variable across cancers (C). Genes found in modules were put through pathway enrichment analysis (WebGestalt) and used for hierarchical clustering (D)

Characteristics of the data sets used in this study The 60,498 ‘genes’ in the TcgaTargetGtex_gene_expected_count data set includes various species of non‐coding RNAs and pseudogenes which have unique Ensembl ENSG identifiers. A flowchart depicting the analyses used in this study. Transcriptome profiles were first restricted to protein‐coding genes (A), then two different primary site‐correction approaches were taken to analyse the three data sets in parallel (B). Each data set was analysed using WGCNA to identify groups of genes (modules) that were co‐correlated, and variable across cancers (C). Genes found in modules were put through pathway enrichment analysis (WebGestalt) and used for hierarchical clustering (D)

Data normalization

Because non‐cancerous primary site‐specific gene expression might obscure cancer signatures, we used two methods to subtract non‐cancer expression data. We analysed data before and after correction (Figure 1B). We used two binary primary site classification matrices: , a matrix of cancer primary sites, and , a matrix of normal tissue primary sites. q and r are the number of cancer and normal samples, respectively, and t is the number of primary sites. We used two gene expression matrices:C, a matrix of cancer gene expression from the T‐C‐PS data set (Table 1), and , a matrix of normal tissue gene expression from the N‐C‐PS data set (Table 1). and are the number of cancer and normal samples, respectively, and is the number of genes. For a given cancer expression vector of gene in matrix , and for a binary classification vector for primary site in matrix , we derived the vector of tissue‐specific cancer gene expression by multiplying these two vectors: For a normal tissue, given the expression vector for gene in matrix and the binary classification vector for primary site in matrix , we derived the vector of tissue‐specific normal tissue gene expression by multiplying these two vectors: By calculating and or all primary sites and all genes, we created a series of vectors that form the two three‐dimensional matrices and . is the TPM gene expression value for gene in cancer of primary site . is the TPM gene expression value for gene in normal tissue of primary site . The tissue normal‐corrected data set (subsequently called ‘tissue‐corrected’) was calculated by first defining the mean normal expression for gene at each primary site as:Where is the number of normal tissue samples in primary site , is calculated as: The tissue‐corrected gene expression matrix was calculated as: The grand normal‐corrected data set (subsequently called ‘grand mean‐corrected’) was calculated by partitioning matrix by both the total number of primary sites, and the number of cancers within each primary site, r: was calculated as before. Finally, the grand mean‐corrected gene expression matrix as calculated as:

Gene selection for clustering analysis

For clustering analysis, the genes profiled were restricted firstly to protein‐coding genes because mechanisms of tumorigenesis are currently better understood for the protein‐coding transcriptome (Figure 1A). Secondly, protein‐coding genes were restricted to ‘modules’ with expression values co‐correlated and variable across cancers using weighted gene co‐expression network analysis (WGCNA, Figure 1C).19 WGCNA was carried out using the WGCNA package in R (version 1.68).20 The mean TPM values of all genes in a module were used to evaluate the expression of a module in a cancer.

Characterization of modules identified by WGCNA

Modules were characterized using the over representation analysis (ORA) in the WebGestaltR package (version 0.4.1, Figure 1C).21 ORA used all protein‐coding genes as a reference set, the WikiPathway 22 database for functional annotations and the Benjamini‐Hochberg method 23 for multiple testing correction. Modules were named using default WGCNA settings, which assign each module a colour. The module names were not changed after characterization due to the complexity of the functional enrichment.

Clustering by transcript profiling

Clusters of similar cancers were defined by hierarchical clustering 24 using the cosine distance 25 between the expression profiles of the genes included in the modules and Ward's method 26 for agglomeration (Figure 1C). The number of clusters was determined with the find_k function from the dendextend R package (version 1.12.0); this function estimates k using maximal average silhouette widths.27 Dendrograms were cut into k groups to assign cancers to a cluster.

RESULTS

To test the hypotheses that cancer‐inducing gene expression changes are detectable by RNA‐Seq and traverse cancer primary sites, we analysed the TTG data set (Table 1) and restricted it to tumour and corresponding normal data (Figure 1; Table 1). These data sets were normalized, analysed and stratified.

Hallmark cancer and tissue‐specific pathways distinguish cancer clusters in uncorrected data

Analysis of the uncorrected data set showed that subtle expression differences in both hallmark cancer and cancer‐unrelated, tissue‐specific pathways differentiated clusters. WGCNA categorized 991 genes into 17 modules. Eleven of those modules were enriched for functional pathway annotations: brown, cyan, green, grey60, light yellow, midnight blue, pink, purple, red, tan and turquoise. Eight modules were enriched for tissue‐specific processes: brown, cyan, green, light yellow, pink, red, tan and turquoise (Table S1, ORA, P ≤ .049). The remaining three modules were enriched for cancer‐relevant processes. The grey60 and purple modules were, respectively, enriched for natural killer (NK) cell signalling and T cell receptor (TCR) signalling (Table S1, ORA, P ≤ 2.9*10‐5), axes characteristic of the avoiding immune destruction hallmark. The midnight blue module was enriched for histone modification signalling (Table S1, ORA, P ≤ 10‐12), a component of the genome instability and mutation hallmark.1 Hierarchical clustering of the 991 genes in WGCNA modules resulted in four cancer clusters. Each cluster was characterized by significantly different expression of the cyan, grey60, light yellow, midnight blue, pink, purple, red, tan and turquoise modules (Figure 2A, Kruskal‐Wallis Test, P ≤ 10‐16). The brown and green modules did not show differential expression (Figure 2A, Kruskal‐Wallis Test, P ≥ .35). Post hoc analysis by Dunn's Test for pairwise differences in module expression between clusters showed significantly different expression for four of six pairwise cluster comparisons for the turquoise module, five of six comparisons for the purple, red and tan modules, and six of six comparisons for the cyan, grey60, light yellow, midnight blue and pink modules (Table S2).

Figure 2

Heat maps of module expression within cancer clusters. Heat maps are shown for (A) uncorrected, (B) tissue‐corrected and (C) grand mean‐corrected RNA‐Seq data. WGCNA‐identified modules (left colour bar) are composed of protein‐coding genes with TPM values co‐correlated and variable across cancers. For panel A, module expression is the mean of TPM + 1 values for all genes within a module. For panels B and C, module expression is ln(Tumour/Normal) as defined in the Methods. Clusters of similar tumours (numbered divisions across the top) were defined by hierarchical clustering using the cosine distance between the genes included in the modules and Ward's method for agglomeration. The anatomical primary sites of tumours are graphically portrayed by the colour bar along the top To investigate whether anatomical cancer primary site corresponded with cluster assignment, we evaluated the primary site composition of each cluster. All clusters were primary site heterogeneous. Cluster 1 was primarily composed of breast (26%), prostate (20%), ovary (18%) and kidney (12%) cancers (Figure 3A). Cluster 2 was predominantly composed of lung (37%) and breast (14%) cancers (Figure 3A). Cluster 3 was primarily composed of kidney (24%), liver (21%) and brain (16%) cancers (Figure 3A). Cluster 4 was primarily composed of kidney (21%), breast (20%) and prostate (11%) cancers (Figure 3A).

Figure 3

Graphs showing the distribution of each cancer type among clusters for (A) uncorrected, (B) tissue‐corrected and (C) grand mean‐corrected RNA‐Seq data. Count represents the number of tumours. Colours within the bars represent the cancer anatomical primary sites as described in the key Post hoc analysis by Tukey's Test determined pairwise differences in module expression between primary sites (Table S3). The brown module was expressed higher in uterine cancers than other sites (Figure 4A, Tukey HSD, P ≤ 3.2*10‐9). The light yellow module was expressed higher in breast cancers than other sites (Figure 4A, Tukey HSD, P ≤ 5.9*10‐6). The pink module was expressed higher in liver cancers than other sites (Figure 4A, Tukey HSD, P ≤ 2.0*10‐11). The red, tan and turquoise modules were expressed higher in brain cancers than other sites (Figure 4A, Tukey HSD, P ≤ 2.0*10‐11).The cyan module was expressed higher in stomach and prostate cancers than other sites (Figure 4A, Tukey HSD, P ≤ 7.6*10‐6). The green module was expressed higher in skin cancers than in prostate, liver, kidney, brain or breast cancers (Figure 4A, Tukey HSD, P ≤ .037).The expression of the grey60, midnight blue and purple modules differed between pairs of primary sites but without an appreciable pattern (Figure 4A, Tukey HSD, P ≤ .05).

Figure 4

Heat maps of the module expression for each anatomical primary site (colour bar along the top).Heat maps are shown for (A) uncorrected, (B) tissue‐corrected and (C) grand mean‐corrected RNA‐Seq data. A heat map showing expression of modules (vertical colour bar) identified by WGCNA for each anatomical primary site (horizontal colour bar). WGCNA‐identified modules (left colour bar) are composed of protein‐coding genes with TPM values co‐correlated and variable across cancers. For panel A, module expression is the mean of TPM + 1 values for all genes within a module. For panels B and C, module expression is ln(Tumour/Normal) as defined in the Methods

Hallmark cancer and tissue‐specific pathways distinguish cancer clusters in tissue‐corrected data

Because of the potential for cancer‐unrelated primary site‐specific pathways to obfuscate tumorigenic signatures, we repeated the analyses (Figure 1D) after correcting for tissue‐specific gene expression. This removed some, but not all, of the primary site‐specific pathways seen in the uncorrected data and introduced new primary site‐specific pathways (Table S4 vs Table S1). WGCNA identified 3573 genes distributed into 27 modules in the tissue‐corrected data. Of those modules, 16 were enriched for functional pathway annotations: black, blue, brown, cyan, dark green, green, green yellow, grey60, light cyan, light yellow, magenta, pink, purple, red, turquoise and white (Table S4, ORA, P ≤ .05). Of these 16 modules, 7 were enriched for tissue‐specific processes: black, blue, cyan, dark green, light cyan, light yellow and turquoise (Table S4). The remaining 9 modules were enriched for cancer‐relevant processes. The grey60, white and purple modules were enriched for mRNA splicing and translation (Table S4, ORA, P ≤ .004), processes globally dysregulated in cancer,28 although not a Hanahan and Weinberg described hallmark. The brown module was enriched for EGFR signalling (Table S4, ORA, P ≤ .05) and genes corresponding to mitogenic signalling axes (BRAF, ERK, CREB1, JAK2 and SOS2(Table S4)); components of the sustained proliferative signalling hallmark. The pink module included genes involved in mitogenic signalling axes (RICTOR, SOS1, MEKK2 and REL)(Table S4) and in histone modification (ARID4B, KAT6A, KDM6A, TET2, KMT2C, ASH1L and KMT2E)(Table S4); the former represents the sustained proliferative signalling hallmark and the latter the genome instability and mutation hallmark.1 The green module was enriched for processes related to cell cycle progression (Table S4), a characteristic of the evading growth suppressors hallmark. The green yellow, magenta and red modules were enriched for NK cell, T cell or inflammatory signalling (Table S4, ORA,P ≤ .05), markers of the tumour‐promoting inflammation and evading immune destruction hallmarks. Hierarchical clustering of the 3573 genes in WGCNA modules detected 10 clusters characterized by distinct expression of 10 modules (Figure 2B, Kruskal‐Wallis Test, P ≤ 2.2*10‐16). Post hoc analysis by Dunn's Test to assess pairwise differences in module expression showed differential expression for 38 of 45 cluster comparisons for the blue and purple modules, 39 of 45 comparisons for the turquoise module, 40 of 45 comparisons for the black and dark green modules, 41 of 45 comparisons for the cyan, green yellow, grey60, light cyan and light yellow modules, 42 of 45 comparisons for the brown, green, magenta, red and white modules, and 44 of 45 comparisons for the pink module (Table S5). The high proportion of pairwise cluster comparisons with significant difference reinforces the distinctive expression patterns of each module across clusters. To investigate whether anatomical cancer primary site corresponded with cluster assignment, we evaluated the primary site composition of each cluster. Clusters 1, 2, 3 and 5 were primary site heterogeneous. Cluster 1 was primarily composed of bladder (66%), testis (24%) and uterine (9%) cancers (Figure 3B). Cluster 2 was composed of stomach (47%), colon (33%) and oesophagus (20%) cancers (Figure 3B). Cluster 3 was predominantly composed of prostate (61%), lung (14%) and breast (13%) cancers (Figure 3B). Cluster 5 was composed of liver (67%) and pancreas (33%) cancers (Figure 3B). Clusters 4, 6, 7, 8, 9 and 10 were ≥ 99.9% a single primary site: skin, kidney, ovarian, lung, brain and breast cancers, respectively (Figure 3B). The primary site homogeneity of clusters suggests either that correction for primary site signatures in this data set was incomplete or that the detected cancer signatures are primary site‐dependent. Post hoc analysis by Tukey's Test determined pairwise differences in module expression between primary sites (Table S6). The expression of the black, blue, brown, cyan, dark green, green, green yellow, grey60, light cyan, light yellow, magenta, pink, purple, red, turquoise and white modules differed between primary sites but without an appreciable pattern (Figure 4B, Tukey HSD, P ≤ .05).

Hallmark cancer pathways primarily distinguish cancer clusters in grand mean‐corrected data

Given modules enriched for normal tissue processes in the tissue‐corrected data, we corrected for non‐cancer gene expression using the grand mean expression of each gene in all non‐cancer primary sites. This stratified tumours according to some hallmarks. WGCNA identified 1084 genes in 17 modules, of which 12 were enriched for functional pathway annotations: black, blue, cyan, green, green yellow, light cyan, magenta, pink, salmon, tan, turquoise and yellow (Table S7, ORA, P ≤ .05). Of those 12 modules, 4 were enriched for tissue‐specific processes: blue, green yellow, magenta and yellow (Table S7, ORA, P ≤ .05). The remaining 8 modules were enriched for cancer‐relevant processes. The tan module was enriched for markers of angiogenesis (Table S7, ORA, P ≤ .01), that is the inducing angiogenesis hallmark. The turquoise module was enriched for cell cycle progression pathways (Table S7, ORA, P ≤ .03), a component of the evading growth suppressors hallmark. The cyan and light cyan modules were, respectively, enriched for markers of the epithelial to mesenchymal transition (EMT) and extracellular matrix (ECM) receptor and adhesion signalling (Table S7, ORA,P ≤ .02), components of the activating invasion and metastasis hallmark. The black, green, pink and salmon modules were enriched for NK cell, T cell or inflammatory signalling (Table S7, ORA,P ≤ .05), components of the tumour‐promoting inflammation and evading immune destruction hallmarks. Hierarchical clustering of the 1,084 genes in grand mean‐corrected WGCNA modules defined 10 clusters. These clusters were characterized by distinct expression of modules (Figure 2C). All modules showed differential expression across clusters (Figure 2C, Kruskal‐Wallis Test, P ≤ 2.2*10‐16). Post hoc analysis by Dunn's Test for pairwise differences in module expression showed differential expression for 31 of 45 cluster comparisons for the magenta module, 37 of 45 for the black module, 38 of 45 for the salmon module, 40 of 45 for the green yellow, pink and tan modules, 41 of 45 for the turquoise and yellow modules, 42 of 45 for the light cyan module, 43 of 45 for the blue and green modules, and 44 of 45 for the cyan module (Table S8). To investigate whether anatomical cancer primary site corresponded with cluster assignment, we evaluated the primary site composition of each cluster. Clusters 1, 3 and 5 were primary site heterogeneous. Cluster 1 was primarily composed of ovarian, (56%), testis (19%) and kidney (16.8%) cancers (Figure 3C). Cluster 3 was predominantly composed of lung (33%), bladder (28%) and breast (18%) cancers (Figure 3C). Cluster 5 was primarily composed of stomach (42%), colon (33%) and pancreas (17%) cancers (Figure 3C). Post hoc analysis by Tukey's Test determined pairwise differences in module expression between primary sites (Table S9). The green yellow module was expressed higher in kidney and liver cancers than other primary sites (Figure 4C, Tukey HSD,P ≤ 2.0*10‐11). The magenta module was expressed higher in skin cancers than in prostate, pancreas, ovary, lung, liver, kidney, brain, colon and breast cancers (Figure 4C, Tukey HSD,P ≤ .02). The expression of black, blue, cyan, green, light cyan, pink, salmon, tan, turquoise and yellow modules differed for many pairwise comparisons of primary sites but did not have an appreciable pattern (Figure 4C, Tukey HSD, P ≤ .05).

Clusters are incompletely primary site‐independent

To evaluate whether cancer clusters express hallmarks independently of primary site, we assessed the stratification of a primary site across clusters and the primary site diversity within clusters. For the former, we counted the number of clusters in which the null hypothesis of the hypergeometric test was rejected (P ≤ .05, Table S10). Primary sites with 2 or more clusters that rejected that null hypothesis were considered stratified across clusters. We observed that in the uncorrected data breast, oesophagus, kidney, ovary, prostate, stomach and uterine cancers were stratified by their gene expression profiles (Table 2), that in the tissue‐corrected data no cancer types were stratified by their gene expression profiles (Table 2), and that in the grand mean‐corrected data breast, oesophagus and lung cancers were stratified by their gene expression profiles (Table 2).

Table 2

The number of clusters with more tumours from a primary site than expected by chance as determined by the hypergeometric test

Data set	Bladder	Brain	Breast	Colon	Oesophagus	Kidney	Liver	Lung	Ovary	Pancreas	Prostate	Skin	Stomach	Testis	Uterus
UC	1	1	2	1	2	2	1	1	2	1	2	1	2	1	2
TC	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
GC	1	1	2	1	2	1	1	2	1	1	1	1	1	1	1

Abbreviations: UC, uncorrected data; TC, tissue‐corrected data; GMC, grand mean‐corrected data

The number of clusters with more tumours from a primary site than expected by chance as determined by the hypergeometric test Abbreviations: UC, uncorrected data; TC, tissue‐corrected data; GMC, grand mean‐corrected data

DISCUSSION

By WGCNA, hierarchical clustering and ORA analyses, RNA‐Seq detects gene expression changes contributing to some cancer hallmarks. Of the ten hallmarks identified by Hanahan and Weinberg, our analyses detected modules enriched for seven: evading growth suppressors, tumour‐promoting inflammation, avoiding immune destruction, inducing angiogenesis, sustained proliferative signalling, activating invasion and metastasis, and genome instability and mutation. The gene expression changes corresponding to these hallmarks stratify a subset of cancers across clusters, although not fully independent of tumour primary site.

RNA‐Seq data consistently stratifies cancers by seven therapeutically targetable hallmarks of cancer

Consistent with prior studies,29 our analyses showed enrichment for pathways representative of the sustained proliferative signalling hallmark. The common perturbation of mitogenic signalling axes such as MAPK or PI3K‐Akt‐mTOR has led to development of inhibitors of those axes, although those inhibitors have not been potent single‐agents,30, 31 therapies targeting cell cycle progression, the end‐point of proliferation cascades commonly activated in cancer,29, 32 are being actively investigated. CDK4/6 inhibitors have shown efficacy in trials for late stage breast and lung cancers.33, 34 The detection of modules enriched for cell cycle pathways represent the evading growth suppressors hallmark. Expression of genes such as CDKN2A, CCNE1 and RB1, which are all components of cell cycle signalling, has been implicated in resistance to CDK4/6 inhibitors.33This illustrates that expression assays might have utility detecting biomarkers for resistance to therapies targeting the sustained proliferative signalling hallmark. Components of the tumour‐promoting inflammation and the avoiding immune destruction hallmarks were enriched in several modules and were detected in all data sets (Tables S1, S4, S7).Because only a minority of patients within a given cancer type respond to CD8+T cell dependent cancer immunotherapy,35induction of CD8+T cell recruitment and activation 36or inclusion of innate immune processes such as NK‐cell activation are being developed.37 This expansion requires characterization of biomarkers defining an antitumoral immune response. The presence of T cell activation, NK‐cell activation and inflammatory signalling axes in our analysis suggests that gene expression assays might contribute such biomarkers. Our analyses detected modules enriched for angiogenesis hallmark related processes. Although clinical targeting of the VEGF signalling axis frequently induces resistance38 that limits it as a monotherapy, expression of VEGF and its receptors correlates with cancer stage and metastasis and might be a useful prognostic indicator.39 The variation in the angiogenic signalling detected in our study divides breast, kidney and colon cancers into high and low expression groups (Figure 4C), a division that might not only be useful as a staging marker but also for identifying tumours likely to respond to antiangiogenic therapy. Detection of markers of EMT and ECM signalling axes processes of the activating invasion and metastasis hallmark suggests that gene expression assays might contribute to the individualization of future antimetastatic drug cocktails. Despite limitations, there are current therapeutic strategies to inhibit the metastatic potential including targeting VEGF, the NF‐κB pathway and integrin signalling.40 Chromatin remodelling pathways, which we detected, are both components of the genome instability and mutation hallmark1 and therapeutic targets. Inhibitors of chromatin remodelling have been in use for over a decade, although primarily for leukemias and lymphomas.41 Because nearly half of cancers have alterations of chromatin remodelling, several current trials target aspects of chromatin remodelling in solid cancers.41 Grouping cancers by their precise mechanisms of dysregulated chromatin remodelling assists selecting appropriate therapies, and our analyses suggest gene expression assays might assist with this. The detection of multiple hallmarks by gene expression analyses highlights a potential for RNA‐Seq to identify therapeutic combinations as our analyses subgrouped some cancers according to expression of multiple modules. For example, kidney cancers subdivided into three subgroups: (a) high expression of genes enriched for TCR signalling and for angiogenic signalling, (b) low expression of both TCR signalling and angiogenic signalling genes, and (c) high expression of TCR signalling and low expression of angiogenic signalling genes (Figure 4C). Although not yet tested clinically, such subgroups could provide useful as biomarkers for multimodal treatment of kidney cancer given that immunotherapy and antimetastatic drugs would theoretically target subgroup 1, standard chemotherapies would target subgroup 2 and immunotherapies would target subgroup 3. Multimodal treatment targets multiple hallmarks concurrently because the strong selective pressure on cancer cell populations 42leads to resistance to monotherapies.43Additional multimodal therapies include dual inhibition of the mitogenic and cell cycle signalling pathways,44CDK4/6 inhibitors plus immunotherapy 45 and VEGF inhibition plus multiple classes of antimetastatic therapies,40 which, respectively, correspond to the sustained proliferative signalling, evading growth suppressors, inducing angiogenesis, and activating invasion and metastasis hallmarks of cancer. The analyses herein subdivided some tumours according to combination of these hallmarks.

RNA‐Seq data does not stratify cancers by three hallmarks of cancer

Our analyses did not detect three of Hanahan and Weinberg's hallmarks as gene expression modules stratifying tumours. These hallmarks were resisting cell death, deregulating cellular energetics and enabling replicative immortality. Strong transcriptomic signatures are not expected for the sustained enabling replicative immortality hallmark. This hallmark is predominantly characterized by the expression of TERT.1, 46 TERT alone is insufficient as a transcriptomic network to be detected by our analyses.20 The deregulating cellular energetics hallmarks have a strong transcriptomic footprint.1, 47 The processes underlying this hallmark originate from changes in gene expression, namely, glucose transport,48 glutamine transport49 and the pentose phosphate pathway,50 as well as the biosynthesis of nucleotides,51 serine,52 arginine 53 and proline.54 Studies detecting these changes in gene expression, however, either did not use tumour biopsies as the source tissue or use more targeted methods than RNA‐Seq. In contrast to cultured tumour cells or xenografts of cultured tumour cells, our analyses agnostically probed tumour biopsies, a highly complex cell population,55 for gene expression signatures varying across samples. Consequently, we postulate that tissue heterogeneity introduces too much biological variability or that our assumption of variability across cancers is invalid. Supporting the latter hypothesis, prior studies show consistent expression of metabolic genes across cancer types,56 and we find that genes with the least variable expression are enriched for metabolic pathway annotations (Table S11). Like the cellular energetics hallmark, the resisting cell death hallmark has a strong transcriptomic footprint. It is characterized by the subversion of the regulatory and functional elements of the cellular apoptosis machinery.1 Cancer cells impair apoptosis by decreasing expression of proapoptotic proteins or by increasing expression of antiapoptotic proteins.57 The specific family members up or down‐regulated are, however, relatively cancer type‐specific.57 Consequently, although gene expression networks involved with apoptosis are altered, specific cancers usually have changes in only a few genes in that network,57 and our methodology is insensitive to such limited changes.

Clusters defined by hallmark gene expression are incompletely independent of cancer primary sites

Analyses of the uncorrected data set showed that brain, oesophageal, ovarian, prostate, stomach and uterine cancers were stratified across clusters (Table 2). Due to the presence of modules enriched for non‐cancer processes (Table S1) and the lack of distinct expression of modules enriched for hallmarks across clusters (Figure 2), we are not certain that hallmark cancer signatures solely underlie that stratification. Suggesting the dependence of clustering on the anatomical primary site, analyses of the tissue‐corrected data set found that no cancer types were stratified across clusters (Table 2). The normalization for the tissue‐corrected data set used separate ‘normal’ gene expression vectors for each primary site and that process could introduce signatures by over‐correction. Supporting this, comparison to the uncorrected data shows the concurrent elimination of modules enriched for non‐cancer processes in the uncorrected data and the detection of other modules enriched for non‐cancer processes (Tables S1, S4). Analyses of the grand mean‐corrected data set showed that breast, oesophageal and lung cancers were stratified across clusters (Table 2), suggesting that clusters are incompletely independent of primary site. Although there are four modules enriched for non‐cancer processes, the expression of those modules only distinguishes three of ten clusters (Figure 2C). The modules enriched for hallmarks are the primary differentiators of clusters responsible for stratifying cancers. This stratification is unlikely to be an artefact of the normalization process since the use of a uniform ‘normal’ gene expression vector for all cancers could introduce consistent signatures that would not be considered in module detection by WGCNA. This raises the promise that clustering by expression of genes relevant to cancer hallmarks stratifies cancers to provide prognostically relevant information or therapeutically relevant information, particularly in conjunction with histopathologic, DNA or proteomic data.

Biological processes identified in this study align with previous literature

Previous investigations into transcriptomic subdivisions of cancer observed several of the pathways that we identified. Specifically, pancreatic,9 breast 10 and pan‐cancer 13 studies found that cell cycle pathways define transcriptomic subgroups and that immune signalling defines subgroups of pancreatic 9 and breast 10 cancers. In contrast to the pan‐cancer study of Kaczowski et al did,13 our analysis did not detect differential expression of genes relevant to the cellular energetics hallmark; we hypothesize, as discussed above, that this arose because our analysis of co‐correlated groups of transcripts is insensitive to small numbers of transcripts with altered expression, whereas the approach of Kaczowski et al is not so limited because it focuses on differential expression of individual transcripts.13 On the other hand, detecting expression changes in the chromatin remodelling, angiogenesis and ECM signalling axes, our analysis distinguished molecular subtypes that were not noted in studies of the pancreatic,9 breast 10 or pan‐cancer 13 data sets.

Limitations

The detection of some cancer hallmark signatures is encouraging, given the conservative approach (requiring an expression signature across all cancers) and the following limitations of our study. First, the data set did not contain isoform‐specific expression, and this prevented the incorporation of alternatively spliced transcripts into our analysis. Alternatively, spliced transcripts play important roles in cancer cell biology 58 and are relevant to all hallmarks of cancer.28 Second, although dysregulation of non‐coding RNAs is integral to cancer biology 59 and play a therapeutically relevant role,60 we restricted our analysis to the protein‐coding transcriptome to facilitate pathway enrichment analysis. Third, our analysis did not account for therapeutically relevant gene fusions detectable by RNA‐Seq.61 Fourth, we did not assess allele‐specific expression, which is relevant to cancer biology and progression.62 Fifth, to mitigate the effects of spurious expression differences, we did not consider modules of single or small numbers of genes20; this might be addressed in the future by the inclusion of spike‐in reference RNAs.63 Sixth, for several technical reasons, our analysis was indifferent to tumour microenvironments (TMEs) which are known to modify processes (proliferation, metastasis and interaction with immune cells 64) underlying cancer hallmarks. Although these limitations decreased the sensitivity of our analysis to hallmark changes, they do not alter the specificity of our analysis.

CONCLUSIONS

RNA‐Seq detects some hallmarks of cancer and those hallmarks stratified some, but not all, cancer types. We consistently identified signatures corresponding to the tumour‐promoting inflammation and avoiding immune destruction hallmarks (T cell activation, NK‐cell activation and complement cascade activation), the evading growth suppressors (ATM, Rb and G1 to S phase transition signalling), the inducing angiogenesis (angiogenesis signalling), the sustained proliferative signalling (BRAF‐ERK‐CREB1), the activating invasion and metastasis (ECM receptor signalling and EMT markers), and the genome instability and mutation (histone modification) hallmarks. Additionally, cancer clusters differentiated by the above hallmarks stratified breast, oesophageal and lung cancers, highlighting the possibility of targeting transcriptomic features independent of anatomical primary site. Future studies are required to determine the therapeutic and prognostic relevance of these findings and to assess the impact of including transcriptomic features that we did not analyse.

CONFLICTS OF INTEREST

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

AUTHOR CONTRIBUTIONS

CB conceived the hypothesis tested in this study. GF and PC developed the methodology to test that hypothesis. All code for data analysis and processing was developed by GF. GF wrote the manuscript with input and revisions provided by CB, PC and SM. SM and CB aided the clinical and biological relevance of conclusions drawn from data analysis. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file.

59 in total

Review 1. Genome-wide allele-specific analysis: insights into regulatory variation.

Authors: Tomi Pastinen
Journal: Nat Rev Genet Date: 2010-06-22 Impact factor: 53.242

Review 2. Telomeres and human disease: ageing, cancer and beyond.

Authors: Maria A Blasco
Journal: Nat Rev Genet Date: 2005-08 Impact factor: 53.242

3. Reprogramming of proline and glutamine metabolism contributes to the proliferative and metabolic responses regulated by oncogenic transcription factor c-MYC.

Authors: Wei Liu; Anne Le; Chad Hancock; Andrew N Lane; Chi V Dang; Teresa W-M Fan; James M Phang
Journal: Proc Natl Acad Sci U S A Date: 2012-05-21 Impact factor: 11.205

Review 4. The VEGF pathway in cancer and disease: responses, resistance, and the path forward.

Authors: Mark W Kieran; Raghu Kalluri; Yoon-Jae Cho
Journal: Cold Spring Harb Perspect Med Date: 2012-12-01 Impact factor: 6.915

5. Efficacy and Safety of Abemaciclib, an Inhibitor of CDK4 and CDK6, for Patients with Breast Cancer, Non-Small Cell Lung Cancer, and Other Solid Tumors.

Authors: Amita Patnaik; Lee S Rosen; Sara M Tolaney; Anthony W Tolcher; Jonathan W Goldman; Leena Gandhi; Kyriakos P Papadopoulos; Muralidhar Beeram; Drew W Rasco; John F Hilton; Aejaz Nasir; Richard P Beckmann; Andrew E Schade; Angie D Fulford; Tuan S Nguyen; Ricardo Martinez; Palaniappan Kulanthaivel; Lily Q Li; Martin Frenzel; Damien M Cronier; Edward M Chan; Keith T Flaherty; Patrick Y Wen; Geoffrey I Shapiro
Journal: Cancer Discov Date: 2016-05-23 Impact factor: 39.397

6. Transcriptome Analysis of Recurrently Deregulated Genes across Multiple Cancers Identifies New Pan-Cancer Biomarkers.

Authors: Bogumil Kaczkowski; Yuji Tanaka; Hideya Kawaji; Albin Sandelin; Robin Andersson; Masayoshi Itoh; Timo Lassmann; Yoshihide Hayashizaki; Piero Carninci; Alistair R R Forrest
Journal: Cancer Res Date: 2015-11-09 Impact factor: 12.701

7. Functional genomics reveal that the serine synthesis pathway is essential in breast cancer.

Authors: Richard Possemato; Kevin M Marks; Yoav D Shaul; Michael E Pacold; Dohoon Kim; Kıvanç Birsoy; Shalini Sethumadhavan; Hin-Koon Woo; Hyun G Jang; Abhishek K Jha; Walter W Chen; Francesca G Barrett; Nicolas Stransky; Zhi-Yang Tsun; Glenn S Cowley; Jordi Barretina; Nada Y Kalaany; Peggy P Hsu; Kathleen Ottina; Albert M Chan; Bingbing Yuan; Levi A Garraway; David E Root; Mari Mino-Kenudson; Elena F Brachtel; Edward M Driggers; David M Sabatini
Journal: Nature Date: 2011-08-18 Impact factor: 49.962

8. CDK4/6 Inhibition Augments Antitumor Immunity by Enhancing T-cell Activation.

Authors: Jiehui Deng; Eric S Wang; Russell W Jenkins; Shuai Li; Ruben Dries; Kathleen Yates; Sandeep Chhabra; Wei Huang; Hongye Liu; Amir R Aref; Elena Ivanova; Cloud P Paweletz; Michaela Bowden; Chensheng W Zhou; Grit S Herter-Sprie; Jessica A Sorrentino; John E Bisi; Patrick H Lizotte; Ashley A Merlino; Max M Quinn; Lauren E Bufe; Annan Yang; Yanxi Zhang; Hua Zhang; Peng Gao; Ting Chen; Megan E Cavanaugh; Amanda J Rode; Eric Haines; Patrick J Roberts; Jay C Strum; William G Richards; Jochen H Lorch; Sareh Parangi; Viswanath Gunda; Genevieve M Boland; Raphael Bueno; Sangeetha Palakurthi; Gordon J Freeman; Jerome Ritz; W Nicholas Haining; Norman E Sharpless; Haribabu Arthanari; Geoffrey I Shapiro; David A Barbie; Nathanael S Gray; Kwok-Kin Wong
Journal: Cancer Discov Date: 2017-11-03 Impact factor: 38.272

9. Gateways to the FANTOM5 promoter level mammalian expression atlas.

Authors: Marina Lizio; Jayson Harshbarger; Hisashi Shimoji; Jessica Severin; Takeya Kasukawa; Serkan Sahin; Imad Abugessaisa; Shiro Fukuda; Fumi Hori; Sachi Ishikawa-Kato; Christopher J Mungall; Erik Arner; J Kenneth Baillie; Nicolas Bertin; Hidemasa Bono; Michiel de Hoon; Alexander D Diehl; Emmanuel Dimont; Tom C Freeman; Kaori Fujieda; Winston Hide; Rajaram Kaliyaperumal; Toshiaki Katayama; Timo Lassmann; Terrence F Meehan; Koro Nishikata; Hiromasa Ono; Michael Rehli; Albin Sandelin; Erik A Schultes; Peter A C 't Hoen; Zuotian Tatum; Mark Thompson; Tetsuro Toyoda; Derek W Wright; Carsten O Daub; Masayoshi Itoh; Piero Carninci; Yoshihide Hayashizaki; Alistair R R Forrest; Hideya Kawaji
Journal: Genome Biol Date: 2015-01-05 Impact factor: 13.583

10. Mining biological pathways using WikiPathways web services.

Authors: Thomas Kelder; Alexander R Pico; Kristina Hanspers; Martijn P van Iersel; Chris Evelo; Bruce R Conklin
Journal: PLoS One Date: 2009-07-30 Impact factor: 3.240

13 in total

1. Identifying common transcriptome signatures of cancer by interpreting deep learning models.

Authors: Anupama Jha; Mathieu Quesnel-Vallières; David Wang; Andrei Thomas-Tikhonenko; Kristen W Lynch; Yoseph Barash
Journal: Genome Biol Date: 2022-05-17 Impact factor: 17.906

2. PCMT1 Is a Potential Prognostic Biomarker and Is Correlated with Immune Infiltrates in Breast Cancer.

Authors: Jufang Guo; Xuelian Du; Chaolin Li
Journal: Biomed Res Int Date: 2022-04-30 Impact factor: 3.246

3. Quantitative Analysis of Differential Expression of HOX Genes in Multiple Cancers.

Authors: Orit Adato; Yaron Orenstein; Juri Kopolovic; Tamar Juven-Gershon; Ron Unger
Journal: Cancers (Basel) Date: 2020-06-14 Impact factor: 6.639

4. Alternative splicing of leptin receptor overlapping transcript in osteosarcoma.

Authors: Emel Rothzerg; Xuan D Ho; Jiake Xu; David Wood; Aare Märtson; Katre Maasalu; Sulev Kõks
Journal: Exp Biol Med (Maywood) Date: 2020-08-12

5. Pan-cancer RNA-seq data stratifies tumours by some hallmarks of cancer.

Authors: F Graeme Frost; Praveen F Cherukuri; Samuel Milanovich; Cornelius F Boerkoel
Journal: J Cell Mol Med Date: 2019-11-15 Impact factor: 5.310

6. Establishing a consensus for the hallmarks of cancer based on gene ontology and pathway annotations.

Authors: Yi Chen; Fons J Verbeek; Katherine Wolstencroft
Journal: BMC Bioinformatics Date: 2021-04-06 Impact factor: 3.169

7. Pancancer Analysis of Neurovascular-Related NRP Family Genes as Potential Prognostic Biomarkers of Bladder Urothelial Carcinoma.

Authors: Chao Deng; Hang Guo; Dongliang Yan; Tao Liang; Xuxiao Ye; Zuowei Li
Journal: Biomed Res Int Date: 2021-04-15 Impact factor: 3.411

8. Construction and Clinical Translation of Causal Pan-Cancer Gene Score Across Cancer Types.

Authors: Shiyue Tao; Xiangyu Ye; Lulu Pan; Minghan Fu; Peng Huang; Zhihang Peng; Sheng Yang
Journal: Front Genet Date: 2021-12-23 Impact factor: 4.599

9. Mapk14 is a Prognostic Biomarker and Correlates with the Clinicopathological Features and Immune Infiltration of Colorectal Cancer.

Authors: Dan Wang; Li Peng; Li Hua; Jiaxiang Li; Yifei Liu; Yanhong Zhou
Journal: Front Cell Dev Biol Date: 2022-01-24

10. miRNA-Mediated Low Expression of EPHX3 Is Associated with Poor Prognosis and Tumor Immune Infiltration in Head and Neck Squamous Cell Carcinomas.

Authors: Shun Ding; Qichao Hong; Tingting Duan; Zhengyang Xu; Qing He; Dongqin Qiu; Lin Li; Jingren Yan; Qimeng Zhang; Zhonglin Mu
Journal: J Oncol Date: 2022-04-01 Impact factor: 4.375