Literature DB >> 32300588

Identification of Pan-Cancer Prognostic Biomarkers Through Integration of Multi-Omics Data.

Ning Zhao¹, Maozu Guo^1,2,3, Kuanquan Wang^1,4, Chunlong Zhang⁵, Xiaoyan Liu⁴.

Abstract

Prognostic biomarkers dedicating to treat cancer are very difficult to identify. Although high-throughput sequencing technology allows us to mine prognostic biomarkers much deeper by analyzing omics data, there is lack of effective methods to comprehensively utilize multi-omics data. In this work, we integrated multi-omics data [DNA methylation (DM), gene expression (GE), somatic copy number alternation, and microRNA expression (ME)] and proposed a method to rank genes by desiring a "Score." Applying the method, cancer-specific prognostic biomarkers for 13 cancers were obtained. The prognostic powers of the biomarkers were further assessed by C-indexes (ranged from 0.76 to 0.96). Moreover, by comparing the 13 survival-related gene lists, seven genes (SLK, API5, BTBD2, PTAR1, VPS37A, EIF2B1, and ZRANB1) were found to be associated with prognosis in a variety of cancers. In particular, SLK was more likely to be cancer-related due to its high missense mutation rate and associated with cell adhesion. Furthermore, after network analysis, EPRS, HNRNPA2B1, BPTF, LRRK1, and PUM1 were demonstrated to have a broad correlation with cancers. In summary, our method has a better integration of multi-omics data that can be extended to the researches of other diseases. And the prognostic biomarkers had a better prognostic power than previous methods. Our results could provide a reference for translational medicine researchers and clinicians.

Entities: Disease Gene Species

Keywords: biomarker; multi-omics; pan-cancer; prognosis; survival

Year: 2020 PMID： 32300588 PMCID： PMC7142216 DOI： 10.3389/fbioe.2020.00268

Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN： 2296-4185

Introduction

Cancer is a major public health problem worldwide (Siegel et al., 2020) and the occurrence of cancer is caused by many factors. It is not only controlled by genetics and epigenetics, but also influenced by many other regulatory factors, such as miRNAs. A variety of regulatory factors contribute to the heterogeneity of cancer (Marusyk et al., 2012; Swanton, 2012; Burrell et al., 2013), which leads to a low cure rate and poor prognosis. Survival prediction provided a crucial evidence for the process of cancer diagnosis and treatment. Prognostic biomarkers are used to predict likelihood of recurrence or progression in patients with cancer (Cagney et al., 2018). However, it is still hard to identify the prognostic biomarkers of cancer accurately. Omics data play a key role in predicting prognostic biomarkers. At present, many researchers have identified prognostic biomarkers based on differential analysis of DNA methylation (DM) or other omics data, involving gene expression (GE), somatic copy-number alteration (SCNA) and microRNA expression (ME). Dalerba et al. (2016) found that CDX2 was a prognostic biomarker in stage II and stage III colon cancer by analyzing GE data. Zhao et al. (2017) identified eight differentially methylated CpGs as new prognostic biomarkers for prostate cancer by analyzing DM data. Morikawa et al. (2018) discovered that SCNAs in 8p11.21-22, 12p13.31, 20q13.2, 3q26.1, 4q13.2, and 22q11.23 were critical for the development and survival of ovarian clear cell carcinoma. Lindahl et al. (2018) developed a prognostic 3-miRNA classifier (miR-106b-5p, miR-148a-3p, and miR-338-3p) in early-stage mycosis fungoides. The advantage of omics data for identifying cancer-related prognostic biomarkers can be clearly seen in the studies mentioned above. However, each of these studies used only one type of omics data, which did not make full use of omics data. The regulation of GE is a complex process. Generally, the DNA hypermethylation in promoter region of genes could cause transcriptional silencing (Baylin, 2005) and DNA hypomethylation was associated with the activation of GE (Berdasco and Esteller, 2010). Besides, the copy number correlated positively with expression levels for genes (Fehrmann et al., 2015). Moreover, miRNAs complementary bound to messenger RNAs (mRNAs) and formed RNA-induced silencing complex (RISC) to downregulate GE levels (Bartel, 2004). The researches of cancer focusing on one-dimensional omics data may only provide limited information for the etiology of oncogenesis and tumor progression. In the past few years, more and more researches applied multi-omics data. Xu et al. (2019) proposed a method, named high-order path elucidated similarity (HOPES), to identify cancer subtypes by simultaneous interrogation multi-omics data. They utilized their method on GE, DM, and ME data of five TCGA cancers to identify subtypes and further validated reliability and clinical role of them. Vasaikar et al. (2018) developed a powerful database, named LinkedOmics, for analysis of omics data in cancer. LinkedOmics contained multi-omics data of 32 cancer types and allowed for flexible exploration and comparison of associations between multiple types of attributes within and across tumor types. The positive results of these researches confirmed the feasibility of integrating multi-omics data. Both of these work used multi-omics data for cancer research. However, they did not focus on prognostic markers, so we cannot further compare them numerically with our method. Similarly, integrating omics data indicated the potential benefits for discovering underlying prognostic markers in cancer (Huang et al., 2017). Using multi-omics data acquiring from the same set of samples has the potential capacity to expose more accurate biomarkers for patients’ survival than examining by one single-omics data (Rappoport and Shamir, 2018). Yuan et al. (2014) used somatic copy-number alteration (SCNA), DM, GE, ME, and protein expression data to predict survival status of patients. They found that incorporating molecular data with clinical variables improved the accuracy of survival prediction for cancers. This work provided a starting point and resources for the subsequent researches. Zhang et al. (2016) utilized GE, SCNA, ME, and DM data to uncover protein–protein subnetworks associated with prognosis. This work built a multi-dimensional subnetwork atlas for cancer prognosis to investigate the potential impact of multiple genetics and epigenetics better. Chaudhary et al. (2018) presented a deep learning based model on liver hepatocellular carcinoma (LIHC) that robustly differentiates survival subpopulations of patients using GE, DM, and ME data. They validated this multi-omics model on five external datasets of various omics types and all have good performance. Zhu et al. (2017) presented a kernel machine learning method to systematically quantify the prognostic values of clinical information, GE, SCNA, DM, and ME across 14 cancer types. This study aimed to compare the advantages and disadvantages of using different omics data to evaluate patients’ survival. Based on their result, GE and ME data were demonstrated to be the best data for the prognosis of cancers. Mishra et al. (2019) used DM, GE, ME, and long non-coding RNA (lncRNA) expression data to identify potential prognostic markers of pancreatic ductal adenocarcinoma. They identified several genes, miRNA, lncRNA, and CpG sites as probable prognostic biomarkers. All methods mentioned above used multi-omics data to perform prediction of patients’ survival. However, most of them did not integrate multi-omics data comprehensively but only utilized multi-omics data to explore mechanism of cancer separately. Moreover most of them they did not provide specifically prognostic biomarkers for other clinical researches or just aimed at limited kinds of cancers. The Cancer Genome Atlas (TCGA) provides multiple omics data for different cancers (Cancer Genome Atlas Research Network, 2011, 2012, 2013; Cancer Genome Atlas Research Network et al., 2016), which allows for analyzing multi-omics data coming from the same samples. So far, there already exist a variety of methods for predicting patients’ survival status using TCGA omics data. In this work, we put forward our own method to identify prognostic biomarkers and identified prognostic gene lists for 13 types of cancers. This work provided theoretical foundation and reliably prognostic biomarkers for other researches focusing on diagnosis, prognosis, and treatment of cancers.

Materials and Methods

Data

Multi-omics data were downloaded from TCGA. The scale and platform of each cancer data are shown in Table 1. We selected the cancers which had HM450K DM data, RNA-seq data (GE), miRNA-seq data (ME), and SNP 6.0 copy number data (SCNA) simultaneously and whose sample size was greater than 200. Samples with sample type codes of “01” were retained, which represented “Primary Solid Tumor.” After being filtered, there were 13 types of cancers available. For SCNA data, a matrix was obtained after being processed by Gistic 2.0 (Mermel et al., 2011). Next, all omics data matrixes except ME were converted into gene matrixes based on the annotation information from TCGA. Genes with missing values in > 5% of the samples were removed in each matrix. Moreover, for GE and ME data, we retained the genes or miRNAs with values greater than 0 in > 50% of the samples and with values greater than 1 in > 10% of the samples, respectively. After converting if one gene had multiple signals in one sample, we calculated the average of the values as the final signal. For ME, miRNAs were specifically bound to mRNAs by complementary base pairing, therefore the corresponding relationships between miRNAs and genes were obtained through the miRNA–mRNA interactions which were downloaded from the Starbase database (Yang et al., 2011). Interactions with no less than five supporting experiments and anti-correlation in no less than one cancer type were selected. Since multiple miRNAs were bound to the same gene, the average value of the miRNAs was assigned to the gene.

TABLE 1

The sample size of 13 types of cancers.

Cancer	Clinical	DM (450K)	SCNA (nocnv)	GE (FPKM-UQ)	ME (isoform)	Total size
Bladder urothelial carcinoma [BLCA]	291	412	410	408	409	283
Breast invasive carcinoma [BRCA]	735	783	1092	1091	1078	497
Cervical squamous cell carcinoma and endocervical adenocarcinoma [CESC]	177	307	295	304	307	166
Colon adenocarcinoma [COAD]	186	296	452	456	444	164
Head and neck squamous cell carcinoma [HNSC]	416	528	522	500	523	380
Kidney renal clear cell carcinoma [KIRC]	452	319	532	530	516	251
Kidney renal papillary cell carcinoma [KIRP]	183	275	289	288	291	172
Brain lower grade glioma [LGG]	347	516	515	511	512	340
Liver hepatocellular carcinoma [LIHC]	259	377	375	371	372	250
Lung adenocarcinoma [LUAD]	292	460	520	515	515	229
Lung squamous cell carcinoma [LUSC]	299	370	503	501	478	190
Sarcoma [SARC]	204	261	260	259	259	200
Stomach adenocarcinoma [STAD]	201	395	441	375	436	157

The sample size of 13 types of cancers. Because of different scales for the omics data, the data were normalized based on the following rules. First, each omics data were organized into a matrix of the same genes and samples, separately. Second, the method z-score was used to transform a matrix into standardized one with the mean and standard deviation of 0 and 1, respectively. Finally, we uniformly kept the fourth decimal place for better integration of the standardized data.

Screening of Candidate Survival-Related Genes

Univariate Cox proportional hazards regression model (Cox, 1986) was used to identify candidate survival-related genes from each omics data through the formula: where the explanatory variable X was the omics data (DM, GE, copy number variation, or miRNA expression) of a gene, and the response variable t was the survival time (Aalen, 1989). The proportional hazards regression model was calculated through the R package “survival.” β greater than zero meant the gene was a risk factor base on the corresponding omics data. Then using voting strategy, if a gene had a p-value of likelihood ratio test less than 0.05 (Yuan et al., 2014), the gene was denoted as “1”. Otherwise, it was denoted as “0.” Finally, a gene defined as a candidate survival-related gene should be marked as “1” in no less than two of the four omics data types (Figure 1A).

FIGURE 1

The workflow of survival-related genes identification. (A) Candidate survival-related gene screening. DNA methylation, gene expression, somatic copy-number alteration, and microRNA (miRNA) expression profiles of TCGA for the same samples were extracted. miRNA expression data were corresponding to genes according to miRNA–mRNA interactions. Then, we got four types of data in the same samples and the same genes. On each omics data, univariate Cox proportional hazards model was utilized to identify survival-related genes. Only the genes associated with survival in more than two types of data were considered to be candidate genes. (B) Prognostic biomarker identifying. For the selected candidate genes, the multivariate Cox proportional hazards model was then applied to get risk scores (RS). Further, scores for ranking genes were obtained by calculating GS scores. In which, A, B, C, and D were binary variables indicating whether the gene was survival-related at the four omics data or not (“1” for related and “0” for not), respectively. The high ranked genes were identified survival-related.

Identification of Prognostic Biomarkers

As shown in Figure 1B, prognostic biomarkers were further identified in the set of candidate survival-related genes. For each gene, a matrix M = [Omics,Omics,Omics,Omics] merged by the vectors of the four omics data of the gene was obtained. Then, the multivariate Cox proportional hazards model was applied on it. Briefly, the model assumed that a patient with covariate values has a cumulative hazard rate related to an unspecified baseline hazard rate seen in the equation: where h(t, M) was the patient’s hazard of death at time t, h0(t) was the baseline hazard rate, and B = [β1,β2,β3,β4] was a regression coefficient that gives the effect of each M covariate on the hazard rate (Alamartine et al., 1991). Each β could be interpreted as a risk coefficient (Collett, 2015). If the p-values of Cox fitting in all three overall tests (likelihood, Wald, and log-rank) were less than 0.05, the model was thought to be significant (Rodriguez-Martin et al., 2020). Therefore, we only kept genes whose all three p-values were less than 0.05. For the retained genes, each gene had a vector including the value of four types of omics data in each sample V = [v1,v2,v3,v4]. The risk score (RS) for the gene in each sample was then calculated: The RS score could be used to predict the patients’ risk. Thereafter, RS scores of the genes were used to calculate each gene’s score (GS): where m was the number of samples. At last, the scores of univariate and multivariate Cox proportional hazards model were combined to calculate the survival-related score of each gene (Score): where A, B, C, and D represented whether the gene was survival-related at the GE level, copy number level, DM level, and miRNA level, respectively (“1” meant related and “0” meant not). The higher the score, the more relevant between the gene and patients’ survival. Therefore, high score genes were identified as prognostic biomarkers.

Functional Analysis

Cumulative hypergeometric inspection was applied to enrichment analysis of Gene Ontology (GO) functions (Gene Ontology, 2015) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (Kanehisa and Goto, 2000): where N was the whole number of genes. M was the number of genes on a term or a pathway. n was the intersection of interested gene set and N. i was the intersection of M and n. Significant threshold of hypergeometric test was set to P < 0.05 (Liu et al., 2018). For the enrichment analysis of GO, we only investigated the biological process (BP) terms.

Different Expression

In order to identify differentially expressed genes, the corresponding normal samples of LUSC and KIRC were downloaded from TCGA. There were 49 LUSC normal samples and 72 KIRC normal samples. After organized into the same gene set, the differentially expressed genes between tumor and normal samples were identified using the R package “samr” with the threshold value q-value < 0.05 and |log2(foldchange)| > 1 (Group et al., 2020). The package was based on significance analysis of microarrays (SAM). SAM was developed based on t-test and adjusted the p-value to assess the statistically significant changes for genes (Tusher et al., 2001).

Predictive Model Validation

For each cancer type, in order to evaluate the prognostic power of the biomarker fairly and accurately, the concordance index (C-index) (Harrell et al., 1996) was applied to assess the prognostic power of the classifier. The C-index was a non-parametric measure to quantify the discriminatory power of a predictive model with the value ranging from 0.5 to 1. A C-index of 1 represented perfect prediction accuracy, while C-index of 0.5 indicated a bad prediction like a random guess. First, we randomly selected 90% of the samples. Second, the Cox regression model was used to calculate the RS score for each sample by multi-omics data of the identified biomarker genes. Based on the RS score, samples were classified into high and low risk groups. Patients in the high risk group were more likely to have poor prognosis while patients in the low risk group were more likely to have good prognosis. Finally, the predicted outcomes for patients were compared with the real status to calculate the C-indexes. The procedure above was repeated 100 times to generate 100 C-indexes. If the median value of C-index was significantly higher than 0.5, indicating that the model had substantially prognostic power.

Decision Curve Analysis

Decision curve analysis was performed through the multi-omics data and every single omics data, respectively. The method was based on the principle that the relative harms of false positives (e.g., unnecessary biopsy) and false negatives (e.g., missed cancer) could be expressed in terms of a probability threshold (Vickers et al., 2008). Therefore, this threshold probability could be used to determine both whether a patient was defined as test-positive or negative and to model the clinical consequences of true and false positives using a clinical net benefit function: where n was the total number of patients in the study and p was the threshold probability. Net benefit was weighted by the relative harm of forgoing treatment compared with the negative consequences of an unnecessary treatment. In the decision curve, the thin oblique line represented the assumption that all patients have been treated. The black line represented the assumption that no patients have been treated.

Results

Pan-Cancer Prognostic Biomarker Identification

We integrated GE, SCNA, DM, and miRNA expression data of 13 cancers from TCGA: bladder urothelial carcinoma (BLCA), breast invasive carcinoma (BRCA), cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), colon adenocarcinoma (COAD), head and neck squamous cell carcinoma (HNSC), kidney renal clear cell carcinoma (KIRC), kidney renal papillary cell carcinoma (KIRP), brain lower grade glioma (LGG), LIHC, lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), sarcoma (SARC), and stomach adenocarcinoma (STAD). After data preprocessing, samples with all the four omics data were kept. Whereupon, we collected the DM, GE, copy number, and miRNA expression of 3279 samples (Figure 2A). The percentage of each cancer is shown in Figure 2B. We then summarized the clinical characteristics of the 3279 samples. As shown in Figure 2C, the majority of these patients were 60–79 years old. And the number of men and women was basically equal. Hence, the sample set could be used to study cancer without gender and age bias. In addition, most of the patients were white people. The complete clinical information for each sample is provided in Supplementary Table S1.

FIGURE 2

The information of pan-cancer samples. (A) The sample set intersections of the multi-omics data. Only the intersecting samples were chosen. We selected 3279 samples in this study. (B) The proportion of each cancer. (C) The clinical features distribution of the 3279 samples. The survival-related gene list of each cancer is shown in Supplementary Table S2. We took top-10 genes as a prognostic biomarker of each cancer (Table 2) to draw Kaplan–Meier (KM) curves and calculated their log-rank p-values. As shown in Figure 3, the prognostic markers for each cancer significantly distinguished the high and low risk groups, except for SARC.

TABLE 2

The prognostic biomarkers of each cancer.

Cancers	Genes
BLCA	NCBP1, RP9, UBP1, AURKB, ARL6IP5, LEMD3, HSPBP1, TMEM214, MCMBP, FAM107B
BRCA	LIMS1, NDUFA3, NKTR, SRP68, ARPC3, TMEM138, DDIT4, OCIAD1, MAF1, DPY19L4
CESC	HNRNPA2B1, MYO9B, EIF3B, MTX2, MON1B, SUN1, SSH1, SLC35E3, MAP7D1, PGAM5
COAD	SRP72, TAF10, USP1, USP8, JKAMP, YTHDF1, BRIX1, ATG101, VPS37A, TMED4
HNSC	DAGLA, DNAJA3, PTER, NUDT5, FAM168A, GTPBP4, PTCD3, MINDY3, NCALD, STEAP2
KIRC	BPTF, IMPA1, RPS6KA4, TSFM, CEPT1, PRKD3, PPME1, SPIRE1, NUDCD1, NHLRC2
KIRP	DAPK3, EPRS, IARS, SYMPK, PCSK7, APBB3, MSANTD2, TBC1D17, DOHH, CMSS1
LGG	FDXR, VPS4B, WASHC5, CITED2, BRD8, MON2, TSPAN13, MIOS, OGFOD3, PIGO
LIHC	BCL2L1, SELENOW, HIST1H2BN, MMADHC, PNPO, ZDHHC11, ULBP2, CSRNP2, SPC24, RPL7L1
LUAD	FGFR3, LTBP3, SLC6A4, PUM1, ARHGAP44, SLC39A1, NAGPA, BTBD2, LRRK1, ZFC3H1
LUSC	HSF2, BCLAF1, UHRF1BP1L, CHORDC1, CREBZF, FBXO30, PCGF6, PLCD3, HINT3, SLC35E2B
SARC	BMP1, NCAM2, PBX1, RAD17, ARHGEF10, PSD3, MRPL17, FAM160B2, CHMP7, VPS37A
STAD	ANK3, GNAI2, MARCKS, NEDD4, PRKAA1, UGP2, TAF1C, INO80D, USP37, FAM126B

FIGURE 3

The Kaplan–Meier curves of top-10 survival-related genes for each cancer. The green lines represented the low risk groups and the red lines represented the high risk groups. “ + ” indicated the censored follow-ups. (A) BLCA. (B) BRCA. (C) CESC. (D) COAD. (E) HNSC. (F) KIRC. (G) KIRP. (H) LGG. (I) LIHC. (J) LUAD. (K) LUSC. (L) SARC. (M) STAD.

The prognostic biomarkers of each cancer. The Kaplan–Meier curves of top-10 survival-related genes for each cancer. The green lines represented the low risk groups and the red lines represented the high risk groups. “ + ” indicated the censored follow-ups. (A) BLCA. (B) BRCA. (C) CESC. (D) COAD. (E) HNSC. (F) KIRC. (G) KIRP. (H) LGG. (I) LIHC. (J) LUAD. (K) LUSC. (L) SARC. (M) STAD. For each cancer type, we calculated the C-index which was a non-parametric measure to quantify the discriminatory power of a predictive model. Figure 4 shows the C-indexes of each cancer. All of the cancers had a C-index significantly higher than 0.5. BRCA had the highest C-index (0.96) while LUSC had the lowest (0.76).

FIGURE 4

The C-index comparison of the prognostic power of our prognostic biomarkers in 13 cancers.

The C-index comparison of the prognostic power of our prognostic biomarkers in 13 cancers. In order to discover the relationship among different cancers based on function, we used the prognostic biomarker genes to perform functional enrichment analysis of GO and KEGG (Supplementary Table S3). The most significantly enriched functions and pathways of each cancer are displayed in Figure 5A. Among them, COAD, LGG, and SARC were enriched in “endocytosis.” BLCA was enriched in “RNA splicing” and CESC was enriched in “mitophagy.” The prognostic biomarker genes were enriched in closely cancer-related functions. Then, we calculated the counts of each function enriched by cancers. As shown in Figure 5B, “Mitophagy” was enriched by the most cancers. Mitophagy was a tumor suppression mechanism (Bernardini et al., 2017). Besides, we had some interesting findings. First, the most significantly enriched functions of each cancer were their specific functions, while the common functions of cancers were not highly significant generally. For example, “cytoskeleton-dependent cytokinesis” was the common enriched function of STAD, KIRC, COAD, and BLCA, and they had p-values about 0.03 which was less significant than their most significantly enriched functions (p-values < 0.003). And their most significant functions were all their specific functions. Second, even if different cancers were enriched in a same function, the enrichment of function in different cancers was caused by different gene sets. For instance, “Mitophagy” was the common function of LIHC, LGG, KIRP, COAD, and CESC, but the function was hit by different genes (BCL2L1 of LIHC, CITED2 of LGG, TBC1D17 of KIRP, USP8 of COAD, and PGAM5 of CESC). Whereafter, we sought the intersection of associated functions for the 13 cancers (Figure 6A top right corner). The result showed that the intersection of LGG and SARC was the largest, followed by BLCA and CESC.

FIGURE 5

FIGURE 6

The intersection of pan-cancer genes. (A) The intersections of the lists of survival-related genes (left bottom) and the intersection of associated functions (top right corner) of cancers. The total numbers of genes associated with survival of each cancer were on the left, and the total associated functions were on the right. The color blocks represented the number of intersecting samples of each two cancers. The darker the color, the greater the intersection was. (B) The pan-cancer survival-related genes. The red blocks indicated that the gene was survival-associated with the cancer.

Pan-cancer functional comparison of survival-related genes. (A) The representative KEGG pathways and GO functions enriched by the top-10 prognostic genes of each cancer. (B) The distribution of cancers enriched to each function. The size of the dots represented the number of enriched genes. The color of the dots represented the p-values. The intersection of pan-cancer genes. (A) The intersections of the lists of survival-related genes (left bottom) and the intersection of associated functions (top right corner) of cancers. The total numbers of genes associated with survival of each cancer were on the left, and the total associated functions were on the right. The color blocks represented the number of intersecting samples of each two cancers. The darker the color, the greater the intersection was. (B) The pan-cancer survival-related genes. The red blocks indicated that the gene was survival-associated with the cancer. In order to discover the relationship among different cancers based on survival-related genes, we first compared the intersection of the survival-related gene lists between each two cancers. We found there was always an overlap between each two gene lists (Figure 6A left bottom). The intersection of KIRP and KIRC was the largest. All of the intersections among KIRC, KIRP, and BLCA were large, which might be due to the reason that the three cancers had the largest number of genes. The intersections with other cancers were roughly proportional to the size of the gene list. Second, we compared the gene lists among the 13 types of cancers. We found that seven genes were associated with survival in three kinds of cancers (Figure 6B). Subsequently, we downloaded the list of cancer-related genes from the Candidate Cancer Gene Database (CCGD) (Abbott et al., 2015), and retained the human genes that appeared in at least one of the COSMIC and CGC (Sondka et al., 2018). A total of 9265 genes were retained. All of the seven pan-cancer survival-related genes were in the list, and have been verified cancer-related in no less than one literature (Table 3). In addition, we investigated the functions of the seven genes (Supplementary Table S4) and conducted UCSC Genome Browser (Tyner et al., 2017) analysis on the seven genes. We found that SLK had an unconservative exon region, which containing four missense variants (Figure 7A).

TABLE 3

The seven pan-cancer survival-related genes.

Genes	Functions	PubMed IDs
SLK	Cell adhesion, regulation of cell cycle	26676752, 27849608, 22057237, 27247392, 22699621
ZRANB1	Protein catabolic process	26676752, 27790711, 25559195, 27006499, 24316982, 27178121, 23685747, 22421440
BTBD2	Protein catabolic process	26676752, 22057237
PTAR1	Cellular protein modification process	26676752
VPS37A	Viral process	27849608, 24316982, 23045694
EIF2B1	Glial cell development	22057237
API5	Regulation of cell death	27790711, 27178121

FIGURE 7

The characteristics of SLK. (A)The results of UCSC Genome Browser. (B) Distribution of mutations on SLK. (C) The functions of SLK. (D) The protein–protein interactions of SLK.

The seven pan-cancer survival-related genes. The characteristics of SLK. (A)The results of UCSC Genome Browser. (B) Distribution of mutations on SLK. (C) The functions of SLK. (D) The protein–protein interactions of SLK. In order to further verify the close relationship of SLK and cancer, we checked the mutation of SLK in the COSMIC database (Tate et al., 2019), and found that missense substitution occurred in 36.93% of the samples (Figure 7B). Next, enrichment analysis in GO terms (BP) was performed by Enrichr (Kuleshov et al., 2016) and found SLK was mainly associated with cell adhesion (Figure 7C). Finally, we used STRING database (Szklarczyk et al., 2019) to check the interacted proteins of SLK. Figure 7D shows there were 10 genes interacting with SLK. Half of the interactions have been demonstrated through more than one method, and the genes interacting with SLK also had strong relationship between each other. In order to explore the correlation among prognostic biomarkers of different cancers, we checked the genomic locations of these 130 genes (Figure 8A). There were many prognosis-related genes located in chr 6, chr 7, chr 8, chr 11, chr 12, and chr 17, while few genes in chr 13, chr 14, chr 18, chr 20, and chr 21. In addition, we constructed a protein–protein interaction network of these genes based on the STRING database (Figure 8B). As shown in the network, the prognosis-related genes of different cancers were connected to each other. The degree distribution (Figure 8C) and the betweeness centrality (Figure 8D) of the network satisfied the condition of scale-free network and were conformed as the general characteristics of biological network. In the network, the degree of EPRS had the highest degree of 33. The degrees of HNRNPA2B1, BPTF, LRRK1, and PUM1 were all greater than 20. These genes mentioned above were widely mutated in many cancers.

FIGURE 8

Pan-cancer survival-related gene networks. (A) The chromosome distribution of the genes. The blue, green, red, and purple blocks represented the survival correlation of the genes in each omics data, respectively. The links in the middle represented the interaction of the genes. (B) The interaction network among the top-10 survival-related genes. Different colors represented prognostic genes of different cancers. Red nodes represented genes prognosis-related in multiple cancers. (C) The degree distribution of the prognostic-related gene network. (D) The betweeness centrality of the prognostic-related gene network. Based on the COSMIC database, we found extensive mutations occurred in EPRS, and 70% of them were missense mutation. EPRS has been shown to be associated with a wide range of cancers by 80 articles. The other four genes also had widespread mutations. In the COSMIC database, 78, 114, 113, and 83 studies confirmed the correlation between HNRNPA2B1, BPTF, LRRK1, and PUM1 with cancer, respectively.

The Predictive Performance of Our Method

In order to demonstrate the effectiveness of our method, we compared our prognostic biomarkers with previous works. The works using TCGA data were chosen to compare with our work. Chaudhary et al. (2018) used LIHC data of TCGA in their work. Their C-indexs of training and testing set were 0.70(±0.04) and 0.69(±0.08), while our median C-index of LIHC was 0.82. The prognostic power of our method was stronger than theirs. Next, both Yuan et al. (2014) and Zhang et al. (2016) used KIRC and LUSC data of TCGA in their works, so we compared our results of these two cancers with their studies. The comparisons of the C-indexes are shown in Figures 9A,B, which showed the higher prognostic power of our 10-gene biomarkers. For KIRC, the median C-index of our work was 0.91. The median C-index of the best performing data (clinical + miRNA) in Yuan’s work and the best performing subnetwork (subnetwork K1) of Zhang’s study were about 0.76 and 0.74, respectively. For LUSC, the median C-index of our work was 0.76. The median C-index of the best performing data (clinical + protein) in Yuan’s work and the best performing subnetwork (subnetwork L1) of Zhang’s study were about 0.66 and 0.62, respectively. Therefore, the biomarkers identified by our method could display the better prediction for the patients’ survival.

FIGURE 9

The prognostic biomarkers of KIRC and LUSC. (A,B) The C-index comparison of the prognostic power of our 10-gene prognostic biomarkers and other work. (C,D) The heatmap of samples hierarchical clustering by the expression of the 10-gene prognostic biomarkers. The bar on the top of the heatmap indicated the group the samples really belong to. Red represented tumor and green represented normal. To further confirm reliability of the genes, we downloaded GE data of the corresponding normal samples and used the prognostic biomarkers to cluster the samples. The results showed that the prognostic biomarkers could distinguish the tumor and normal samples (Figures 9C,D). Furthermore, we screened the differentially expressed genes between tumor and normal samples of LUSC and KIRC. After comparing them with the list of survival-associated genes, there were 12 differentially expressed genes in LUSC list (Figure 10A) and 49 differentially expressed genes in KIRC list (Figure 10D). Subsequently, we examined the copy number variation and chromosome location of both the differentially expressed genes and the top-10 biomarker genes (Figures 10B,C,E,F). It turned out that among the 22 LUSC genes, six were located in chr 6q, three were in chr 10q, three were in chr 11q, and three were in chr 15q. Of the 59 KIRC genes, 10 were located in chr 8, eight were located in chr 17, and seven were in chr 1. These locations were the peak regions of copy number alternation, suggesting a relationship between these genes and cancer. Moreover, it could be seen that the driver genes of the two cancers were located in different chromosomes, which supported the uniqueness of different cancer-related genes.

FIGURE 10

The comparison of survival-related genes and differentially expressed genes. (A) The differentially expressed genes of LUSC. Red represented high expression and green represented low expression. Differentially expressed survival-related genes were marked. (B) The copy number variation peaks of LUSC. (C) Chromosomal positions and interactions of prognostic biomarkers and differentially expressed genes of LUSC. (D) The differentially expressed genes of KIRC. Red represented high expression and green represented low expression. Differentially expressed survival-related genes were marked. (E) The copy number variation peaks of KIRC. (F) Chromosomal positions and interactions of prognostic biomarkers and differentially expressed genes of KIRC. Moreover, in the process of univariate Cox regression model, we separately calculated the survival correlation of a gene in four different omics data and then counted them. We also tried the result of considering the same gene in different omics data as different features, and merged the four omics data into one matrix then performed multivariate Cox regression model on it. Only the genes identified as survival-related features more than twice were retained. Finally, the obtained genes were all involved in the gene lists identified through our method and had an incomplete coverage compared with our gene lists. Interestingly, most of these genes were related to survival in GE or SCNA. In addition, in the process of multivariate Cox regression model, we involved all of the four types of omics data for each candidate gene to perform analysis. Actually, the genes were not survival-related at all of the four omics data in the univariate Cox regression model. To prove the validity of this process, we recalculated the Score of each gene by only using the types of omics data at which the gene was determined to be related to survival in univariate Cox regression model. The results showed that neither the Scores nor the ranks of the genes changed much after recalculation. In consequence, it could suggest the high predictive performance of our multivariable Cox regression model.

The Necessity of Multi-Omics Data Integration

In the process of univariate Cox regression analysis, we found that a gene appeared to be survival-related in one omics dataset, while it might appear to be unrelated to survival on another omics data even under the same model, selection criteria and set of samples. Although this phenomenon might be caused by the error of the data or the imprecision of the experiment, it implied the necessity of multi-omics data integration. To verify the superiority of integrating multi-omics data, we compared the results of integrating multi-omics data with the results of single omics data in LUSC and KIRC. As shown in Figure 11, the results of integrating multi-omics data were significantly higher than those of applying single omics data in decision curve analysis and C-index. The decision curve showed that compared with single omics data, the curve of multi-omics data was further apart from the two extreme curves, which had the greater application value.

FIGURE 11

The comparison of the results for multi-omics data and single omics data. (A) The decision curve of multi-omics data and each omics data in KIRC. The thin oblique line represented the assumption that all patients have been treated. The black line represented the assumption that no patients have been treated. (B) The C-indexes of multi-omics data and each omics data in KIRC. (C) The decision curve of multi-omics data and each omics data in LUSC. The thin oblique line represented the assumption that all patients have been treated. The black line represented the assumption that no patients have been treated. (D) The C-indexes of multi-omics data and each omics data in LUSC.

Discussion

The recognition of prognostic biomarkers in cancers could predict the prognostic status of each individual patient. This could help to achieve personalized medicine for cancer (Nalejska et al., 2014). Prior work has utilized omics data to predict prognostic status of cancer patients. However, multi-omics data were not used comprehensively. In this work, we proposed a method to integrate multi-omics data and predict the prognostic status of patients. And gene lists associated with survival were identified in 13 types of cancers. Based on this foundation, the prognostic biomarkers of the cancers were obtained. Compared with previous studies, this work took a more comprehensive integration of multi-omics data. To verify the reliability and reproducibility of our approach, we confirmed the relationship between our prognostic genes and cancer from multiple perspectives, and the results were stable when changing feature selection strategies. And this method was easy to implement because of its light calculation burden. We obtained candidate survival-related gene lists for 13 types of cancers, and compared the differences and similarities of the lists. The genes associated with survival in multiple cancers were found. Not only have we successfully verified that genes like EPRS were indeed related to various cancers, but also we found that genes such as SLK were related to survival of multiple cancers. SLK has been reported to associate with blood cancer, breast cancer, colorectal cancer, liver cancer, and pancreatic cancer. In our work, we found that it was participated in the BP of patients’ survival of bladder cancer, lung cancer, and renal cancer. SLK mainly associated with cell adhesion. Cell adhesion plays an important role in the maintenance of tissue structure, whose abnormality results in tumor invasion and metastasis. However, we also got some confused results in comparing different cancers. As different cancer subtypes of the same tissue, the overlap of gene lists between LUAD and LUSC was small, which was different from the expected outcome. We suspected that this might be due to their different pathogenesis. LUSC commonly occurred in older men and was strongly associated with smoking, but LUAD was more common in women and non-smokers (Kenfield et al., 2008). The differences in the pathogenesis might lead to differences in the genetic mechanisms and the list of related genes. Moreover, the prognostic markers for SARC did not significantly distinguish the high and low risk groups. This might be due to the subtypes of SARC (leiomyosarcoma, liposarcoma, myxofibrosarcoma, synovial SARC, etc.). The subtypes of SARC ought to be considered as different cancer types. In addition, this might be caused by the bias of data. TCGA patient samples were selected from multiple sources, and were characterized at multiple centers, which might introduce heterogeneity and bias. And the clinical annotations of the patients might not be sufficiently rigorous and comprehensive (Yuan et al., 2014). Even though we only selected the basic information such as age and gender, there were still some missing values. Till now, only a few molecular prognostic biomarkers based on multi-omics data have been applied to clinic (Yuan et al., 2014). The presence of publication bias and incompletion in the literatures is a major reason why the identified tumor biomarkers have not been applied in clinic (Mcshane and Hayes, 2012). Further, translational medicine researchers have no access to the results of these studies. Our work clearly provided gene lists related to the survival of various cancers, which could be easily obtained and searched, and help to transform biological data into clinical experiments. Even so, our work remains inadequate. First of all, overfitting and collinearity of biological data make it technically challenging to effective integration of multi-omics data. Our work did not solve the problem. Although LASSO can well select the most important features to overcome the overfitting problem, it will lose many equally important features at the same time when high pairwise correlations occurred (Zhang et al., 2016). And the intra-tumor heterogeneities of cancer make it almost impossible to find prognostic biomarkers 100% suitable for each patient. Future efforts are still needed to address these problems. In addition, since the data were downloaded from TCGA which was a program of the National Cancer Institute (NCI) of the United States, most of the patients were white people. The results of this study may be only appropriate for the whites. Although Chaudhary et al. (2018) have validated their model, which was built by TCGA data, on Japanese and Chinese datasets, further validation of other cancer should be done and data of black population should be included in future studies. Furthermore, some studies have suggested a non-linear relationship between miRNA expression and clinical outcomes (Fuchs et al., 2013; Lee et al., 2013). Therefore, some non-parametric algorithms can be applied to the analysis of the prognosis of miRNA in future studies.

Data Availability Statement

The datasets generated for this study can be found in the TCGA database: https://www.cancer.gov/tcga. And the core code of this study was merged in the Supplementary Data Sheet S1.

Author Contributions

NZ collected data, carried out the initial analyses, and drafted the manuscript. MG conceived of the study, participated in its design and coordination, and helped to draft the manuscript. KW and XL coordinated and supervised data collection, and critically commented on the important intellectual content of the manuscript. CZ participated in the design of the study and performed the statistical analysis. All authors read and approved the final manuscript.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

50 in total

1. Connecting high-dimensional mRNA and miRNA expression data for binary medical classification problems.

Authors: Mathias Fuchs; Tim Beißbarth; Edgar Wingender; Klaus Jung
Journal: Comput Methods Programs Biomed Date: 2013-07-10 Impact factor: 5.428

Review 2. DNA methylation and gene silencing in cancer.

Authors: Stephen B Baylin
Journal: Nat Clin Pract Oncol Date: 2005-12

Review 3. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors.

Authors: F E Harrell; K L Lee; D B Mark
Journal: Stat Med Date: 1996-02-28 Impact factor: 2.373

4. A linear regression model for the analysis of life times.

Authors: O O Aalen
Journal: Stat Med Date: 1989-08 Impact factor: 2.373

5. Cancer statistics, 2020.

Authors: Rebecca L Siegel; Kimberly D Miller; Ahmedin Jemal
Journal: CA Cancer J Clin Date: 2020-01-08 Impact factor: 508.702

Review 6. Intratumor heterogeneity: evolution through space and time.

Authors: Charles Swanton
Journal: Cancer Res Date: 2012-09-20 Impact factor: 12.701

7. Integrated genomic analyses of ovarian carcinoma.

Authors:
Journal: Nature Date: 2011-06-29 Impact factor: 49.962

8. The UCSC Genome Browser database: 2017 update.

Authors: Cath Tyner; Galt P Barber; Jonathan Casper; Hiram Clawson; Mark Diekhans; Christopher Eisenhart; Clayton M Fischer; David Gibson; Jairo Navarro Gonzalez; Luvina Guruvadoo; Maximilian Haeussler; Steve Heitner; Angie S Hinrichs; Donna Karolchik; Brian T Lee; Christopher M Lee; Parisa Nejad; Brian J Raney; Kate R Rosenbloom; Matthew L Speir; Chris Villarreal; John Vivian; Ann S Zweig; David Haussler; Robert M Kuhn; W James Kent
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

9. Extensions to decision curve analysis, a novel method for evaluating diagnostic tests, prediction models and molecular markers.

Authors: Andrew J Vickers; Angel M Cronin; Elena B Elkin; Mithat Gonen
Journal: BMC Med Inform Decis Mak Date: 2008-11-26 Impact factor: 2.796

10. COSMIC: the Catalogue Of Somatic Mutations In Cancer.

Authors: John G Tate; Sally Bamford; Harry C Jubb; Zbyslaw Sondka; David M Beare; Nidhi Bindal; Harry Boutselakis; Charlotte G Cole; Celestino Creatore; Elisabeth Dawson; Peter Fish; Bhavana Harsha; Charlie Hathaway; Steve C Jupe; Chai Yin Kok; Kate Noble; Laura Ponting; Christopher C Ramshaw; Claire E Rye; Helen E Speedy; Ray Stefancsik; Sam L Thompson; Shicai Wang; Sari Ward; Peter J Campbell; Simon A Forbes
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

7 in total

Review 1. The Breast Cancer Protooncogenes HER2, BRCA1 and BRCA2 and Their Regulation by the iNOS/NOS2 Axis.

Authors: Katie Lin; Stavroula Baritaki; Silvia Vivarelli; Luca Falzone; Aurora Scalisi; Massimo Libra; Benjamin Bonavida
Journal: Antioxidants (Basel) Date: 2022-06-17

Review 2. Prognostic/predictive markers in systemic therapy resistance and metastasis in breast cancer.

Authors: Evthokia A Hobbs; Natalie Chen; Alphi Kuriakose; Elizabeth Bonefas; Bora Lim
Journal: Ther Adv Med Oncol Date: 2022-07-16 Impact factor: 5.485

Review 3. Recent advances of GOLM1 in hepatocellular carcinoma.

Authors: Jiuliang Yan; Binghai Zhou; Hui Li; Lei Guo; Qinghai Ye
Journal: Hepat Oncol Date: 2020-06-29

4. A cell-to-patient machine learning transfer approach uncovers novel basal-like breast cancer prognostic markers amongst alternative splice variants.

Authors: Jean-Philippe Villemin; Claudio Lorenzi; Marie-Sarah Cabrillac; Andrew Oldfield; William Ritchie; Reini F Luco
Journal: BMC Biol Date: 2021-04-12 Impact factor: 7.431

5. Prognostic biomarkers for predicting papillary thyroid carcinoma patients at high risk using nine genes of apoptotic pathway.

Authors: Chakit Arora; Dilraj Kaur; Leimarembi Devi Naorem; Gajendra P S Raghava
Journal: PLoS One Date: 2021-11-12 Impact factor: 3.240

6. Prognostic Biomarker-Based Identification of Drugs for Managing the Treatment of Endometrial Cancer.

Authors: Dilraj Kaur; Chakit Arora; Gajendra Pal Singh Raghava
Journal: Mol Diagn Ther Date: 2021-06-22 Impact factor: 4.074

7. Integrative Analysis of Identifying Methylation-Driven Genes Signature Predicts Prognosis in Colorectal Carcinoma.

Authors: Hao Huang; Jinming Fu; Lei Zhang; Jing Xu; Dapeng Li; Justina Ucheojor Onwuka; Ding Zhang; Liyuan Zhao; Simin Sun; Lin Zhu; Ting Zheng; Chenyang Jia; Binbin Cui; Yashuang Zhao
Journal: Front Oncol Date: 2021-06-11 Impact factor: 6.244

7 in total