Literature DB >> 35831792

DriverRWH: discovering cancer driver genes by random walk on a gene mutation hypergraph.

Chenye Wang1, Junhan Shi1, Jiansheng Cai2, Yusen Zhang1, Xiaoqi Zheng3, Naiqian Zhang4.   

Abstract

BACKGROUND: Recent advances in next-generation sequencing technologies have helped investigators generate massive amounts of cancer genomic data. A critical challenge in cancer genomics is identification of a few cancer driver genes whose mutations cause tumor growth. However, the majority of existing computational approaches underuse the co-occurrence mutation information of the individuals, which are deemed to be important in tumorigenesis and tumor progression, resulting in high rate of false positive.
RESULTS: To make full use of co-mutation information, we present a random walk algorithm referred to as DriverRWH on a weighted gene mutation hypergraph model, using somatic mutation data and molecular interaction network data to prioritize candidate driver genes. Applied to tumor samples of different cancer types from The Cancer Genome Atlas, DriverRWH shows significantly better performance than state-of-art prioritization methods in terms of the area under the curve scores and the cumulative number of known driver genes recovered in top-ranked candidate genes. Besides, DriverRWH discovers several potential drivers, which are enriched in cancer-related pathways. DriverRWH recovers approximately 50% known driver genes in the top 30 ranked candidate genes for more than half of the cancer types. In addition, DriverRWH is also highly robust to perturbations in the mutation data and gene functional network data.
CONCLUSION: DriverRWH is effective among various cancer types in prioritizes cancer driver genes and provides considerable improvement over other tools with a better balance of precision and sensitivity. It can be a useful tool for detecting potential driver genes and facilitate targeted cancer therapies.
© 2022. The Author(s).

Entities:  

Keywords:  Cancer driver genes; Candidate gene prioritization; Gene network; Hypergraph model; Random walk; Somatic mutation

Mesh:

Year:  2022        PMID: 35831792      PMCID: PMC9281118          DOI: 10.1186/s12859-022-04788-7

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.307


Background

Cancer is a complex genetic disease characterized by abnormal and uncontrolled cellular growth, which is caused primarily by the accumulation of genomic alterations that together enable malignant growth [1, 2]. Recent advances in next-generation sequencing (NGS) technologies have generated massive amounts of cancer genomic data, such as The Cancer Genome Atlas (TCGA), which provides somatic mutation landscapes to better characterize the molecular signatures of cancer [3]. There is a consensus viewpoint on tumorigenesis that only a few mutational events occurring in a set of genes (called “cancer driver genes”) affect the homeostatic development of a set of key cellular functions [4-6]. Discovery of these cancer driver genes across various tumor types is a key step in understanding tumor biology and developing targeted anticancer therapies. A number of computational tools have been developed to identify cancer driver genes from multidimensional genomic data. Most of these tools can be classified into three categories based on their basic principles [7]. Frequency-based approaches define that the most commonly occurring mutation are more likely to be drivers, such as MutSigCV and MuSic [8, 9]. Unfortunately, methods based on frequency are underpowered for uncovering low recurrently driver genes [10]. Functional impact-based approaches, such as OncodriveFM, integrate multiple-domain information to predict the functional impact of single nucleotide variants (SNVs) [11, 12]. However, most of these methods use machine learning based models. Building either a gold-standard positive data set or a negative data set for such model is a difficult task, and that restricts the use of these methods [10]. The third category is network-based methods enlightened by the observation that mutations in a cancer genome tend to converge on a few biological pathways, attempt to identify groups of driver genes based on prior knowledge of pathways and proteins or genetic interactions [13-17]. A tool named DawnRank adopts PageRank algorithm to rank potential drivers based on their impact on the overall differential expression of the downstream genes [14]. HotNet2 uses a random walk with restart algorithm for identification of mutated subnetworks, in which the mutation frequency of each gene and the frequencies of its network neighbors are considered and hub genes are often yielded with highly predicted scores [15]. This kind of methods have advantages in their ability to identify driver genes with low recurrence and improve the accuracy of predicting driver genes to some extent [18]. Despite the rapid progress in computational approaches to prioritize cancer driver genes with the advent of next-generation sequencing technologies, the false positive rates of these existing methods are still too high. In addition, there are evidences showing that driver gene co-occurring may play a key role in cancer initiation and progression [19-21]. Because the activation or inactivation of one given driver gene is usually not sufficient to induce tumorigenesis, multiple mutations in different driver genes have to cooperate to gradually transform normal cells into precursor lesions and subsequently invasive and metastatic cancer [22-25]. Among majority of the published methods, the practice of putting single gene mutation frequency as input information could result in the loss of all the co-occurring alternations information of the individual tumors. In this study, we introduced a weighted hypergraph model and present a novel tool DriverRWH by integrating mutation profile and PPI network data to predict driver genes. Hypergraph is a generalization of simple graphs where its edges, called hyperedges, are allowed to connect arbitrary number of vertices, which makes it suitable for representation of high-order relations and it can be used to model biology network, data structure, computations and a variety of other systems [26-28]. Herein, we adopted hyperedges to represent the co-exist relationship among mutated genes in individuals, so the problem of information loss of co-occurring alternations can be avoided in a certain extent. We next specified the weights of mutated genes in each hyperedge according their interaction in PPI network and construct the weighted hypergraph. Thereafter, we generalized a random walk algorithm to the weighted hypergraph. Finally, we ranked all the candidate mutated genes for the given cancer type. To verify our method, we applied DriverRWH to 31 cancer types from TCGA and found that our method outperforms the state-of-the-art tools for the majority of cancer types regardless of which reference network we use. We also evaluated the robustness of our method and found that DriverRWH is highly robust to various data perturbations.

Methods

Overview

In this study, we proposed DriverRWH, which uses random walk on weighted hypergraph to prioritize the driver genes (Fig. 1). Firstly, for a given cancer type, a hypergraph was constructed basing on mutation profile, wherein tumor samples are presented as hyperedges and mutant genes are presented as vertices. Secondly, according to our hypothesis that a gene is more likely to be a driver gene if it is highly associated with other mutated genes, we differentiated genes within a hyperedge of sample in accordance with their degrees in the corresponding subnetwork of the PPI network. Then, we adopted a probabilistic weighted random walk that take advantage of the hypergraph structure, and carried out this iteratively. After some steps, the random walk would stabilize, producing a score for each mutated gene. At last, all candidate mutant genes are ranked in descending order based on their score.
Fig. 1

Overview of DriverRWH. A, B Construction of the weighted hypergraph model using somatic mutation profiles of a given cancer type and a PPI network. Each sample is indicated with colored circular area (hyperedge) which contains all the mutated genes (vertices) of individual. Since the number of mutated gene varies from samples, the hypergraph contains different number of vertices. The weights of vertices in each hyperedge are assigned according to the degree in the context of the background subnetwork. C Illustration of the random walk process on the hypergraph. For vertex , we randomly select a hyperedge which incident with and then selects a node according the weights of vertices in selected hyperedge as the destination vertex to shift

Overview of DriverRWH. A, B Construction of the weighted hypergraph model using somatic mutation profiles of a given cancer type and a PPI network. Each sample is indicated with colored circular area (hyperedge) which contains all the mutated genes (vertices) of individual. Since the number of mutated gene varies from samples, the hypergraph contains different number of vertices. The weights of vertices in each hyperedge are assigned according to the degree in the context of the background subnetwork. C Illustration of the random walk process on the hypergraph. For vertex , we randomly select a hyperedge which incident with and then selects a node according the weights of vertices in selected hyperedge as the destination vertex to shift

The DriverRWH algorithm

In the present model, mutation data of a given cancer type and a PPI dataset are used as the input information (Fig. 1A). As shown in Fig. 1B, a hypergraph consisting of the mutated genes of all samples was constructed. If a gene is mutated in a sample, it would be presented as a vertex in the hyperedge corresponding to the sample. Without loss of generalization, the hypergraph can be defined as , where is the set of vertices and is the set of hyperedges. A hyperedge is a subset of, satisfying . Hyperedge is said to be incident with vertex if ; thus, the incidence matrix can be defined as follows: After construction of the hypergraph, a specified subnetwork is generated for each sample, based on the mutated genes and their interaction in the PPI network. According to our hypothesis that a gene is more likely to be a driver gene if it is highly associated with other mutated genes, a fairly standard choice of the weight of vertices in each hyperedge are their degrees in the corresponding induced subnetwork of the PPI network. Then, we developed a random walk process on the weighted hypergraph. Similar to a random walk on a simple graph, this walk is a type of Markov process, which is seen as the transition between two vertices. Note that the transition on the hypergraph occurs only if two vertices are incident to a hyperedge, so the random walk on the hypergraph is defined to be a two-step process. In the first step, the surfer selects a hyperedge incident with the current vertex ; thereafter, it selects a target vertex within the chosen hyperedge (Fig. 1C). If one vertex is an isolated node in the subnetwork, it also has the potential to be a driver gene, so a small weight of 0.01 is set. Let be the subnetwork containing vertices in hyperedge and denote as the degree of in the subnetwork. Thereafter, the surfer selects vertex proportional to the weight of within the hyperedge. Notably, in our model, the weights of vertices may vary in accordance with the hyperedges. According to the aforementioned definition, the degree of vertex and hyperedge in hypergraph can be defined as follows: With all the elements defined, we calculated the transition probability from vertex to vertex as follows: which can also be written in matrix form:where is the diagonal vertex degree matrix, is the diagonal hyperedge degree matrix with element and is the weighted incident matrix of hypergraph . Note that the transition matrix is stochastic, where each row sums to 1. Furthermore, we implemented a random walk with restart on the hypergraph. All genes are considered to be potential driver genes and are assigned with equal probabilities; i.e., the initially normalized probability vector such that each element is assigned with equal probability . Moreover, the restart probability at every step is set to be . In this article, we set to be 0.2. Finally, the random walk formula can be expressed as follows: In the formula above, is defined such that the th element means the probability that the surfer stops at vertex at step . After a number of steps, the random walk will be stable, which can be defined as . The stabilized state implies that the distance between and by the L1 norm is smaller than the provided cutoff value. In this paper, we set the cutoff as . The elements of the stabilized vector are defined as the DriverRWH score, which can reflect the role that the mutated genes play in cancer.

Datasets and networks

Somatic mutation data for 9183 tumor samples across 31 cancer types (Additional file 2) used in this work are available from TCGA, which were downloaded by UCSC Browser (https://xenabrowser.net/datapages/) [29]. We downloaded two independently developed PPI datasets from the STRINGv10 (https://string-db.org) [30] and the HumanNet (http://www.functionalnet.org/humannet/) [31].

Performance evaluation

To evaluate the method, an unbiased comprehensive known cancer gene set is needed. Unfortunately, such a gold-standard set of cancer genes is currently unavailable. Alternatively, we used four complementary cancer gene sets derived from various sources as the reference driver gene set for all the cancer types. First, 616 cancer genes were downloaded from the Cancer Gene Census (CGC) database, which includes genes for which mutations have been causally implicated in cancer and is widely used as a gold-standard cancer gene set [32]. Second, the list of HiConf cancer gene panels consists of 99 driver genes that have previously been detected through genetic criteria and that could plausibly be detected with exome sequencing data [33]. The third set has 291 high-confidence cancer driver genes identified by a rule-based method (HCD) [34]. The fourth set contains 125 driver genes defined by the "20/20 rules", which identifies Mut-driver genes based on the characteristic mutational patterns for oncogenes and tumor suppressor genes [35]. Now that each cancer gene set is biased toward particular features or study methods, we utilized a union of these four lists as the reference driver gene set, with a total of 785 genes. This operation can reduce the bias caused by using a single reference gene list to some degree. Using aforementioned reference driver genes as a benchmark, we generated receiver operating characteristic (ROC) curves and areas under the curve (AUCs) to evaluate the true positive and false positive rate. For practical reasons, only top-ranked candidate genes might enter into follow-up experimental validation. Considering that the high performance of prioritization for all genes cannot guarantee successful prioritization for the top ranked candidates, we also assessed the number of known driver gene recovered in the top 20, 50, 100,150 and 200 candidate genes. Due to the diversity of cancer types, we are more interested in tumor-specific drivers than the general common drivers across all tumor types. We downloaded IntOGen database (https://www.intogen.org/download) [4]. This database harnesses the strengths of different driver prediction methods and provides a tumor-specific driver genes list, which is considered to be the best trade-off between sensitivity and specificity. This list contains 31 types of cancer among which Kidney Chromophobe (KICH) has 7 specific drivers (minimum) and Uterine Corpus Endometrial Carcinoma (UCEC) has 55 (maximum). All of the above lists are shown in Additional file 3. From an application point of view, we should assess the ability of our method to identify novel driver genes that may not have been discovered in IntOGen. The genes in top 200 candidate gene list predicted by DriverRWH with both HumanNet and STRINGv10 while not in the tumor-specific drivers were considered to be potential novel drivers. From the functional perspective, these genes were evaluated by the biological analysis using DAVID on-line database, CancerGeneNet and iGMDR database [36-39]. We leveraged a literature mining method named CoCiter, which calculates the co-citation significance between predicted driver genes and the keywords cancer type, ‘driver’ and ‘cancer’ to verify the top 30 significant genes [40]. The higher co-citation score implicates the stronger association between the genes and the key terms. Without loss of generality, we compared DriverRWH with 24 driver gene prediction methods across 31 cancer type, some of which identify significant drivers by P-value (the genes with FDR adjusted P-value < 0.05) and the rest of methods provide the priority scores for candidate driver genes (the top 30 genes are selected as significant drivers). It is acceptable for the reason that the median number of significant genes for other methods in all data sets is 30.

Results

Known driver genes have higher degree in the PPI network

In DriverRWH, we hypothesized that a gene is more likely to be a cancer driver if it is prone to associate with other mutated genes in cancer. This hypothesis has already been proposed in some studies [15, 41]. To further validate it, we analyzed the linkage of mutated genes in the PPI network. For a given cancer type, an induced subnetwork of the PPI network which just contains mutated genes from all samples was built. The genes that mutated at least once in a cancer type were divided into two groups according to whether they are in the reference driver gene set (the union of CGC, HiConf, MCD, Mut-driver, with a total number of 785 genes): known driver genes and the others. We calculated the degree of vertices in the induced subnetwork. Taking the three cancer types LUSC, BRCA and UCEC for illustration, we found the degrees of known driver genes were significantly larger than those of the other mutant genes (Fig. 2, P-value < 0.001). This result suggests that cancer driver genes were adjacent to more mutated genes than the others. The same analysis using HumanNet is also available (Additional file 1: Fig S1).
Fig. 2

Boxplot comparing the degrees of known driver and the other genes in induced subnetwork. Bracketed digits indicate the number of known driver genes and the other genes in the subnetwork of STRINGv10, which are induced by the mutated genes present in at least one tumor sample for a given cancer type

Boxplot comparing the degrees of known driver and the other genes in induced subnetwork. Bracketed digits indicate the number of known driver genes and the other genes in the subnetwork of STRINGv10, which are induced by the mutated genes present in at least one tumor sample for a given cancer type

Performance of DriverRWH

To evaluate the performance of our method, we compared our method from three aspects, prediction of known driver genes, functional enrichment analysis and literature mining analysis. Firstly, we implemented six prioritizing methods, MutsigCV [8], DawnRank [14], MinNetRank [16], Subdyquency [17], Gravity [41] and OncodriveFML [42] on three cancer types, namely Lung squamous cell carcinoma (LUSC), Breast invasive carcinoma (BRCA), and Uterine Corpus Endometrial Carcinoma (UCEC) (see Additional file 4). In order to eliminate the deviation brought by the background network, we operated DriverRWH and the other three network-based methods (MinNetRank, Subdyquency, and Dawnrank) basing on the same network, STRINGv10 and HumanNet respectively. Then, we compared DriverRWH with 24 other driver gene prediction tools to evaluated its performance across 31 cancer types. Lastly, we verified the robustness of our method by testing the performance in perturbed data where the mutation data and network data were extracted randomly with different size.

Results for lung squamous cell carcinoma

Lung cancer is regarded as the main leading cause of cancer deaths, which take up 18.0% of deaths [43]. In this research, we applied DriverRWH to 480 LUSC samples in TCGA database. Using reference driver genes as benchmarks, we generated receiver operating characteristic (ROC) curves. When using STRINGv10 as background network, DriverRWH outperforms the other six tools. in terms of sensitivity and specificity in identifying known driver gene (Fig. 3A). We further assessed the predictive power for the top-ranked candidate genes. As shown in Fig. 3B, we observed that DriverRWH identified more known cancer driver genes by its top 20, 50, 100, 150 and 200 genes. Furthermore, the number of know driver gene retrieved by DriverRWH with STRINGv10 network in its 20 top-ranked candidates is more than half of it. When HumanNet was used, DriverRWH is still significantly better than the others methods (Additional file 1: Fig S2).
Fig. 3

Prediction performance of DriverRWH based on the reference driver set. A ROC plots of DriverRWH and other six methods. All the network-based methods, DriverRWH, Subdyquency, MinNetRank and Dawnrank were implemented by using STRINGv10 as background network. B Cumulative number of known cancer genes recovered within the top 20, 50, 100, 150 and 200

Prediction performance of DriverRWH based on the reference driver set. A ROC plots of DriverRWH and other six methods. All the network-based methods, DriverRWH, Subdyquency, MinNetRank and Dawnrank were implemented by using STRINGv10 as background network. B Cumulative number of known cancer genes recovered within the top 20, 50, 100, 150 and 200 To assess the ability of DriverRWH of discovering potential novel cancer driver genes, we considered the genes in the 200 top ranked candidate genes predicted with both HumanNet and the STRINGv10 while not in tumor-specific drivers list, resulting in 72 genes after screening. Biological enrichment analysis using DAVID against Genetic Association Database (GAD) shows that 36 genes (48.6%) are cancer-related (P-value = 5.92 × 10–6, FDR = 5.92 × 10–4) [44]. In particular, these genes are enriched for "lung cancer" (P-value = 1 × 10−3, FDR = 0.1217). Furthermore, the KEGG pathway enrichment analysis for the potential drivers is encouraging. 8 genes (11.2%) are significantly enriched in pathway: "PI3K - Akt signaling pathway" (P-adjust < 0.05), which is significantly related to lung cancer (Additional file 1: Fig S3) [45-48]. Specifically, using the top 30 candidate genes as significant driver, we searched these genes in co-citer website by the key terms ‘Cancer’, ‘Driver’ and ‘Lung’. As Table 1 shows, some significant well-known driver genes like TP53, PTEN and PIK3CA are near the top of the list. Although they are also identified by most of other methods, their ranking fell behind ours. The well-known suppressor TP53 which disrupts the cell cycle arrest and the apoptosis pathways in human cancer ranks first in our method, but it ranks 527th in Gravity algorithm. The PTEN is proved to be related to small cell lung cancer, which is an admitted tumor suppressor gene with phosphatase activity [49]. It is co-cited with ‘Lung’ and ‘Cancer’ for 253 and 2597 times, which is regarded as driver genes in 35 publications. The PTEN ranks the 16th in our list but ranked 44th in MinNetRank and 588th in Gravity. The mutation of PIK3CA gene can lead to abnormal enhancement of the catalytic activity of PI3Ks and promote the carcinogenesis of cells in lung cancer [49]. It ranks 7th in our method but 22th in MutsigCV and OncodriveFML, and 473th in MinNetRank. On the other hand, KDR (Kinase insert domain-containing receptor), ranked 24th, was reported to play a critical role in the metastasis of cancer and is used as a molecular target in cancer therapy [50]. Co-cited with "Cancer" for 207 times and ‘Lung’ for 105 times, KDR even not deemed as a diver gene in lung cancer and can be thought as a potential driver. The similar analysis basing on HumanNet is also available (Additional file 1: Table S1).
Table 1

Cociter mining analysis of top 30 LUSC candidate driver genes identified by DriverRWH (STRINGv10)

GenesCo-appeared countIs_SpecificityRank position
LungCancerDriverMutsigCVDawnrankGravityOncodriveFMLSubdyquencyMinNetRank
TP53854594255111527313
TTN181022771395913,175NA1424
DNAH80110152NA2741373
RYR23320449240011,45621706
LRRK251810583155610,604NA223
PTEN253259735126658821444
PIK3CA94576131223625362211473
NOTCH18448623149241591201847
CSMD3031032434NA46NA940
ANK214009967557613,775NA4868
SYNE112101141814035NA8653
KMT2D11811NANA31471NA8656
DMD14193023419199811,399NA566
USH2A241061136459113,959NA215
OBSCN040020924582049252NA1487
RYR103105364178812994590
NF1121378152217932552881131106
LRP1B81520523973999317NA1471
APOB32310843882538676NA1465
RELN09201131724013,119NA3089
MYH111410122630NA6933NA2615
EPHA526101728NA76972890
MYH243104498NA502572538
KDR105207301312254928327143
HERC20141015556241488436152403
POTEE19001153403138873367229
PIK3CG361191042621602818651234
CPS12610715385213,518NA387
KMT2C32141NANA5041479NA4264
HDAC9518103711640378310,89485944
Cociter mining analysis of top 30 LUSC candidate driver genes identified by DriverRWH (STRINGv10) We adopted the GAD and KEGG pathway enrichment analysis and found these significant driver genes enrich in the small cell lung cancer, PI3K-Akt signaling pathway, etc., which are significantly related to lung cancer (Additional file 1: Fig S3). The hallmarks of cancer are defined as a set of crucial functional abilities acquired by human cells as they move from normalcy to neoplastic growth states [51]. We linked these significant drivers to hallmarks of cancer using CancerGeneNet online database which calculates the shortest paths between genes and phenotypes [38]. Half of the top 30 genes could be associated with hallmarks of cancer. KDR, one of the potential drivers we mentioned above, is linked to “Angiogenesis”, “Cell Death”, “Differentiation”, “DNA Repair”, “Glycolysis”, “Immortality”, “Inflammation”, “Metastasis” and “Proliferation” (Additional file 5). In order to assess the drug sensitivity of these significant drivers, we performed gene-drug analysis using online database iGMDR, which shows that 73.3% of significant genes are druggable (Additional file 6).

Results for breast invasive carcinoma

Breast cancer is the most commonly diagnosed cancer, with an estimate 2.3 million new cases, taking up to 11.7% of all the cancer cases in 2020 [43]. We focused on 791 BRCA samples in TCGA database to construct the hypergraph. Compared with other methods, DriverRWH shows the best performance in terms of ROC curves when STRINGv10 and HumanNet were used respectively (Fig. 4 and Additional file 1: Fig S4). Meanwhile, although DriverRWH discerned less driver gene than MutSigCV in top 20 candidates, it was found to predict more known driver genes in the top 50, 100, 150 and 200 candidates (Fig. 4B).
Fig. 4

Prediction performance of DriverRWH based on the reference driver set. A ROC plots of DriverRWH and other six methods. All the network-based methods, DriverRWH, Subdyquency, MinNetRank and Dawnrank were implemented by using STRINGv10 as background network. B Cumulative number of known cancer genes recovered within the top 20, 50, 100, 150 and 200

Prediction performance of DriverRWH based on the reference driver set. A ROC plots of DriverRWH and other six methods. All the network-based methods, DriverRWH, Subdyquency, MinNetRank and Dawnrank were implemented by using STRINGv10 as background network. B Cumulative number of known cancer genes recovered within the top 20, 50, 100, 150 and 200 We evaluated the capacity of DriverRWH in identifying the breast cancer potential driver genes. Similarly, we adopted 61 genes, which are in the 200 top ranked candidate genes predicted with both HumanNet and the STRINGv10 while not in tumor-specific drivers list to conduct the GAD and pathway enrichment analysis. Notably, 29 genes (44.6%) are enriched for "CANCER" (P-value = 1.67 × 10–4, FDR = 1.67 × 10–4) and 12 (18.5%) are enriched for "breast cancer" (P-value = 2.15 × 10−5, FDR = 0.0087). In the case of pathways, these genes are significantly enriched in "Breast cancer". The top 25 pathways are shown in additional file (Additional file 1: Fig S5). The cociter score of the top 30 candidate genes predicted by DriverRWH using STRINGv10 network is demonstrated in Table 2. Particularly, 8 of the top 10 candidate genes are exactly driver genes, including acknowledged driver gene TP53 (ranked 1st), the most recurrently mutated gene PIK3CA (ranked second), etc. With high cociter scores, KMT2C ranked 8th in DriverRWH, not even identified in MutsigCV and Dawnrank and ranked 2121 in Gravity. AKT1, which co-appears with "Cancer" for 1863 times and "Breast" for 477 times, ranked 10th in DriverRWH while it ranked merely 1226th in Gravity and 2233th in OncodriveFML. The ERBB2, which ranked 16th in DriverRWH, is confirmed to be related to breast cancer, but it ranked 35th in OncodriveFML, 126th in MutsigCV, and even 1465th in Gravity [52]. Besides, DriverRWH can identify some genes that are highly related with breast cancer but was not recognized by other six methods. For instance, EGFR is one of the first identified important targets of novel antitumor agents, which co-occur "Breast" 722 times, "Cancer" 4091 times, and "Driver" 94 times [53]. MTOR ranked 22nd, co-appearing 321 times with "Breast", 1896 times with "Cancer", and 21 times with "Driver". The similar analysis basing on HumanNet is also available (Additional file 1: Table S2).
Table 2

Cociter mining analysis of top 30 BRCA candidate driver genes identified by DriverRWH (STRINGv10)

GenesCo-appeared countIs_SpecificityRank position
BreastCancerDriverMutsigCVDawnrankGravityOncodriveFMLSubdyquencyMinNetRank
TP5311775942551118845224
PIK3CA17057613124394998851200
CDH12911143131324486452
GATA384114414111791NA8652
TTN2810NA314819716,43836097
PTEN59525973516330071042
MAP3K15912921512220845528
KMT2C32141NANA21213NA4076
DNAH801101157NA194278122
AKT14771863131NA612262233NA31
OBSCN1400NA172464215,684141760
DMD1193016171457579511827
NF119137811399816412534926
UBC1766534019729NA27043816
PRDM1011209162141044956027
ERBB23631442236112651465353351
MYH91032403035815404141NA476
NCOR116582192903762273160
FOXA1821285126263321NA8660
ANK33420NA325124154803542288
LRRK2418102183973883480278270
MTOR3211896210NA22664141947115
EGFR7224091940NA10290914,413119535
RYR22320NA1689252512,92992197
PRKDC57274407712127932651312
ANK20400NA4053532971444349
ASH1L021010956729576540852723
KDM6A52430645042214078NA2380
SYNE10210NA1801123312,287NA8654
RUNX114110617205874817283175
Cociter mining analysis of top 30 BRCA candidate driver genes identified by DriverRWH (STRINGv10) We performed GAD and pathway enrichment analysis of the top 30 candidate driver genes. The identified genes are enriched in "breast cancer" in GAD. These gene are significantly enriched in "Breast cancer", "Proteoglycans in cancer", "Endometrial cancer", etc., which have an association with breast cancer by KEGG enrichment analysis (Additional file 1: Fig S5). 66.7% of the candidate driver genes could be linked to hallmarks of cancer (Additional file 5). Besides, 86.7% of the identified genes are druggable according to the iGMDR database (Additional file 6).

Results for uterine corpus cancer

Uterine corpus cancer is the sixth most common type of cancer and the second most common gynecological malignancy in female, with more than 417,000 new cases and 97,000 deaths worldwide in 2020 [54]. We used 448 patients with 40,543 candidate genes from the TCGA database. DriverRWH outperforms the other six prioritizing methods with the same reference driver genes as benchmarks when assessed by the ROC and percentage of known driver gene in the top candidate genes (Fig. 5 and Additional file 1: Fig S6).
Fig. 5

Prediction performance of DriverRWH based on the reference driver set. A ROC plots of DriverRWH and other six methods. All the network-based methods, DriverRWH, Subdyquency, MinNetRank and Dawnrank were implemented by using STRINGv10 as background network. B Cumulative number of known cancer genes recovered within the top 20, 50, 100, 150 and 200

Prediction performance of DriverRWH based on the reference driver set. A ROC plots of DriverRWH and other six methods. All the network-based methods, DriverRWH, Subdyquency, MinNetRank and Dawnrank were implemented by using STRINGv10 as background network. B Cumulative number of known cancer genes recovered within the top 20, 50, 100, 150 and 200 For the discovery of potential drives, we selected 41 genes with the same criteria mentioned earlier, of which 22 genes (51.2%) are association with cancer (P-value = 1.37 × 10–4, FDR = 1.37 × 10–4). These genes are significantly enriched in PI3K - Akt signaling pathway and MAPK signaling pathway, both of which play an important role in cellular growth and survival, have been implicated in endometrial cancer pathogenesis (Additional file 1: Fig S7) [55]. We took top 30 candidate drivers in consideration, Table 3 shows the cociter score between these candidate genes and the terms " Endometrial", "Cancer" and "Drivers". Apoptosis-suppressing gene MTOR which co-appears with "Endometrial" 63 times, with "Cancer" 1896 times, ranked 19th in DriverRWH, but ranked 112th, 182th, and 1380th in Dawnrank, MutsigCV and OncodriveFML. Notch1 is tumor-suppressive in human endometrial cancer cells [56], which ranked 11th in DriverRWH, while 61th in MutsigCV, 94th in Subdyquency, even 2630th in OncodriveFML and 7054th in Gravity. Moreover, PRKDC is proved to be significantly associated with a high mutation load, which ranked 20th in DriverRWH [57]. Recent research suggest that high mutation load is a predictive biomarker of response to immune checkpoint inhibitors in uterine corpus cancer [58]. The similar analysis basing on HumanNet is also available (Additional file 1: Table S3).
Table 3

Cociter mining analysis of top 30 UCEC candidate driver genes identified by DriverRWH (STRINGv10)

GenesCo-appeared countIs_SpecificityRank position
EndometrialCancerDriverMutsigCVDawnrankGravityOncodriveFMLSubdyquencyMinNetRank
PTEN38025973511123316819
TP531435942551221687403443
PIK3CA395761313342673273
CTNNB111220142915510221142
KRAS5125389514931787865314218
DNAH80110374166012NA2444
LRRK201810243313NA711445110
OBSCN04001171264151992055NA1184
PRDM100120765842NA99247936
RANBP20121014816440117632176
NOTCH1154862316118705426809445
TAF109104579545227171NA8657
ARID1A1467413159944505231260
ANK3042021652145808157NA1914
ATM412225124154109749316148
ALB032107085257566739039737
EP30021452018151791754430
DMD019301806927763448NA508
MTOR631896211182112NA13089563
PRKDC3274403374012174128130
CTCF250311214292419NA164
TTN081012769631195NA4116
FGFR2232945117755487513944380
CAD0401017277174109112251
NSD101621116753346312185
ASH1L02101279469442603296492436
TRRAP026108129989630303
POTEE09004902155NA4873171146
GLI323620212022633088NA104
KMT2D0181123NANA3113NA8652
Cociter mining analysis of top 30 UCEC candidate driver genes identified by DriverRWH (STRINGv10) We performed GAD and pathway enrichment analysis of these candidate genes (Additional file 1: Fig S7). In terms of GAD enrichment analysis, these genes are enriched in "endometrial cancer", etc. In pathway enrichment analysis, they significantly enriched in Endometrial cancer. 70% of the top ranked genes have the shortest path to cancer phenotypes in CancerGeneNet database. PRKDC is linked with “Angiogenesis”, “Cell death”, “Differentiation”, “DNA repair”, “Glycolysis”, “Immortality”, Metastasis” and “Proliferation” (Additional file 5). 83.3% of these candidate genes have related drugs in iGMDR online database (Additional file 6).

The stability of the performance across 31 cancer types

Furthermore, we compared the performance of DriverRWH with 24 up-to-date driver gene prediction methods in order to assess the stability of DriverRWH across 31 cancer types. For DriverRWH and six methods mentioned above which provide ranks of the candidate driver gene, top 30 genes were selected as significant drivers [59]. For those methods that generate P-values, an adjusted P-values < 0.05 was used as the threshold to claim driver genes [60, 61]. The details of tools and the criteria for candidate driver genes are provided in the Additional file 7. Figure 6 displays the proportion of predicted driver genes presented in the reference driver set across 31 cancer types, arranged by the order of the median. DriverRWH recovered approximately 50% (median fraction is 53.3%) of known driver genes in the top 30 ranked candidate genes in more than half of 31 cancer types, which is significantly better than the results of the other methods.
Fig. 6

The performance of 25 driver gene prediction methods. Distribution of the fraction of predicted candidate driver genes presented in the reference driver set across 31 cancer types

The performance of 25 driver gene prediction methods. Distribution of the fraction of predicted candidate driver genes presented in the reference driver set across 31 cancer types

Robustness of DriverRWH

To test the robustness of DriverRWH, we applied our method to perturbed data where the mutation data and network data were shuffled randomly (Fig. 7). In detail, for the mutation data, two types of perturbations were taken: (1) randomly selecting 50% and 10% of the samples and (2) randomly selecting 50% and 10% of the original mutation information in the somatic mutation matrix. With 20 repeats, we used only 50% and 10% of samples and 50% of mutation information. There is no significant decrease in terms of the AUC scores and the cumulative number of recovered driver genes. If only 10% of mutation information was retained, there would be a slight decrease. It’s worth noting that the performance of the top 20 candidates was always at a high level. For the network data, two forms of perturbation were also taken: (1) randomly selecting 50% and 10% of the original network information and (2) using PPI data with 50% and 10% noise added. There was also only a minor decrease in the AUC scores and the cumulative number of recovered cancer genes. A similar conclusion could be obtained when performing robust analysis basing on HumanNet (Additional file 1: Fig S8). These results suggest that the perturbation of mutation data and the network did not seriously affect the result, indicating that DriverRWH is highly robust to the quality of the input data.
Fig. 7

Robustness of DriverRWH. A, B Boxplots of the effects of different data perturbations on the performance of DriverRWH. The vertical lines represent the AUC scores by DriverRWH using all of the data. C, D Effects of different data perturbations on the performance of DriverRWH measured by the average cumulative number of known cancer genes recovered in the 20, 50, 100, 150, and 200 top-ranked candidate genes

Robustness of DriverRWH. A, B Boxplots of the effects of different data perturbations on the performance of DriverRWH. The vertical lines represent the AUC scores by DriverRWH using all of the data. C, D Effects of different data perturbations on the performance of DriverRWH measured by the average cumulative number of known cancer genes recovered in the 20, 50, 100, 150, and 200 top-ranked candidate genes

Discussion

Recent years, many methods have been developed to distinguish driver genes from passengers. Limited by the design of the simple network model, most of them are incapable of expressing the many-to-many multiple association relationship. The mutation profile was always compressed into the mutation frequency of genes, resulting in the loss of co-mutation information for individual samples. In this study, we propose a network-based method DriverRWH, which has the capability of effectively integrating the mutation and PPI network data to predict cancer driver genes. The novelty of our method lies in the introduction of a weighted hypergraph model, which is constructed to simultaneously capture two class of relation among mutated genes in individual samples: 1) high-order relations were captured by storing hundreds of mutated genes in a hyperedge for each sample. 2) using the same mutated genes as above, an induced subnetwork of PPI network can be generated by preserving mutated genes and their interaction in the background network, which represents the pair-wise relations between mutated genes. Our model retains complete co-mutation relations for the mutated genes in individual tumors and these interactions in PPI network, which can adequately embody the implicit inherent peculiarity of them and avoid the loss of information. Taking advantage of hypergraph structure, we extended the typical random walk process on a simple graph to a probabilistic weighted random walk on hypergraph. Using a reference driver gene set as a benchmark, DriverRWH consistently outperformed the other six state-of-art prioritization methods in terms of the ROC analysis, rank of driver genes and the cumulative number of known driver genes recovered in the top-ranked candidate genes. Moreover, some new unknown potential driver genes which are co-cited by some cancer associated literatures also can be discovered by DriverRWH, meanwhile the high-ranking genes enrich in some significant cancer pathway. At last, taking top 30 as predicted candidate driver genes, we can compare DriverRWH with other non-ranking methods. The results shows that DriverRWH achieves a higher performance than four prioritization methods and 19 other non-ranking methods across 31 cancer types. Despite of these encouraging results, there are several limitations in the current model. First, for TCGA data, tumor heterogeneity may increase the data bias, and future work should be done to reduce false-positive discoveries by using single-cell genomics data. Second, DriverRWH relies on a broad context molecular network that is still incomplete at present, so refined gene functional networks in the near future could improve the performance of our method. A cancer-specific network might better represent the natural interactions of genes in cancer and potentially provide a more reliable network. Third, our method focuses on general driver gene detection but does not aim to offer personalized means of diagnosis, which is more useful in real applications. In the future, we plan to extend our method to discover drivers in personalized manner.

Conclusions

Recently, many computational methods and tools have been proposed to identify driver genes. However, long-tail distribution of the mutation frequency of genes in cancer genomes remains a major concern. There are many widely accepted methods based on mutation frequencies, but they fail to comprehensively consider the co-mutation information in individuals. Considering hypergraph has unique advantages of retaining complete co-occurrence information, we introduced the hypergraph theory in driver gene prediction, thus compensating for the co-mutation information loss issue by existing methods. For each hyperedge, degrees of vertex in the corresponding subnetwork of the PPI network were utilized to design the weighted hypergraph, through which we realized the integration of the mutation data and the PPI data. Subsequently, motivated by PageRank algorithm, we implemented the random walk with restart on the hypergraph, and proposed a novel approach DriverRWH to prioritize mutated genes. As demonstrated in this paper, DriverRWH not only excels existing methods in the identification of known driver genes but also is capable of discovering potential driver genes. Furthermore, the model behaves robustly under the perturbation of mutation data and network data. Our results show that DriverRWH can be a useful tool for prioritization driver genes. The source code of DriverRWH is freely available at https://github.com/ShandongUniversityZhanglab/DriverRWH. Additional file1. Figure S1: Boxplot comparing the degrees of known driver and the other genes in induced subnetwork. Figure S2: Prediction performance of DriverRWH based on the reference driver set in HumanNet of LUSC. Figure S3: The KEGG pathway enrichment analysis for the candidate driver genes of LUSC. Figure S4: Prediction performance of DriverRWH based on the reference driver set in HumanNet of BRCA. Figure S5: The KEGG pathway enrichment analysis for the candidate driver genes of BRCA. Figure S6: Prediction performance of DriverRWH based on the reference driver set in HumanNet of UCEC. Figure S7: The KEGG pathway enrichment analysis for the candidate driver genes of UCEC. Figure S8: Robustness of DriverRWH in HumanNet. Table S1: Cociter mining analysis of top 30 LUSC candidate driver genes identified by DriverRWH (HumanNet). Table S2: Cociter mining analysis of top 30 BRCA candidate driver genes identified by DriverRWH (HumanNet). Table S3: Cociter mining analysis of top 30 UCEC candidate driver genes identified by DriverRWH (HumanNet). Additional file2. The details for 31 cancer types used in this work. Additional file3. The known driver gene lists used in this work. Additional file4. The top 200 candidate driver genes predicted by DriverRWH, MutsigCV, Gravity, OncodriveFML, Dawnrank, Subdyquency, : and MinNetRank for three cancer types (LUSC, BRCA, UCEC). Additional file5. The linkages between significant drivers and hallmarks of cancer using CancerGeneNet online database Additional file6. The gene-drug analysis results of significant drivers using iGMDR online database Additional file7. The details of 24 tools and the criteria for candidate driver genes.
  56 in total

1.  clusterProfiler: an R package for comparing biological themes among gene clusters.

Authors:  Guangchuang Yu; Li-Gen Wang; Yanyan Han; Qing-Yu He
Journal:  OMICS       Date:  2012-03-28

2.  NOTCH1, NOTCH3, NOTCH4, and JAG2 protein levels in human endometrial cancer.

Authors:  Aušra Sasnauskienė; Violeta Jonušienė; Aurelija Krikštaponienė; Stasė Butkytė; Daiva Dabkevičienė; Daiva Kanopienė; Birutė Kazbarienė; Janina Didžiapetrienė
Journal:  Medicina (Kaunas)       Date:  2014-06-06       Impact factor: 2.430

3.  DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies.

Authors:  Yi Han; Juze Yang; Xinyi Qian; Wei-Chung Cheng; Shu-Hsuan Liu; Xing Hua; Liyuan Zhou; Yaning Yang; Qingbiao Wu; Pengyuan Liu; Yan Lu
Journal:  Nucleic Acids Res       Date:  2019-05-07       Impact factor: 16.971

4.  Discovering personalized driver mutation profiles of single samples in cancer by network control strategy.

Authors:  Wei-Feng Guo; Shao-Wu Zhang; Li-Li Liu; Fei Liu; Qian-Qian Shi; Lei Zhang; Ying Tang; Tao Zeng; Luonan Chen
Journal:  Bioinformatics       Date:  2018-06-01       Impact factor: 6.937

5.  Identifying Driver Genes for Individual Patients through Inductive Matrix Completion.

Authors:  Tong Zhang; Shao-Wu Zhang; Yan Li
Journal:  Bioinformatics       Date:  2021-06-27       Impact factor: 6.937

6.  Somatic mutations affect key pathways in lung adenocarcinoma.

Authors:  Li Ding; Gad Getz; David A Wheeler; Elaine R Mardis; Michael D McLellan; Kristian Cibulskis; Carrie Sougnez; Heidi Greulich; Donna M Muzny; Margaret B Morgan; Lucinda Fulton; Robert S Fulton; Qunyuan Zhang; Michael C Wendl; Michael S Lawrence; David E Larson; Ken Chen; David J Dooling; Aniko Sabo; Alicia C Hawes; Hua Shen; Shalini N Jhangiani; Lora R Lewis; Otis Hall; Yiming Zhu; Tittu Mathew; Yanru Ren; Jiqiang Yao; Steven E Scherer; Kerstin Clerc; Ginger A Metcalf; Brian Ng; Aleksandar Milosavljevic; Manuel L Gonzalez-Garay; John R Osborne; Rick Meyer; Xiaoqi Shi; Yuzhu Tang; Daniel C Koboldt; Ling Lin; Rachel Abbott; Tracie L Miner; Craig Pohl; Ginger Fewell; Carrie Haipek; Heather Schmidt; Brian H Dunford-Shore; Aldi Kraja; Seth D Crosby; Christopher S Sawyer; Tammi Vickery; Sacha Sander; Jody Robinson; Wendy Winckler; Jennifer Baldwin; Lucian R Chirieac; Amit Dutt; Tim Fennell; Megan Hanna; Bruce E Johnson; Robert C Onofrio; Roman K Thomas; Giovanni Tonon; Barbara A Weir; Xiaojun Zhao; Liuda Ziaugra; Michael C Zody; Thomas Giordano; Mark B Orringer; Jack A Roth; Margaret R Spitz; Ignacio I Wistuba; Bradley Ozenberger; Peter J Good; Andrew C Chang; David G Beer; Mark A Watson; Marc Ladanyi; Stephen Broderick; Akihiko Yoshizawa; William D Travis; William Pao; Michael A Province; George M Weinstock; Harold E Varmus; Stacey B Gabriel; Eric S Lander; Richard A Gibbs; Matthew Meyerson; Richard K Wilson
Journal:  Nature       Date:  2008-10-23       Impact factor: 49.962

Review 7.  Hallmarks of cancer: the next generation.

Authors:  Douglas Hanahan; Robert A Weinberg
Journal:  Cell       Date:  2011-03-04       Impact factor: 41.582

Review 8.  Targeting the PI3K/Akt/mTOR pathway in non-small cell lung cancer (NSCLC).

Authors:  Aaron C Tan
Journal:  Thorac Cancer       Date:  2020-01-27       Impact factor: 3.500

9.  Large-scale mutagenesis in p19(ARF)- and p53-deficient mice identifies cancer genes and their collaborative networks.

Authors:  Anthony G Uren; Jaap Kool; Konstantin Matentzoglu; Jeroen de Ridder; Jenny Mattison; Miranda van Uitert; Wendy Lagcher; Daoud Sie; Ellen Tanger; Tony Cox; Marcel Reinders; Tim J Hubbard; Jane Rogers; Jos Jonkers; Lodewyk Wessels; David J Adams; Maarten van Lohuizen; Anton Berns
Journal:  Cell       Date:  2008-05-16       Impact factor: 41.582

10.  PRKDC: new biomarker and drug target for checkpoint blockade immunotherapy.

Authors:  Kien Thiam Tan; Chun-Nan Yeh; Yu-Chan Chang; Jen-Hao Cheng; Wen-Liang Fang; Yi-Chen Yeh; Yu-Chao Wang; Dennis Shin-Shian Hsu; Chiao-En Wu; Jiun-I Lai; Peter Mu-Hsin Chang; Ming-Han Chen; Meng-Lun Lu; Shu-Jen Chen; Yee Chao; Michael Hsiao; Ming-Huang Chen
Journal:  J Immunother Cancer       Date:  2020-03       Impact factor: 13.751

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.