Literature DB >> 31888619

Identifying driver genes involving gene dysregulated expression, tissue-specific expression and gene-gene network.

Junrong Song1, Wei Peng2, Feng Wang1, Jianxin Wang3.   

Abstract

BACKGROUND: Cancer as a kind of genomic alteration disease each year deprives many people's life. The biggest challenge to overcome cancer is to identify driver genes that promote the cancer development from a huge amount of passenger mutations that have no effect on the selective growth advantage of cancer. In order to solve those problems, some researchers have started to focus on identification of driver genes by integrating networks with other biological information. However, more efforts should be needed to improve the prediction performance.
METHODS: Considering the facts that driver genes have impact on expression of their downstream genes, they likely interact with each other to form functional modules and those modules should tend to be expressed similarly in the same tissue. We proposed a novel model named by DyTidriver to identify driver genes through involving the gene dysregulated expression, tissue-specific expression and variation frequency into the human functional interaction network (e.g. human FIN).
RESULTS: This method was applied on 974 breast, 316 prostate and 230 lung cancer patients. The consequence shows our method outperformed other five existing methods in terms of Fscore, Precision and Recall values. The enrichment and cociter analysis illustrate DyTidriver can not only identifies the driver genes enriched in some significant pathways but also has the capability to figure out some unknown driver genes.
CONCLUSION: The final results imply that driver genes are those that impact more dysregulated genes and express similarly in the same tissue.

Entities:  

Keywords:  Driver genes; Dysregulated expression; Human functional interaction network; Tissue-specific expression; Variation frequency

Mesh:

Year:  2019        PMID: 31888619      PMCID: PMC6936147          DOI: 10.1186/s12920-019-0619-z

Source DB:  PubMed          Journal:  BMC Med Genomics        ISSN: 1755-8794            Impact factor:   3.063


Background

Cancer as a kind of genomic alteration disease each year deprives many people’s life [1-3]. It is acknowledged that cancer arise is due to the accumulation of mutations in a subgroup of genes which conferring growth advantage, allowing uncontrolled proliferation and avoiding apoptosis [4, 5]. With the development of next-generation sequencing technology, several large-scale cancer projects have generated a large amount of cancer genomic data, such as The Cancer Genome Atlas (TCGA) [6], International Cancer Genome Consortium (ICGC) [7], which enable the detection of thousands of mutations. However, not all mutations contribute to the cancer initiation and progression. The mutations that are important to the cancer development and provide selective growth advantage are called driver mutations, the opposite is termed as the passenger mutations [8, 9]. Some researches show that the number of passenger mutations far beyond the number of driver mutations [9]. For example, from 11 cancer types, there are only 2 to 6 mutations have been regarded as the driver mutations among 200 somatic mutations which including missense, nonsense, silent, non-coding, splice-site, non-stop mutations, frameshift insertions and deletions (indels) and inframe indels [9-12]. Besides, those important alterations are not uniformly distributed across the genome and target to some specific genes associated with important cellular functions such as cell survival, cell fate etc. [4, 13–15]. For example, the well-known tumor suppressor TP53 participate in defense mechanisms against cancer and their inactivation by alteration can increase the selective growth advantage of the cell [16]. The alterations of ERBB2 [17] and KRAS [18] can lead to the acquisition of new properties that provide some selective growth advantage or spread to remote organs. Hence, the biggest challenge to overcome cancer is how to precisely discriminate those driver genes which harboring driver mutations and have the capability to promote cancer development from those irrelevant passenger genes [11]. This act is essential to understand the tumor biology and designing precision therapies [4, 19]. Traditional methods to identify cancer driver genes are based on the assumption that driver mutations confer a selective advantage to tumor growth and they occur more frequently than expected by random chance [20]. This kind of methods such as Mutsig [21] and MuSic [22] successfully pinpoints part of recurrence genes. However, in fact, only a small number of genes are altered in a high percentage of patient. Much larger number of genes are altered infrequently [11]. Besides, due to the heterogeneity of cancer, it is so hard to properly estimate the background mutation rate that many errors may be introduced [23]. A promising angle to identify cancer driver genes is based on network since it is acknowledged that cancer genes are more closely related with each other within a group to perform a certain function [24]. HotNet [25] and HotNet2 [26] apply a propagation process that diffuse the score of mutation frequency through the whole gene-gene interaction network and extract significantly mutated subnetworks to identify driver genes. NBS [27] detects driver genes by taking the strategies similar to HotNet. However, NBS detects mutated subnetworks of each patient and uses a consensus clustering framework to merge subnetworks across all patients. Unlike previous methods that use global network information, MUFFINN [28] prioritizes the cancer driver genes by measuring the impact from all neighbors of mutated genes in the functional network. Although these network-based methods mentioned above proposed a new focus on the interacting relationship of cancer driver genes, most of them identified cancer driver genes only consider the patient-gene mutation profiles and topology of networks. Besides, they are too much rely on the known network which may create some false positive data [23]. To overcome these limitations, some researchers focus on combining the cancer gene’s functional interactive relationship and other biological properties to improve the precision of detecting cancer driver genes. For example, DriverNet [29] identifies cancer driver genes by estimating their effect on mRNA expression. Inspired by the rationale that cancer driver genes may be determined by their impact on expressions of downstream genes, DriverNet firstly identifies the downstream genes (called outlying genes) with significantly differential expressions and then constructs a bi-graph where one side is mutated genes and the other side is outlying genes. It selects the driver genes that connect to the most nodes in the outlying gene side. Shi et al. [30] further improve DriverNet method by introducing diffusion process on the bi-graph. DawnRank [31] ranks potential cancer driver genes based on both their own expression difference and their impact on the overall differential expression of the downstream genes in the molecular interaction network. LNDriver [24] is also designed on the basis of bi-graph, while it incorporates the DNA length to filter mutated gene at the first step. Above mentioned bi-graph-based methods to some degree improve the accuracy of identifying cancer driver genes by adding biology profiles to the gene itself. However, the reliability of network still needs to do further improvement since most of known networks are built based on either or mix of large scale of computational and experimental data. This may directly impact the efficiency and precision of detecting novel driver genes [23]. Hence, the fundamental problem is to establish one model that can improve the reliability of network so as to improve the power of prediction. To achieve this, some researchers consider to incorporate specific biological profiles to assign a weight for each interaction such as the impact of differential expression information [32]. However, seldom of them considered the facts that the majority of cancer genes interact with each other to form functional modules and those modules should tend to be expressed similarly in the same tissue. Ganegoda et.al [33] use the tissue-specific data to predict the new disease-gene associations by measuring the gene expression in disease related tissues and achieved higher performance. Besides, previous studies found genetic disorders tend to manifest only in a single or a few tissues for a given disease [34]. Motivated by these, we want to refine the gene functional interaction network by considering expression similarity between each pair of mutated genes in the cancer’s related one or two tissues. Moreover, from the previous research, it is known that cancer driver genes are more likely to be frequently mutated across a cohort of patients and also dysregulate downstream genes’ expression. Based on the facts mentioned above, we proposed a model called DyTidriver to predict cancer driver genes by integrating dysregulated expression profiles, tissue-specific expression profiles, modularity of mutated genes and variation frequency into the gene functional interaction network. In DyTidriver, considering the fact that cancer driver genes are likely dysregulate downstream genes’ expression, mutated genes were firstly filtered according to their impact on the expression of downstream genes. After that, mutated genes’ interactive network was weighted by considering gene-gene co-expression in specific tissues of each query disease and the relationship between mutated genes. Because the majority of cancer driver genes interact with each other to form functional modules and those modules tend to be expressed similarly in the same tissue. Finally, with respect to the facts that driver genes are more likely to be frequently mutated across a cohort of patients and interact with each other to form functional modules, the mutated genes were ranked by summing up the weighted graph and multiplying itself variation frequency. We explored our method to detect cancer driver genes of lung cancer, breast cancer and prostate cancer. The result shows that our method significantly outperforms other five existing methods [28-31] in terms of Fscore, Precision and Recall. Besides, the cociter analysis illustrates our method can not only identify some well-known cancer driver genes but also detects the unknown cancer driver genes with high co-occurrence ratio in some publications. Furthermore, the identified cancer driver genes also enrich in some significant pathways and biological functions.

Methods

Our method consists of four steps (see Fig. 1). At first, we filtered the mutated genes for each patient according to whether or not it influenced the expression of downstream genes. Only the mutated genes that dysregualte downstream genes’ expression will be included in our study. Then, the remaining mutated genes for all patients were mapped to the human functional interaction network (human FIN) to construct the Mut-Mut matrix. Thirdly, the tissue-specific pearson correlation coefficient (PCC) matrix was constructed by calculating the co-expression values of mutated genes derived from downloaded tissue expression information after searching the disease-tissue matrix. Finally, we calculated the edge clustering coefficient (ECC) values for the interactions in the network which established at the last step and assigned each mutated gene in the network a score by firstly summing up ECC values of its connected edges and then multiplying the addictive result to its corresponding variation frequency. According to the scores, the mutated genes were ranked in a descending order and those ranked at the top of the list were considered as potential cancer driver genes.
Fig. 1

The workflow of Dytidriver. We divided our whole process of cancer driver gene identification into four steps and marked with ‘a’,’ b’, ‘c’, ‘d’. In the step ‘a’, we filtered the mutated genes for each patient according to whether or not it influenced the expression of downstream genes. Only the mutated genes which connect at least one outlying genes would be included in our study. Then, the filtered mutated genes for all patients were mapped to the human functional interaction network to construct the Mut-Mut matrix. The ‘b’ step is to generate the tissue-specific PCC matrix. For each cancer, we chose the top one or two tissues with the higher association score in disease-tissue matrix as the cancer related tissues such as the tissue 1 and tissue 2 for disease D1. For each tissue, we calculated its gene-gene pearson correlation values across the whole patients and then generated the gene-gene PCC matrix by keeping the absolute PCC values more than 0.3 while left setting to 0. If there are more than one tissue related to a cancer, the final tissue-specific PCC matrix is constructed by averaging the values in the gene-gene PCC matrix of each tissue. In the ‘c’ step, we constructed the ECC mutated matrix by utilizing the ECC equation. In the final ‘d’ step, we assigned each mutated gene in the network a score by summing up all the ECC values of its connecting edges and then multiply to its corresponding variation frequency. According to the scores, the mutated genes were ranked in a descending order and those ranked at the top list the were considered as potential driver genes

The workflow of Dytidriver. We divided our whole process of cancer driver gene identification into four steps and marked with ‘a’,’ b’, ‘c’, ‘d’. In the step ‘a’, we filtered the mutated genes for each patient according to whether or not it influenced the expression of downstream genes. Only the mutated genes which connect at least one outlying genes would be included in our study. Then, the filtered mutated genes for all patients were mapped to the human functional interaction network to construct the Mut-Mut matrix. The ‘b’ step is to generate the tissue-specific PCC matrix. For each cancer, we chose the top one or two tissues with the higher association score in disease-tissue matrix as the cancer related tissues such as the tissue 1 and tissue 2 for disease D1. For each tissue, we calculated its gene-gene pearson correlation values across the whole patients and then generated the gene-gene PCC matrix by keeping the absolute PCC values more than 0.3 while left setting to 0. If there are more than one tissue related to a cancer, the final tissue-specific PCC matrix is constructed by averaging the values in the gene-gene PCC matrix of each tissue. In the ‘c’ step, we constructed the ECC mutated matrix by utilizing the ECC equation. In the final ‘d’ step, we assigned each mutated gene in the network a score by summing up all the ECC values of its connecting edges and then multiply to its corresponding variation frequency. According to the scores, the mutated genes were ranked in a descending order and those ranked at the top list the were considered as potential driver genes

Experimental data

The datasets in this study derived from three places. The first part includes the somatic mutation data and their corresponding transcriptional expression data for each patient. Both of these datasets were downloaded from the TCGA website by utilizing the TCGA2STAT R packages. For our analysis, we focused on the somatic mutation and gene transcriptional expression data for 230 lung cancer patients, 974 breast cancer patients and 331 prostate cancer patients. The downloaded TCGA datasets include both tumor and normal patients: 58 of 230 lung, 110 of 974 breast and 52 of 331 prostate are normal patients. The second part of dataset is the tissue-specific expression profiles. In order to find the most related tissues for each cancer type, we searched the tissue-disease matrix which can be downloaded from the reference [34]. Each entry in the matrix represents the covariance of a disease with a tissue through the way of counting the number of publications co-appearing the disease and tissue, relative to the number of publications mentioning the disease or tissue alone. It is acknowledged that genetic disorders tend to manifest only in a single or few tissues for a given disease [34]. Hence, we chose one or two of the most relevant tissues for each cancer type. Fortunately, the directly related tissue can be found for most of cancer type e.g. the lung tissue for lung cancer, prostate tissue for prostate cancer. However, we cannot find the breast tissue in the disease-tissue matrix. Instead, we chose the top two relevant tissues (e.g. prostate, ovary) with higher association score for breast cancer. In order to obtain the tissue-specific expression profiles, we used the Gene Expression Omnibus (GEO) database. Because GEO database is currently the largest and most famous expression data platform which stores relatively complete expression data. According to the identified most related tissues for each cancer type, we downloaded the gene expression details of each tissue sample from the GEO website by querying dataset GSE7307. The database lists the transcriptional profile of both normal and disease human tissues representing over 90 distinct tissue types by using the Affymetrix human U133 plus 2.0 array. At here, we used the R package called GEOquery to download the corresponding tissue expression information from the platform GPL570. The downloaded data is the expression profile matrix with genes and patients as the columns and rows respectively. The last part of the dataset comes from the currently release version (2016) of human functional interaction network (human FIN) in which involving 12,275 genes and 46,0434 edges [35]. This network is constructed by extending curated pathways with non-curated sources of information, including protein-protein interactions, gene co-expression, protein domain interaction, Gene Ontology (GO) annotations and text-mined protein interactions, which cover close to 50% of the human proteome. The benchmarking of driver genes was downloaded from the NCG 4.0 which included 537 known cancer genes from the Cancer Gene Census [36] and 1463 candidate cancer genes that were derived from the manual curation of 77 whole genome or whole exome cancer-resequencing screenings [37] .

Filtering mutated genes and constructing Mut-Mut matrix

The somatic mutation data were downloaded from TCGA website where records the information of mutated gene across patients. The genes that were mutated in at least one patient were kept and regarded as the mutated genes. Previous researches have pointed out that driver genes are more likely to regulate the expression of downstream genes [29-31]. Those gene whose expression were impacted significantly are called outlying genes. In order to acquire the outlying genes, we downloaded the transcriptional expression information from the TCGA website and calculated their z-scores. More specifically, for each gene and each patient, a gene was regarded as the outlying gene for the patient if its z-score > 2.0 or its z-score < − 2.0. The setting of threshold as ± 2.0 was referred to the DriverNet [29]. Then, we kept the mutated genes which have at least one connection with outlying genes in the human FIN while filtered out those having no connections with outlying genes. Finally, the remaining mutated genes were mapped to the human FIN to generated the binary Mut-Mut matrix in which the rows and columns are the remaining mutated genes and the element is 1 if there is a connection between the two mutated genes in the human FIN, 0 otherwise.

Assigning weight to Mut-Mut matrix by PCC values

Since the majority of disease genes forming a common functional module tend to be expressed similarly in the same tissue and there exist too much false positive connections in the gene networks, in this work, we use tissue-specific expression profile to assign weights for the interactions of genes in order to improve the reliability of genes interactive network. For each cancer type, at first, we chose the most related tissue according to its association score in the disease-tissue matrix [34]. If there is at least one tissue related with a cancer in the disease-tissue matrix, its corresponding tissue expression information across a cohort of patients can be downloaded from the GEO website. After that, we calculated the gene-gene PCC values of downloaded tissue expression matrix across the whole patients and then generated the PCC matrix by keeping their absolute PCC values more than 0.3 while left setting to 0. The threshold setting was according to previous research [34]. At last, the average score of PCC matrix of each tissue was regarded as the final tissue-specific PCC matrix of the cancer type. We assigned a weight to values in the Mut-Mut matrix based on the tissue-specific PCC matrix. Specifically, if a mutated gene i connects to a mutated gene j in the Mut-Mut matrix (e.g. W(i,j) = 1), the PCC value of genes i and j was assigned to the corresponding entry of the Mut-Mut matrix otherwise the value was set to 0. Consequently, a weighted mutated PCC matrix denoted by W is constructed.

Calculating the mutated gene score

Previous studies have found that cancer is the fact that genes act together in various signaling pathway and protein complexes [25]. Hence, in order to highlight the modularity of cancer driver genes, we calculated the ECC values for each pair of mutated genes in the mutated PCC matrix. The ECC value was normally used to measure the degree of closeness between two nodes in a network, which has been widely applied in detecting network modules [38-40]. We calculated the ECC values for each pair of mutated genes in the weighted mutated PCC matrix (denoted by Matrix W in Eq. 1). The higher ECC value means two genes are more likely to act together in a common module. The definition of ECC is as Eq. 1. After calculating the ECC score for each pair of mutated genes in the weighted mutated PCC matrix, we assigned each mutated gene a score (Mi) by summing up all ECC values of its connecting edges (see Eq. 2). It is known that cancer driver genes are more likely to be those frequently mutated in many patients. Hence, the final ranking score of each mutated gene was calculated by multiplying its variation frequency to its additive score (see Eq. 3). After that, all mutated genes were ranked in a descending order according to their ranking scores and the genes with the higher rank are more likely to be the cancer driver genes. Where W denotes weighted mutated PCC matrix. k denotes the common neighbors between mutated gene i and gene j in the matrix W. Wik is the weight between mutated gene i and gene k. d and d are the degrees of nodes i and j, respectively. Min (d,d) represents the maximal possible number of triangles that might include the edge(i,j). N is the set of all neighbors of mutated gene i. Vi denotes variation frequency of gene i which is measured by mutated times of gene i out of total patient counts.

Statistic evaluation metrics

In order to evaluate the performance of our method, top N of ranked genes were selected as potential cancer driver genes. The accuracy of prediction depends on how well the predicted cancer driver genes match the real ones, which was measured by three widely used statistic metrics, Precision, Recall and Fscore. where TP (true positive) is the number of predicted driver genes matched by known driver genes in benchmarking dataset. TN (true negative) is the number of not predicted driver genes that are not matched by known ones. FP (False Positive) is the number of predicted driver genes that are not matched by known driver genes. FN (false negative) is the number of known driver genes that are not matched by predicted ones.

Enrichment analysis

Another evaluation metric is pathway and GO enrichment analysis in order to evaluate whether or not the predicted cancer driver genes share common biological functions. It is widely known that cancer is a disease of pathways and the somatic mutations target the cancer genes in a group of regulatory and signaling networks [25]. Besides, those cancer-related driver mutations recurrently occur in the functional regions of protein (such as kinase domains and binding domains) to interrupt the major biological functions [41]. In this study, we leveraged the DAVID database to do the KEGG pathway enrichment analysis and GO enrichment analysis [42].

Results

In order to testify the effectiveness of our method, we applied our method and other four models: DriverNet [29], DawnRank [31] and Diffusion algorithm [30], Muffinn [28] on the breast cancer, prostate cancer and lung cancer to identify their driver genes. Among them, the DriverNet, DawnRank and Shi’s Diffusion algorithm utilize the gene dysregulated expression information to identify outlying genes and construct the bipartite graph. These methods ranked mutated genes according to their connections with the outlying genes. The Muffinn method leverages both the variation frequency of mutated genes and the impact of their neighbors to design the ranking scores. It was further classified into two models: Muf_max and Muf_sum, according to considering the impact of either the most frequently mutated neighbor or all direct neighbors [28]. Unlike the DriverNet, DawnRank and Shi’s diffusion method that use gene dysregulated expression to construct bipartite graph, our study only employs the dysregulated expression profile to filter the mutated genes. Moreover, similar to the Muffinn method, we also consider the variation frequency of mutated genes and the impact of their direct neighbors. However, compared with other methods, our method not only integrates the features of dysregulated expression information, variation frequency and human FIN but also considers the modularity of mutated genes and their co-expression in the same tissue. Running DawnRank demands expression data with normal and tumor samples. From the three cancer datasets, we can only download 110, 58, 52 tumor samples that have normal gene expression profiles for breast, lung and prostate respectively. Besides, we set the free parameter of DawnRank as three which was recommended by DawnRank authors [31].

Comparing performance

All the mutated genes were ranked in a descending order based on the scores assigned by each comparing method. After that, K of genes ranked in the top list were selected as candidate driver genes. According to the benchmark dataset, the Fscore, Recall, Precision values can be calculated to evaluate the performance of each method. With different values of K ranging from 1 to 200, the Fscore curve, Recall curve and Precision curve is drawn. The results are shown in the Fig. 2. In general, our results are superior to all of other four methods on the lung, prostate and breast cancer datasets. Compared with the other five methods, our model identifies the largest number of known drivers from NCG 4.0. For lung cancer, the Dytidriver and the other methods are tangled together when predicting small number of potential driver genes and then Dytidriver is significantly better than the other methods when the number of predicted driver genes increases from top 40 to 200. For prostate and breast cancer, our model demonstrated the best performance from beginning to the end. Similar to Muffinn, considering the variation frequency and the functional impact of direct neighbors, our method additionally takes advantage of the tissue-specific co-expression property and the modularity property which improve the precision of detecting driver genes to a higher level. Besides, the performance of Muf_max is worse than that of Muf_sum, which means it is inappropriate to judge a driver only based on the impact of single gene. DawnRank performed poorly among all comparing methods. The reason might be that only a limited number of cancer patients both have normal and tumor expression data for DawnRank.
Fig. 2

A comparison of the Precision, Recall, and Fscore for top ranking genes in the six methods. The X-axis represents the number of top-ranking genes. The Y-axis represents the score of the given metric

A comparison of the Precision, Recall, and Fscore for top ranking genes in the six methods. The X-axis represents the number of top-ranking genes. The Y-axis represents the score of the given metric We select the top 200 of cancer driver genes to do GO and pathway enrichment analysis. For lung cancer, in the biological process, the genes detected by our method enrich in the signal transduction, intracellular signaling cascade, transcription, metabolic process, regulation of cell death and apoptosis etc. In the cellular component, our results focus on the plasma membrane, organelle, cytoskeleton, lumen and cell fraction etc. In the molecular function, our results enrich in ion binding, nucleotide binding, ATP binding, transcription regulator activity etc. From the pathway aspect, our identified cancer driver genes enrich in some important cancer pathway, such as calcium signaling pathway, PI3K-Akt signaling pathway, mTOR signaling pathway. With respect to the breast cancer, in biological process, our results enrich in the intracellular signaling cascade, signal transduction, regulation of transcription, metabolic process, regulation of cell death, phosphorylation, transcription, phosphorylation and cell proliferation. In the cellular component, our results enrich in the plasma membrane, organelle, lumen and cell fraction. In the molecular function, our results mainly enrich in the nucleotide binding, ATP binding, DNA binding, transcription regulator activity and kinase activity. In pathway analysis, our results enrich in Calcium signaling pathway, MAPK signaling pathway, PI3K signaling pathway, p53 signaling pathway etc. In terms of prostate cancer, our results enrich in the regulation of transcription, signal transduction, adhesion molecules, regulation of GTPase activity etc. in biological process. For cellular component, our results enrich in nucleus, plasma membrane, cytosol, intracellular, protein complex etc. For molecular function, our results focus on protein binding, ATP binding, DNA binding, protein kinase activity and so on. From pathway aspect, our results enrich in the Calcium signaling pathway, PI3K signaling pathway, cAMP signaling pathway, mTOR signaling pathway.

Cociter analysis

Because the benchmark cancer driver genes are incomplete, to further prove the prediction capability of our method in distinguishing potentially cancer driver genes, we adopted the literature mining method to figure out the co-citation times of the predicted driver genes with the keywords ‘cancer type’(i.e. breast, prostate or lung), ‘driver’ and ‘cancer’ in the cociter website [25]. The larger the number of times the gene co-appeared with the keywords, the stronger associations between them. In this study, Tables 1, 2 and 3 show the cociter analysis of top 30 of genes identified by our method for each cancer type. In order to illustrate the capability of our method to prioritize significant well-known cancer driver genes, we also listed genes ranking position in other five methods.
Table 1

Cociter analysis of top 30 lung cancer driver genes identified by our method

GenesCancerLungDriverIs_driverDyTidriverDiffusionDriverNetDawnRankMuf_maxMuf_sum
TP53677299911011201156
ZNF536401125015NA268984979
EGFR4748284916613134726
TSHZ3411042748129524631268188
PRUNE21211055211NA26232018332
RYR2432067572055812825
SPTA1321072216151236
ATP10D100081836NA28252667873
ANKIB1210091607NA257241072080
ZNF5212011105025NA30581906302
NES1923150111483NA146130941138
PIK3CA1199183541122511243081
TLR4417591911371453672138
NF11651611114345621389139
FAT4457201531068391961970119
ASH1L4110161506NA22892549761
PRKCB41111117512NA44292
SLC12A12210181647NA303840061750
CTNNB1251734044119621NA5127
PLCB197102025222774591
APOB27420211177866442
MET10453484002221377427186
GRIN2B13320231839120397135
UBC13417202434NA1371
SASH113310251537NA132551003080
HGF39317470264784403981192
BRAF21752701261277075155392150
UBA61110285263NANA2957980
PTPRZ112110293366NA2402894289
TAF1L2110305575754710130

The second to the fourth column show the co-appeared times of top 30 identified genes with ‘driver’, ‘lung’ and ‘cancer’ (from the left to the right). Is_Driver indicates whether the given gene is a driver gene or not in the benchmark dataset. The left columns represent the ranking positions of identified genes in Dytidriver, Diffusion, DriverNet, DawnRank, Muf_max, Muf_sum respectively

Table 2

Cociter analysis of top 30 prostate cancer driver genes identified by our method

GenesCancerProstateDriveris driverDyTidriverDiffusionDriverNetDawnRankMuf maxMuf sum
TP53677229811011111384
CTNNB1251717044122221409
ASH1L401031703NANA65378
SPOP43244141721316983
ATM1377615051311123614
PTEN3047642641670094NA3937
TTN1002071724221422
FOXA118269100817533710
KMT2D25220985554NANANA
PIK3CA11993454110710NA28236
DYNC1H191201166195121972
CDH124000121511NA755349296
BRAF217533126113326633634834
AKT12152317231142023NA5233
FAT3111015192675NANA
LRP47020161440NANA1426541
GRIN2B13020177433NA22090
KMT2C232401861327NANANA
NCOR11092731195977584160
HSPA89691020108NA43867
OBSCN7000211714168408124
GRIN2A501022285928537473
PCDHA12100023145327119732465
MED12194402437616215731784
STAT318241472702516155588
PCDH182110261656936626239
CDH2350102745797NA29563
SPTA1301028171916922115
UFL1701029NANANA12381265
SP139338313089NA865

The second to the fourth column show the co-appeared times of top 30 identified genes with ‘driver’,‘prostate’ and ‘cancer’ (from the left to the right). Is_driver indicates whether the given gene is a driver or not in benchmark dataset. The left columns represent the ranking positions of identified genes in Dytidriver, Diffusion, DriverNet, DawnRank, Muf_max, Muf_sum respectively

Table 3

Co-citer analysis of top 30 breast cancer driver genes identified by our method

GenesCancerBreastDriveris driverDyTidriverDiffusionDriverNetDawnRankMuf maxMuf sum
TP5367721356110112331272
PIK3CA119933454121562123
MAP 3 K11356221312818489928
GATA31541228148513688817
CDH1141035819154241016
ERBB2533543327816726490873
UBC134302072403122221
NCOR1109453181391248668
ASH1L401091097NA19861846729
PIK3R113121711016010261345
EP3002698641116851783674
DYNC1H1922012638171017107
HUWE1294301325128459112
PTEN30476726411418598193379
MAP 3 K132011156189NA330326542045
NF1165241111614141194144
TTN10120172581657175
TPP24020181041NA267431722926
UFL1711019802NANA34933129
BRCA146524017221202511NA36127
BACH28120218101182236622981079
JAK2382921912211832NA73119
ERBB335417841237329810207
ERBB43502204124745627618410
MAP 2 K470102125127342389886
CTCF632131265520211102729
PRKCB4191127174593180151
SASH113810281011NANA37064179
TAF11031129225863335919
SPTA130103021217251018109

The second to the fourth column show the co-appeared times of top 30 identified genes with ‘driver’, ‘breast’ and ‘cancer’ (from the left to the right). is_driver indicates whether the given gene is a driver or not in the benchmark dataset. The left columns represent the ranking positions of identified genes in Dytidriver, Diffusion, DriverNet, DawnRank, Muf_max, Muf_sum respectively

Cociter analysis of top 30 lung cancer driver genes identified by our method The second to the fourth column show the co-appeared times of top 30 identified genes with ‘driver’, ‘lung’ and ‘cancer’ (from the left to the right). Is_Driver indicates whether the given gene is a driver gene or not in the benchmark dataset. The left columns represent the ranking positions of identified genes in Dytidriver, Diffusion, DriverNet, DawnRank, Muf_max, Muf_sum respectively Cociter analysis of top 30 prostate cancer driver genes identified by our method The second to the fourth column show the co-appeared times of top 30 identified genes with ‘driver’,‘prostate’ and ‘cancer’ (from the left to the right). Is_driver indicates whether the given gene is a driver or not in benchmark dataset. The left columns represent the ranking positions of identified genes in Dytidriver, Diffusion, DriverNet, DawnRank, Muf_max, Muf_sum respectively Co-citer analysis of top 30 breast cancer driver genes identified by our method The second to the fourth column show the co-appeared times of top 30 identified genes with ‘driver’, ‘breast’ and ‘cancer’ (from the left to the right). is_driver indicates whether the given gene is a driver or not in the benchmark dataset. The left columns represent the ranking positions of identified genes in Dytidriver, Diffusion, DriverNet, DawnRank, Muf_max, Muf_sum respectively For lung cancer, Table 1 shows some well-studied cancer driver genes were ranked in the top 30 by our methods, but were put in the latter positions by other methods. For example, Phosphatidylinositol 3-kinases (PI3Ks) are well known regulators of cellular growth and proliferation. It was ranked 12th by our method while ranked 112th by Dawnrank, 430th by Muf_max, 81th by Muf_sum. Toll-like receptor-4 (TLR4) in human tumors often correlates with chemoresistance and metastasis [43] which was ranked 13th by our method, ranked 71th by Diffusion algorithm while ranked 672th by Muf_max and 138th by Muf_sum. The oncogenic BRAF(V600E) mutation results in an active structural conformation characterized by greatly elevated ERK activity [44]. It was identified as the known cancer driver genes but ranked 70th, 75th, 155th, 392th and 150th by Diffusion, DriverNet and DawnRank, Muf_max and Muf_sum respectively. Our method can not only prioritize the significant cancer driver genes but also identify some potential cancer driver genes which were neglected by the NCG 4.0 such as the NES, MET and HGF. Especially for the MET, some researchers found that high MET gene copy number leads to shorter survival in patients with non-small cell lung cancer. MET co-existed with key words, ‘cancer’, ‘lung’ and ‘driver’ for 1045, 348 and 40 times. For the prostate cancer as shown in Table 2, our method also identified some high-ranking significant driver genes, including TP53, CTNNB1, PTEN, PIK3CA and so on. What we want to mention is the famous tumor suppressor PTEN which is frequently inactivated in human prostate cancer [45]. It was ranked 6th by our method but strangely put in the 700th by Diffusion algorithm, 94th by DriverNet and even neglected by DawnRank. Furthermore, the results show DawnRank missed more than one significant cancer driver genes including PTEN, PIK3CA and AKT1. BRAF which involves in prostate related RAS/RAF/ERK signaling pathway [28] was ranked 13th by our methods while 326th by Diffusion algorithm, 63th by DriverNet, 36th by DawnRank, 348th by Muf_max and 34th by Muf_sum. Besides, some high associated genes ignored by NCG 4.0 are also ranked in the top list of our method. The ATM (ataxia telangiectasia mutated) kinase plays an essential role in maintaining genome integrity by coordinating cell cycle arrest, apoptosis, and DNA damage repair [46]. It was missed by the NCG 4.0 but co-appeared with ‘cancer’ for 1377 times, with ‘prostate’ for 61 times and with ‘driver’ for 5 times. Forkhead box protein A1 (FOXA1) modulates the transactivation of steroid hormone receptors and thus may influences tumor growth and hormone responsiveness in prostate cancer [47]. It was ranked 8th by our method while neglected by NCG 4.0. In addition, the transcription factors SP1 also has been missed by NCG 4.0. For breast cancer in Table 3, our method successfully achieved a high precision in identifying the top 10 cancer driver genes with 8 out of 10 accuracy rates. The well-studied breast cancer driver genes including TP53, PIK3CA, MAP 3 K1, CDH1, ERBB2 and PTEN were also put in the top list of our method. Among those known breast cancer driver genes, the top three cancer driver genes (TP53, PIK3CA, MAP 3 K1) identified by our methods were ranked 233th, 156th and 128th respectively by Diffusion algorithm. The HER2 (official name is ERBB2) gene encodes a membrane receptor in the epidermal growth factor receptor family amplified and over expressed in adenocarcinoma [48]. It was regarded as the important cancer driver gene by many researchers and ranked 6th by our method while 72th, 64th, 90th, 73th by Diffusion algorithm, DriverNet, DawnRank and Muf_sum respectively. The breast cancer suppressor gene PTEN was ranked 14th by our method while 185th, 98th, 93th and 79th by Diffusion, DriverNet, DawnRank and Muf_sum receptively. Besides, the BRCA1 and JAK2 that co-cited with ‘cancer’ and ‘breast’ for many times were also missed by the DawnRank.

Discussion

The core step to overcome cancer is to identify the cancer driver genes which can promote cancer evolvement and development. However, it is a hard task since cancer is heterogeneous and there are too much irrelevant passenger genes. Recently, many methods try to shorten the distance to the truth. However, these methods still have some limitations. For example, they ignored many driver genes with low variation frequency and highly depend on the error-prone network. Inspired by the fact that cancer genes forming functional modules tend to be expressed similarly in the same tissue, we considered to improve the reliability of the gene functional interaction network by incorporating the expression similarity between mutated gene pairs in the cancers’ related tissues. In order to obtain the tissue-specific expression profiles, we used the GEO database. Because GEO database is currently the largest and most famous expression data platform which stores relatively complete expression data. The GEO dataset which we used in this work was consisted of a total of 677 patients, including cancer and normal patients, covered over 90 distinct tissue types and was created by the same organization using the same experimental technology. Although our model is superior to the other methods, it still has some limitations. For example, the datasets used in this work come from different projects: TCGA and GEO. Although, we just use the GEO dataset to calculate the co-expression values of mutated genes in a specific tissue. The likelihood is that there exists ambiguous since the heterogeneous within different patients. Therefore, in order to release this concern, in the future, we consider to unify the dataset as far as possible.

Conclusion

In this work, we proposed a new method to identify cancer driver genes by integrating the gene dysregulated expression, tissue-specific expression and variation frequency into the functional interaction network. Compared to other network-based methods, our method not only considered that driver genes have impact on the expression of downstream genes, but also took advantage of the modularity property of driver genes, their co-expression in specific tissues and itself variation frequency. We compared our results with other four similar methods and did cociter analysis and enrichment analysis. From the results, we can easily draw the conclusion that our method has the capability to identify the cancer driver genes with high precision and meanwhile detect some potential unknown cancer driver genes. Besides, the enrichment analysis also illustrates that the top ranking cancer driver genes in our list enrich in some significant cancer-related pathways and implement important functions [48].
  45 in total

1.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.

Authors:  Da Wei Huang; Brad T Sherman; Richard A Lempicki
Journal:  Nat Protoc       Date:  2009       Impact factor: 13.491

Review 2.  Evolution of the cancer genome.

Authors:  Lucy R Yates; Peter J Campbell
Journal:  Nat Rev Genet       Date:  2012-10-09       Impact factor: 53.242

3.  DawnRank: discovering personalized driver genes in cancer.

Authors:  Jack P Hou; Jian Ma
Journal:  Genome Med       Date:  2014-07-31       Impact factor: 11.117

4.  The neu oncogene: an erb-B-related gene encoding a 185,000-Mr tumour antigen.

Authors:  A L Schechter; D F Stern; L Vaidyanathan; S J Decker; J A Drebin; M I Greene; R A Weinberg
Journal:  Nature       Date:  1984 Dec 6-12       Impact factor: 49.962

Review 5.  A census of human cancer genes.

Authors:  P Andrew Futreal; Lachlan Coin; Mhairi Marshall; Thomas Down; Timothy Hubbard; Richard Wooster; Nazneen Rahman; Michael R Stratton
Journal:  Nat Rev Cancer       Date:  2004-03       Impact factor: 60.716

6.  Patterns of somatic mutation in human cancer genomes.

Authors:  Christopher Greenman; Philip Stephens; Raffaella Smith; Gillian L Dalgliesh; Christopher Hunter; Graham Bignell; Helen Davies; Jon Teague; Adam Butler; Claire Stevens; Sarah Edkins; Sarah O'Meara; Imre Vastrik; Esther E Schmidt; Tim Avis; Syd Barthorpe; Gurpreet Bhamra; Gemma Buck; Bhudipa Choudhury; Jody Clements; Jennifer Cole; Ed Dicks; Simon Forbes; Kris Gray; Kelly Halliday; Rachel Harrison; Katy Hills; Jon Hinton; Andy Jenkinson; David Jones; Andy Menzies; Tatiana Mironenko; Janet Perry; Keiran Raine; Dave Richardson; Rebecca Shepherd; Alexandra Small; Calli Tofts; Jennifer Varian; Tony Webb; Sofie West; Sara Widaa; Andy Yates; Daniel P Cahill; David N Louis; Peter Goldstraw; Andrew G Nicholson; Francis Brasseur; Leendert Looijenga; Barbara L Weber; Yoke-Eng Chiew; Anna DeFazio; Mel F Greaves; Anthony R Green; Peter Campbell; Ewan Birney; Douglas F Easton; Georgia Chenevix-Trench; Min-Han Tan; Sok Kean Khoo; Bin Tean Teh; Siu Tsan Yuen; Suet Yi Leung; Richard Wooster; P Andrew Futreal; Michael R Stratton
Journal:  Nature       Date:  2007-03-08       Impact factor: 49.962

7.  Frequent mutation of histone-modifying genes in non-Hodgkin lymphoma.

Authors:  Ryan D Morin; Maria Mendez-Lago; Andrew J Mungall; Rodrigo Goya; Karen L Mungall; Richard D Corbett; Nathalie A Johnson; Tesa M Severson; Readman Chiu; Matthew Field; Shaun Jackman; Martin Krzywinski; David W Scott; Diane L Trinh; Jessica Tamura-Wells; Sa Li; Marlo R Firme; Sanja Rogic; Malachi Griffith; Susanna Chan; Oleksandr Yakovenko; Irmtraud M Meyer; Eric Y Zhao; Duane Smailus; Michelle Moksa; Suganthi Chittaranjan; Lisa Rimsza; Angela Brooks-Wilson; John J Spinelli; Susana Ben-Neriah; Barbara Meissner; Bruce Woolcock; Merrill Boyle; Helen McDonald; Angela Tam; Yongjun Zhao; Allen Delaney; Thomas Zeng; Kane Tse; Yaron Butterfield; Inanç Birol; Rob Holt; Jacqueline Schein; Douglas E Horsman; Richard Moore; Steven J M Jones; Joseph M Connors; Martin Hirst; Randy D Gascoyne; Marco A Marra
Journal:  Nature       Date:  2011-07-27       Impact factor: 49.962

8.  Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes.

Authors:  Mark D M Leiserson; Fabio Vandin; Hsin-Ta Wu; Jason R Dobson; Jonathan V Eldridge; Jacob L Thomas; Alexandra Papoutsaki; Younhun Kim; Beifang Niu; Michael McLellan; Michael S Lawrence; Abel Gonzalez-Perez; David Tamborero; Yuwei Cheng; Gregory A Ryslik; Nuria Lopez-Bigas; Gad Getz; Li Ding; Benjamin J Raphael
Journal:  Nat Genet       Date:  2014-12-15       Impact factor: 38.330

9.  Network-based stratification of tumor mutations.

Authors:  Matan Hofree; John P Shen; Hannah Carter; Andrew Gross; Trey Ideker
Journal:  Nat Methods       Date:  2013-09-15       Impact factor: 28.547

10.  Mutational heterogeneity in cancer and the search for new cancer-associated genes.

Authors:  Michael S Lawrence; Petar Stojanov; Paz Polak; Gregory V Kryukov; Kristian Cibulskis; Andrey Sivachenko; Scott L Carter; Chip Stewart; Craig H Mermel; Steven A Roberts; Adam Kiezun; Peter S Hammerman; Aaron McKenna; Yotam Drier; Lihua Zou; Alex H Ramos; Trevor J Pugh; Nicolas Stransky; Elena Helman; Jaegil Kim; Carrie Sougnez; Lauren Ambrogio; Elizabeth Nickerson; Erica Shefler; Maria L Cortés; Daniel Auclair; Gordon Saksena; Douglas Voet; Michael Noble; Daniel DiCara; Pei Lin; Lee Lichtenstein; David I Heiman; Timothy Fennell; Marcin Imielinski; Bryan Hernandez; Eran Hodis; Sylvan Baca; Austin M Dulak; Jens Lohr; Dan-Avi Landau; Catherine J Wu; Jorge Melendez-Zajgla; Alfredo Hidalgo-Miranda; Amnon Koren; Steven A McCarroll; Jaume Mora; Brian Crompton; Robert Onofrio; Melissa Parkin; Wendy Winckler; Kristin Ardlie; Stacey B Gabriel; Charles W M Roberts; Jaclyn A Biegel; Kimberly Stegmaier; Adam J Bass; Levi A Garraway; Matthew Meyerson; Todd R Golub; Dmitry A Gordenin; Shamil Sunyaev; Eric S Lander; Gad Getz
Journal:  Nature       Date:  2013-06-16       Impact factor: 49.962

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.