Literature DB >> 30069494

Inferring Novel Tumor Suppressor Genes with a Protein-Protein Interaction Network and Network Diffusion Algorithms.

Lei Chen^1,2, Yu-Hang Zhang¹, Zhenghua Zhang³, Tao Huang¹, Yu-Dong Cai⁴.

Abstract

Extensive studies on tumor suppressor genes (TSGs) are helpful to understand the pathogenesis of cancer and design effective treatments. However, identifying TSGs using traditional experiments is quite difficult and time consuming. Developing computational methods to identify possible TSGs is an alternative way. In this study, we proposed two computational methods that integrated two network diffusion algorithms, including Laplacian heat diffusion (LHD) and random walk with restart (RWR), to search possible genes in the whole network. These two computational methods were LHD-based and RWR-based methods. To increase the reliability of the putative genes, three strict screening tests followed to filter genes obtained by these two algorithms. After comparing the putative genes obtained by the two methods, we designated twelve genes (e.g., MAP3K10, RND1, and OTX2) as common genes, 29 genes (e.g., RFC2 and GUCY2F) as genes that were identified only by the LHD-based method, and 128 genes (e.g., SNAI2 and FGF4) as genes that were inferred only by the RWR-based method. Some obtained genes can be confirmed as novel TSGs according to recent publications, suggesting the utility of our two proposed methods. In addition, the reported genes in this study were quite different from those reported in a previous one.

Entities: CellLine Chemical Disease Gene Species

Keywords: Laplacian heat diffusion; permutation test; protein-protein interaction network; random walk with restart; tumor suppressor gene

Year: 2018 PMID： 30069494 PMCID： PMC6068090 DOI： 10.1016/j.omtm.2018.06.007

Source DB: PubMed Journal: Mol Ther Methods Clin Dev ISSN： 2329-0501 Impact factor: 6.698

Introduction

Tumor, also named as neoplasm, generally refers to a malignant tissue with malignant proliferative capacity, which usually but not always forms a solid mass with invasive and metastatic tendency as a space-occupying systematic disease. According to World Health Organization (WHO) statistics, more than 8.8 million people around the world directly die from cancer, taking nearly one-sixth of all global deaths, indicating that cancer may be one of the most significant threats for human health nowadays. Every year, direct economic loss on cancer prevention, diagnosis, and treatment has been validated to reach up to 1.16 trillion dollars. After a century of research, the main pathogeneses of cancers have been partially revealed, confirming both genetic background and environmental factors, inducing genomic abnormality that may initiate and promote the tumorigenesis. However, the detailed genes that contribute to such malignant processes in different tumor subtypes have not been fully discovered. Genes that directly contribute to tumor-associated malignant processes can be simply divided into two groups: oncogenes and tumor suppressor genes (TSGs), with reverse biological functions in normal conditions. Generally, oncogenes turn out to be genes that have the potential to induce tumor, while tumor suppressors turn out to be genes that protect cells from malignant alterations.4, 5 During the initiation and progression of tumor, genetic variants may motivate the activation of oncogenes and the silencing of tumor suppressors, further inducing the abnormal proliferation and invasion of malignant cells. Therefore, oncogenes and tumor suppressors may be equally significant for tumorigenesis. During the century of research on tumor-associated genes, various studies have paid attention to two such different groups of genes, revealing that two such groups of genes may play different biological roles during tumorigenesis. Two-hit hypothesis emerges to be one of the mainstream theories explaining the genetic contribution on tumor initiation.6, 7 Based on two-hit hypothesis, both or at least one of two such alleles encoding specific proteins may be affected before the phenotype appears. The accumulation of variants in both alleles turns out to be a specific biological characteristic of TSG, but not oncogene, because of the recessive status of most mutant TSGs, indicating that only one mutant allele is not functional enough to induce tumorigenesis. The exceptions to the two-hit roles have been further identified, perfecting the existing regulatory and pathological mechanisms of tumor suppressors. Two groups of TSGs have been confirmed to be the exceptions of the two-hit rule: (1) genes that containing dominant-negative mutations, and (2) genes exhibiting haploinsufficiency, both of which allow the expression and function of mutants in one single allele. TP53, which encodes p53 protein, is one of the famous exceptions. Considering that a mutant TP53 allele may prevent the function of the non-mutant one, therefore, only one mutant allele of TP53 may directly contribute to tumorigenesis. Apart from genes that contain dominant native mutations like TP53, another group of TSGs that escapes the two-hit rule turns out to exhibit haploinsufficiency. Usually caused by loss-of-function mutations, haploinsufficiency emerges to be a common genetic mechanism explaining the abnormal phenotype induced by only one functional copy of a certain gene with the other allele as dysfunctional.11, 12 Recessive genes, like most TSGs, can perform their potential biological functions when they exhibit haploinsufficiency. PTCH in medulloblastoma and NF1 in neurofibroma are two typical examples of functional recessive TSGs induced by haploinsufficiency.13, 14 Although such functional TSGs that escape the two-hit theory are generally functional and common TSGs, they only account for a small part of all TSGs, remaining the most of which obey the two-hit hypothesis. Considering that most TSGs obey the two-hit hypothesis, we find it harder to identify TSGs than oncogenes while comparing the tumor and normal samples (normal samples may also have dysfunctional mutations in only one allele of the TSGs).6, 7 It is hard and time consuming to identify TSGs based on current experimental conditions. In recent years, developing computational methods, which were constructed based on known data, is an alternative way to investigate cancer-related problems. These methods can always yield good performance on known data and may give useful clues for extending current knowledge. For example, several computational models have been built for the identification of cancer-related non-coding RNAs (ncRNAs),15, 16, 17, 18, 19, 20 which can further give help to recognize potential cancer-related ncRNAs. For TSGs, with the development of high-throughput sequencing and the accumulation of experimentally identified TSGs, identifying potential TSGs based on computational methods and current databases, which are faster and cheaper compared with experimental ones, is possible. Recently, Chen et al. proposed a computational method to identify potential TSGs based on known ones. Their method applied the shortest path (SP) algorithm to a protein network and searched for SPs connecting any two known TSGs. Genes lying on the obtained paths were extracted as candidate TSGs, which were further filtered by a permutation test. However, this method cannot fully use the network because the SP algorithm can only identify genes with limited distance to known genes. This study employed two network diffusion algorithms, which can make full use of the whole network and design stricter screening tests to extract more reliable potential TSGs. In this study, we proposed two computational methods based on two network diffusion algorithms, including Laplacian heat diffusion (LHD) algorithm and random walk with restart (RWR) algorithm. By executing these two algorithms on a protein-protein interaction (PPI) network using known TSGs as seed nodes, several potential TSGs were accessed. Then, three screening tests, including permutation, association, and function tests, followed to control false-positive genes and extract important ones. As a result, the LHD-based method, using the LHD algorithm as the network diffusion algorithm, yielded 41 putative genes, and the RWR-based method, using the RWR algorithm as the network diffusion algorithm, produced 140 putative genes. By comparing the putative genes yielded by the two methods, only a few genes were identified by both methods, indicating that both methods have their respective prediction advantages and the combined application of two such methods may further improve the prediction effects. In addition, our putative genes were quite different from those reported in a previous study, suggesting that putative genes obtained in this study can be essential supplements for the study of TSG. Finally, based on recent publications, several putative genes yielded by either the LHD-based or RWR-based method can be confirmed to be novel TSGs, validating the reliability of the results.

Results

In this study, we built two computational methods to infer novel TSGs: one was based on the LHD algorithm (LHD-based method), and the other one was based on the RWR algorithm (RWR-based method). These two methods integrated three screening tests. The procedures of these two methods are illustrated in Figure 1. The detailed description of these two methods can be found in the Materials and Methods. This section provides the detailed results yielded by them and makes some comparisons.

Figure 1

The Procedures of the LHD-Based and RWR-Based Methods

The LHD-based method first applied the LHD algorithm on a PPI network using validated TSGs as seed nodes, producing a large number of LHD genes. Then, these genes were filtered by three screening tests. The RWR-based method followed similar procedures. The only difference was the application of the RWR algorithm on the PPI network rather than the LHD algorithm.

The Procedures of the LHD-Based and RWR-Based Methods The LHD-based method first applied the LHD algorithm on a PPI network using validated TSGs as seed nodes, producing a large number of LHD genes. Then, these genes were filtered by three screening tests. The RWR-based method followed similar procedures. The only difference was the application of the RWR algorithm on the PPI network rather than the LHD algorithm.

Putative Genes Yielded by the LHD-Based Method

The LHD-based method applied the LHD algorithm to the PPI network G using Ensembl IDs of TSGs as seed nodes, yielding large numbers of LHD genes. Then, the obtained LHD genes were evaluated by three tests (permutation test, association test, and function test), thereby extracting important putative genes. By applying the LHD algorithm to the PPI network G using Ensembl IDs of TSGs as seed nodes, each node in G was assigned with a heat value. We set the threshold 10−4 to the heat value, obtaining 2,874 LHD genes. Next, these genes were evaluated by the permutation test with 500 randomly produced sets, yielding a p value for each of them. Moreover, 443 genes with p values less than 0.05 were selected. Then, these 443 genes were measured by maximum interaction scores (MISs) in the association test. Because 900 is set as the cutoff of the highest confidence in STRING, it was set as the threshold of MIS. Finally, for each of 85 genes that remained, we calculated their maximum function scores (MFSs) in the function test. By setting 0.9 as the threshold of MFS, we accessed 41 putative genes. All measurements mentioned above for each LHD gene are listed in Table S1, and the number of candidate genes passing each stage in the LHD-based method is listed in Table 1. The obtained 41 putative genes can be found in Table S2.

Table 1

Number of Candidate TSGs in Different Stages of LHD-Based and RWR-Based Methods

	Network Diffusion Algorithm	Permutation Test	Association Test	Function Test
LHD-based method	2,874	443	85	41
RWR-based method	5,889	1,364	980	140

Number of Candidate TSGs in Different Stages of LHD-Based and RWR-Based Methods

Putative Genes Yielded by the RWR-Based Method

Similar to the LHD-based method, the RWR-based method first adopted the RWR algorithm to search candidate genes in the PPI network G using validated TSGs as seed nodes. A great deal of RWR genes can be selected, which were further filtered via permutation, association, and function tests. The remaining genes were termed as the putative genes of this method. The RWR algorithm was executed on the PPI network G by setting validated TSGs as seed nodes. Each node (gene) received a probability. By setting 10−5 as the threshold of probability, we obtained 5,889 RWR genes. Then, these genes were also filtered by a permutation test (with 1,000 randomly produced sets), an association test, and a function test, yielding a p value, an MIS, and an MFS for each RWR gene, which can be found in Table S3. By setting 0.05 as the threshold of p value, 900 as the threshold of MIS, and 0.98 as the threshold of MFS, the number of RWR genes gradually decreased (see Table 1 for the number of RWR genes passing each stage in the method). Finally, 140 putative genes were accessed, which are provided in Table S4.

Comparison of Putative Genes Yielded by the Different Methods

The LHD-based and RWR-based methods yielded 41 and 140 putative genes, respectively. For convenience, we denoted two sets consisting of these genes as GLHD and GRWR, respectively. After taking the union operation of these two gene sets, we obtained 169 putative genes in which 12 were identified by both of the methods, 29 were identified only by the LHD-based method, and 128 were yielded only by the RWR-based method. The distribution of 169 putative genes is illustrated in Figure 2A. To elaborate the difference of GLHD and GRWR, the Jaccard coefficient of these two sets, which is defined as the proportion of members in their intersection among those in their union, i.e., , was calculated, resulting in 7.101%. If these two methods can be proven to be effective, which is addressed in the Discussion, the putative genes obtained by one method can be essential supplements for those obtained by another method. The simultaneous usage of these two methods can provide more potential TSGs.

Figure 2

Venn Diagrams to Illustrate Putative Gene Sets Yielded by the Different Methods

(A) A Venn diagram to illustrate the distribution of 169 putative genes that were identified by either the LHD-based method or the RWR-based method. The red and blue circles represent the gene sets consisting of the putative genes yielded by the LHD-based method and the RWR-based method, respectively. 12 genes were identified by both of the two methods. (B) A Venn diagram to illustrate three putative gene sets yielded by three methods. The purple circle represents the gene set yielded by the LHD-based method, the yellow circle represents the gene set yielded by the RWR-based method, and the green circle represents the gene set reported in a previous study (SP-based method).

Venn Diagrams to Illustrate Putative Gene Sets Yielded by the Different Methods (A) A Venn diagram to illustrate the distribution of 169 putative genes that were identified by either the LHD-based method or the RWR-based method. The red and blue circles represent the gene sets consisting of the putative genes yielded by the LHD-based method and the RWR-based method, respectively. 12 genes were identified by both of the two methods. (B) A Venn diagram to illustrate three putative gene sets yielded by three methods. The purple circle represents the gene set yielded by the LHD-based method, the yellow circle represents the gene set yielded by the RWR-based method, and the green circle represents the gene set reported in a previous study (SP-based method). To further elaborate the reliability of the 169 putative genes, the subnetwork, consisting of the linkages between them and validated TSGs, was extracted from the PPI network G and is shown in Figure 3A. Clearly, the associations between putative genes and validated genes are quite strong. In addition, we extracted the most important linkages (with highest confidence) from this subnetwork to construct another subnetwork, as shown in Figure 3B, from which we can observe that several putative genes have several strong associated TSGs, implying that they can be novel TSGs with high probabilities. In the Discussion, extensive analyses of several important putative genes are given.

Figure 3

The Subnetwork of PPI Network Containing the Linkages between Putative Genes and Validated TSGs

Pink nodes represent validated TSGs, while green, blue, and red nodes represent putative genes yielded by the LHD-based method, the RWR-based method, and both methods, respectively. The sizes of nodes in green, blue, and red represent their degrees. (A) The subnetwork containing all linkages between putative genes and validated TSGs. Edges in black, blue, green, and red represent PPIs with low, medium, high, and highest confidence, respectively. (B) The subnetwork only containing linkages between putative genes and validated TSGs with highest confidence.

The Subnetwork of PPI Network Containing the Linkages between Putative Genes and Validated TSGs Pink nodes represent validated TSGs, while green, blue, and red nodes represent putative genes yielded by the LHD-based method, the RWR-based method, and both methods, respectively. The sizes of nodes in green, blue, and red represent their degrees. (A) The subnetwork containing all linkages between putative genes and validated TSGs. Edges in black, blue, green, and red represent PPIs with low, medium, high, and highest confidence, respectively. (B) The subnetwork only containing linkages between putative genes and validated TSGs with highest confidence. In Chen et al.’s study, 205 putative TSGs were reported, which comprised gene set GSP. Among these genes, one gene (EPHA7) was in both GLHD and GRWR, one gene was only in GLHD, and 11 genes were only in GRWR (see Figure 2B). As shown in Figure 2B, 156 genes identified by either the LHD-based or RWR-based method were not in GSP, that is, they were not reported in Chen et al.’s study. If we can prove the utility of our two methods, which is elaborated upon in the Discussion, putative genes reported in this study can help us improve the comprehension of TSGs. To quantify the difference between putative genes in our study and Chen et al.’s study, we further calculated the Jaccard coefficients of GSP and GLHD, GSP and GRWR, and GSP and GLHD∪GRWR; they were 0.820%, 3.604%, and 3.601%, respectively. All of these indicated that our reported genes can be essential supplements for those in Chen et al.’s study. In addition, we list putative genes in at least two sets of GSP, GLHD, and GRWR in Table 2, i.e., these genes were identified by at least two different methods, implying they can be novel TSGs with high probabilities.

Table 2

24 Putative Genes Identified by at Least Two Methods

Ensembl ID	Gene Symbol	LHD-Based Method	RWR-Based Method	SP-Based Methoda
ENSP00000020945	SNAI2	×^b	√^c	√
ENSP00000222330	GSK3A	×	√	√
ENSP00000228682	GLI1	×	√	√
ENSP00000233948	WNT6	×	√	√
ENSP00000253055	MAP3K10	√	√	×
ENSP00000254480	SMARCC1	×	√	√
ENSP00000262158	SMAD7	×	√	√
ENSP00000287934	FZD1	×	√	√
ENSP00000293549	WNT1	×	√	√
ENSP00000308461	RND1	√	√	×
ENSP00000341032	WNT7B	×	√	√
ENSP00000343819	OTX2	√	√	×
ENSP00000347942	RET	√	√	×
ENSP00000354586	GLI2	√	√	×
ENSP00000358309	EPHA7	√	√	√
ENSP00000361892	STK4	×	√	√
ENSP00000362139	EPHA10	√	√	×
ENSP00000363115	FGR	√	√	×
ENSP00000364895	ZBTB17	√	×	√
ENSP00000365012	HCK	√	√	×
ENSP00000368686	E2F4	×	√	√
ENSP00000370912	TEC	√	√	×
ENSP00000381097	EPHB1	√	√	×
ENSP00000390500	STK3	√	√	×

×, the putative gene cannot be identified by the method; √, the putative gene can be identified by the method.

The computational method proposed in Chen et al.’s study.

24 Putative Genes Identified by at Least Two Methods ×, the putative gene cannot be identified by the method; √, the putative gene can be identified by the method. The computational method proposed in Chen et al.’s study.

Discussion

As mentioned in the previous section, comparing to putative genes reported in Chen et al.’s study, one specific gene (EPHA7) has been identified by all three computational methods, implying its specific biological function as a potential TSG. According to recent publications, EPHA7 has been confirmed to act as a tumor suppressor in multiple tumor subtypes, including follicular lymphoma, small cell lung cancer, gastric cancer, renal cell carcinoma, prostate cancer, and osteosarcoma, validating its effective anti-tumorigenesis contributions. Take follicular lymphoma as an instance for detailed demonstration. In 2011, a specific study on follicular lymphoma confirmed that the knockdown of EPHA7 may directly induce tumorigenesis in a mouse model, implying the specific tumor suppressor role of such gene. We identified two potential TSG sets via the LHD-based and RWR-based methods. As shown in Figure 2A, we clustered all putative genes into three groups: (1) putative genes identified by both of the two methods, (2) those identified only by the LHD-based method, and (3) those identified only by the RWR-based method. According to recent publications, some genes in these three groups can be confirmed to definitely contribute to tumorigenesis as functional TSGs, indicating that the two proposed methods are quite efficient and accurate to predict novel TSGs, whether used alone or combined. The detailed analysis is given below.

Putative Genes Identified by Both of the Two Methods

Twelve genes were predicted by the two methods to be putative TSGs. Here, we analyzed five of them, listed in Table 3.

Table 3

Important Putative Genes Yielded by Both LHD-Based and RWR-Based Methods

Ensembl ID	Gene Symbol	LHD-Based Method		RWR-Based Method		MIS	MFS
Ensembl ID	Gene Symbol	Heat	p Value	Probability	p Value	MIS	MFS
ENSP00000253055	MAP3K10	1.8567E−04	0.036	2.7819E−05	0.029	925	0.9954
ENSP00000308461	RND1	1.2137E−04	0.040	5.2943E−05	<0.001	982	0.9942
ENSP00000343819	OTX2	4.1770E−04	0.028	3.7524E−05	0.024	984	0.9855
ENSP00000347942	RET	1.3647E−04	0.048	1.1216E−04	<0.001	984	0.9867
ENSP00000354586	GLI2	2.1655E−04	0.032	5.4405E−05	<0.001	999	0.9847

MIS, maximum interaction score; MFS, maximum function score.

Important Putative Genes Yielded by Both LHD-Based and RWR-Based Methods MIS, maximum interaction score; MFS, maximum function score. MAP3K10 has been predicted to be a specific TSG. Based on recent publications, MAP3K10 has been widely reported to participate in auto-phosphorylation and subsequent activation via the JUN N-terminal pathway as a homodimer.28, 29 A recent study on integrative genomic analyses in embryonic stem cell (ESC) and NT2 cell lines (normal cells) revealed that MAP3K10 induced DYRK2 phosphorylation combined with SUFU inhibition. This revelation may directly affect the normal biological function of the stem cell-signaling network interacting with TP53, a famous TSG contributing to the downstream proliferation and cell adhesion associated in biological processes. Although no direct evidence confirmed that such gene acts as a tumor suppressor gene during tumorigenesis, such evidence in ESC lines confirmed that our predicted gene may cooperate with TP53, a famous tumor suppressor, implying its potential tumor suppressor functions. Therefore, it is reasonable to forecast such gene as a potential tumor suppressor. Apart from MAP3K10, the gene RND1 has also been predicted to be a potential TSG. RND1 lacks intrinsic GTPase activity and controls the rearrangement of actin cytoskeleton and the Rac-dependent neuritic process formation independent of GDP binding in normal cells.31, 32 Furthermore, Rnd subfamily members, including RND1, RND2, and RHOH, may participate in p53-mediated regulatory pathways resistant to certain endogenous and exogenous stimuli like tumorigenesis in human osteosarcoma cell lines, U2OS and SAOS2, implying that RND1 may definitely contribute to the identification and elimination of tumor cells. Such gene may also directly control the balance between cell survival and malignant transformation, validating that RND1 itself may be enough to act as a functional tumor suppressor. OTX2, as another predicted TSG, participates in the early specification of neuroectoderm in the developmental processes of brain based on a mouse model. As for the potential anti-tumor functions of such gene, recent publications confirmed that OTX2 plays a specific role during tumorigenesis and acts as a core factor in c-myc, CRX, and phosphorylated RB pathways. Although OTX2 promotes the malignant transformation of tumor cells as a traditional oncogene, a recent study on retinal detachment confirmed that OTX2 may contribute to the regulation of the p53-signaling pathway and can also act as a potential TSG in multiple cell lines and animal models, like childhood brain tumor medulloblastoma cell line D425MED and patient-derived mouse xenograft models. Genes like RET and GLI2 are both potential TSGs that were identified by both of the two methods. These two genes have been identified as quite effective tumor suppressors in common cell lines like mouse NIH 3T3 cell lines. RET, as a member of the cadherin superfamily, participates in the regulation of cell proliferation, migration, and differentiation.38, 39 As for its contribution to tumor suppression, at least in colorectal cancer and medullary thyroid carcinoma, RET acts as a specific TSG in respective cell lines and mouse models. As for GLI2, such gene is a development-associated gene contributing to the formation of lung, trachea, and esophagus, which is further validated in gli3-null and gli3Delta699 mouse models.

Putative Genes Identified Only by the LHD-Based Method

29 genes were identified only by the LHD-based method. We analyzed seven of them, listed in Table 4, to validate the effectiveness of the LHD-based method.

Table 4

Important Putative Genes Yielded Only by the LHD-Based Method

Ensembl ID	Gene Symbol	LHD-Based Method		RWR-Based Method		MIS	MFS
Ensembl ID	Gene Symbol	Heat	p Value	Probability	p Value	MIS	MFS
ENSP00000055077	RFC2	3.1496E−04	0.036	1.8028E−05	0.316	999	0.9146
ENSP00000218006	GUCY2F	2.1541E−04	0.044	1.8699E−05	0.544	904	0.9930
ENSP00000238558	GSC	9.4478E−04	0.030	1.1429E−05	0.170	977	0.9595
ENSP00000241261	TNFSF10	4.9214E−04	0.010	3.9617E−05	0.005	999	0.9460
ENSP00000261731	LHX5	6.8836E−04	0.018	1.4245E−05	0.082	914	0.9436
ENSP00000261980	VSX2	1.0335E−03	0.016	1.6524E−05	0.055	910	0.9491
ENSP00000266058	SLIT1	3.3237E−04	0.012	4.0380E−05	<0.001	959	0.9608

MIS, maximum interaction score; MFS, maximum function score.

Important Putative Genes Yielded Only by the LHD-Based Method MIS, maximum interaction score; MFS, maximum function score. RFC2 was predicted to be a potential TSG in this study. As a member of the activator 1 small subunits family, such gene contributes to the assembly of PCNA and polymerase delta on the DNA template, promoting cell survival. Although various data have been reported to confirm the relationship between RFC2 and tumorigenesis by cell lines or mouse models, only some reports validated the potential tumor suppression function of such gene.43, 44 A specific study on renal cell carcinoma confirmed that, during the initiation and progression of such tumor, RFC2 has a specific rearrangement, partially losing the normal biological function of such gene and indicating that RFC2 may be a potential TSG at least in cell lines. GUCY2F was also predicted to be a potential TSG. Encoding a guanylyl cyclase predominantly expressed in the retina, such gene contributes to re-synthesis of cGMP required for recovery of the dark state after phototransduction. GUCY2F may also participate in the development of various tumor subtypes, including Japanese colorectal cancer and myeloma, as a negative regulator, validated by mouse models and related cell lines. The following two genes, namely, GSC and TNFSF10, were also predicted to be functional potential TSGs. GSC, as a homeodomain-encoding gene, is expressed and involved in a gastrulation process. In concert with another functional gene NKX3.1, GSC contributes to resistance against malignant transformation and cell proliferation. Similarly, TNFSF10, as a member of the tumor necrosis factor superfamily, mainly contributes to the TNF-related apoptosis in spleen, lung, and prostate.49, 50 TNFSF10, as a p53 target gene, contributes to the regulation of p53-dependent cell death, and, during tumorigenesis, it acts as a resistant regulator of abnormal cell proliferation. Therefore, TNFSF10 may definitely be a potential TSG. LHX5 is another functional gene, which was also predicted as a potential TSG. Involving the control of differentiation and development, such gene contributes to the regulation of neuronal differentiation and migration during development of the CNS. As for its specific contribution to tumorigenesis, this gene may indicate better prognosis of patients with breast cancer by inhibiting the undifferentiated processes of malignant tumor cells, implying its potential contribution as a tumor suppressor. Genes VSX2 and SLIT1 were also predicted as potential TSGs. VSX2, as a specific regulator in specification and morphogenesis, controls the cell fate specification and differentiation in the developing retina. In retinal tissues, such gene may inhibit the malignant transformation of normal cells to malignant ones. SLIT1 contributes to the histogenesis of the CNS physically or pathologically.

Putative Genes Identified Only by the RWR-Based Method

The RWR-based method identified 128 putative genes. Five of them, which are listed in Table 5, are analyzed here.

Table 5

Important Putative Genes Yielded Only by the RWR-Based Method

Ensembl ID	Gene Symbol	RWR-Based Method		LHD-Based Method		MIS	MFS
Ensembl ID	Gene Symbol	Probability	p Value	Heat	p Value	MIS	MFS
ENSP00000020945	SNAI2	5.3473E−05	0.002	3.9529E−05	–	998	0.9825
ENSP00000168712	FGF4	3.8685E−05	<0.001	4.1323E−05	–	936	0.9836
ENSP00000222330	GSK3A	6.2092E−05	0.010	1.0955E−04	0.076	999	0.9921
ENSP00000222462	WNT16	3.6751E−05	<0.001	4.0873E−05	–	919	0.9854
ENSP00000222598	DLX5	2.9495E−05	0.028	4.7228E−05	–	936	0.9853

MIS, maximum interaction score; MFS, maximum function score; –, the corresponding gene received a heat lower than the threshold of heat in the LHD-based method, i.e., it was not selected as LHD genes. Thus, the p value was not available for this gene.

Important Putative Genes Yielded Only by the RWR-Based Method MIS, maximum interaction score; MFS, maximum function score; –, the corresponding gene received a heat lower than the threshold of heat in the LHD-based method, i.e., it was not selected as LHD genes. Thus, the p value was not available for this gene. SNAI2 has been confirmed to be a transcriptional repressor that modulates both activator-dependent and basal transcription, involving neural crest cell generation and migration. This gene is silenced in prostate cancer and serves as a direct transcriptional regulator of breast tumor cell metastasis targeting BIM (BCL2-Like 11). Although there is a lack of direct evidence, such gene may be a functional potential TSG. FGF4 is a member of the fibroblast growth factor (FGF) family, participating in cell growth, morphogenesis, and invasion. Interacting with the tumor microenvironment, protein FGF4 encoding promotes the proliferation of tumor cells, indicating that the potential anti-tumor function of such gene still remains to be revealed. GSK3A is a functional regulator of glycogen synthesis in skeletal muscle. As for the tumor suppressor functions, GSK3A participates in the APC/beta-catenin/Tcf. pathway and modulates drug resistance and chemotherapy-induced necroptosis. Considering that the APC/beta-catenin/Tcf.-signaling pathway is involved in various tumor suppression biological processes, we regard GSK3A as a potential TSG. WNT16, as another putative gene, participates in the regulation of p53 activity and the phosphoinositide 3-kinase/AKT pathway, validating its potential contribution to tumor suppression processes. Similarly, DLX5 is a member of a homeobox transcription factor gene, expressed in brain and skeleton. Interacting with TP63, such gene contributes to the negative regulation of tumorigenesis, validating its potential tumor suppression function.66, 67 As analyzed above, several putative genes inferred by either the LHD-based or the RWR-based method can be confirmed to be novel TSGs by recent publications, indicating the usefulness of these two methods. For the rest of the putative genes that were not analyzed in this study, we believe that many of them can be validated. We leave them to readers for further study.

Materials and Methods

Validated TSGs

The validated human TSGs were retrieved from a previous study, in which 716 human TSGs were collected from the TSGene database (https://bioinfo.uth.edu/TSGene/). In this study, these genes were employed to infer novel TSGs with some advanced network diffusion algorithms. Because the diffusion algorithms would be executed on a PPI network, the above 716 genes were mapped into the Ensembl IDs of proteins encoded by them. Also, IDs that did not occur in the PPI network were discarded. Finally, 631 Ensembl IDs were obtained, which are provided in Table S5.

PPI Network

Interacting proteins always share similar functions.21, 69, 70, 71, 72, 73, 74, 75, 76 The 631 proteins encoded by TSGs share some common functions because they can protect cells from one step on the path to cancer. Interacting proteins may also have these types of functions but with lower probabilities. This can be done for proteins with longer distances to those encoded by TSGs. Thus, inferring novel TSGs using PPIs based on validated ones is feasible. In this study, we employed the PPIs reported in STRING (https://string-db.org/, version 10.0), a well-known public database collecting known and predicted PPIs from various sources, such as genomic context predictions, high-throughput lab experiments, (conserved) co-expression, automated textmining, and previous knowledge in databases. To access human PPIs, a file named “9606.protein.links.v10.txt.gz” was downloaded from the download page of the STRING website. This file contained 4,274,001 interactions, each of which consisted of two proteins, represented by Ensembl IDs, and one score with a range between 150 and 999. A high score indicated that the corresponding interaction had a high probability to occur. For later formulation, we denoted the score of the interaction between proteins, p1 and p2, as . Ensembl IDs at 19,247 were involved in the above interactions. Based on the aforementioned PPIs, a large PPI network can be constructed. Ensembl IDs at 19,247 were defined as nodes in the network, and two nodes were connected by an edge if and only if they could comprise a PPI. In addition, the interaction score was assigned to the edge as its weight. As a result, we constructed a PPI network with 19,247 nodes and 4,274,001 edges. For formulation, we denoted this PPI network as G.

Searching Candidate Genes Using the LHD Algorithm

Heat diffusion is a type of network diffusion algorithm that always transmits heat on some seed nodes to other nodes in a network following some rules. The heat assigned to a node indicates its obscure associations to seed nodes. One study has suggested that heat diffusion has good performance to identify disease genes. In this study, one type of heat diffusion algorithm, LHD algorithm, was employed as a basic searching algorithm to infer novel TSGs. For the PPI network G constructed in the previous section, let A be its adjacent matrix, from which a column-wise normalized matrix was built as follows:From 631 Ensembl IDs of TSGs, an initial heat distribution, , was formulated as a column vector, which contained 19,247 components. Each component indicated the heat on one node in the PPI network G. The components of the Ensembl IDs of TSGs were set to 1/631, others were set to 0. Then, this vector was updated in the follow manner:where represented the heat distribution at time t, and was the i-th eigenvalue of matrix . When the later heat distribution vector and the former one were quite similar, the updating procedures stopped. As a result, each node was assigned with a heat value. Clearly, a node with a high heat value was more important than that with a low heat value. By setting a threshold to heat value, we extracted nodes that were assigned to heat values larger than the threshold and mapped obtained nodes to corresponding genes. Certainly, the TSGs mentioned above in Validated TSGs should not be included. For convenience, genes yielded by the LHD algorithm were called LHD genes. In this study, we downloaded the program of LHD algorithm from https://cran.r-project.org/web/packages/diffusr/index.html, and we executed it with default parameters on the PPI network G, obtaining several LHD genes.

Searching Candidate Genes Using the RWR Algorithm

The RWR algorithm is a classic ranking algorithm. This algorithm is also a type of network diffusion algorithm that has been successfully applied to identify novel genes, proteins, or chemicals in different networks.71, 72, 80, 81 The RWR algorithm simulates a walker starting from one seed node or a set of seed nodes and walking randomly on the network. Similar to the LHD algorithm, an initial probability distribution vector, , was constructed, which was the same as that in the LHD algorithm, and was repeatedly updated. Let denote the probability vector after executing (t + 1)-th step, which can be updated from as follows:where was the same as that defined in Equation 1, and r was set to 0.8 as suggested in some previous studies.71, 72, 80 When and were quite similar, measured by the L1 norm of their difference less than 10−6, the updating procedures stopped. was output as the results of the RWR algorithm. Accordingly, each node was assigned to a probability, indicating its associations with seed nodes. We clearly selected nodes with high probabilities, and their corresponding genes were extracted as candidate TSGs. Similarly, a threshold was set to the probability. The obtained genes were termed as RWR genes. Here we used the program developed by Li and Patra to quickly implement the RWR algorithm.

Screening Tests

Based on validated TSGs, the LHD and RWR algorithms can separately produce some candidate TSGs, namely, LHD genes and RWR genes, respectively. However, the utility of these two algorithms highly relied on the structure of the PPI network G. Some nodes (genes) may have special positions in G, suggesting that they were more likely to be selected by the LHD or RWR algorithm. However, these nodes had little and even no associations with the biological processes of protecting cells from malignant alterations, that is, they were false-positive genes. Thus, controlling this type of candidate TSG is necessary. In addition, to increase the reliability of the obtained genes, we should select the most important candidate TSGs. Therefore, three screening tests were proposed as follows: (1) permutation test, (2) association test, and (3) function test. The first test was to control false-positive genes, and the other two tests helped us to extract important genes. Their descriptions are given below.

Permutation Test

The idea of a permutation test is to evaluate the heat or probability values, produced by the LHD or RWR algorithm, respectively, of candidate TSGs under several gene sets and to compare them with actual ones. In detail, m Ensembl ID sets with size 631 were randomly produced, denoted as . Each set, such as , executed the LHD algorithm or RWR algorithm on the PPI network G with members in this set as seed nodes. Then, each LHD gene or RWR gene (g) received a heat or a probability value. After all 1,000 sets had been tested, each g received one actual heat or probability and 1,000 heat values or probabilities, based on which a measurement, p value, can be calculated as follows:where Ω represented the number of randomly produced sets on which g received higher heat values or probabilities than its actual one. Clearly, a candidate TSG with a high p value was less likely to be an actual TSG because it has strong associations with several randomly produced sets. Considering the fact that 0.05 is always deemed as the important cutoff for the significance level of the test, we selected LHD or RWR genes with p values less than 0.05.

Association Test

With regard to the candidate TSGs passing the permutation test, some have strong associations with validated TSGs and should be selected. As mentioned above in PPI Network, interacting proteins always share similar functions.21, 69, 70, 71, 72, 73, 74, 75, 76 Thus, for each candidate gene g, we linked them to validated TSGs and extracted the MIS as a measurement, named MIS, which was formulated as follows:Clearly, candidate genes assigned with high MISs were more likely to be actual TSGs. By setting a threshold, we can select important candidate TSGs.

Function Test

The validated TSGs must be highly related to some biological processes. To date, gene ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways are two popular types of information to describe biological processes. By having an insight into the relationships between TSG (candidate TSGs) and GO terms (KEGG pathways), we can further extract important candidate TSGs that exhibit similar relationships with GO terms and KEGG pathways to at least one validated TSG. To quantify the relationship between one gene and GO terms or KEGG pathways, the enrichment theory83, 84, 85, 86 was employed in this study. For one gene g and one GO term or KEGG pathway T, their relationship defined by enrichment theory was calculated by the following:where N is the total number of proteins in humans, M is the number of proteins annotated to T, n is the number of proteins in H that consist of g and its direct neighbors in the PPI network reported in STRING, and m is the number of proteins that are in H and annotated to T. Our in-house program using R function phyper was developed to calculate this value. Accordingly, each gene can be encoded into a vector by collecting all outcomes of this function. For gene g, the obtained vector was formulated as . Accordingly, the relationship between two genes on their GO terms and KEGG pathways can be measured by the following:Similar to MIS, another measurement, MFS, was calculated for each gene g, which was defined as follows:Clearly, we should select candidate TSGs with high MFSs. By setting a threshold to MFS, important candidate TSGs can be extracted.

Author Contributions

T.H. and Y.-D.C. designed the research. L.C. and Y.-D.C. performed the experiments. Y.-H.Z., Z.Z., and T.H. analyzed the data. L.C., Y.-H.Z., and Z.Z. wrote the paper.

Conflicts of Interest

The authors declare no competing financial interests.

84 in total

1. Absence of tyrosine kinase mutations in Japanese colorectal cancer patients.

Authors: R-X Shao; N Kato; L-J Lin; R Muroyama; M Moriyama; T Ikenoue; H Watabe; M Otsuka; B Guleng; M Ohta; Y Tanaka; S Kondo; N Dharel; J-H Chang; H Yoshida; T Kawabe; M Omata
Journal: Oncogene Date: 2006-10-02 Impact factor: 9.867

2. Characterization of a 3;6 translocation associated with renal cell carcinoma.

Authors: Rebecca E Foster; Mahera Abdulrahman; Mark R Morris; Elena Prigmore; Susan Gribble; Beeling Ng; Dean Gentle; Steven Ready; Phil M T Weston; Michael S Wiesener; Takeshi Kishida; Masahiro Yao; Val Davison; Jose Luis Barbero; Carol Chu; Nigel P Carter; Farida Latif; Eamonn R Maher
Journal: Genes Chromosomes Cancer Date: 2007-04 Impact factor: 5.006

3. p38 Mitogen-activated protein kinase (MAPK) is a key mediator in glucocorticoid-induced apoptosis of lymphoid cells: correlation between p38 MAPK activation and site-specific phosphorylation of the human glucocorticoid receptor at serine 211.

Authors: Aaron L Miller; M Scott Webb; Alicja J Copik; Yongxin Wang; Betty H Johnson; Raj Kumar; E Brad Thompson
Journal: Mol Endocrinol Date: 2005-04-07