Literature DB >> 36203453

Predicting non-small cell lung cancer-related genes by a new network-based machine learning method.

Yong Cai¹, Qiongya Wu¹, Yun Chen¹, Yu Liu¹, Jiying Wang².

Abstract

Lung cancer is the leading cause of cancer death globally, killing 1.8 million people yearly. Over 85% of lung cancer cases are non-small cell lung cancer (NSCLC). Lung cancer running in families has shown that some genes are linked to lung cancer. Genes associated with NSCLC have been found by next-generation sequencing (NGS) and genome-wide association studies (GWAS). Many papers, however, neglected the complex information about interactions between gene pairs. Along with its high cost, GWAS analysis has an obvious drawback of false-positive results. Based on the above problem, computational techniques are used to offer researchers alternative and complementary low-cost disease-gene association findings. To help find NSCLC-related genes, we proposed a new network-based machine learning method, named deepRW, to predict genes linked to NSCLC. We first constructed a gene interaction network consisting of genes that are related and irrelevant to NSCLC disease and used deep walk and graph convolutional network (GCN) method to learn gene-disease interactions. Finally, deep neural network (DNN) was utilized as the prediction module to decide which genes are related to NSCLC. To evaluate the performance of deepRW, we ran tests with 10-fold cross-validation. The experimental results showed that our method greatly exceeded the existing methods. In addition, the effectiveness of each module in deepRW was demonstrated in comparative experiments.

Entities: Chemical

Keywords: computational techniques; deep neural network; deep walk; graph convolutional network; lung cancer

Year: 2022 PMID： 36203453 PMCID： PMC9530852 DOI： 10.3389/fonc.2022.981154

Source DB: PubMed Journal: Front Oncol ISSN： 2234-943X Impact factor: 5.738

1 Introduction

Lung cancer continues to be the primary cause of cancer deaths worldwide, causing 1.8 million fatalities annually (1). The two primary kinds of lung cancer are small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC). Additionally, nearly 85% of all cases of lung cancer are related to NSCLC (2). More and more researchers found that lung cancer is highly inherited and is associated with certain genes that increase the risk (3). Genome-wide association studies (GWAS) are a common method to mine diseased-related genes. Hung et al. (4) firstly used GWAS and found a locus in chromosome region 15q25 that related to lung cancer. Hu et al. (5) reported that 5p15 locus is related to lung cancer via GWAS, and 6p21 was found by Wang et al. (6). With the development of next-generation sequencing (NGS), whole-exome sequencing (WES), whole-genome sequencing (WGS), and other technologies are applied to find disease-related genes. Sun et al. (7) applied WES on 73 advanced NSCLC tumor samples and demonstrated Protein tyrosine phosphatase receptor type D (PTPRD) might be both a prognostic and a predictive biomarker predicting clinical outcomes in non-squamous (ns)-NSCLC patients. Liu et al. (8) found infrequent detrimental mutations in GWAS-nominated sites in dopamine β-hydroxylase (DBH) and coiled-coil domain containing 147 (CDC147) via WES. With the explosive growth of relevant information and data in recent years, GWAS and other methods become more and more time-consuming and laborious. Many studies have focused on drug–disease association tasks and other bioinformatics tasks through machine learning and deep learning methods (9–13). Graph neural network methods that can integrate multiple types of knowledge bases are suitable for this task. 14) used graph convolutional network (GCN) to capture structural information from the network integrating gene and disease. GCN (15) is one type of neural network architecture to learn nodes and edges of graphs. It has been proven that GCN enhances algorithms of abilities to mine information and make decisions in the bioinformatics field like Deep-DRM (16). Graph embedding methods are popular in this task. Xiong et al. (17) built a heterogeneous network that incorporates different type datasets and obtained network representation by random walk (RW) to predict gene–disease associations. RW is a common graph embedding approach. This approach has been used to research microRNAs (miRNAs) (18), gene expression (19), and drug repositioning (20). Deep walk (21) is a graph structure data-mining algorithm that combines RW and work2vec. Zhu et al. (22) integrated graph embedding representation and GCN to learn the gene–disease associations. They connected the two methods in series as the encoder to learn features and predicted associations by a decoder. In the paper, we focused on the problem of mining NSCLC-causing genes. We treated it as a binary classification and proposed a new network-based method. We integrated two types of graph embedding method, deep walk and GCN, to represent the gene interaction network and learn the features and used DNN to predict which genes are related to NSCLC.

2 Methods

We proposed a new method named deepRW based on the gene interaction network to predict NSCLC-related genes. The structure of our method is shown in . First, we built a graph network that represented the interactions between genes. Then, we utilized two types of graph embedding method, deep walk and GCN, to learn network information and extract features. Last, we constructed a DNN module to predict disease-related genes.

Figure 1

The structure of deepRW. GCN, graph convolutional network; DNN, deep neural network.

2.1 Construction of the gene network

The network of gene interactions is represented as a graph network. The graph network we built can be expressed as G= (V, E) V represents the genes that we selected related to NSCLC; E represents the interactions between genes. It should be emphasized that outliers that did not interact with other genes were eliminated.

2.2 Network representation by deep walk and graph convolutional network

After we obtained the gene interaction network G= (V, E), we used two graph embedding methods to learn the representations of vertices.

2.2.1 Network representation by deep walk

Deep walk uses randomness to produce the sequences of vertices [v 1, v], where v is a vertex picked at random from the neighbors of vertex v, and the likelihood of choosing each neighbor is proportional to the weight of the edge in the adjacency matrix that corresponds to it. In the paper, we were able to build sequences at each vertex by using deep walk. Skip-gram (23) was used to train on the sequences of the vertices by sliding window sampling. Deep walk is actually a combination of RW and skip-gram. RW is responsible for sampling to obtain the co-occurrence relationship between nodes in the graph. Skip-gram trains the embedding vectors of nodes from the relationship. After training, we can get the embedding representation vectors and the probability distribution of the vertices. A representation vector optimizes the conditional probability P(vc/vi ), where v is the vertex that is in the context window of v. The loss function of training is: where W represents the window size.

2.2.2 Network representation by graph convolutional network

The other graph embedding method we used is GCN. GCN used the graph network to learn node and edge information of the graph. Compared with deep walk, GCN can not only learn the structure of each node and its neighborhood but also integrate the characteristics of each node into it. If A is the adjacency matrix, the Laplacian matrix is: where D means the degree matrix of the network. Since the features of genes should contain not only connections between nodes but also the information itself. So we can get: where I is the identify matrix. Then, the inverse degree matrix D ′ can be obtained. Last, we can get the features as follows: where X is the feature vector of each vertex, and σ is the activation function. In the study, we used Leaky Rectified Linear Unit (Leaky ReLU) function (24) as the activation function. This activation function may reduce the likelihood of vanishing gradients and boost feature sparsity when compared to other activations. The formula is as follows: Two feature vectors of each vertex were generated via deep walk and GCN. Then, two feature vectors were fused and delivered to the prediction module.

2.3 Network prediction by deep neural network

To increase the quality of features and determine whether or not the gene is related to NSCLC, we employed a DNN module after network representation by deep walk. Whether there is a linear or non-linear connection between the input and the output, DNN can determine the appropriate mathematical operation to convert the input into the output. Now, most classification methods are shallow structure algorithms, which have the disadvantages of limited representation ability of complex functions in the case of limited samples and calculation suits, and the generalization ability for complex classification problems is limited. Deep learning can realize complex function approximation by learning a deep non-linear network structure and represent the distributed representation of input data. DNN has stronger ability to abstract problems and can also simulate more complex models. The following formula may be used to determine the feature map that advances to the next layer: where Input is the input of the forward propagation, Output is the output, Bias is the bias of layer l, and W is the weight of the neurons. The output of each layer is then sent via an activation function, which boosts positive vectors and suppresses negative vectors from the previous layer. We still used Leaky ReLU as the activation function in the predicting module. depicts the number of layers of the DNN module and the specific parameters of each layer. There are three layers in the DNN module. Identifying NSCLC-related genes is a binary classification task, so we applied softmax as the activation function of the output layer. We used binary cross-entropy as the loss function as follows:

Figure 2

The structure of the DNN module.

The structure of the DNN module. where y means the true value, p means the predicted value. Also, we used batch normalization (25) and early stop in the training process, which can end training if no improvement is shown after 50 epochs.

3 Materials

3.1 Dataset

DisGeNET (26) is used to obtain the gene–disease associations. DisGeNET is a database that contains information on the links between genes and disease. It is one of the biggest collections of genes and variants linked with human diseases. The data in DisGeNET come from a variety of sources, including expert-curated archives, catalogs of GWAS, animal models, and published scientific articles. HumanNet (27), a probabilistic functional gene database, is used to generate the gene–gene associations; each gene–gene association has a score that represents the probability of the association. The gene network can be expressed as G (N), where N is the set of genes and E is the association of genes. The adjacent matrix of G is W ∈ R, where , w is the weight of each association of genes provided by HumanNet ( ). In the paper, we found 142 genes linked to NSCLC from DisGeNET, containing stage I, II, III, IIIA, and IIIB types(Supplementary table 1). These 142 genes were positive samples, and another 142 genes were randomly chosen that were reported to be irrelevant to the NSCLC disease. We used gene expression of tissues as the gene features from BioGPS (28).

3.2 Experimental setup

To demonstrate the performance of deepRW, we utilized 10-fold cross-validation to repeat experiments 10 times. The dataset is separated into 10 subsets, in every time experiment, we randomly choose one subset as the test samples, and the others as the train samples. The precision-recall curve (AUPR) and area under the ROC curve (AUROC) is used to evaluate the effectiveness of the methods. In the training set, the main hyper parameters were set as follows: the window size of the Skip-gram was set to 10, and Skip-gram was trained for 10 iterations. The GCN of three layers and DNN module was trained 50 epochs, and early stopping and Adam with default parameters were used. To demonstrate the effectiveness of our method, we tested the performance of our method by comparing the models listed as follows. RWR: Random walk with restart (29) is used to capture relationships between two nodes and the overall structure information of the network by calculating the proximity between two nodes. KBMF: Kernelized Bayesian matrix factorization (30), which always is used in recommender systems, can take advantage of many side information sources. RF: Random forest (31) is a classifier that contains multiple decision trees, and its output is determined by the votes on individual trees.

4. Result

4.1 Performance of deep walk and graph convolutional network

First, we discussed the influence of the number of GCN layers. It is known that stacking too many layers into a GCN causes the vanishing gradient problem. This means that back-propagating through these networks leads to over-smoothing, eventually leading to features of graph vertices converging to the same value (32). We constructed the GCN module with two layers, three layers, and four layers. The results are shown in . From the results, GCN with three layers obtained the highest scores. Stacking four layers slightly reduced performance because of over-smoothing. In the paper, we built GCN with three layers as an encoder.

Table 1

The effectiveness of deep walk and GCN in deepRW.

Number of layers	AUROC	AUPR
Two layers	0.702	0.723
Three layers	0.763	0.795
Four layers	0.741	0.769

AUROC, area under the ROC curve; AUPR, The area under the precision recall curve.

The effectiveness of deep walk and GCN in deepRW. AUROC, area under the ROC curve; AUPR, The area under the precision recall curve. Then, we demonstrated the effectiveness of each module through the comparison trial on our method missing specific module. In “Without GCN,” we only used deep walk as the network representation module. In “Without deep walk,” we only used GCN. In “DNN,” we directly used DNN as the encoder and the decoder. shows the results. Without GCN or deep walk, our method obtained worse scores. We can conclude that GCN and deep walk are important parts in deepRW, and integrating deep walk and GCN can improve the ability of learning the graph network.

Table 2

The AUROC and AUPR scores of different methods.

Method	AUROC	AUPR
DeepRW	0.763	0.795
KBMF	0.701	0.748
RF	0.647	0.697
RWR	0.636	0.659

DeepRW, Deep random walk; KBMF, Kernelized Bayesian matrix factorization; RF, Random forest; RWR, Random walk with restart.

The AUROC and AUPR scores of different methods. DeepRW, Deep random walk; KBMF, Kernelized Bayesian matrix factorization; RF, Random forest; RWR, Random walk with restart.

4.2 Performance of different methods

To decrease the errors, we repeated the experiment 10 times and calculated the average scores as the final results. From the results in , we can find that deepRW outperformed all other methods in terms of AUROC and AUPR scores of 0.763 and 0.795. The RWR obtained the worst scores with AUROC and AUPR of 0.636 and 0.659, which are lower than deepRW by 16.64% and 17.11%. The results demonstrated that deepRW works better than a number of machine learning methods for locating NSCLC-related genes. GCN and deepRW are the methods that can extract feature information of nodes and edges. The results show that interactions between genes are helpful for enriching the characteristic information of genes. Compared with RWR, deep walk had better performance because deep walk combines RW and word2vec, which makes the algorithm easier to converge. Although deepRW obtained the best performance in the task, this method needs a long time to train and needs more train data to get better results.

Table 3

The AUROC and AUPR scores of different methods.

Method	AUROC	AUPR
DeepRW	0.763	0.795
KBMF	0.701	0.748
RF	0.647	0.697
RWR	0.636	0.659

DeepRW, Deep random walk; KBMF, Kernelized Bayesian matrix factorization; RF, Random forest; RWR, Random walk with restart.

The AUROC and AUPR scores of different methods. DeepRW, Deep random walk; KBMF, Kernelized Bayesian matrix factorization; RF, Random forest; RWR, Random walk with restart.

5 Conclusion

Lung cancer is the leading cause of cancer death globally, and NSCLC is the main pathological subtype of lung cancer, accounting for about 85%. As the cost of sequencing continues to decrease and the amount of data continues to grow, GWAS and NGS as the main techniques to find disease-causing genes are time-consuming and laborious, and machine learning methods are getting more and more attention. In the paper, we proposed a new network-based method that is integrated with two different graph embedding methods to identify genes related to NSCLC. In order to learn about the relationships between genes and diseases, we first built a gene interaction network made up of both relevant and unrelated genes to the NSCLC disease. Then, we utilized deep walk and GCN to learn gene–disease interactions. Finally, DNN was constructed as the prediction module. This method concerns the gene network topology relationship and is conducive to mining genetic characteristics. We compared our method with several other methods and demonstrated better performance of our method. We did case studies on new samples to verify the effectiveness of deepRW. We found that tumor protein p63(TP63) is related to NSCLC. Gürgen et al. (33) found that TP63 expression values were higher than the predefined cutoff of 12 in 23 NSCLC tumors with squamous cell carcinoma histology. general transcription factor IIH subunit 4(GTF2H4) was also found and supported by Wang et al. (34) who reported that GTF2H4 is associated with lung cancer risk. Compared with machine learning methods, deepRW as a deep learning method needs more time and more samples to train to obtain better performance. In the future, we will study the ability of deepRW to identify other pathogenic genes.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ .

Ethics statement

Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

Author contributions

YCa, YL, and JW conceived and designed study, collected and analyzed data. QW and YCh statistical analyses. YCa, YL, and YCh drafted and edited manuscript. All authors contributed to the article and approved the submitted version.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

21 in total

1. Kernelized Bayesian Matrix Factorization.

Authors: Mehmet Gönen; Samuel Kaski
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2014-10 Impact factor: 6.226

2. deepDR: a network-based deep learning approach to in silico drug repositioning.

Authors: Xiangxiang Zeng; Siyi Zhu; Xiangrong Liu; Yadi Zhou; Ruth Nussinov; Feixiong Cheng
Journal: Bioinformatics Date: 2019-12-15 Impact factor: 6.937

3. Genetic variant in DNA repair gene GTF2H4 is associated with lung cancer risk: a large-scale analysis of six published GWAS datasets in the TRICL consortium.

Authors: Meilin Wang; Hongliang Liu; Zhensheng Liu; Xiaohua Yi; Heike Bickeboller; Rayjean J Hung; Paul Brennan; Maria Teresa Landi; Neil Caporaso; David C Christiani; Jennifer Anne Doherty; Christopher I Amos; Qingyi Wei
Journal: Carcinogenesis Date: 2016-06-10 Impact factor: 4.944

4. Deep-DRM: a computational method for identifying disease-related metabolites based on graph deep learning approaches.

Authors: Tianyi Zhao; Yang Hu; Liang Cheng
Journal: Brief Bioinform Date: 2021-07-20 Impact factor: 11.622

5. A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor subunit genes on 15q25.

Authors: Rayjean J Hung; James D McKay; Valerie Gaborieau; Paolo Boffetta; Mia Hashibe; David Zaridze; Anush Mukeria; Neonilia Szeszenia-Dabrowska; Jolanta Lissowska; Peter Rudnai; Eleonora Fabianova; Dana Mates; Vladimir Bencko; Lenka Foretova; Vladimir Janout; Chu Chen; Gary Goodman; John K Field; Triantafillos Liloglou; George Xinarianos; Adrian Cassidy; John McLaughlin; Geoffrey Liu; Steven Narod; Hans E Krokan; Frank Skorpen; Maiken Bratt Elvestad; Kristian Hveem; Lars Vatten; Jakob Linseisen; Françoise Clavel-Chapelon; Paolo Vineis; H Bas Bueno-de-Mesquita; Eiliv Lund; Carmen Martinez; Sheila Bingham; Torgny Rasmuson; Pierre Hainaut; Elio Riboli; Wolfgang Ahrens; Simone Benhamou; Pagona Lagiou; Dimitrios Trichopoulos; Ivana Holcátová; Franco Merletti; Kristina Kjaerheim; Antonio Agudo; Gary Macfarlane; Renato Talamini; Lorenzo Simonato; Ray Lowry; David I Conway; Ariana Znaor; Claire Healy; Diana Zelenika; Anne Boland; Marc Delepine; Mario Foglio; Doris Lechner; Fumihiko Matsuda; Helene Blanche; Ivo Gut; Simon Heath; Mark Lathrop; Paul Brennan
Journal: Nature Date: 2008-04-03 Impact factor: 49.962

6. Focused Analysis of Exome Sequencing Data for Rare Germline Mutations in Familial and Sporadic Lung Cancer.

Authors: Yanhong Liu; Farrah Kheradmand; Caleb F Davis; Michael E Scheurer; David Wheeler; Spiridon Tsavachidis; Georgina Armstrong; Claire Simpson; Diptasri Mandal; Elena Kupert; Marshall Anderson; Ming You; Donghai Xiong; Claudio Pikielny; Ann G Schwartz; Joan Bailey-Wilson; Colette Gaba; Mariza De Andrade; Ping Yang; Susan M Pinney; Christopher I Amos; Margaret R Spitz
Journal: J Thorac Oncol Date: 2016-01 Impact factor: 15.609