Literature DB >> 29551932

An enhanced topologically significant directed random walk in cancer classification using gene expression datasets.

Choon Sen Seah1, Shahreen Kasim1, Mohd Farhan Md Fudzee1, Jeffrey Mark Law Tze Ping1, Mohd Saberi Mohamad2, Rd Rohmat Saedudin3, Mohd Arfian Ismail4.   

Abstract

Microarray technology has become one of the elementary tools for researchers to study the genome of organisms. As the complexity and heterogeneity of cancer is being increasingly appreciated through genomic analysis, cancerous classification is an emerging important trend. Significant directed random walk is proposed as one of the cancerous classification approach which have higher sensitivity of risk gene prediction and higher accuracy of cancer classification. In this paper, the methodology and material used for the experiment are presented. Tuning parameter selection method and weight as parameter are applied in proposed approach. Gene expression dataset is used as the input datasets while pathway dataset is used to build a directed graph, as reference datasets, to complete the bias process in random walk approach. In addition, we demonstrate that our approach can improve sensitive predictions with higher accuracy and biological meaningful classification result. Comparison result takes place between significant directed random walk and directed random walk to show the improvement in term of sensitivity of prediction and accuracy of cancer classification.

Entities:  

Keywords:  Directed random walk algorithm; Group specific tuning parameter; Pathway

Year:  2017        PMID: 29551932      PMCID: PMC5851940          DOI: 10.1016/j.sjbs.2017.11.024

Source DB:  PubMed          Journal:  Saudi J Biol Sci        ISSN: 1319-562X            Impact factor:   4.219


Introduction

Early detection is one of the key elements in the reduction of mortality rate among disease carriers. The accurate determination type of cancer provides adequate early treatment and also to make sure that the treatment is efficient. For example, early malignant pleural mesothelioma is optimally treated by extrapleural pneumonectomy followed by radiochemotherapy, whereas metastatic lung cancer is cured by chemotherapy only (Kirk, 2007). Anticancer strategies are build based on tumor morphology (morphogenesis). As technology grows, many researchers executed various investigations on the gene expression patterns and studied the gene mutation (Shao et al., 2011, Fahey, 2010). Microarray has been an experimental tool to extract functional information from the genome (Bair, 2013). In recent years, many researchers used microarray to profile the gene expression patterns of abnormal and normal gene in cancer (Srivastava et al., 2014, Lin et al., 2016). These kinds of studies shed light on obtaining bio-markers for cancer classification. Cancer classification enable doctors to get some insights about gene expression patterns such as gene function and interaction between genes. Microarray has been adopted to profile gene expression datasets, and, applied in cancer classification. The success rate of cancer classification on the tools is largely dependent on data mining (Young, 2016). This is because, among gene expression datasets, only a part gives significant expression levels towards cancer. Therefore, a classification tools that can identify cancerous genes with high accuracy is needed. There are several types of cancer prediction and cancer classification approach (Young, 2016, Malla, 2017). In recent years, the random walk algorithm has been used by several researchers (Revathy and Amalraj, 2011, Li and Li, 2012, Petrochilos et al., 2013, Lan et al., 2016, Matteo and Random, 2017) to enable a more efficient process of cancer classification. In 2011, Revathy studied the usage of random walk in the improvement of cancer classification accuracy (Revathy and Amalraj, 2011). Through a multi-directed graph, the Random Walk with Restart on Multigraphs (RWRM) that was introduced by Li is able to identify gene with higher area under curve (AUC) value (Li and Li, 2012). Petrochilos introduced the Walktrap which is a random-walk-based community detection algorithm to identify biological modules predisposing to cancer growth in gene expression datasets (Petrochilos et al., 2013). Bi-random walk, proposed by Lan in 2016, is used to identify potential miRNA environmental factor interaction (Lan et al., 2016). In 2017, random walk with restart probability was introduced by Matteo, has the ability to rank cancerous gene with respect to cancer modules (Matteo and Random, 2017). By using directed graphs to represent the random walk, the probability of random walk is no longer 0.5 for both, the forward and backward step (Suki and Frey, 2017), and has instead, a bias probability of random walk which comes from a present walker that establishes a potential direction. When the bias is small, the walk exhibits a positive asymptotic speed in the bias direction, while when the bias become large, the walk starts spending huge amounts of time in bias, and constant direction before eventually backtracking and continuing march off to infinity (Yano, 2011). Hence, the results for every experiment no longer fluctuates broadly due to a more systematic use of random walk with a bias probability. In Codling’s research, he derived a biased telegraph equation from different turning probabilities which, is applied based on direction of the movement (Codling et al., 2008). A similar analogy is extended by Zlatić, through his research, whereby the parametric equations of motion is applied to study the features of biased random walks versus parameter values (Zlatic et al., 2010). In 2013, Liu developed the directed random walk to great effect, which is based off a biased type random walk (Liu et al., 2013). Due to limitation of the algorithm, directed random walk algorithm is not focusing on enhanced the sensitivity of prediction. Besides, the accuracy of cancer classification can be further enhanced. This model was developed to classify the cancer gene by the implementation of a directed graph. The DRW proved to be a success in classifying cancer genes by instigating an initial node as well as the restart the probability when the vector drops to a certain value. Hence, a proposed method for efficient cancer classification is the significant directed random walk which, is the enhancement of the directed random walk. In this study, we considerably extend our preliminary work (Seah et al., 2017). The restart probability parameter in directed random walk (Revathy and Amalraj, 2011) is being studied and improvement is being considered by expanding the initial parameters of the directed random walk, taking the weight of each biological pathways and their relationship with genes into account. The sensitivity of prediction is enhanced by enhanced the bond between two genes within the gene expression data. The restart probability parameter is tuned in the range of 0.1–0.9 in order to justify the optimum and most suitable restart probability for the corresponding datasets. Datasets are divided into training and test set by K-fold cross validation. Classifier is built and experimented to justify the results of classification. The reliability of the classifier is proved through the accuracy of cancer classification. With lung cancer dataset used as the benchmark dataset, its implementation in the directed random walk is analysed (Liu et al., 2013). The results are then compared with previous works. The contribution of this approach are lies as below: We test the tuning parameter selection method with more datasets. We improve directed random walk by implementation of weight as parameter. We provide the detailed analysis of the proposed significant directed random walk through extended experiments, which conducted with six gene expression datasets. We report statistically significant results by comparison with previous work. In Section 2, we present the datasets that used during the experiment and the details of the methodology of proposed approach in. In Section 3, we present the results and discussion of cancer prediction and cancer classification. Lastly, we draw the conclusion in Section 4.

Materials and methods

Experimental data

The proposed algorithm, significant directed random walk, is tested with six different input datasets and a group of reference datasets. The input datasets are briefly described in Section 2.1.1, while the reference datasets are presented in Section 2.1.2.

Input datasets

The purpose of the experiment is to evaluate the effectiveness of the significant directed random walk (sDRW) approach in six publicly available gene expression datasets. These datasets are obtained from the web-based database of National Centre for Biotechnology Information (NCBI), Gene Expression Omnibus (GEO). GEO database stores original submitter-supplied records as well as curated datasets. These datasets are briefly described as follows: Lung Cancer Dataset (Landi et al., 2008): The GEO ID of the chosen lung cancer dataset is GSE10072. It consists of 107 samples, which 58 are cancer, while 49 are normal tissues samples. These samples were collected from 20 non-smokers, 26 former smokers and 28 current smokers. The platform to prepare the affymetrix microarray gene expression dataset is GPL96. The ID of the samples falls from between GSM254625 until GSM254731. Liver Cancer Dataset (Tsuchiya et al., 2010): The GEO ID of chosen liver cancer dataset is GSE17856. It consists of 95 samples but only 87 samples are chosen as sample datasets. Out of these 87 samples, 43 are cancer samples and 44 are normal samples. These 87 samples are Hepatocellular Carcinoma tissue samples while the remaining 8 samples are metastatic liver cancer samples. The cancer cells found in metastatic liver cancer are not liver cells because they are migrants from other parts of the body (Roessler et al., 2015). The platform to prepare the affymetrix microarray gene expression dataset is GPL6480. The ID of samples that is used in the experiment are from GSM446165 to GSM446251. Thyroid Cancer Dataset (Yu et al., 2008): The GEO ID of chosen thyroid cancer dataset is GSE5364. This dataset consists of several cancer types but only the thyroid cancer dataset is chosen as the sample dataset. Out of 341 samples, 51 are related to thyroid dataset which are 35 cancer samples and 16 normal samples. The platform to prepare Affymetrix microarray gene expression dataset is GPL96. The ID of thyroid samples are between GSM121979 and GSM122029. Stomach Cancer Dataset (D’Errico et al., 2009): The GEO ID of chosen stomach cancer dataset is GSE13911. It is a dataset that mainly focuses on Microsatellite Instability (MSI) and Microsatellite Stable (MSS) issues which resulted DNA Mismatch Repair gene, does not function normally. This dataset consists of 38 cancer samples and 31 normal samples. It is prepared with the Affymetrix Human Genome U133 plus 2.0 Array, with platform ID of GPL570. The samples ID are between GSM350411 and GSM350479. Kidney Cancer Dataset (Dalgliesh et al., 2010): The GEO ID of chosen kidney cancer dataset is GSE17895. It focuses on Renal Cell Carcinoma which is also known as kidney cancer that originates in the lining of proximal convoluted tubule (small tubes in the kidney that transport urine) (Gaur et al., 2017). It consists of 138 cancer samples and 22 normal samples. It is prepared with Affymetrix GeneChip Human Genome U133 Plus 2.0 Array with platform ID of GPL9101. The samples ID are between GSM444445 to GSM444610. Breast Cancer Dataset (Pawitan et al., 2005): The GEO ID of chosen breast cancer dataset is GSE1456. It was prepared on two different Affymetrix platform, GPL96 and GPL97. The results from GPL96 will be used as the input dataset which are 22 poor samples and 130 good samples. Breast cancer patients who died within 5 years are considered poor samples while those patients that can survive more than 5 years without any additional reported events are consider as good samples. The samples ID within GPL96 are GSM107072 until GSM107231.

References datasets

References datasets are also known as additional datasets that supports the experiments. In the experiment of the proposed significant directed random walk, directed graphs are used as the reference data. These directed graphs are built from 300 pathway datasets. Fig. 1 shows the example of pathway dataset, Leukocyte Transendothelial Migration (KEGG PATHWAY, 2017). This directed graph consists of 150 metabolic and 150 non-metabolic pathways. The pathway datasets were obtained from Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway Database.
Fig. 1

Biological pathway of Leukocyte Transendothelial Migration (KEGG PATHWAY, 2017).

Biological pathway of Leukocyte Transendothelial Migration (KEGG PATHWAY, 2017). KEGG pathways are converted into directed graph using SubpathwayMiner package in R programming (Li et al., 2009). This directed graph covers 4113 nodes (genes) and 40875 directed edges. The directed edge represents the interaction between genes. The interaction between genes can be found from the pathway datasets. Fig. 2 shows a simple illustration of a single pathway from a complete pathway dataset whereby the influence of a particular gene onto a corresponding gene is represented by the direction of the arrowheads. Gene A influences gene B, while gene B influences gene C. Gene C influences both genes E and D, which, both, influence gene F. Fig. 3 displays the illustration of the influence between genes. In graph theory, eigenvector centrality is used to measure the influence of node in a network (Meghanathan, 2015). Fig. 4 shows the illustration of the significance of genes pertaining to their weights. The weight of each gene is determined by the number of corresponding genes connected. The higher the number of corresponding gene connected, the heavier the weight of the gene, the higher its significance. Thus, gene C is regarded as highly significant, compared to gene A and B. Fig. 5 shows the highlighted genes in single pathway and their weights in Table 1. The red highlighted genes, EPAC, Rap1, ITGAL, Pyk2, Vav, RhoA shows a simple pathway within the biological pathway, Leukocyte Transendothelial Migration (KEGG PATHWAY, 2017). These six genes are used to show the relationship between connection and weight. The highest weight, 11.38365 belongs to gene, Vav, which is activated by two different genes (ITGAL and Pyk2), as shown in Fig. 5. A gene is important if it play the roles in being influences by other genes (Draghici et al., 2007).
Fig. 2

Simple illustration of single pathway data.

Fig. 3

Simple illustration of the relationship between genes.

Fig. 4

Simple illustration of relationship of weight among genes.

Fig. 5

Highlighted genes to represent a single pathway.

Table 1

Weight of highlighted gene in Fig. 5.

NodesEPACRap1ITGALPyk2VavRhoA
Weight2.3389148.473016.14413.10298911.383655.149393
Simple illustration of single pathway data. Simple illustration of the relationship between genes. Simple illustration of relationship of weight among genes. Highlighted genes to represent a single pathway. Weight of highlighted gene in Fig. 5.

Methodology

This section contains the approaches in constructing significant directed random walk. In order to improve existing biased random walk (directed random walk), directed random walk has been studied and tested. According to Liu, in his studies (Liu et al., 2013), the restart probability was set as 0.7. In sDRW, restart probability has been tuned with a range of 0.1–0.9. After several experiments, more risk pathways have been predicted and the results of the experiments shows improvement towards the sensitivity of cancer prediction and accuracy of cancer classification. Another approach in the construction of the sDRW looks to the relevance of weight in determining the significant level towards cancerous mutation, as was utilized by Playdon (Playdon et al., 2013). The relationship of weight between two nearest genes in a pathway has been used as a key parameter to differentiate the cancerous gene and normal gene. Hence, the tuning parameter selection and weight as parameter for algorithm performance optimization will be implemented into the sDRW to enable result enhancement, which are, the sensitivity of cancer prediction and accuracy of cancer classification.

Significant directed random walk (sDRW)

Significant directed random walk (sDRW) is an improved biased random walk that is used in cancerous gene prediction and classification. This approach makes specific hypotheses about the predictive significance of relative gene expression by providing a range of restart probability in different cancer datasets. Although such approach may not represent the accuracy of every datasets, it shows the optimum accuracy of different datasets with different restart probabilities. The second approach in sDRW implements weight as one of its parameter. The weight of genes is different which is dictated by the influence by previous genes. If the gene is influenced by many genes, represented by the direction of arrowheads, it will have higher weight compared to the rest. Fig. 6 illustrates the whole structure of sDRW.
Fig. 6

Flowchart of sDRW.

Flowchart of sDRW.

Tuning parameter selection

During preliminary work, the tuning parameter selection is used in the directed random walk algorithm (Seah et al., 2017). Directed random walk algorithm might excluded some informative genes in selected pathway due to the limitation in single, constant restart probability. With only a single, constant restart probability, the optimum results cannot be obtained. This is because different datasets have different pattern of pathways. For example, cancer datasets, A and B have different variety of biological pathways and these biological pathways play an important role in determining cancerous genes. Therefore, the tuning parameter selection is proposed in significant directed random walk in order to find out the optimum restart probability for the corresponding cancer datasets. Tuning parameter selection is aimed to estimate the nearby optimum parameter for pathway (Misman et al., 2014). It is also used to identify an effective predictive model and cancerous classification. Therefore, tuning parameter selection can lead to better performance of sDRW compare to DRW. Directed random walk algorithm is using a constant restart probability, r, also known as gamma (Liu et al., 2013). Restart probability plays an important role in determining the needs of restarting the random walk process. In significant directed random walk, the restart probability is used as tuning parameter (Seah et al., 2017). Restart probability is applied to estimate the probability of the node to move into the neighbouring nodes or goes backward to the previous nodes. With a variety of restart probabilities, the sDRW can list all the risk pathways that are topologically important and significant to the corresponding cancerous genes. This can identify all the risk pathways though the processing time will increase by 9 times due to the processing of 9 different restart probabilities. This is because with variety of restart probabilities, the process of random walk will increase regardless to the number of restart probabilities. In directed random walk algorithm, restart probability is set as 0.7. Instead of only 0.7, sDRW is using additional of eight restart probabilities in the initial stage of experiment. The eight restart probabilities are 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.8, and 0.9. Of Course, 0.7 is also used in sDRW. The significant genes within pathways can be selected and classified with better accuracy by using different restart probability. The process is consisting of three main steps. Firstly, the genes in microarray datasets are selected and grouped based on their prior pathway information from the pathway datasets. This process repeats for each pathway in the pathway datasets and some genes might be excluded in the process. This is because the gene in gene expression datasets cannot be matched or cannot be found in pathway datasets/directed graph. The P-values of genes is calculated and the significant level of genes is differentiated according to the P-values. The calculation process is followed by the calculating of weight, t-score, and reproducible power of pathways. Pseudo-code of tuning parameter selection in sDRW is shown in Fig. 7. The reproducibility of gene will determine the robustness and significant level towards cancer. The higher the reproducibility of genes, the more the robustness and significant level of respective gene towards cancer (Jadamba and Shin, 2014). Pathway that contain higher significant level of gene will be predicted as risk pathway and further evaluated by restart probabilities. With different number of nine restart probabilities, the process of evaluation will go through nine times and the final selected risk pathways will vary according to the restart probabilities.
Fig. 7

Pseudo code of tuning parameter selection method in sDRW.

Pseudo code of tuning parameter selection method in sDRW. The evaluation method is evaluated by the optimum number of risk pathways that matches with the corresponding input datasets. For example, out of nine restart probabilities, 0.1 have the most number of selected number of pathways in lung cancer dataset. Hence, 0.1 is set as the default restart probability for lung cancer dataset. Note that, different cancer dataset requires different restart probability. If there are two restart probabilities that have same number of selected risk pathways, further evaluation steps will be taken. The number of significant genes will be referred to, for this evaluation step and the highest number of risk genes selected by the corresponding restart probability will be set as default for the corresponding cancer dataset. For example, the restart probability of 0.2 and 0.6 have selected three risk pathways for stomach cancer dataset. However, with the restart probability of 0.2, 53 significant genes are selected while, the restart probability of 0.6 selected 72 significant genes. Hence, 0.6 will be set as the default restart probability for stomach cancer datasets. Evaluation method will be further enhanced based on the accuracy of classification.

Weight as parameter

The weight of every single gene is different, depending on the number of other genes influencing it. Thus, the higher the number of influence, the higher the weight of the gene. With sDRW, weight is presented as one of the important parameter in determining the relationship between genes (Seah et al., 2017). sDRW had proved that weight of genes can affect the attraction bond between genes which will lead to higher vector (Montenegro, 2009). In sDRW, the cost of travelling from node to node is vector. The cost can be measure by different units, depending on the application. Directed graph is defined as weight graph when the weight value of each gene is attached to the correspond node. Relationship between genes, also known as direction from gene towards next gene is fixed by pathway datasets. Since the pathway datasets are converted into directed graph, hence matrix can be made based on directed graph. Simple illustration of pathway is shown in Fig. 8 with corresponding matrix in Table 2.
Fig. 8

Simple illustration of pathway dataset.

Table 2

Adjacency matrix of relationship of gene.

A12345
100100
200010
301001
400000
510000
Simple illustration of pathway dataset. Adjacency matrix of relationship of gene. Weight of gene is corresponding between towards and forward of random walk. With directed graph, it is lead biasedly towards the selected gene (Pawitan et al., 2005). Formally, sDRW is defined aswhere W is the vector, the cost of travelling towards next gene while r is the restart probability with a range of 0.1 until 0.9. M is an adjacency matrix developed from the original directed graph. As weight is one of the parameter and playing an important role in determine the connectivity between genes. Hence, weight of two connected genes, N and N is used as average between both genes to obtain a stable connectivity. W is a vector of N node which is transmitted from N-1 node (Seah et al., 2017). In Fig. 5, the relationship between the gene can be written as EPAC -> Rap1 -> ITGAL -> Pyk2 -> Vav -> RhoA. Vector of sDRW will be calculated based on the gene shown in Fig. 5. Initial vector, W0 of first nodes (1) is zero because it is an initial node. Hence, Table 3 shows the resulted vector of sDRW after 6th walk. The fluctuation of vector happened because the weight if gene is influencing the reading. Weight plays an important role to attract the other nodes.
Table 3

Result of vector from first node to sixth node.

Vector, WSignificant Directed Random Walk
W00
W13.243577
W25.682564
W35.047153
W46.364853
W57.505854
Result of vector from first node to sixth node. If the nodes have strong connectivity between each other, hence the vector will be higher and this vector will be contributed to the next walk. T-test score is also used in sDRW during initial probability, hence the magnitude of t-test score will contributed to the weight adjustments. Therefore, the genes which have higher weight are topologically important towards cancerous and significantly different compare to other normal genes. Fig. 9 shows the illustration of directed graph formation from different pathways. Each shape indicates different biological pathways that topologically important to different cancers. Initially, random walk will start from A, 1 or I. It will randomly select the significant important genes based on the weight of the next gene. For example, the walker will walk towards 2 from 1 during single pathway. When pathways are combined to form directed graph, the walker will walk randomly towards any direction that have set in directed graph. But if B have higher weight compare to 2, hence walker will prefer walk toward B instead of 2. With different restart probabilities toward different input datasets, the walker will be prevented walk with only one criteria. The walker will evaluate the suitable restart probability and calculate the most suitable path for the correspond dataset.
Fig. 9

Complex illustration of pathway dataset.

Complex illustration of pathway dataset. In sDRW, tuning parameter selection and weight parameter is combined to evaluate any possibility that might happened in order to optimize the prediction and classification of input datasets. The restart probabilities of 0.1 until 0.9 will be used for all input datasets even though there is a default optimum restart probability for each dataset. This is because some restart probabilities had predicted different pathways as risk pathways compared to default restart probability. And the risk pathways contained cancerous gene as well. In order to not missing any cancerous classification, hence, all restart probability will be used and the default restart probability will be bold in order to show the different in terms of number of risk pathway, number of risk gene and area under curve for accuracy purposed.

Results and discussion

In this section, performance of sDRW showcases two methods. The methods are used to study the effectiveness and the performance of sDRW, which are sensitivity of cancer prediction, and accuracy of cancer classification.

Cancer prediction

Prediction method is used to predict the risk pathways and significant genes before classifying the genes. Gene expression datasets are being implemented and run on directed graph with its weight. By going through sDRW, the walker will study the vector and P-values of each gene from the pathway. If the pathways contain genes that have P-value less than 0.05, then the pathway is used in constructing directed graph (Štefka and Holeňa, 2013). This is because P-value will determine the significant towards cancer mutation. Experiment had been run with six different input datasets. First, risk pathways are predicted and with these detected risk pathways, further prediction is able to take place by figuring out the risk genes among the risk pathways. Hence, the restart probability that detects most pathways, in comparison to the rest, is set as the optimum restart probability for the correspond dataset. The sensitivity of prediction will be counted based on the number of pathways detected by the optimum restart probability. Hence, the optimum restart probability that predicts the most number of pathways is the most effective regardless of the number of risk genes detected. Table 4 shows the name of risk pathway, in different dataset that had been predicted to be significantly towards cancerous mutation. The detected risk pathways are used to further extended in the prediction of significant genes.
Table 4

Name of risk pathway that predicted by sDRW.

DatasetsRestartprobabilitySignificant Directed Random Walk, sDRW
Lung0.1Endocytosis,Tight junction,Focal adhesion
0.2Pancreatic selection,Regulation of actin cytoskeleton
0.3Focal adhesion
0.4ECM-receptor interaction
0.5Leukocyte transendothelial migration,ECM-receptor interaction
0.6Focal adhesion
0.7Focal adhesion
0.8Pancreatic secretion,Focal adhesion
0.9ECM-receptor interaction



Stomach0.1TGF-beta signaling pathway
0.2Hedgehog signaling pathway,Notch signaling pathway
0.3Wnt signaling pathway,Notch signaling pathway
0.4Hedgehog signaling pathway,TGF-beta signaling pathway
0.5Notch signaling pathway,TGF-beta signaling pathway
0.6Regulation of actin cytoskeleton
0.7Hedgehog signaling pathway
0.8Alanine, aspartate and glutamate metabolism,Shigellosis,TGF-beta signaling pathway
0.9TGF-beta signaling pathway



Liver0.1Sphingolipid metabolism
0.2Focal adhesion,Tight junction
0.3Tight junction
0.4Sphingolipid metabolism,Glycerolipid metabolism,Lysosome
0.5Bacterial invasion of epithelial cells
0.6Glycerolipid metabolism,Bacterial invasion of epithelial cells
0.7Focal adhesion,Glycerolipid metabolism
0.8Glycerolipid metabolism
0.9Sphingolipid metabolism



Tyroid0.1Tight junction
0.2Cell adhesion molecules (CAMs)Fatty acid metabolism
0.3Tight junctionCell adhesion molecules (CAMs)
0.4Fatty acid metabolismFc gamma R-mediated phagocytosis
0.5Regulation of actin cytoskeleton,Wnt signaling pathway,Fc gamma R-mediated phagocytosis,Fatty acid metabolism
0.6Wnt signaling pathwayCell adhesion molecules (CAMs)
0.7Fatty acid metabolism
0.8MAPK signaling pathway & Fatty acid metabolism
0.9Focal adhesion



Kidney0.1Endocytosis,Regulation of actin cytoskeleton
0.2Regulation of actin cytoskeleton
0.3Calcium signaling pathway,Phosphatidylinositol signaling system
0.4Endocytosis
0.5Phosphatidylinositol signaling system,Regulation of actin cytoskeleton
0.6Protein processing in endoplasmic reticulum,PPAR signaling pathway,Regulation of actin cytoskeleton
0.7Endocytosis,Regulation of actin cytoskeleton
0.8PPAR signaling pathway
0.9Calcium signaling pathway



Breast0.1Neuroactive ligand-receptor interaction
0.2Glycerophospholipid metabolism
0.3Neuroactive ligand-receptor interaction
0.4Adipocytokine signaling pathway,Fatty acid metabolism,Jak-STAT signaling pathway
0.5Cytokine-cytokine receptor interaction,Fatty acid metabolism
0.6Jak-STAT signaling pathway
0.7Neuroactive ligand-receptor interaction
0.8Chemokine signaling pathway
0.9Adipocytokine signaling pathway,Glycerophospholipid metabolism
Name of risk pathway that predicted by sDRW. sDRW was developed based on DRW. Comparison will be taken to evaluate the performance and effectiveness of sDRW with its successes towards increasing the sensitivity of prediction and accuracy of binary classification towards gene expression dataset. Six input datasets had applied in sDRW and DRW. Firstly, the risk pathways that predicted by both algorithm are presented in term of name, and number of detected pathways. Table 5 shows the comparison of name of risk pathway that predicted by sDRW and DRW. Six datasets had applied in the experiment and the experiment had run for nine times due to different restart probabilities. Different number of risk pathways had been predicted and there are some restart probabilities that can identify more risk pathways compare to the other restart probabilities. Table 6 shows the comparison of number of risk pathways that predicted by sDRW and DRW with the correspond different in term of number. The table clearly identify the improvement of sDRW with more predicted risk pathways.
Table 5

Name of risk pathway that predicted by sDRW and DRW.

DatasetsRestartprobabilitySignificant Directed Random Walk, sDRWDirected Random Walk, DRW
Lung0.1Endocytosis,Tight junction,Focal adhesionTight junction
0.2Pancreatic selection,Regulation of actin cytoskeletonECM-receptor interaction
0.3Focal adhesionECM-receptor interaction
0.4ECM-receptor interactionECM-receptor interaction,Focal adhesion
0.5Leukocyte transendothelial migration,ECM-receptor interactionECM-receptor interaction,Focal adhesion
0.6Focal adhesionLeukocyte transendothelial migration
0.7Focal adhesionFocal adhesion
0.8Pancreatic secretion,Focal adhesionFocal adhesion
0.9ECM-receptor interactionPancreatic secretion



Stomach0.1TGF-beta signaling pathwayTGF-beta signaling pathway
0.2Hedgehog signaling pathway,Notch signaling pathwayHedgehog signaling pathway
0.3Wnt signaling pathway,Notch signaling pathwayWnt signaling pathway
0.4Hedgehog signaling pathway,TGF-beta signaling pathwayHedgehog signaling pathway,TGF-beta signaling pathway
0.5Notch signaling pathway,TGF-beta signaling pathwayNotch signaling pathway
0.6Regulation of actin cytoskeletonRegulation of actin cytoskeleton
0.7Hedgehog signaling pathwayHedgehog signaling pathway
0.8Alanine, aspartate and glutamate metabolism,Shigellosis,TGF-beta signaling pathwayAlanine, aspartate and glutamate metabolism,Shigellosis
0.9TGF-beta signaling pathwayTGF-beta signaling pathway



Liver0.1Sphingolipid metabolismSphingolipid metabolism
0.2Focal adhesion,Tight junctionFocal adhesion,Sphingolipid metabolism
0.3Tight junctionSphingolipid metabolism
0.4Sphingolipid metabolism,Glycerolipid metabolism,LysosomeSphingolipid metabolism,Tight junction
0.5Bacterial invasion of epithelial cellsBacterial invasion of epithelial cells
0.6Glycerolipid metabolism,Bacterial invasion of epithelial cellsGlycerolipid metabolism
0.7Focal adhesion,Glycerolipid metabolismFocal adhesion
0.8Glycerolipid metabolismSphingolipid metabolism
0.9Sphingolipid metabolismGlycerolipid metabolism



Tyroid0.1Tight junctionCell adhesion molecules (CAMs)
0.2Cell adhesion molecules (CAMs)Fatty acid metabolismCell adhesion molecules (CAMs)
0.3Tight junctionCell adhesion molecules (CAMs)Tight junction,Cell adhesion molecules (CAMs)
0.4Fatty acid metabolismFc gamma R-mediated phagocytosisTight junction,Fc gamma R-mediated phagocytosis
0.5Regulation of actin cytoskeleton,Wnt signaling pathway,Fc gamma R-mediated phagocytosis,Fatty acid metabolismRegulation of actin cytoskeleton,Fc gamma R-mediated phagocytosis
0.6Wnt signaling pathwayCell adhesion molecules (CAMs)Wnt signaling pathway,Cell adhesion molecules (CAMs)
0.7Fatty acid metabolismFatty acid metabolism
0.8MAPK signaling pathway & Fatty acid metabolismMAPK signaling pathway
0.9Focal adhesionFocal adhesion



Kidney0.1Endocytosis,Regulation of actin cytoskeletonRegulation of actin cytoskeleton
0.2Regulation of actin cytoskeletonRegulation of actin cytoskeleton
0.3Calcium signaling pathway,Phosphatidylinositol signaling systemRegulation of actin cytoskeleton,Phosphatidylinositol signaling system
0.4EndocytosisEndocytosis
0.5Phosphatidylinositol signaling system,Regulation of actin cytoskeletonPhosphatidylinositol signaling system,Regulation of actin cytoskeleton
0.6Protein processing in endoplasmic reticulum,PPAR signaling pathway,Regulation of actin cytoskeletonRegulation of actin cytoskeleton
0.7Endocytosis,Regulation of actin cytoskeletonEndocytosis,Regulation of actin cytoskeleton
0.8PPAR signaling pathwayPPAR signaling pathway
0.9Calcium signaling pathwayEndocytosisRegulation of actin cytoskeleton



Breast0.1Neuroactive ligand-receptor interactionAdipocytokine signaling pathway
0.2Glycerophospholipid metabolismGlycerophospholipid metabolism
0.3Neuroactive ligand-receptor interactionNeuroactive ligand-receptor interaction
0.4Adipocytokine signaling pathway,Fatty acid metabolism,Jak-STAT signaling pathwayFatty acid metabolism
0.5Cytokine-cytokine receptor interaction,Fatty acid metabolismCytokine-cytokine receptor interaction
0.6Jak-STAT signaling pathwayJak-STAT signaling pathway
0.7Neuroactive ligand-receptor interactionAdipocytokine signaling pathway,Neuroactive ligand-receptor interaction
0.8Chemokine signaling pathwayChemokine signaling pathway
0.9Adipocytokine signaling pathway,Glycerophospholipid metabolismAdipocytokine signaling pathway,Fatty acid metabolism
Table 6

Number of risk pathway detected by sDRW and DRW.

DatasetsMethodRestart probabilities, r
0.10.20.30.40.50.60.70.80.9
Lung,sDRW321121121
GSE10072DRW111221111
Detected Extra pathway210−100010



Stomach,sDRW122221131
GSE13911DRW111211121
Detected Extra pathway011010010



Liver,sDRW121312211
GSE17856DRW121211111
Detected Extra pathway000101100



Tyroid,sDRW122242121
GSE5364DRW112222111
Detected Extra pathway010020010



Kidney,sDRW212123211
GSE17895DRW112121212
Detected Extra pathway10000200−1



Breast,sDRW111321112
GSE1456DRW111111212
Detected Extra pathway000210−100

*The bold r is the optimum restart probability for sDRW.

Name of risk pathway that predicted by sDRW and DRW. Number of risk pathway detected by sDRW and DRW. *The bold r is the optimum restart probability for sDRW. Fig. 10 presents the number of risk pathways that are detected by sDRW and DRW against six different cancer datasets. The comparison between the sDRW and DRW with lung cancer dataset shows the sDRW predicting the highest number of risk pathway, 3 against the restart probability of 0.1. The comparison between the sDRW and DRW with stomach cancer dataset shows the sDRW predicting the highest number of risk pathway, 3 against the restart probability of 0.8. The comparison between the sDRW and DRW shows liver cancer dataset with the sDRW predicting the highest number of risk pathway, 3 against the restart probability of 0.4. The comparison between the sDRW and DRW with thyroid cancer dataset shows the sDRW predicting the highest number of risk pathway, 4 against the restart probability of 0.5. The comparison between the sDRW and DRW with kidney cancer dataset shows the sDRW predicting the highest number of risk pathway, 3 against the restart probability of 0.6. The comparison between the sDRW and DRW with liver cancer dataset shows the sDRW predicting the highest number of risk pathway, 3 against the restart probability of 0.4.
Fig. 10

Comparison of number of detected risk pathways between sDRW and DRW in six different cancer datasets.

Comparison of number of detected risk pathways between sDRW and DRW in six different cancer datasets.

Cancer classification

Binary classification has been used to classify the genes of input datasets into cancerous genes or normal genes (Gao et al., 2009). In this experiment, all input datasets have been divided into test set and training set based on 5-fold cross validation. Four-fifths of the samples were used as training set while the remaining one-fifth was used as test set. Training set is further split into three equal-sized subsets in order to select the best pathway marker set. Out of three subsets, two were used as marker evaluation subset to build classifier and rank the pathway marker. While the remain one subset of training set was used as feature selection dataset for assessing which pathway marker set produced the best classification performance. T-test statistics of pathway activities of the two subsets had been calculated in order to build classifier. They had been ranked by the P-values in increasing order. Out of 300 pathways, 50 top ranked pathways were selected as feature to build logistic regression model. Pathways were added sequentially to train the logistic regression model. While the performance of the classifier was measured by evaluating the area under the receiver operating characteristics curve (AUC) on the feature selection dataset [39]. Two marker pathway subsets were rotated to test and the significant pathway from the correspond subset will be kept in feature set if the AUC is increased and more than 0.9. Process is repeated for the top 50 pathway markers in order to optimize the performance of classifier and obtain the best feature set. After optimized the performance of classifier, test set is used to evaluate the performance of classifier. Pathway marker in the selective best feature set is used in classifier. Table 7 shows the AUC of each dataset in different restart probabilities.
Table 7

AUC of every datasets against restart probabilities from 0.1 to 0.9 in sDRW.

DatasetsRestart Probabilities, r
0.10.20.30.40.50.60.70.80.9
Lung0.96760.97020.98180.97640.98190.98770.98770.95820.9871
Stomach0.94720.97490.93620.89350.93560.96420.92150.97840.95478
Liver0.94690.98440.94270.96290.94280.95250.96350.98360.9684
Tyroid0.94260.95790.98690.92580.95380.91250.93120.92160.9528
Kidney0.96150.94720.96370.95780.94720.94780.95730.92680.9637
Breast0.84930.70420.72960.95080.89410.82510.84660.99430.9467
AUC of every datasets against restart probabilities from 0.1 to 0.9 in sDRW. Besides, comparison of number of cancerous genes that detected by sDRW and DRW is presented in Table 8. sDRW had successfully predicted more significant genes compare to DRW. This result had proved that sDRW are more sensitive in gene prediction. Table 9 shows the comparison of AUC after classification between sDRW and DRW. Comparison of AUC had proved that sDRW are better in terms of cancer classification due to higher accuracy.
Table 8

Number of cancerous gene detected by sDRW and DRW.

DatasetsMethodRestart Probabilities, r
0.10.20.30.40.50.60.70.80.9
Lung,sDRW2681601184911211811811849
GSE10072DRW6349491671676311811845
Increment of percentage, %325.3968226.5306140.8163−70.6597−32.934187.3016008.8889



Stomach,sDRW41531096570108248941
GSE13911DRW4124806529108244841
Increment of percentage, %0120.833336.250141.37930085.41670



Liver,sDRW211706173406713610921
GSE17856DRW211302182402710921109
Increment of percentage, %030.7692190.4762−10.97560148.148124.77064.1905−80.7339



Tyroid,sDRW232939339852137651
GSE5364DRW16163943952136351
Increment of percentage, %43.7581.250−23.2558988.88890020.63490



Kidney,sDRW73391753453947319161
GSE17895DRW393953345339731973
Increment of percentage, %87.17950230.188700141.025600120.5479



Breast,sDRW191219443521192326
GSE1456DRW14121992621332323
Increment of percentage, %35.714300388.888934.61380−42.4242013.0435

*The bold r is the optimum restart probability for sDRW.

Table 9

Comparison of AUC between sDRW and DRW.

DatasetMethodRestart Probabilities, r
0.10.20.30.40.50.60.70.80.9
Lung,sDRW0.96760.97020.98180.97640.98190.98770.98770.95820.9871
GSE10072DRW0.96360.97610.96990.9760.9630.98170.97640.97640.9816
Stomach,sDRW0.94720.97490.93620.89350.93560.96420.92150.97840.95478
GSE13911DRW0.93620.92350.94240.95310.92350.96420.91480.95480.9642
Liver,sDRW0.94690.98440.94270.96290.94280.95250.96350.98360.9684
GSE17856DRW0.92250.95280.94830.94680.92410.92160.95740.97480.9425
Tyroid,sDRW0.94260.95790.98690.92580.95380.91250.93120.92160.9528
GSE5364DRW0.94610.94720.95720.94620.91360.84670.93180.91270.9424
Kidney,sDRW0.96150.94720.96370.95780.94720.94780.95730.92680.9637
GSE17895DRW0.94370.94260.92590.94710.94210.94310.98410.91440.9258
Breast,sDRW0.84930.70420.72960.95080.89410.82510.84660.99430.9467
GSE1456DRW0.63790.78210.68720.94960.91350.72580.59840.95460.9268

*The bold r is the optimum restart probability for sDRW.

Number of cancerous gene detected by sDRW and DRW. *The bold r is the optimum restart probability for sDRW. Comparison of AUC between sDRW and DRW. *The bold r is the optimum restart probability for sDRW. Fig. 11 shows the number of cancerous genes that are detected by sDRW and DRW. The optimum restart probability is chosen based on the highest number of risk pathway, which is, detected by that corresponding restart probability. The optimum restart probability for lung cancer dataset is 0.1, with the highest number of cancerous gene detection, 268. With the same restart probability, the DRW detected 63 cancerous genes, which is less than sDRW, at about 205 genes. The optimum restart probability for stomach cancer dataset is 0.8, with the highest number of cancerous gene detection, 89. With the same restart probability, the DRW detected 48 cancerous genes, which is less than sDRW, by approximately 41 genes. Even though restart probability 0.3 has detected more genes compare to the other restart probabilities, the detected pathways at the corresponding restart probabilities are only 2. Hence, it will not be set as the default restart probability. The optimum restart probability for liver cancer dataset is 0.4, with the highest number of cancerous gene detection, 82. With the same restart probability, the sDRW detected 73 cancerous genes, which is less than DRW, by about 9 genes. Overall, restart probability 0.2 detected more genes but only detected two pathways. Compared to lesser number of detected genes, restart probability 0.4 shows its significance by detecting more pathways. The optimum restart probability for thyroid cancer dataset is 0.5, with the highest number of cancerous gene detection, 98. With the same restart probability, the DRW detected 9 cancerous genes, which is less than sDRW, by about 89 genes. The optimum restart probability for kidney cancer dataset is 0.6, with the highest number of cancerous gene detection, 94. With the same restart probability, the DRW detected 39 cancerous genes, which is less than sDRW, by about 55 genes. The optimum restart probability for breast cancer dataset is 0.4, with the highest number of cancerous gene detection, 44. With the same restart probability, the DRW detected 9 cancerous genes, which is less than sDRW, by about 35 genes.
Fig. 11

Comparison number of detected significant genes between sDRW and DRW in 8 different cancer datasets.

Comparison number of detected significant genes between sDRW and DRW in 8 different cancer datasets. From the experiments, we concluded that the sDRW is less effective on liver cancer dataset, which detects 9 genes less compared to DRW. Overall, sDRW is more effective in proving the sensitivity of the risk gene prediction.

Conclusion

In this paper, we proposed a significant directed random walk approach based on tuning parameter selection and weight as parameter for cancer classification using gene expression datasets. This approach is used as cancer classification which studied the relationship of gene expression data and cancerous gene. The main objective of this paper is to prove the effectiveness and performance of the proposed approach against directed random walk. The comparison between these two algorithms is done by comparing the sensitivity of cancer prediction and accuracy of cancer classification. Throughout the experiment results, this approach had proved to have higher sensitivity of cancerous prediction and more accurate cancer classification. First, tuning parameter selection is used to highlight the optimum restart probability for correspond dataset by testing with all nine restart probabilities. Then, the optimum restart probability will be chosen based on the most detected number of pathways. This is because only a complete biological pathway will generate protein, and with more biological pathway, more genes can be detected. Then weight among genes will be added into the pathway while walker is working on the directed graph for cancer prediction. The connectivity among gene plays an important role in determining the vector which will determine the walker to walk along the pathway. Finally, five-fold cross validation is used to train the classifier and classify the significant gene that detected by sDRW. The results demonstrated that the proposed approach is more effective, and feasible, for cancer classification compared to directed random walk.
  22 in total

1.  Topologically biased random walk and community finding in networks.

Authors:  Vinko Zlatić; Andrea Gabrielli; Guido Caldarelli
Journal:  Phys Rev E Stat Nonlin Soft Matter Phys       Date:  2010-12-08

2.  A group-specific tuning parameter for hybrid of SVM and SCAD in identification of informative genes and pathways.

Authors:  Muhammad Faiz Misman; Mohd Saberi Mohamad; Safaai Deris; Siti Zaiton Mohd Hashim
Journal:  Int J Data Min Bioinform       Date:  2014       Impact factor: 0.667

3.  Identification of significant features in DNA microarray data.

Authors:  Eric Bair
Journal:  Wiley Interdiscip Rev Comput Stat       Date:  2013-07

4.  Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts.

Authors:  Yudi Pawitan; Judith Bjöhle; Lukas Amler; Anna-Lena Borg; Suzanne Egyhazi; Per Hall; Xia Han; Lars Holmberg; Fei Huang; Sigrid Klaar; Edison T Liu; Lance Miller; Hans Nordgren; Alexander Ploner; Kerstin Sandelin; Peter M Shaw; Johanna Smeds; Lambert Skoog; Sara Wedrén; Jonas Bergh
Journal:  Breast Cancer Res       Date:  2005-10-03       Impact factor: 6.466

5.  Integrative genomic and transcriptomic characterization of matched primary and metastatic liver and colorectal carcinoma.

Authors:  Stephanie Roessler; Guoling Lin; Marshonna Forgues; Anuradha Budhu; Shelley Hoover; R Mark Simpson; Xiaolin Wu; Ping He; Lun-Xiu Qin; Zhao-You Tang; Qing-Hai Ye; Xin Wei Wang
Journal:  Int J Biol Sci       Date:  2015-01-01       Impact factor: 6.580

6.  A time-varying biased random walk approach to human growth.

Authors:  Béla Suki; Urs Frey
Journal:  Sci Rep       Date:  2017-08-10       Impact factor: 4.379

7.  SubpathwayMiner: a software package for flexible identification of pathways.

Authors:  Chunquan Li; Xia Li; Yingbo Miao; Qianghu Wang; Wei Jiang; Chun Xu; Jing Li; Junwei Han; Fan Zhang; Binsheng Gong; Liangde Xu
Journal:  Nucleic Acids Res       Date:  2009-08-25       Impact factor: 16.971

8.  Systematic sequencing of renal carcinoma reveals inactivation of histone modifying genes.

Authors:  Gillian L Dalgliesh; Kyle Furge; Chris Greenman; Lina Chen; Graham Bignell; Adam Butler; Helen Davies; Sarah Edkins; Claire Hardy; Calli Latimer; Jon Teague; Jenny Andrews; Syd Barthorpe; Dave Beare; Gemma Buck; Peter J Campbell; Simon Forbes; Mingming Jia; David Jones; Henry Knott; Chai Yin Kok; King Wai Lau; Catherine Leroy; Meng-Lay Lin; David J McBride; Mark Maddison; Simon Maguire; Kirsten McLay; Andrew Menzies; Tatiana Mironenko; Lee Mulderrig; Laura Mudie; Sarah O'Meara; Erin Pleasance; Arjunan Rajasingham; Rebecca Shepherd; Raffaella Smith; Lucy Stebbings; Philip Stephens; Gurpreet Tang; Patrick S Tarpey; Kelly Turrell; Karl J Dykema; Sok Kean Khoo; David Petillo; Bill Wondergem; John Anema; Richard J Kahnoski; Bin Tean Teh; Michael R Stratton; P Andrew Futreal
Journal:  Nature       Date:  2010-01-06       Impact factor: 49.962

9.  Disease gene identification by random walk on multigraphs merging heterogeneous genomic and phenotype data.

Authors:  Yongjin Li; Jinyan Li
Journal:  BMC Genomics       Date:  2012-12-13       Impact factor: 3.969

10.  Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival.

Authors:  Maria Teresa Landi; Tatiana Dracheva; Melissa Rotunno; Jonine D Figueroa; Huaitian Liu; Abhijit Dasgupta; Felecia E Mann; Junya Fukuoka; Megan Hames; Andrew W Bergen; Sharon E Murphy; Ping Yang; Angela C Pesatori; Dario Consonni; Pier Alberto Bertazzi; Sholom Wacholder; Joanna H Shih; Neil E Caporaso; Jin Jen
Journal:  PLoS One       Date:  2008-02-20       Impact factor: 3.240

View more
  1 in total

1.  From Infection Clusters to Metal Clusters: Significance of the Lowest Occupied Molecular Orbital (LOMO).

Authors:  Yuta Tsuji; Kazunari Yoshizawa
Journal:  ACS Omega       Date:  2021-01-07
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.