Literature DB >> 28243332

Network Biomarkers Constructed from Gene Expression and Protein-Protein Interaction Data for Accurate Prediction of Leukemia.

Xuye Yuan¹, Jiajia Chen², Yuxin Lin¹, Yin Li¹, Lihua Xu³, Luonan Chen⁴, Haiying Hua⁵, Bairong Shen¹.

Abstract

Leukemia is a leading cause of cancer deaths in the developed countries. Great efforts have been undertaken in search of diagnostic biomarkers of leukemia. However, leukemia is highly complex and heterogeneous, involving interaction among multiple molecular components. Individual molecules are not necessarily sensitive diagnostic indicators. Network biomarkers are considered to outperform individual molecules in disease characterization. We applied an integrative approach that identifies active network modules as putative biomarkers for leukemia diagnosis. We first reconstructed the leukemia-specific PPI network using protein-protein interactions from the Protein Interaction Network Analysis (PINA) and protein annotations from GeneGo. The network was further integrated with gene expression profiles to identify active modules with leukemia relevance. Finally, the candidate network-based biomarker was evaluated for the diagnosing performance. A network of 97 genes and 400 interactions was identified for accurate diagnosis of leukemia. Functional enrichment analysis revealed that the network biomarkers were enriched in pathways in cancer. The network biomarkers could discriminate leukemia samples from the normal controls more effectively than the known biomarkers. The network biomarkers provide a useful tool to diagnose leukemia and also aids in further understanding the molecular basis of leukemia.

Entities: Chemical Disease Gene Mutation Species

Keywords: integrative analysis; leukemia.; network biomarker

Year: 2017 PMID： 28243332 PMCID： PMC5327377 DOI： 10.7150/jca.17302

Source DB: PubMed Journal: J Cancer ISSN： 1837-9664 Impact factor: 4.207

Introduction

Leukemia is a prevalent hematologic malignancy and one of the most common causes of cancer deaths in the developed countries1, 2. The overall incidence of leukemia is 14 per 100000 people in the United States in 2015 and is projected to continue rising. Based on the origin, leukemia can be classified into myeloid leukemia or lymphoid leukemia, which can be subdivided into acute or chronic according to the degree of cellular differentiation3, 4. Many of the symptoms of leukemia are non-specific and vague, which could not be diagnosed by conventional blood tests and bone marrow examination5, 6. Plenty of efforts have been devoted to investigate the molecular alterations in leukemogenesis. Next generation sequencing of human genomes and exomes has revealed somatic mutations, aberrantly expressed genes, microRNAs and DNA methylations with putative roles in leukemia7-9. However, most of the individual molecules suffer from low reproducibility and high false-positive rates. Few of them have been translated to the clinic for diagnostic application. It is well recognized that cancer is a complex disease caused not by the malfunction of single molecules but their collective behavior in the network 10-15. Therefore, network biomarkers are considered to better characterize leukemia than individual molecules and have recently attracted much attention. A number of protein interaction sub-networks have been proposed for early diagnosis, prognosis and efficacy prediction of cancers16-19. In this study, we proposed a framework (Figure 1) that integrates protein-protein interaction (PPI) data and microarray-based gene expression profiles to construct network biomarkers for accurate prediction of leukemia. The network biomarkers prove to be effective in distinguishing leukemia from normal samples.

Figure 1

The flowchart of network biomarkers identification for leukemia diagnosis.

Materials and Methods

Data collection

We used two different types of datasets, protein-protein interaction data and disease annotation of the protein-coding genes to reconstruct the leukemia-specific PPI network. PPI data was extracted from the Protein Interaction Network Analysis (PINA) v2.0 platform 20. PINA is a unified database of protein-protein interaction that collects 14454 genes and 108470 interactions from six manually curated public databases (listed in Table 1). The leukemia-associated genes were extracted from the commercial knowledge database MetacoreTM, which is developed by GeneGo.

Table 1

Source databases of PINA.

Original database	Version	Ref.	Link
IntAc	Oct 4,2012	[21]	http://www.ebi.ac.uk/intact/
BioGRID	3.1.93	[22]	http://thebiogrid.org/
MINT	Dec 21,2010	[23]	http://mint.bio.uniroma2.it/mint/Welcome.do
DIP	June 14,2010	[24]	http://dip.doe-mbi.ucla.edu/dip/Stat.cgi
HPRD	April 13,2010	[25]	http://www.hprd.org/download
MIPS/Mpact	Oct 1,2008	[26]	http://mips.helmholtz-muenchen.de/

The public gene expression data were downloaded from the Gene Expression Omnibus (GEO) database. All the gene expression data were obtained using Affymetrix Human Genome arrays. The samples in each GEO datasets are divided into three categories: Leukemia (including AML, CLL, T-PLL and B-CLL), others and Normal. The others samples are filtered out in this study since they are not associated with leukemia. Detailed information for GEO datasets is summarized in Table 2. The six groups of expression datasets were analyzed to get statistics values. Additional three sets of expression datasets were used for further verification (Table 3).

Table 2

Leukemia-associated gene expression datasets used for analysis.

Series	Platform	No. Samples	Leukemia				Others	Normal	Ref.
Series	Platform	No. Samples	AML	CLL	T-PLL	B-CLL	Others	Normal	Ref.
GSE9476	GPL96	64	26					38	[27]
GSE6691	GPL96	56		11			32	13	[28]
GSE5788	GPL96	14			6			8	[29]
GSE22529	GPL96	52		41				11	[30]
GSE26725	GPL570	17				12		5	[31]
GSE23293	GPL570	41		7			18	16	[32]

Table 3

Leukemia-associated gene expression datasets used for validation.

Series	Platform	No. of samples	CML	CLL	Normal	Ref.
GSE8835	GPL96	66		42	24	[33]
GSE24739	GPL570	24	16		8	[34]
GSE39411	GPL570	152		104	48	[35]

Reconstruction of leukemia-specific PPI network

Human leukemia-specific protein-protein interaction network was first downloaded from PINA and then refined with the 1495 leukemia-associated gene from GeneGo. Only the interactions formed between leukemia-associated genes were selected to form a leukemia-specific PPI network.

Integration with gene expressing profiles

The statistical analysis was invoked through the limma (Linear Models for Microarray Data) R package 36 and the affy(Methods for Affymetrix Oligonucleotide Arrays) R package in R software platform 37. Student t-test was used to identify the significant difference level (P-value) of each considerable gene in each dataset. To enhance the accuracy, we applied the empirical Bayesian statistical method to moderate the standard errors and then utilized the method proposed by Benjamini et al. to adjust the multi-testing 38, and got the adjusted P-values simultaneously. To integrate the gene expression data and leukemia-specific PPI network, the adjusted P-value of each gene was mapped onto its corresponding gene in the leukemia-specific PPI network to obtain a dataset-specific weighted PPI network, with adjusted P-value as weight factor.

Active module subtraction

In general, the network integration analysis was performed in 3 steps. At the first step, we converted the adjusted P-values to Z score through using the inverse normal cumulative distribution. Higher Z score indicates more important role in leukemogenesis. Given the Z score, we performed a greedy search to identify the modules with a locally maximal Z score. The candidate modules were seeded with a single gene and then a neighbor within a distanced=3 from the seed were iteratively added. If the neighbor added to the Z score, it was incorporated into the module. The search terminated when no addition increased the Z score over the improvement rate r. The parameter r was set as 0.05 to avoid over fitting. At last the top 10 modules with the highest Z-score identified from each run were merged and iteratively searched for 3-5 times, until the module reached the optimal size of 70-80 nodes. We used jActiveModules 39 to select active modules from the weighted PPI network since it is a fashionable method for this kind of investigation. jAM is a plug-in of Cytoscape which evaluated module activity with Z score.

Network-based biomarkers construction

At last, as 6 optimized modules include 290 genes in total, which are too large and loosely interconnected for further analysis, we carried out the overlapping analysis to find out the number of enriched genes shared by each optimized modules. We overlapped the six modules and selected the genes shared by at least two networks to construct the final network-based biomarker.

Pathway enrichment analysis

We performed pathway enrichment analysis in Ingenuity Pathway Analysis (IPA) and Kyoto Encyclopedia of Genes and Genomes (KEGG) 40to provide functional insight into the identified network marker. The statistical significance of the enrichment was calculated using hypergeometric test and adjusted by FDR method (P-value < 0.05).

Statistical significance assessment

Hypergeometric test was used to test whether the network biomarkers are significantly enriched with leukemia-related genes. Known mutation genes related to cancers were obtained from the Catalogue of Somatic Mutations in Cancer (COSMIC), which is a cancer gene census 41. 213 of the COSMIC genes are found in common with the GeneGO database. An empirical P-value was calculated to evaluate the statistical significance. P-value was obtained according to the following equation: Where, N represents the number of genes in the leukemia-specific PPI network; M is the number of known leukemia related genes in COSMIC; n denotes the number of genes in the final network biomarkers; k represents the known leukemia related genes in the final network biomarkers.

Performance evaluation

We employed the receiver-operating characteristic (ROC) analysis to evaluate the prediction performance of the network biomarkers in distinguishing leukemia samples from the normal controls. The epicalc R package (http://CRAN.R-project.org/package=epicalc) was used to produce the ROC curves. A 5-fold cross validation was performed on three gene expression dataset listed in Table 3. Normal samples were set as 0 and cancer samples were set as 1. The classification performance was represented as the area under curve (AUC). We also provided sensitivity, specificity and accuracy for the network biomarkers.

Results and Discussion

Sub-network involved in leukemogenesis

The leukemia-specific PPI network was reconstructed by integrating PPI from PINA and 1495 leukemia-associated genes from GeneGo. As a result, the leukemia-specific PPI network consists of 4136 interactions among 978 genes. As is described in Methods section, gene expression profiles of 6 independent GEO datasets were overlaid to the reconstructed leukemia-specific PPI network and 6 correspondent dataset-specific sub-networks were obtained (marked orderly as PPI_GSE9476_raw, PPI_GSE6691_raw, PPI_GSE9476_raw, PPI_GSE22529_raw, PPI_GSE23293_raw and PPI_GSE26725_raw in Figure 1). Greedy search was performed for 6 sub-networks respectively. After 3 iterations for GSE5788, GSE9476 and GSE22529, 4 iterations for GSE23293 and GSE26725, 5 iterations for GSE6691, finally we obtained 6 active modules with a locally maximal Z score by jActiveModules (marked orderly as PPI_GSE9476_TR, PPI_GSE6691_ TR, PPI_GSE9476_TR, PPI_GSE22529_TR, PPI_GSE23293_TR and PPI_GSE26725_TR in Figure 1). The number of nodes and edges in each module is summarized in Table 4.

Table 4

Detailed information of the active modules.

	PPI_GSE5788_TR	PPI_GSE6691_ TR	PPI_GSE9476_ TR	PPI_GSE22529_ TR	PPI_GSE23293_ TR	PPI_GSE23293_ TR
Nodes	77	71	75	73	71	75
Edges	205	186	166	193	126	188

After overlap analysis, a total of 97 genes along with their interactions were incorporated into the final network-based biomarkers, as illustrated in Figure 2. Genes with previous evidence in leukemia are marked yellow.

Figure 2

The final network-based biomarker for leukemia. The known cancer related genes in final network are marked yellow.

Functional analysis of candidate network biomarkers

The network biomarkers were most enriched for molecular mechanisms of cancer (IPA) and pathways in cancer (KEGG). Leukemia-specific pathways such as Chronic Myeloid Leukemia (KEGG) and Acute Myeloid Leukemia Signaling (both IPA and KEGG) were also enriched and showed high statistical significance. It indicates that genes in the biomarker network are closely associated with the development of different types of leukemia. Besides, in He's study, P13K/AKT Signaling (IPA) was also proved to be involved in chronic myeloid leukemia42. Irwin et al. found that ErbB inhibitors played important roles in Philadelphia chromosome-positive acute lymphoblastic leukemia (Ph(+)ALL) and ErbB signaling (KEGG) was a complementary molecular target in Ph(+)ALL 43. The top-ranked pathways in both IPA and KEGG displayed apparent correlation between leukemia and the network biomarkers, which implied the potential accuracy of our result. Figure 3 shows the top 10 most significantly enriched IPA and KEGG pathway respectively.

Figure 3

IPA and KEGG pathway enrichment analysis for network biomarkers. The top 10 most significantly enriched IPA and KEGG pathway are shown in panel (A) and (B) respectively.

We used the Database for Annotation, Visualization, and Integrated Discovery (DAVID) 44 for the Gene ontology (GO) annotation in three domains: molecular function, biological process, and cellular component. The top 10 most significantly enriched items for each domain are shown in Figure 4. These results indicate that genes in the network are closely associated with the biological processes in the development of different types of leukemia, such as cell death 45 and apoptosis 46. This indicated the accuracy of the predicted network biomarkers to a certain extent.

Figure 4

Gene ontology annotation for the network biomarkers. The network biomarkers identified by our method were annotated with DAVID tools at three levels of gene ontology: Molecular Function, Biological Process, and Cellular Component. The top 10 most significantly enriched items for each level are shown.

Network biomarkers are significantly associated with leukemia

We further investigated whether the genes in the network biomarkers were randomly obtained. The statistical significance was checked using hypergeometric test and a significant p-value of 0.008987933 was obtained. This indicates that the candidate network biomarkers are enriched with known leukemia-related genes and could not be obtained randomly. As illustrated in Figure 5(A), the blue circle represents the 978 genes in the leukemia-specific PPI network; the red circle includes the 522 known leukemia-related genes in COSMIC. The leukemia-specific PPI contains 195 known leukemia-related genes in COSMIC. The purple circle represents 97 genes in final network biomarkers, among which 29 genes belong to the known leukemia-related category.

Figure 5

Validation of the network biomarkers. (A) Distribution of the leukemia-associated genes in the network. (B-D) ROC curves obtained with the network biomarkers tested by three gene expression datasets. Panel (B), (C) and (D) represent respectively the results of the gene expression datasets in series of GSE8835, GSE24739 and GSE39411.

Sub-network marker with higher classification accuracy

To evaluate the performance of network biomarkers in classifying leukemia and normal gene expression profiles, we used three independent gene expression datasets listed in Table 3 as tested datasets to produce the ROC curves. We compared the network biomarker with three reported gene biomarkers: CD3847, BCL2 48 and IGFBP7 49. The reasons we chose these three markers for comparison are as follows, 1) these biomarkers are all well-studied and all of them have been validated by clinical experiments. 2) The marker CD38 is a member of our network whereas the remaining two are not. We included two others for fair evaluating the performance of our network biomarker. Figure 5 shows the ROC curves for network biomarkers and 3 known biomarkers. Network-based biomarker has higher AUC than any of the single markers which means network-based biomarker could more effectively discriminate the leukemia from the normal controls. The sensitivity, specificity and accuracy of each dataset are given in Table 5.

Table 5

Detailed information of ROC curves.

Series	Biomarker	Sensitivity	Precision	Specificity	Accuracy	AUC
GSE8835	CD38	0.913	0.700	0.212	0.658	0.629
	BCL2	0.885	0.650	0.166	0.623	0.604
	IGFBP7	0.965	0.665	0.148	0.668	0.584
	Network biomarkers	0.851	0.686	0.316	0.657	0.698
GSE24739	CD38	0.893	0.662	0.088	0.625	0.431
	BCL2	0.943	0.654	0.004	0.630	0.237
	IGFBP7	0.938	0.652	0.001	0.625	0.197
	Network biomarkers	0.874	0.986	0.976	0.908	0.966
GSE39411	CD38	0.999	0.725	0.177	0.740	0.981
	BCL2	0.915	0.931	0.853	0.895	0.966
	IGFBP7	0.886	0.797	0.513	0.768	0.880
	Network biomarkers	0.996	0.999	0.998	0.997	0.999

It is worth noting that for network biomarkers from GSE8835 has a relatively lower AUC than the other two datasets. This may be caused by the difference of platform and method between GSE8835 and the training datasets. Anyhow, the accuracy of the network biomarkers is still higher than other three single biomarkers. The result indicates that the putative network biomarkers could diagnose leukemia samples more accurately and could be used as putative biomarker to aid in early diagnosis of leukemia.

Conclusions

In conclusion, we developed a network approach for molecular investigation and diagnosis of leukemia. The constructed network biomarkers not only achieve higher accuracy rate of diagnosis compared to known single biomarkers but also provide systematic insights into the leukemogenesis process. We noticed that we only considered the combination of genes (or the nodes) in the network for the prediction of leukemia. The interactions among genes can also provide valuable biological signatures for diagnosis of diseases. We will take the edge-variation in the network into the account for the further improving of the leukemia prediction.

Availability of Data and Materials

PINA: http://cbg.garvan.unsw.edu.au/pina/interactome.stat.do. GeneGo: https://portal.genego.com/cgi/data_manager.cgi. GSE9476: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9476. GSE6691: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6691. GSE5788: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5578. GSE22529: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE22529. GSE26725: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE26725. GSE23293: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE23293. GSE8835: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE8835. GSE24739: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24739. GSE39411: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE39411. COSMIC: http://cancer.sanger.ac.uk/cosmic/download IPA: http://www.ingenuity.com/products/login DAVID: https://david.ncifcrf.gov/.

48 in total

1. DIP: The Database of Interacting Proteins: 2001 update.

Authors: I Xenarios; E Fernandez; L Salwinski; X J Duan; M J Thompson; E M Marcotte; D Eisenberg
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

2. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

3. Lack of a relationship between the common 8q24 variant rs6983267 and risk of chronic lymphocytic leukemia.

Authors: G S Sellick; P Broderick; S Fielding; D Catovsky; R S Houlston
Journal: Leukemia Date: 2007-08-23 Impact factor: 11.528

Review 4. The prognostic and functional role of microRNAs in acute myeloid leukemia.

Authors: Guido Marcucci; Krzysztof Mrózek; Michael D Radmacher; Ramiro Garzon; Clara D Bloomfield
Journal: Blood Date: 2010-11-02 Impact factor: 22.113

5. Identifying novel prostate cancer associated pathways based on integrative microarray data analysis.

Authors: Ying Wang; Jiajia Chen; Qinghui Li; Haiyun Wang; Ganqiang Liu; Qing Jing; Bairong Shen
Journal: Comput Biol Chem Date: 2011-04-27 Impact factor: 2.877

6. Reverse-engineering the genetic circuitry of a cancer cell with predicted intervention in chronic lymphocytic leukemia.

Authors: Laurent Vallat; Corey A Kemper; Nicolas Jung; Myriam Maumy-Bertrand; Frédéric Bertrand; Nicolas Meyer; Arnaud Pocheville; John W Fisher; John G Gribben; Seiamak Bahram
Journal: Proc Natl Acad Sci U S A Date: 2012-12-24 Impact factor: 11.205

7. LEF-1 is a prosurvival factor in chronic lymphocytic leukemia and is expressed in the preleukemic state of monoclonal B-cell lymphocytosis.

Authors: Albert Gutierrez; Renee C Tschumper; Xiaosheng Wu; Tait D Shanafelt; Jeanette Eckel-Passow; Paul M Huddleston; Susan L Slager; Neil E Kay; Diane F Jelinek
Journal: Blood Date: 2010-07-01 Impact factor: 22.113

8. Role of clinical bioinformatics in the development of network-based Biomarkers.

Authors: Xiangdong Wang
Journal: J Clin Bioinforma Date: 2011-10-24

9. Discovery and characterization of long intergenic non-coding RNAs (lincRNA) module biomarkers in prostate cancer: an integrative analysis of RNA-Seq data.

Authors: Weirong Cui; Yulan Qian; Xiaoke Zhou; Yuxin Lin; Junfeng Jiang; Jiajia Chen; Zhongming Zhao; Bairong Shen
Journal: BMC Genomics Date: 2015-06-11 Impact factor: 3.969

3. Potential Applications of DNA, RNA and Protein Biomarkers in Diagnosis, Therapy and Prognosis for Colorectal Cancer: A Study from Databases to AI-Assisted Verification.

Authors: Xueli Zhang; Xiao-Feng Sun; Bairong Shen; Hong Zhang
Journal: Cancers (Basel) Date: 2019-02-01 Impact factor: 6.639

4. Towards integrated oncogenic marker recognition through mutual information-based statistically significant feature extraction: an association rule mining based study on cancer expression and methylation profiles.

Authors: Saurav Mallik; Zhongming Zhao
Journal: Quant Biol Date: 2017-11-23

Review 5. Single-cell network biology for resolving cellular heterogeneity in human diseases.

Authors: Junha Cha; Insuk Lee
Journal: Exp Mol Med Date: 2020-11-26 Impact factor: 8.718

5 in total