Literature DB >> 26288567

Identifying new targets in leukemogenesis using computational approaches.

Archana Jayaraman¹, Kaiser Jamil², Haseeb A Khan³.

Abstract

There is a need to identify novel targets in Acute Lymphoblastic Leukemia (ALL), a hematopoietic cancer affecting children, to improve our understanding of disease biology and that can be used for developing new therapeutics. Hence, the aim of our study was to find new genes as targets using in silico studies; for this we retrieved the top 10% overexpressed genes from Oncomine public domain microarray expression database; 530 overexpressed genes were short-listed from Oncomine database. Then, using prioritization tools such as ENDEAVOUR, DIR and TOPPGene online tools, we found fifty-four genes common to the three prioritization tools which formed our candidate leukemogenic genes for this study. As per the protocol we selected thirty training genes from PubMed. The prioritized and training genes were then used to construct STRING functional association network, which was further analyzed using cytoHubba hub analysis tool to investigate new genes which could form drug targets in leukemia. Analysis of the STRING protein network built from these prioritized and training genes led to identification of two hub genes, SMAD2 and CDK9, which were not implicated in leukemogenesis earlier. Filtering out from several hundred genes in the network we also found MEN1, HDAC1 and LCK genes, which re-emphasized the important role of these genes in leukemogenesis. This is the first report on these five additional signature genes in leukemogenesis. We propose these as new targets for developing novel therapeutics and also as biomarkers in leukemogenesis, which could be important for prognosis and diagnosis.

Entities: Chemical Disease Gene Species

Keywords: Acute Lymphoblastic Leukemia (ALL); Gene prioritization; Microarray analysis; Protein interaction network; Therapeutic targets

Year: 2015 PMID： 26288567 PMCID： PMC4537869 DOI： 10.1016/j.sjbs.2015.01.012

Source DB: PubMed Journal: Saudi J Biol Sci ISSN： 1319-562X Impact factor: 4.219

Introduction

A key focus of cancer research is the identification of driver genes in the tumorigenesis pathway as tumor specific signature genes, for use as drug targets or biomarkers, which could be possible from microarray databases (Ma et al., 2013). The recent advancement in bioinformatics techniques has made it possible to search for therapeutic targets for specific diseases in a systematic and comprehensive manner (Desany and Zhang, 2004). Acute Lymphoblastic Leukemia (ALL) is a blood cancer that targets B and T-lymphocyte cells, affecting their differentiation and leading to the loss of regulation of cell division (Khalid et al., 2010). Even with numerous advances in therapeutic efficacy, 20–40% of patients still relapse, especially children and young adults (Smith et al., 2010). Research studies have implicated alterations in several pathways that mediate crucial biological processes to play a role in disease progression and particularly in relapse (Bhojwani et al., 2006; Pui et al., 2011). These studies suggest that an interconnected network of many genes and their products are altered in carcinogenesis and may contribute to leukemia pathogenesis (Bhojwani et al., 2006; Pui et al., 2011). A study by Kang et al. (2012) reported a correlation between event free survival and expression levels of NEGR1, IRX2, EPS8 and TPD52. Lin et al. (2012) reported that point mutations in NOTCH1 led to increased expression of this gene which might contribute to pathogenesis in T-ALL. In recent years, meta-analysis studies have led to the identification of novel genetic markers that might play crucial roles in the neoplastic process and in other diseases, as demonstrated through our previous studies (Khan and Jamil, 2008; Shaik et al., 2009; Jamil and Sabeena, 2011). Understanding the evolutionary relationship of these genes could also help to investigate the mechanisms of neoplastic transformation observed in leukemic cells (Jayaraman et al., 2011; Jayaraman and Jamil, 2012). Further, our previous studies using bioinformatics approaches have helped in highlighting the significance of protein networks in ALL (Jayaraman and Jamil, 2013) and identified important amino acid residues that may be useful in therapeutic targeting of cell cycle proteins (Jayaraman and Jamil, 2014). In recent years, several research studies have applied a systems biology approach to understand ALL leukemogenesis. Maiorov et al. (2013) identified a set of non-differential putative biomarkers in T-ALL based on network analysis of expression data. Gao et al. (2014) analyzed differentially expressed genes, screened for prognostic genes and identified latent pathway genes. Their analysis identified HK3 and PTGS2, two key metabolic pathway genes as possible prognostic genes in pediatric ALL. Chaiboonchoe et al. (2014) used an integrated bioinformatics approach to identify glucocorticoid regulated genes in Childhood ALL. Many studies have shown that various bioinformatics and computational biology approaches, such as PseKNC (Chen et al., 2014) or Chou’s PseAAC (Chou, 2001), can be successfully used to identify modifications in the genome such as recombination spots of DNA (Chen et al., 2013), various PTM (posttranslational modification) sites (Xu et al., 2014), anticancer peptides (Hajisharifi et al., 2014), interactions between drugs and target proteins in cellular networking (Xiao et al., 2014), providing very useful information and insights for both basic research and drug development, and hence are widely welcomed by the scientific community, both experimental and theoretical. Here, we have used computational approaches to identify new targets in leukemogenesis in the hope to provide useful information for stimulating the development of new and effective drugs to treat leukemia. Understanding the interactions of disease genes is essential as dynamic networking of genes could be correlated with clinical informatics, including therapeutic and imaging profile and other parameters and this correlation could help in a better understanding of the disease in relation to each patient (Wang, 2011). To meet such challenges our objective was to retrieve overexpressed genes from Oncomine expression database (Rhodes et al., 2007), to perform gene prioritization analysis using bioinformatics software. Further, we have also analyzed protein interactions of the prioritized proteins as studies investigating Protein–Protein interactions have provided key insights on the biological functioning of many proteins and have also been effective in identifying novel genes that play a role in pathogenesis of various diseases such as cancer (Huang et al., 2011; Li et al., 2012). Our hypothesis is based on our belief that a large amount of data generated through expression studies in previous reports which contribute to leukemogenesis may have been missed due to the varied detection methods. Hence, our research combines the use of expression data, gene prioritization analysis and a network based approach to identify genes of significance in ALL and the use of well validated datasets, prioritization based approach, using rigorous network analysis suggests that the results from our study may be replicable in vivo as well. The use of these combined bioinformatics approaches enhances the validity of our results and has led to the identification of few novel genes in this study.

Materials and methods

An overview of the analysis workflow for the study is represented in Fig. 1.

Figure 1

Scheme showing overview of the methodology followed in the study.

Microarray expression data analysis using Oncomine database

In the current study we queried the Oncomine database to obtain only those datasets which have reported differentially expressed genes between normal and leukemic tissues. Oncomine database 3.0 (Rhodes et al., 2007) is a comprehensive cancer microarray expression database with expression data from experimental studies. We found three studies – Maia et al. (2005) (one dataset; B-ALL; 20 patients), Andersson et al. (2007) (one dataset; Childhood B- (87 patients) and T-ALL (11 patients)) and Haferlach et al. (2010) (two datasets; Childhood and Adult B- (933 patients) and T-ALL (253 patients)), in Oncomine database which reported differential expression in ALL and we selected only the top 10% overexpressed genes from all these studies. Further, we compared these studies using the default database threshold values of odds ratio above 2.0, P-value of 1E-4. The genes short-listed from this analysis, designated as candidates, were used as input for gene prioritization.

Gene prioritization using ENDEAVOUR, DIR and TOPPGene tools

Three software tools – ENDEAVOUR (Tranchevent et al., 2008), TOPPGene (Chen et al., 2009) and DIR (Chen et al., 2011) – were utilized to evaluate whether the genes obtained through Oncomine dataset analysis could play a role in the disease process. The three tools perform prioritization based on sources such as disease information, pathway information, phenotype, regulatory modules, etc. and produce a ranked list of genes which could be further validated statistically by each of the software to output a final ranked list of genes.

Training genes

The gene prioritization tools additionally require a set of training genes to train the software, which were obtained from the published literature, reporting their alteration in the disease process. PubMed was queried using the keywords “overexpression” and “acute lymphoblastic leukemia” for all studies pertaining to humans (study period from 1992–March 2012). Analysis of the retrieved literature revealed 30 genes (Table 1), which were reported through experimental studies to be significantly overexpressed in ALL and contribute to leukemogenesis.

Table 1

List of ALL specific genes used as training genes for prioritization.

S.No.	Gene name	Reference	S.No.	Gene name	Reference
1.	NOTCH1	Lin et al. (2012)	16.	SCGF	Bhojwani et al. (2006)
2.	CRLF2	Tasian and Loh (2011)	17.	AML1	Mikhail et al. (2002)
3.	NOTCH3	Palermo et al. (2012)	18.	CD49f	DiGiuseppe et al. (2009)
4.	LEF1	Kühnl et al. (2011)	19.	Aven	Choi et al. (2006)
5.	USP44	Zhang et al. (2011)	20.	BCL2	Coustan-Smith et al. (1996)
6.	MYC	Cardone et al. (2005)	21.	ABCB1	Baudis et al. (2006)
7.	Survivin	Esh et al. (2011)	22.	Livin	Choi et al. (2007)
8.	WT1	Shabani et al. (2008)	23.	MK	Hidaka et al. (2007)
9.	hCLP46	Wang et al. (2010)	24.	TNF-R1	Holleman et al. (2006)
10.	MDM2	Hendy et al. (2009)	25.	TRAIL-R2	Holleman et al. (2006)
11.	CDX2	Thoene et al. (2009)	26.	TRAIL-R4	Holleman et al. (2006)
12.	EPOR	Inthal et al. (2008)	27.	BCL2L13	Holleman et al. (2006)
13.	MsrB2	Cabreiro et al. (2008)	28.	Ikaros 6	Ruiz et al. (2004)
14.	ROR1	Shaheen and Ibrahim (2012)	29.	XIAP	Hundsdoerfer et al. (2010)
15.	ABL1	Chiaretti et al. (2007)	30.	HOX11	Ferrando and Look (2003)

Further, to ensure that the results obtained through the use of the ALL specific training genes were not random, we also performed a separate gene prioritization analysis using a set of housekeeping genes as training genes (Table 2). Thirty housekeeping genes, having the highest expression in bone marrow tissues, were retrieved from the study by Chang et al. (2011) (Table 2). Of the ranked list of genes obtained from each analysis, the top 100 were considered to be significant and were compared to short-list the prioritized genes common to at least two tools. These common prioritized genes along with the training genes were used to construct protein interaction network.

Table 2

List of training housekeeping genes with higher expression in bone marrow tissue (Chang et al., 2011).

S.No.	Gene name	S.No.	Gene name
1.	ACTB	16.	RPS10
2.	B2 M	17.	RPS11
3.	EEF1A1	18.	RPS12
4.	HBB	19.	RPS14
5.	RPL13A	20.	RPS15
6.	RPL23A	21.	RPS17
7.	RPL27A	22.	RPS18
8.	RPL3	23.	RPS23
9.	RPL30	24.	RPS27
10.	RPL41	25.	RPS29
11.	RPL7A	26.	RPS3A
12.	RPL9	27.	RPS6
13.	RPL32	28.	TPT1
14.	RPLP0	29.	UBB
15.	RPLP1	30.	UBC

Construction of Protein–Protein Interaction (PPI) network

We have used STRING database v9 (Search Tool for the Retrieval of Interacting Genes, available at: http://string-db.org/ (Szklarczyk et al., 2011)) with protein names of the prioritized and training genes as seeds for construction of the protein interaction (PPI) network. We selected the interactions pertaining to Homo sapiens and grew the interaction network to obtain an additional 230 protein interactors, using the “add more interactors” option in STRING database, and further refined it to include only those interactions with a confidence score greater than 0.9. Further, to simplify the complicated dense network and to obtain a better understanding of the interaction network, we clustered the interactors using STRING k-Means clustering algorithm (MacQueen, 1967). We specified the number of clusters to be 12, based on the rule of thumb k = √(n/2) (Mardia et al., 1979), where n is number of nodes (protein interactors) in the cluster. The resulting clusters were separated manually for better visual representation and comprehension of the interaction network. Further, to study the functional significance of these genes, we used WebGestalt server.

Gene Ontology (GO) WebGestalt server for functional enrichment

We used WebGestalt, WEB-based GEne SeT AnaLysis Toolkit (available at http://bioinfo.vanderbilt.edu/webgestalt/) (Zhang et al., 2005) for GO enrichment analysis of the common prioritized genes and the interactors in the protein network. The statistical significance of the enrichment analysis was checked by choosing Hypergeometric test and Benjamini–Hochberg false discovery rate (FDR) correction model for multiple test adjustment, which were available in the software. We also set the threshold to the default settings to include a minimum of two genes per category and a P-value cut-off of 0.05 to obtain significant enrichment. Further, we have used KOBAS (KEGG Orthology Based Annotation System, available at http://kobas.cbi.pku.edu.cn/home.do) (Xie et al., 2011) to perform KEGG database based enrichment analysis of the prioritized genes obtained using housekeeping and ALL specific training genes. The default parameter of Hypergeometric test/Fisher’s exact test was selected as the statistical method and Benjamini–Hochberg was used as the FDR correction method.

Analysis of the topology of network interaction data and Hub proteins

The PPI network was downloaded in Protein Standards Initiative (PSI) format and imported into the network visualization software, Cytoscape (Smoot et al., 2011). The topological parameters of the network, i.e. node degree distribution and clustering coefficient, were analyzed using the Network Analyzer plugin (Assenov et al., 2008). These parameters are a measure of the importance of the nodes in the network and their ability to form clusters (Barabási and Oltvai, 2004). Information about hub genes was obtained through the cytoHubba plugin (Lin et al., 2008) with the option “confidence value” set as the edge attribute and degree and betweenness as the node ranking methods. We set each of the ranking methods to output the top 50 hub forming genes/proteins as a measure of significance. The genes obtained from both the ranking methods were compared and those common to both methods were considered to be significant.

Results

Candidate gene selection and data analysis

We could identify and shortlist candidate genes which were overexpressed in the Oncomine microarray database. We found about 23 datasets of Acute Lymphoblastic Leukemia (ALL) in the database, of which only 3 studies reported differential analysis between cancer and normal tissues. All the datasets present in Oncomine represented statistically validated information via analysis performed by Oncomine using t-tests and validated by the database using false discovery rate test, prior to incorporation into the database. Dataset of Maia et al. (2005) comprised of about 12,624 measured genes with 627 genes among the top 10% overexpressed genes. The dataset of Andersson et al. (2007) comprised of about 10,735 measured genes with 1072 genes in B-ALL and 1071 genes in T-ALL among the top 10% overexpressed genes. Haferlach et al. (2010) dataset-1 comprised of 19,574 measured genes, and dataset-2 of about 910 genes and 1957 genes each in B-ALL and T-ALL samples were among the top 10% overexpressed genes. On comparing all the overexpressed signature genes across these datasets, we found 237 genes in B-ALL and 422 genes in T-ALL common to two out of three studies in B-ALL and present in both the studies used for T-ALL analysis. Since our aim was to determine the alterations in ALL genes as a whole, we combined both the B- and T-ALL genes and after removal of duplicates we obtained 573 genes. Of these, 530 genes, designated as candidates, mapped to ENSEMBL ids, were short-listed for prioritization.

Gene prioritization and Gene Ontology (GO) functional enrichment of overexpressed candidates using training genes

The selected and shortlisted genes from Oncomine were prioritized, using the software ENDEAVOUR, ToppGene and DIR. Each of the gene prioritization algorithms used in this study ranked the candidate genes according to their significance and the results were presented as a tabulated list. On comparison of the top 100 ranked results from the three tools, we found that 54 genes (referred to as ALL prioritized genes, Table 3) were common to the prediction methods and hence may play an important role in ALL. Of these 54 genes, 30 were found to be overexpressed in T-ALL, 13 in B-ALL and 11 in both subtypes.

Table 3

Prioritized candidate genes common to ENDEAVOUR, DIR, TOPPGENE tools.

T-ALL only		B-ALL only	Both B-ALL,T-ALL
ABI2	KHDRBS1	BCR	CDK6
ADA	LCK	BLNK	CSNK1E
AOF2	MAP4K1	CDK9	DVL2
BMI1	MEN1	CHD4	GNPTAB
CD3D	MLL	ETS2	MYB
CD3E	NPM1	INSR	NONO
CD81	PTMA	MEF2C	SET
CTCF	SMAD2	NR3C1	TCF3
DNTT	SMO	NRIP1	SPTBN1
FGFR1	TCEA2	PARP1	TP53BP1
FUBP1	TCF7	PHB	YY1
GATA3	TFDP2	PMAIP1	Number of genes = 11
HDAC1	TRRAP	SOX4
HNRNPR	WHSC1	Number of genes = 13
ILF3	ZAP70
Number of genes = 30

Cross verification analysis using housekeeping training genes (Chang et al., 2011) (Table 2) resulted in short-listing of 77 genes that were common to the three tools used (referred as housekeeping prioritized genes). Comparison with ALL prioritized genes showed that, although some of the prioritized genes obtained through ALL specific training genes also occur in the results from housekeeping training genes, their priority ranking was vastly different in both. Functional enrichment of the prioritized genes was analyzed through WebGestalt software which showed that the 54 ALL prioritized genes were highly enriched in a diverse array of pathways such as Hemopoiesis (adjP = 1.91e−08), regulation of cell proliferation (adjP = 7.63e−08), chromatin modification (adjP = 6.72e−08), regulation of transcription (adjP = 1.17e−08) and regulation of biosynthetic processes (adjP = 3.17e−08) (Fig. 2). AdjP values signify P values obtained after multiple test adjustment using Bonferroni–Hochberg false discovery rate correction.

Figure 2

Directed acyclic graph showing Gene Ontology of biological processes of the 54 prioritized genes (graph obtained from WebGestalt server).

KEGG Pathway enrichment of the 54 ALL prioritized genes, using KOBAS server, showed significant enrichment in normal and disease pathways such as primary immunodeficiency (corrected P-value = 0.000672), Transcriptional misregulation in cancer (corrected P-value = 0.000672), Adherens junction (corrected P-value = 0.044118), Pathways in cancer (corrected P-value = 0.044118), NF-kappa B signaling pathway (corrected P-value = 0.066325), and T cell receptor signaling pathway (corrected P-value = 0.085695). Further, we observed that the 77 housekeeping prioritized genes were enriched in the KEGG pathways: Ribosome (corrected P-value = 0.002727), Spliceosome (corrected P-value = 0.010555), Glycolysis/Gluconeogenesis (corrected P-value = 0.197154), Biosynthesis of amino acids (corrected P-value = 0.220077). The differences in functional enrichment between ALL prioritized and housekeeping prioritized genes suggest that the genes prioritized from the ALL specific training genes may be significant in leukemogenesis as their enrichment analysis is populated by pathways that are known to be deregulated in ALL.

Protein–Protein Interaction (PPI) network

After identification of prioritized genes, we investigated protein associations using the STRING database. The PPI network using the 54 prioritized and 30 ALL specific training genes as query (seed), formed a dense network with 313 interacting proteins and 2405 interactions (Fig. 3), after removal of disconnected nodes. On grouping the network, the members within and between each cluster were observed to be highly interconnected, reflecting a high degree of functional association and suggesting interplay between the myriad pathways that comprise the protein network (Fig. 4).

Figure 3

STRING database generated protein interaction network generated using prioritized and training protein names as query.

Figure 4

STRING Protein–Protein Interaction network, separated into 12 k-Means clusters with clusters containing LCK, MEN1, SMAD2, HDAC1, CDK9 specifically highlighted.

Functional enrichment analysis of PPI network

Functional enrichment of the network interactors through WebGestalt server, using KEGG Pathway analysis filter, revealed that they participate in a wide variety of processes and pathways such as Cell cycle (adjP = 1.13e−45), apoptosis regulation (adjP = 1.12e−42), p53 signaling pathway (adjP = 8.22e−41), T-cell (adjP = 2.57e−34) and B-cell (adjP = 3.01e−20) receptor signaling pathways, MAPK signaling pathway (adjP = 8.47e−32), Wnt (adjP = 4.03e−27), Notch (adjP = 5.61e−26), TGF β (adjP = 2.38e−16) signaling pathways, and Hematopoietic cell lineage (adjP = 1.23e−14). Comparison of the pathways enriched in ALL specific training genes and the 54 prioritized genes showed that both sets of genes share many common cellular pathways. Our analysis of the PPI Network topology, using Network Analyser plugin in Cytoscape, revealed that it is a small world scale free network which follows power law (P(k) ∼ k−γ) of node degree distribution with a degree exponent of 0.923 and R2 of 0.684, where R2 signifies the fitness of data points to the curve. The clustering coefficient, which indicates cluster forming ability of a particular node, was 0.434.

Identification of hub genes

The network protein interactors were analyzed to determine hubs i.e. proteins that have the highest connectivity within a network and hence tend to be biologically significant. Through comparison of fifty hub genes, using degree and betweenness centrality algorithms, that were output by cytoHubba, we have identified five prioritized hubs as potential biomarker genes and therapeutic targets – SMAD2, CDK9, HDAC1, LCK and MEN1. These hubs were observed to function in the regulation of cell cycle, cell differentiation and hematopoiesis processes. Their prioritization and hubness suggest that they may be likely to play a crucial role in neoplastic transformation. These five genes were also found to be part of the clusters in the PPI network containing proteins involved in leukemogenesis. Thus, our results highlight the functioning of the short-listed genes and their probable role in leukemogenesis and their use as novel therapeutic target genes.

Discussion

In our study, we have profiled overexpressed genes of biological and statistical significance in B- and T-ALL. Further, analysis of the 54 prioritized genes revealed 30 T-ALL upregulated and 13 B-ALL overexpressed genes (Table 3), which could be explored further as subtype specific drug targets and also for understanding leukemic transformation specific to T- and B-ALL. Also, the eleven genes common to both subtypes may regulate pathways common, to a certain extent, in both the subtypes and hence these could emerge as important targets of the disease. The altered genes were found to function in cell growth and development processes that mediate the balance between actively dividing and quiescent hematopoietic stem cells (Arai and Suda, 2007). This suggests their possible role in the disruption of this balance, leading to increased leukemogenic transformation. Furthermore, the common pathways shared by the ALL training and 54 prioritized genes suggest that the prioritized genes may also contribute to pathogenesis, via molecular mechanisms that function in actively transforming normal hematopoietic stem cells into leukemic stem cells. Our study in ALL is based on the need to further understand leukemogenesis as the exact mechanisms through which altered genes and pathways co-operate and lead to neoplastic transformation are still under investigation (Pui et al., 2011). Differential expression profiling of genes in leukemic samples is essential for identification of genes and pathways that are deregulated and thus involved in leukemogenesis. Gene set enrichment analysis studies by Andersson et al. (2010) have suggested unique expression profiles specific to leukemic cells that are different from those of other tissues and cancers, thus emphasizing the need for in depth analysis of expression datasets to discover new therapeutic targets and biomarkers for the disease. In heterogeneous diseases, especially cancers, understanding functional associations could help provide better insights with respect to disease. The STRING protein interaction network generated in our study, was functionally enriched in many crucial pathways especially Notch, Wnt and T- and B-cell receptor signaling pathways which are among the most deregulated processes in ALL. This functional diversity was helpful in our study to highlight the multiple aberrant pathway modules that may act synergistically in leukemia initiation and subsequent disease process and hence may serve as drug targets. We also investigated the hub proteins in the network as they tend to play a significant role in regulation of cell processes and disease etiology and hence are prime targets for designing therapeutic ligands (Zotenko et al., 2008). Through hub analysis we have short-listed five important therapeutic targets – SMAD2, CDK9, MEN1, HDAC1, and LCK that could serve as potential biomarkers which were observed to be significantly upregulated in leukemic cells. These genes have crucial roles in regulating cell cycle proliferation and gene expression processes, therefore could serve as potential therapeutic targets and also as biomarkers for prognosis or diagnosis of leukemia. Of these, SMAD2 and CDK9 are novel findings and have not been reported earlier in leukemogenesis of ALL. Alterations in HDAC1 have been reported in association with many cancers and its overexpression in T-ALL has been suggested to play a role in lymphocyte differentiation (Moreno et al., 2010). The components of HDAC1 cluster could play a role in disease progression via deregulated expression of their target genes, leading to an increase in the levels of cell survival genes and inhibition of apoptotic genes, resulting in loss of proliferation control and thus increase in neoplastic cell number. Therapeutic strategies targeting LCK gene have been found effective against ALL malignant cells (Harr et al., 2010; De Keersmaecker et al., 2014). The close association of LCK cluster interactors suggests that alterations in one gene/protein may lead to a cascade event disrupting the signaling mechanisms, altering the cell fate determination process. Also, the members of LCK interaction cluster such as Zap70 and SYK (Fig. 4) have been previously observed to have altered expression in ALL and suggested to be possible prognostic markers for the disease (Ebeid et al., 2008). MEN1 has been reported to be crucial in MLL leukemogenesis (Ichikawa et al., 2003; Caslini et al., 2007; Grembecka et al., 2010). Many of the interactors of MEN1 cluster, especially NOTCH1 (Lin et al., 2012), CCND1 (Aref et al., 2006) and LEF1 (Gutierrez et al., 2010) (Fig. 4), have been reported to be altered in leukemic cells, especially T-leukemia cells. The proteins in the MEN1 cluster may thus be important for T-lymphocyte mediated leukemogenesis and may therefore constitute important targets for T-ALL specific therapy. The role of SMAD2 gene has been reported in other cancers such as Pancreatic cancer (Kleeff et al., 1999), and Colorectal cancer (Matsuzaki et al., 2009) wherein its alteration has been associated with malignant TGF-β signaling, poor prognosis and in metastasis (Oft et al., 2002), but no mutations in SMAD2 in ALL samples were observed (Wieser et al., 1998). However, in our study we report for the first time that its expression levels in the T-ALL datasets used for analysis were significantly high. It has also been identified as a prioritized hub gene in our study indicating that it could play an important role in the T-cell leukemogenesis, through altered TGFβ pathway, similar to other cancers. Alterations in the expression levels of the proteins in this cluster may contribute to loss of regulation of proliferation signals and apoptosis and thus lead to neoplastic transformation. As reviewed by Connolly et al. (2012), several studies have reported a remarked decrease in cancer cells on administration of antagonists of TGFβ/signaling pathway and hence a similar approach may also be useful in patients with increased expression of SMAD2 in ALL. The cyclin dependent kinases (CDKs) help in proper initiation and elongation steps in the transcription process and studies have reported that their inhibition helps promote apoptosis in malignant cells and may prevent cytotoxicity due to deregulated pathways in altered cells (Shapiro, 2006). The overexpression of CDK9 in B-ALL indicates that it may contribute to neoplastic transformation of B-cells. Overexpression of CDK9 in other cancers such as certain lymphomas (Bellan et al., 2004) and Neuroblastoma (De Falco et al., 2005) has been associated with differentiation and proliferation status. Thus, CDK9 overexpression in B-ALL may deregulate cell cycle in B-lymphocytes and lead to leukemogenesis through promotion of increased cell division and targeting this gene may be useful in controlling aberrant proliferation in B-ALL. The diversity in the functional pathways of the interactors in this cluster suggests a possible role of their interconnectivity in the transmission of oncogenic signals. The markers cited before i.e., NEGR1, IRX2, EPS8, TDP52 (Kang et al., 2012) and NOTCH1 (Lin et al., 2012), were reported to play a crucial role in disease progression. Since the genes short-listed in our study also function in related pathways, there may be a possibility of these genes influencing each other and contributing to leukemogenesis. Although LCK, HDAC1 and MEN1 have previously been reported in association with ALL, our study emphasizes the importance of these genes, their proteins and their interactions in leukemogenesis. Our study has also identified two new genes, SMAD2 and CDK9 whose role in the neoplastic transformation of lymphocyte cells in ALL has not been emphasized earlier. Therefore, this study assigns new putative roles to these genes taking part in leukemogenesis as important hubs. Further, these two genes may serve as prognostic markers, since they play a critical role in regulation of cell growth. Since these five genes were generally among the top 2–5% overexpressed genes in the ALL data sets used in our analysis, they could be used to differentiate between ALL and healthy samples. Further, analysis of expression ranking of these genes in their respective datasets showed that, in case of HDAC1 there is a significant difference in its expression ranking between T-ALL (top 5%) and B-ALL (∼top 10–26%) and hence expression levels of this gene may also be used to differentiate between B- and T-ALL subtypes and between healthy and leukemic cells. Also, we observed that the RNA and protein expression levels of SMAD2 and CDK9 were high in MOLT4 and REH leukemic cell lines in the Human Protein Atlas database (Uhlen et al., 2010), which further supports the investigation of these genes as potential biomarkers in ALL and further exploration as targets for therapy. Though further validation via in vitro and in vivo methods may be needed we believe, that these genes are involved in leukemogenesis as they have also been reported to contribute to carcinogenesis of other neoplasms. Further, the statistical validation performed by each of the gene prioritization and network analysis tools used in our study support that the results we obtained are significant. Also, since our analysis includes expression studies from Childhood ALL, our results can especially be useful to understand which pathways function in disease progression and also in disease relapse in affected children, which is one of the utmost concerns in treatment failure. Further, as reviewed by Chou and Shen (2009) and established by numerous research studies (such as Chen et al., 2013, 2014; Xiao et al., 2014; Xu et al., 2014), user-friendly and publicly accessible web-servers represent the future direction for developing more useful models and prediction methods and for demonstrating new and novel findings; hence, we shall endeavor in our future work to provide a web-server for the approach and findings presented in this paper.

Conclusions

In this study, we have aimed to decipher the significance of the complex molecular networks of proteins encoded by overexpressed genes, retrieved from Oncomine database. Our computational analysis has short-listed five genes, MEN1, SMAD2, CDK9, LCK and HDAC1, whose biological and functional relevance suggests their use as therapeutic targets and also as potential biomarkers and predictors of leukemogenesis in ALL. The use of interaction networks in our study led to identification of biological pathway modules, mediated by these genes that may aid in leukemogenesis. Further, the differential expression of the five short-listed genes suggests that they may be useful in segregation of ALL samples from controls. Due to their functional enrichment we believe that these genes could serve as potential biomarkers of prognosis and diagnosis and also the new genes identified in our study, SMAD2 and CDK9, could serve as novel targets for therapy. This information would be helpful in accurately diagnosing ALL and will be beneficial in improvement of clinical studies. Finally, we also point to some useful data mining and bioinformatics software packages that can be used for identifying novel biomarkers in cancer research.

Author contributions

K.J. – Conceived the idea, and helped in manuscript preparation, A.J. – worked on the project to get the results, and drafted the manuscript, H.A.K. – Reviewed the manuscript, and made suitable suggestions.

Competing financial interests

The authors declare no competing financial interests.

83 in total

Review 1. Network biology: understanding the cell's functional organization.

Authors: Albert-László Barabási; Zoltán N Oltvai
Journal: Nat Rev Genet Date: 2004-02 Impact factor: 53.242

2. ABCB1 over-expression and drug-efflux in acute lymphoblastic leukemia cell lines with t(17;19) and E2A-HLF expression.

Authors: Michael Baudis; Victor Prima; Yoon Han Tung; Stephen P Hunger
Journal: Pediatr Blood Cancer Date: 2006-11 Impact factor: 3.167

3. Inactivation of LEF1 in T-cell acute lymphoblastic leukemia.

Authors: Alejandro Gutierrez; Takaomi Sanda; Wenxue Ma; Jianhua Zhang; Ruta Grebliunaite; Suzanne Dahlberg; Donna Neuberg; Alexei Protopopov; Stuart S Winter; Richard S Larson; Michael J Borowitz; Lewis B Silverman; Lynda Chin; Stephen P Hunger; Catriona Jamieson; Stephen E Sallan; A Thomas Look
Journal: Blood Date: 2010-02-01 Impact factor: 22.113

4. Detection of orphan receptor tyrosine kinase (ROR-1) expression in Egyptian pediatric acute lymphoblastic leukemia.

Authors: Iman Shaheen; Noha Ibrahim
Journal: Fetal Pediatr Pathol Date: 2012-02-27 Impact factor: 0.958

5. Biologic pathways associated with relapse in childhood acute lymphoblastic leukemia: a Children's Oncology Group study.

Authors: Deepa Bhojwani; Huining Kang; Naomi P Moskowitz; Dong-Joon Min; Hokyung Lee; Jeffrey W Potter; George Davidson; Cheryl L Willman; Michael J Borowitz; Ilana Belitskaya-Levy; Stephen P Hunger; Elizabeth A Raetz; William L Carroll
Journal: Blood Date: 2006-07-15 Impact factor: 22.113

6. AML1 gene over-expression in childhood acute lymphoblastic leukemia.

Authors: F M Mikhail; K A Serry; N Hatem; Z I Mourad; H M Farawela; D M El Kaffash; L Coignet; G Nucifora
Journal: Leukemia Date: 2002-04 Impact factor: 11.528

7. Frequency and prognostic significance of murine double minute protein-2 overexpression and p53 gene mutations in childhood acute lymphoblastic leukemia.

Authors: Olfat M Hendy; Doaa M Elghannam; Jehan A El-Sharnouby; Enas F Goda; Rasha El-Ashry; Youssef Al-Tonbary
Journal: Hematology Date: 2009-12 Impact factor: 2.269