Literature DB >> 29996709

Knowledge database assisted gene marker selection for chronic lymphocytic leukemia.

Xixi Xiang¹, Yu-Ping Wang², Hongbao Cao^3,4, Xi Zhang¹.

Abstract

Objective To investigate whether previously curated chronic lymphocytic leukemia (CLL) risk genes could be leveraged in gene marker selection for the diagnosis and prediction of CLL. Methods A CLL genetic database (CLL_042017) was developed through a comprehensive CLL-gene relation data analysis, in which 753 CLL target genes were curated. Expression values for these genes were used for case-control classification of four CLL datasets, with a sparse representation-based variable selection (SRVS) approach employed for feature (gene) selection. Results were compared with outcomes obtained by using analysis of variance (ANOVA)-based gene selection approaches. Results For each of the four datasets, SRVS selected a subset of genes from the 753 CLL target genes, resulting in significantly higher classification accuracy, compared with randomly selected genes (100%, 100%, 93.94%, 89.39%). The SRVS method outperformed ANOVA in terms of classification accuracy. Conclusion Gene markers selected from the 753 CLL genes could enable significantly greater accuracy in the prediction of CLL. SRVS provides an effective method for gene marker selection.

Entities: Chemical Disease Gene Species

Keywords: Chronic lymphocytic leukemia (CLL); case-control classification; disease prediction; gene markers; genetic databases; sparse representation; variable selection

Mesh：

Substances：
Genetic Markers

Year: 2018 PMID： 29996709 PMCID： PMC6134680 DOI： 10.1177/0300060518783072

Source DB: PubMed Journal: J Int Med Res ISSN： 0300-0605 Impact factor: 1.671

Introduction

Chronic lymphocytic leukemia (CLL) is the most frequent B-cell leukemia, which affects men more frequently than women.[1] The disease often occurs in elderly patients, and rarely affects children.[2] Despite the efforts of many genetic studies, the molecular abnormalities and genetic mechanics of CLL remain largely unknown.[3] Most CLL patients are diagnosed without symptoms, with the exception of a high white blood cell count in a routine blood test. Consequently, early CLL could easily remain untreated.[4] Therefore, there is an urgent need for biomarker identification to facilitate early prediction of CLL.[5] In the past, hundreds of genes/proteins have been linked to CLL. Mutations of some risk genes, including IL4 and TP53, have been frequently reported as important markers for the pathogenic development of CLL.[6,7] These genes may serve as biomarkers for multiple other diseases,[7,8] thus decreasing their specificities as biomarkers for the prediction of CLL. Additionally, many CLL-gene relationships have been reported, but few can be replicated (e.g., PRKCD and TGFBR2[9,10]), reflecting the heterogeneity of CLL and the variance of CLL-related genetic changes among patients.[11] Moreover, a number of novel CLL risk genes are identified each year,[12] facilitating the development of an enriched genetic database for CLL. The purpose of this study was to investigate whether previously reported CLL genes could be leveraged as a database for gene marker selection, specifically targeting early diagnosis of CLL. We hypothesized that if these CLL genes are effective for the prediction of CLL, gene markers selected from among them should enable significant accuracy in differentiating CLL cases from controls.

Methods

Development and analysis of CLL_042017

Figure 1 presents the database schema of the curated database CLL_042017. The database contains 753 genes (CLL_042017→Related Genes) that were collected as CLL target genes; each of these genes has at least one reference to support its relationship with CLL (3,078 references in total; see CLL_042017→Ref for Disease-Gene Relation). The CLL-gene relations were identified by using Pathway Studio (www.pathwaystudio.com).[13] The database also includes 235 drugs (CLL_042017→Related Drugs), 97 diseases (CLL_042017→Related Diseases), and 88 pathways (CLL_042017→Related Pathways). The information of 2,756 supporting references for CLL-Drug relations is provided in CLL_042017→Ref for Related Drugs. The reference information includes titles and related sentences where a relationship has been identified. The current CLL_042017 is online, available at http://gousinfo.com/database/Data_Genetic/CLL_042017.xlsx. For a more detailed description of the database, please refer to CLL_042017→Database Note.

Figure 1.

Chronic lymphocytic leukemia (CLL) genetic database schematic.

SRVS for gene vector selection

A sparse representation-based variable selection (SRVS) algorithm (described in detail elsewhere)[14] was used to rank the 753 CLL target genes, on the basis of a given experimental dataset. For each gene, a sparse weight is assigned by SRVS. The gene vector, composed of the top genes by SRVS, is the genetic marker for a CLL case/control group, where is the number of genes corresponding to the maximum classification ratio (CR) as defined in Eq. (1).

Gene expression data

In this study, we used 4 RNA gene expression datasets to evaluate classification performance with CLL target genes; these datasets were GSE2466, GSE19147, GSE50006, and GSE8835. The datasets were selected by using the Illumina BaseSpace Correlation Engine (http://www.illumina.com) and are publicly available at the NCBI Gene Expression Omnibus (www.ncbi.nlm.nih.gov/geo/). The data selection criteria were as follows: 1) Sample organism was Homo sapiens; 2) Data type was RNA expression; 3) Experiment design was CLL case vs. normal control. From each dataset, expression data of normal controls and CLL patients were extracted and used for case/control classification. Genes of each dataset were limited to CLL target genes curated within the database CLL_042017. The key statistics of the four datasets are summarized in Table 1.

Table 1.

Statistics of four gene expression datasets.

NCBI GEO ID	GSE2466	GSE19147	GSE50006	GSE8835
#CLL case/control	72/11	25/8	188/32	42/24
#genes from CLL_042017	564	624	685	624
Sample source	Peripheral blood lymphocytes	Peripheral blood CD3+T cells	leukemia cells	Peripheral blood CD4 T cells and CD8 T cells
Sample population	Austria	Germany	USA	USA

Statistics of four gene expression datasets. The gene expression profiles of the four gene expression datasets are also included in CLL_042017: CLL_042017→GSE2466, GSE19147, GSE50006, and GSE8835. Within each dataset, the SRVS-generated weights (SRVSScore) and analysis of variance (ANOVA)-generated p-value score (PValueScore; logic transferred p-values: -10*log(p-value)) are also presented. The p-value for a gene is generated from the one-way ANOVA of the case/control comparison with the corresponding expression data. An SRVSScore and a PValueScore represent the significance of a gene in the dataset, according to SRVS and ANOVA methods, respectively.

CLL case/control classification

To identify the best gene vector and the corresponding classification accuracy (CR), the CLL target genes were first ranked by SRVSScore in descending order. Then, Euclidean distance-based multivariate classification[18] was performed for each dataset, followed by leave-one-out (LOO) cross-validation. In each run of LOO, gene expression data of one subject were used for testing; the remaining data were used for training. The inputs of the classifier are the top (=1, 2 …) genes, such that the CR of using the top genes could be identified. A permutation of 5,000 runs was then conducted to test the hypothesis that a randomly selected gene set with a similar size can reach equal or higher CR. For each subset of genes, the permutation p-value was calculated as , where was the number of runs generating CR higher than that of the gene subset; was the total number of runs (5000 in this study). The gene vector that generated the highest CR was the best gene subset selected from the gene expression dataset, according to the SRVS method. Following the same process, the best gene subset was identified for each dataset by the ANOVA approach. For comparison purposes, a CR baseline was also generated by using randomly selected gene sets of n (=1, 2 …) genes. For each point of the CR baseline, the value was the mean of 300 CRs by randomly selected genes from all genes within the dataset.

Results

Figure 2 presents the classification results. Table 2 summarizes the results of LOO cross-validation of the two gene-ranking methods on four datasets, where the maximum CRs, corresponding numbers of top genes, and permutation p-values of the two methods are provided.

Figure 2.

Table 2.

LOO cross-validation and permutation results

	GSE2466 (case/control:72/11)		GSE19147 (case/control:25/8)		GSE50006 (case/control:188/32)		GSE8835 (case/control:42/24)
	SRVS	ANOVA	SRVS	ANOVA	SRVS	ANOVA	SRVS	ANOVA
MaxCRs	100.00	100.00	100.00	93.94	98.64	98.18	89.39	84.85
# Selected Genes	4	3	65	3	131	20	101	345
p-value	0.001	0.0002	∼0	0.0016	0.0014	0.0012	∼0	∼0
Unique genes from all datasets (%)	25%(1/4)	66.67%(2/3)	52.31%(34/65)	33.33%(1/3)	75.57%(99/131)	40%(8/20)	97.03%(98/101)	95.94%(331/345)
Overlap genes of two methods (%)	0%(0/4)	0%(0/3)	3.08%(2/65)	66.67%(2/3)	15.27%(20/131)	100%(20/20)	65.35%(66/101)	19.13%(66/345)

SRVS, sparse representation-based variable selection; ANOVA, analysis of variance.

Comparison of different metrics through leave-one-out (LOO) cross-validation. Genes were ranked in ascending order according to SRVSScore or PValueScore, for sparse representation-based variable selection (SRVS) or analysis of variance (ANOVA), respectively. (a) GSE 2466, (b) GSE 19147, (c) GSE 50006 and (d) GSE 8835. LOO cross-validation and permutation results SRVS, sparse representation-based variable selection; ANOVA, analysis of variance. Figure 2 establishes that, compared with the CRs generated by randomly selected gene sets, the genes selected from CLL target genes by both SRVS and ANOVA can demonstrate significantly higher classification accuracies. Notably, by using only the top genes with highest SRVSScore/PValueScore, the highest CRs were acquired (See Figure 2 and Table 2); adding more genes with lower scores may not necessarily improve classification accuracy. These results revealed the validity of both SRVS and ANOVA methods. Moreover, it was noted that SRVSScore outperformed PValueScore in terms of CR (Table 2). Table 2 also shows that, for each dataset, the top genes selected by both methods could be significantly different (CLL_042017→Venn Diagram). For the SRVS method, the unique genes selected for the four datasets ranged from 25% to 97.03%; the range was 33.33% to 95.94% for the ANOVA method (Table 2→Unique genes from all datasets (%)). These results suggested that there were factors that could affect the gene marker selection, which is worthy of further study. It is also notable that, for a given dataset, gene markers selected by SRVS and ANOVA could differ (Table 2→Overlap genes of two methods (%)). This suggests that SRVS performs differently and more effectively than ANOVA.

Discussion

CLL affects approximately one million people globally, but remains poorly diagnosed at early stages. In the past, many studies have been performed with the aim of developing targeted molecular therapy for CLL[6,7]; hundreds of risk genes have been identified. Most of these genes are active within CLL-related genetic pathways, and many have been used as drug targets for the treatment of CLL. However, patients may demonstrate genetic variation, even in the same disease, implying the need for personalized treatment.[16] Therefore, for a given CLL patient/patient group, feature (gene) selection is important for diagnosis and treatment. Thus far, few studies have been conducted to test the validity of curated CLL risk genes for use as genetic markers in diagnosis and prediction of CLL. In this study, we first conducted comprehensive literature data mining in 3078 scientific articles, which identified 753 CLL target genes. Gene set enrichment analysis showed that the majority of these genes (594/753) were significantly enriched within multiple genetic pathways that were associated with CLL (p-value<3e-13; q=0.001 for false discovery rate (FDR)). For instance, there are 230 genes significantly enriched within eight cell apoptosis pathways (p-value<5.2e-14; q=0.001 for FDR).[15] There were also 240 genes enriched within eight pathways/gene sets related to cell growth and proliferation (p-value<6.8e-015)[16] and 218 genes enriched within immune response (p-value <8.7e-029).[17] More pathways and related information can be identified at CLL_042017→Related Pathways. Sub-network enrichment analysis (SNEA; http://pathwaystudio.gousinfo.com/SNEA.pdf) showed that 717 of 753 genes significantly overlapped with risk genes linked to each of the 97 diseases (p-value<1.6e-100; q=0.001 for FDR; CLL_042017→Related Diseases). Many of these 97 diseases are cancers of different types, and many were related to CLL, including rheumatoid arthritis,[18] breast cancer[19] and multiple myeloma.[20] Within CLL_042017, there were 235 known CLL drugs/small molecules (CLL_042017→Related Drugs) that have been evaluated within clinical trials and have demonstrated effectiveness in treating CLL. These 235 drugs demonstrated significant overlap (22 overlapped drugs; p-value=8.20e-23) with the top 100 potential drugs/small molecules (CLL_042017→Potential Drugs), whose gene subnetworks were significantly enriched within the 753 CLL genes. Additionally, many of the 753 CLL genes were target genes of known CLL drugs. For instance, rituximab induces apoptosis of CLL cells by inhibiting the expression of BCL2.[21] These results supported a possible association between CLL and the 753 target genes. CLL case/control classification was conducted on four independent gene expression datasets, with two algorithms for gene selection within the 753 CLL gene pool: SRVS method and ANOVA. The basic theory for feature (gene) selection is that not all 753 genes will exhibit mutations for a given CLL patient/patient group; therefore, it is not appropriate to use all as target genes in the diagnosis and treatment. Compared with randomly selected genes, these selected by both SRVS and ANOVA led to significantly higher prediction power (permutation p-value<0.0014 for SRVS and permutation p-value<0.0016 for ANOVA; CRs of SRVS vs. ANOVA: 100% vs. 100%, 100% vs. 93.94%, 98.64% vs. 98.18% and 89.39% vs. 84.85%, for the four datasets, respectively), as shown in Table 2. These results indicated that genetic markers selected from the 753 CLL target genes possess significant power for the diagnosis and prediction of CLL. Moreover, SRVS outperforms ANOVA in terms of CR. This implies the effectiveness of the SRVS method for gene marker selection for CLL. Gene markers selected by both SRVS and ANOVA methods demonstrated substantial uniqueness (>25%) across different datasets (Table 2). This indicates that, in addition to the genomic specificity of each patient group, there may be other factors that affect the gene marker selection, which merit further study. As shown in Table 1, the four datasets were acquired from different blood cells and different patient populations. This may contribute to variations in the gene marker selection results (Table 2). In conclusion, our study suggested that gene markers selected from the 753 CLL genes could provide high accuracy in the prediction of CLL, and that SRVS is an effective method for gene marker selection in CLL diagnosis and prediction.

21 in total

Review 1. Bcl-2 and apoptosis in chronic lymphocytic leukemia.

Authors: Aaron D Schimmer; Irene Munk-Pedersen; Mark D Minden; John C Reed
Journal: Curr Treat Options Oncol Date: 2003-06

2. Novel polymorphism in p21(waf1/cip1) cyclin dependent kinase inhibitor gene: association with human esophageal cancer.

Authors: R Bahl; S Arora; N Nath; M Mathur; N K Shukla; R Ralhan
Journal: Oncogene Date: 2000-01-20 Impact factor: 9.867

Review 3. Multiple myeloma and chronic lymphocytic leukemia: parallels and contrasts.

Authors: B Barlogie; R P Gale
Journal: Am J Med Date: 1992-10 Impact factor: 4.965

Review 4. Clinical characteristics and outcome of young chronic lymphocytic leukemia patients: a single institution study of 204 cases.

Authors: F R Mauro; R Foa; D Giannarelli; I Cordone; S Crescenzi; E Pescarmona; R Sala; R Cerretti; F Mandelli
Journal: Blood Date: 1999-07-15 Impact factor: 22.113

5. Rheumatoid arthritis and B-cell chronic lymphocytic leukemia.

Authors: P V Voulgari; G Vartholomatos; P Kaiafas; K L Bourantas; A A Drosos
Journal: Clin Exp Rheumatol Date: 2002 Jan-Feb Impact factor: 4.473

Review 6. B cells and macrophages pursue a common path toward the development and progression of chronic lymphocytic leukemia.

Authors: G Galletti; F Caligaris-Cappio; M T S Bertilaccio
Journal: Leukemia Date: 2016-09-28 Impact factor: 11.528

7. The DLEU2/miR-15a/16-1 cluster controls B cell proliferation and its deletion leads to chronic lymphocytic leukemia.

Authors: Ulf Klein; Marie Lia; Marta Crespo; Rachael Siegel; Qiong Shen; Tongwei Mo; Alberto Ambesi-Impiombato; Andrea Califano; Anna Migliazza; Govind Bhagat; Riccardo Dalla-Favera
Journal: Cancer Cell Date: 2010-01-07 Impact factor: 31.743

8. Constitutively activated phosphatidylinositol-3 kinase (PI-3K) is involved in the defect of apoptosis in B-CLL: association with protein kinase Cdelta.

Authors: Ingo Ringshausen; Folker Schneller; Christian Bogner; Susanne Hipp; Justus Duyster; Christian Peschel; Thomas Decker
Journal: Blood Date: 2002-07-12 Impact factor: 22.113

9. Chronic lymphocytic leukemia involving the breast parenchyma, mimicker of invasive breast cancer: differentiation on breast MRI.

Authors: Vandana Dialani; Kalpana Mani; Nicole B Johnson
Journal: Case Rep Med Date: 2013-09-18

10. A curated census of autophagy-modulating proteins and small molecules: candidate targets for cancer therapy.

Authors: Philip L Lorenzi; Sofie Claerhout; Gordon B Mills; John N Weinstein
Journal: Autophagy Date: 2014-05-12 Impact factor: 16.016