Chenglong Li1, Biao Zhu1, Jiao Chen1, Xiaobing Huang1. 1. Department of Hematology, Sichuan Academy of Medical Sciences and Sichuan Provincial People's Hospital, Chengdu, Sichuan 610072, P.R. China.
Abstract
In the present study, gene expression profiles of acute myeloid leukemia (AML) samples were analyzed to identify feature genes with the capacity to predict the mutation status of FLT3/ITD. Two machine learning models, namely the support vector machine (SVM) and random forest (RF) methods, were used for classification. Four datasets were downloaded from the European Bioinformatics Institute, two of which (containing 371 samples, including 281 FLT3/ITD mutation-negative and 90 mutation‑positive samples) were randomly defined as the training group, while the other two datasets (containing 488 samples, including 350 FLT3/ITD mutation-negative and 138 mutation-positive samples) were defined as the test group. Differentially expressed genes (DEGs) were identified by significance analysis of the microarray data by using the training samples. The classification efficiency of the SCM and RF methods was evaluated using the following parameters: Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and the area under the receiver operating characteristic curve. Functional enrichment analysis was performed for the feature genes with DAVID. A total of 585 DEGs were identified in the training group, of which 580 were upregulated and five were downregulated. The classification accuracy rates of the two methods for the training group, the test group and the combined group using the 585 feature genes were >90%. For the SVM and RF methods, the rates of correct determination, specificity and PPV were >90%, while the sensitivity and NPV were >80%. The SVM method produced a slightly better classification effect than the RF method. A total of 13 biological pathways were overrepresented by the feature genes, mainly involving energy metabolism, chromatin organization and translation. The feature genes identified in the present study may be used to predict the mutation status of FLT3/ITD in patients with AML.
In the present study, gene expression profiles of acute myeloid leukemia (AML) samples were analyzed to identify feature genes with the capacity to predict the mutation status of FLT3/ITD. Two machine learning models, namely the support vector machine (SVM) and random forest (RF) methods, were used for classification. Four datasets were downloaded from the European Bioinformatics Institute, two of which (containing 371 samples, including 281 FLT3/ITD mutation-negative and 90 mutation‑positive samples) were randomly defined as the training group, while the other two datasets (containing 488 samples, including 350 FLT3/ITD mutation-negative and 138 mutation-positive samples) were defined as the test group. Differentially expressed genes (DEGs) were identified by significance analysis of the microarray data by using the training samples. The classification efficiency of the SCM and RF methods was evaluated using the following parameters: Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and the area under the receiver operating characteristic curve. Functional enrichment analysis was performed for the feature genes with DAVID. A total of 585 DEGs were identified in the training group, of which 580 were upregulated and five were downregulated. The classification accuracy rates of the two methods for the training group, the test group and the combined group using the 585 feature genes were >90%. For the SVM and RF methods, the rates of correct determination, specificity and PPV were >90%, while the sensitivity and NPV were >80%. The SVM method produced a slightly better classification effect than the RF method. A total of 13 biological pathways were overrepresented by the feature genes, mainly involving energy metabolism, chromatin organization and translation. The feature genes identified in the present study may be used to predict the mutation status of FLT3/ITD in patients with AML.
Fms-like tyrosine kinase 3 (FLT3) is expressed in hematopoietic progenitor cells. In acute myeloid leukemia (AML), its most frequent mutation is an internal tandem duplication (FLT3/ITD), which has a prevalence of 30–35% (1). FLT3/ITD is a critical prognostic factor for patients with AML. Compared with carriers of wild-type FLT3, patients with the FLT3/ITD mutation have shorter overall survival time and disease-free survival time (2). Early diagnosis of FLT3/ITD allows for timely treatment of AML and thus benefits the clinical outcome.Certain achievements have been made in revealing the role of the FLT3/ITD mutation in AML and several feature genes associated with the FLT3/ITD mutation have been identified. Chen et al (3) reported that signaling associated with the FLT3/ITD mutation includes the suppression of SHP-1. Furthermore, aberrant expression of CD7 in myeloblasts has been found to be highly associated with the FLT3/ITD mutation in AML (4). Okamoto et al (5) indicated that Lyn, an important component of the signal transduction pathway specific for FLT3/ITD, may be utilized as a therapeutic target for the treatment of AML in carriers of the FLT3/ITD mutation. Furthermore, PIM1, a serine/threonine kinase, has been found to be upregulated in FLT3-ITD mutation-positive AML and may be involved in FLT3-mediated leukemogenesis (6). Dalal et al (7) reported that CD56 can predict the presence of the FLT3-ITD mutation in AML.In order to distinguish the FLT3/ITD mutation from the wild-type at the transcriptional level, the present study analyzed microarray gene expression data of AML samples. Feature genes were identified by a bioinformatics analysis and subsequent classification was performed by machine learning models, namely the support vector machine (SVM) and random forest (RF) methods. The classification efficiency of the two models was also evaluated. The feature genes identified in the present study may be used to predict the mutation status of FLT3/ITD in patients with AML.
Materials and methods
Microarray data and data pre-processing
Gene expression data of AML samples were downloaded from the European Bioinformatics Institute (EBI; http://www.ebi.ac.uk) (8). Four relevant data sets for patient cohorts with AML containing information on the FLT3/ITD mutation were obtained, which included a total of 859 AML samples (Table I). Two data sets (containing 371 samples, including 281 FLT3/ITD mutation-negative and 90 FLT3/ITD mutation-positive samples) were selected as the training group, while the other two data sets (containing 488 samples, including 350 FLT3/ITD mutation-negative samples and 138 FLT3/ITD mutation-positive samples) were used as the test group.
Table I
Microarray data sets used in the present study.
Data set ID
Total samples (n)
FLT3/ITD mutation negative (n)
FLT3/ITD mutation positive (n)
Undetermined samples (n)
Training sets
E-GEOD-61804
325
243
50
32
E-GEOD-34860
78
38
40
0
Total
403
281
90
32
Testing sets
E-GEOD-17855
237
189
48
0
E-GEOD-15434
251
161
90
0
Total
488
350
138
0
The raw data were pre-processed using the affy package (9) in R (www.r-project.org), including data format conversion, filling in missing values (using median gene expression), background correction using the MAS method and normalization with the quantiles method (10).
Screening of differentially expressed genes (DEGs)
Microarray data from wild-type and FLT3/ITD mutation-positive AML samples were screened for DEGs using the significance analysis of microarray method in R (11). The false discovery rate (FDR) was estimated using the permutation method with P<0.05 (12,13) and |log2(fold change)|>1 set as the thresholds.
Prediction of mutation status of AML samples
The ability of DEGs to predict the FLT3/ITD mutation status in AML samples was examined using two methods: SVM and random forest.SVM is a classification technique based on the structural risk minimization principle (14). The SVM classifier was constructed via the SVM function in the e1071 package of R with the non-linear radial basis function as the kernel and penalty functions set at 1,000.RF utilizes multiple classification and regression trees to classify samples (15). The function randomForest from the randomForest package in R was adopted to classify AML samples from the training group.A leave-one-out cross validation method was performed to evaluate the classification efficiency of the two methods. The sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) (16) and area under the receiver operating characteristic (ROC) curve (17) were calculated. The classification efficiency for the training group, test group and the combined group were evaluated individually. Whenever the construed SVM or RF classifier produced a high reliability, the DEGs collected from the training sets were considered as feature genes for distinguishing wild-type from FLT3/ITD-mutation positive samples.
Functional enrichment analysis
Functional enrichment analysis of the feature genes was performed using the Database for Annotation, Visualization and Integration Discovery (http://david.abcc.ncifcrf.gov/) (18,19). P<0.5 and FDR<0.1 were set as the cut-off values to screen out significantly over-represented biological pathways.
Results
Screening for DEGs
A total of 585 DEGs were identified in FLT3/ITD mutation-positive samples from the training group, comprising 580 upregulated and 5 downregulated genes compared with those in the FLT3/ITD mutation-negative AML samples.
Sample classification using SVM or RF classifier
Classification of AML samples with regard to their FLT3/ITD mutation status depending on their gene expression profiles was performed using the SVM and RF methods (Fig. 1).
Figure 1
Classification results of acute myeloid leukemia samples using the SVM and RF methods. Classification according to the SVM method for (A) the training group, (B) the test group and (C) the combined group. Classification according to the RF method for (D) the training group, (E) the test group and (F) the combined group. SVM, support vector machine; RF, random forest; Neg, negative; Pos, positive.
For the 371 AML samples from the training group, 276 and 273 mutation-negative samples, as well as 86 and 85 mutation-positive samples were correctly classified using the SVM and RF method, respectively. The accuracy rates were 97.57 and 96.5%, respectively.Among the 488 AML samples from the test group, 337 and 325 mutation-negative samples, as well as 123 and 117 mutation-positive samples were correctly classified by using the SVM and RF method, respectively, and the accuracy rates were 94.26 and 90.57%.For the 859 AML samples from the combined group, 606 and 590 mutation-negative samples, as well as 204 and 206 mutation-positive samples were correctly classified by using the SVM and RF method, respectively, with accuracy rates of 94.3% and 92.67%.According to above classification results (Fig. 2), the classification using the SVM method had a better accuracy rate than that of the RF method. However, the accuracy rates were >90%, suggesting a good classification ability of these two method based on the DEGs identified.
Figure 2
Scatter diagrams showing the classification results. Blue dots indicate FLT3/ITD mutation-negative samples and red dots indicate FLT3/ITD mutation-positive samples. Classification according to the SVM method for (A) the training group, (B) the test group and (C) the combined group. Classification according to the RF method for (D) the training group, (E) the test group and (F) the combined group. SVM, support vector machine; RF, random forest.
Classification efficiency
Five parameters were calculated to evaluate the classification efficiency: Rate of correct prediction, sensitivity, specificity, PPV, NPV (Table II) and the area under the ROC curve (Fig. 3). For the SVM and RF methods, the rate of correct prediction, specificity and PPV were >90%, while the sensitivity and NPV were >80%, with the SVM method producing a slightly better classification efficiency than the RF method.
Table II
Classification effects of SVM method and RF method.
Method
No. of samples
Correct rate
Sensitivity
Specificity
PPV
NPV
AUROC
SVM
Training group
371
0.9757
0.9556
0.9822
0.9451
0.9857
0.997
Test group
488
0.9426
0.8913
0.9629
0.9044
0.9574
0.876
Combined
859
0.9430
0.8947
0.9604
0.8908
0.9619
0.902
RF
Training group
371
0.9650
0.9444
0.9715
0.9140
0.9820
0.983
Test group
488
0.9057
0.8478
0.9286
0.8239
0.9393
0.818
Combined
859
0.9267
0.9035
0.9350
0.8340
0.9656
0.916
PPV, positive predictive value; NPV, negative predictive value; AUROC, area under the receiver operating characteristic curve; SVM, support vector machine; RF, random forest.
Figure 3
Receiver operator characteristic curves generated using (A) the support vector machine method and (B) the random forest method.
The feature genes identified were not only suitable for correct predictions of the FLT3/ITD mutation status of AML samples in the training group, but also in the test group and the combined group, suggesting that these DEGs may be utilized for distinguishing FLT3/ITD mutation-negative AML samples from mutation-positive samples. It was indicated that the DEGs identified in the present study are feature genes of the FLT3/ITD mutation, including IDH1, SUZ12, BCORL1, RUVBL2, JMJD1C, TOP2A, DAPK3, RPS15, RPS16, RPS9, EIF2α, EIF4E, EIF3B, EIF3 K, EIF3 L and EIF1B.
Biological pathways of feature genes
A total of 13 biological pathways were over-represented by the feature genes (Table III). The number of genes in each biological pathway is shown in Fig. 4. Several pathways were associated with energy metabolism, including oxidative phosphorylation, mitochondrial electron transport and mitochondrial adenosine triphosphate (ATP) synthesis. Furthermore, chromatin organization, chromosome organization and translation were significantly overrepresented.
Table III
Significantly over-represented biological pathways in feature genes.
Term
Count
P-value
FDR
GO:0006119 - Oxidative phosphorylation
15
3.80×10−6
0.008857
GO:0006091 - Generation of precursor metabolites
27
1.50×10−5
0.017414
GO:0022900 - Electron transport chain
15
2.25×10−5
0.017423
GO:0045333 - Cellular respiration
13
8.00×10−5
0.045815
GO:0016568 - Chromatin modification
23
1.11×10−4
0.050724
GO:0006414 - Translational elongation
13
1.19×10−4
0.045367
GO:0006325 - Chromatin organization
28
1.40×10−4
0.045654
GO:0006412 - Translation
25
2.62×10−4
0.073784
GO:0015980 - Energy derivation
15
2.89×10−4
0.072430
GO:0051276 - Chromosome organization
32
3.37×10−4
0.075984
GO:0006120 - Mitochondrial electron transport
8
4.01×10−4
0.081888
GO:0042775 - Mitochondrial ATP synthesis
9
4.64×10−4
0.086660
GO:0042773 - ATP synthesis-coupled electron transport
Biological pathways and corresponding numbers of feature genes. ATP, adenosine triphosphate.
Discussion
In the present study, a total of 585 feature genes were identified to be differentially expressed between FLT3/ITD mutation-positive and wild-type AML samples from the training group (two data sets). Two methods, SVM and RF, were adopted to classify AML samples from the training group and the test group (two further data sets). The accuracy rates were >90% using either method on either group of data sets. SVM produced a slightly more accurate classification than RF. It was indicated that the feature genes identified in the present study may be used to predict the FLT3/ITD mutation status in patients with AML. Functional enrichment analysis was also performed for the feature genes. Energy metabolism, chromatin organization and translation were significantly overrepresented.Mitochondria are important organelles regulating the energy levels, metabolism and apoptosis in cells, which can in turn affect cell differentiation and proliferation. Therefore, they mitochondria have important roles in the pathogenesis of AML (20). Inhibition of mitochondrial translation has been suggested as a potential therapeutic strategy for AML (21). Yamaguchi et al (22) reported that a mutation in IDH1, which has an important role in the citrate circle, has an adverse effect in patients with AML.Several genes associated with chromatin organization also participate in the development of AML. SUZ12 encodes a subunit of polycomb repressive complex 2, which was shown to drive aberrant self-renewal in a mouse model of AML (23). Tiacci et al (24) found that BCORL1 has a role AML. Zagaria et al (25) reported that the BCOR gene was dysregulated in AML is due to chromosomal translocation. RUVBL2 is a critical mediator of oncogenesis caused by the MLL-AF9 fusion gene and is a potential therapeutic target for MLL-AF9-associated leukemia (26). In addition, Sroczynska et al (27) found that JMJD1C is required for leukemia maintenance, and that depletion of JMJD1C impaired the expansion and colony formation of humanleukemic cell lines. Amplification of TOP2A was found identified in myelodysplastic syndrome transforming to AML (28). DAPK3 was indicated to have a role in the induction of apoptosis, and that CpG island methylation of this gene, leading to its dysregulation, is implicated in AML (29).Translation was also significantly overrepresented in the feature genes identified by the present study. Wang et al (30) indicated that silencing of RPS14 inhibits the proliferation of AML cells via activating p53. It is likely that RPS15, RPS16, RPS9 and other members of the RPS family may exert similar roles. In addition, EIF2α and EIF4E have been implicated in AML (31,32). The roles of EIF3B, EIF3 K, EIF3 L and EIF1B in AML may be worth investigating.In conclusion, the present study identified a number of feature genes that may be used to distinguish FLT3/ITD mutation-positive AML samples from FLT3 wild-type samples. Several of the feature genes identified have been previously implicated in AML. The computational tools developed in the present study may aid in the clinical detection of FLT3/ITD mutation-positive AML for possible early and targeted treatment of these patients.
Authors: Guenter Stoesser; Wendy Baker; Alexandra van den Broek; Evelyn Camon; Maria Garcia-Pastor; Carola Kanz; Tamara Kulikova; Rasko Leinonen; Quan Lin; Vincent Lombard; Rodrigo Lopez; Nicole Redaschi; Peter Stoehr; Mary Ann Tuli; Katerina Tzouvara; Robert Vaughan Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971
Authors: W-m Liu; R Mei; X Di; T B Ryder; E Hubbell; S Dee; T A Webster; C A Harrington; M-h Ho; J Baid; S P Smeekens Journal: Bioinformatics Date: 2002-12 Impact factor: 6.937
Authors: J I Martín-Subero; L Harder; S Gesk; R Schoch; F J Novo; W Grote; M J Calasanz; B Schlegelberger; R Siebert Journal: Cancer Genet Cytogenet Date: 2001-06
Authors: H Osaki; V Walf-Vorderwülbecke; M Mangolini; L Zhao; S J Horton; G Morrone; J J Schuringa; J de Boer; O Williams Journal: Leukemia Date: 2013-02-13 Impact factor: 11.528
Authors: Yousra El Alaoui; Adel Elomri; Marwa Qaraqe; Regina Padmanabhan; Ruba Yasin Taha; Halima El Omri; Abdelfatteh El Omri; Omar Aboumarzouk Journal: J Med Internet Res Date: 2022-07-12 Impact factor: 7.076