Literature DB >> 30911218

Predicting Complete Remission of Acute Myeloid Leukemia: Machine Learning Applied to Gene Expression.

Ophir Gal¹, Noam Auslander^2,3, Yu Fan⁴, Daoud Meerzaman⁴.

Abstract

Machine learning (ML) is a useful tool for advancing our understanding of the patterns and significance of biomedical data. Given the growing trend on the application of ML techniques in precision medicine, here we present an ML technique which predicts the likelihood of complete remission (CR) in patients diagnosed with acute myeloid leukemia (AML). In this study, we explored the question of whether ML algorithms designed to analyze gene-expression patterns obtained through RNA sequencing (RNA-seq) can be used to accurately predict the likelihood of CR in pediatric AML patients who have received induction therapy. We employed tests of statistical significance to determine which genes were differentially expressed in the samples derived from patients who achieved CR after 2 courses of treatment and the samples taken from patients who did not benefit. We tuned classifier hyperparameters to optimize performance and used multiple methods to guide our feature selection as well as our assessment of algorithm performance. To identify the model which performed best within the context of this study, we plotted receiver operating characteristic (ROC) curves. Using the top 75 genes from the k-nearest neighbors algorithm (K-NN) model (K = 27) yielded the best area-under-the-curve (AUC) score that we obtained: 0.84. When we finally tested the previously unseen test data set, the top 50 genes yielded the best AUC = 0.81. Pathway enrichment analysis for these 50 genes showed that the guanosine diphosphate fucose (GDP-fucose) biosynthesis pathway is the most significant with an adjusted P value = .0092, which may suggest the vital role of N-glycosylation in AML.

Entities: Chemical Disease Gene Species

Keywords: acute Myeloid Leukemia (AML); gene expression profiling; machine Learning (ML); remission induction

Year: 2019 PMID： 30911218 PMCID： PMC6423478 DOI： 10.1177/1176935119835544

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

RNA sequencing (RNA-seq) and other high-throughput next-generation sequencing platforms have emerged as powerful approaches for discovering pathogenic pathways and potential targets for clinical intervention in patients with acute myeloid leukemia (AML).[1] Using whole-transcriptome sequencing, our previous work compared the profiles of core-binding factor acute myeloid leukemia (CBF-AML) cases with those characterized by normal karyotype (NK), illuminating similarities and differences with respect to gene-expression signatures and splicing events as well as RNA fusions that typify the inv(16) vs the t(8;21) AML subtypes.[2] In concert with the rise of large-scale omics-oriented sequencing, machine-learning (ML) algorithms have increasingly been applied to gene-expression analysis aimed at classifying tumors, predicting survival, identifying therapeutic targets, and classifying genes according to function.[3-7] Significant results have been shown for predicting outcomes of large B-cell lymphoma,[8] hepatitis B virus–positive metastatic hepatocellular carcinomas[9] as well as documenting diverse pathologic responses to chemotherapy in patients with breast cancer.[10] Using gene-expression profiling of data generated by microarrays in conjunction with both supervised and unsupervised learning, Bullinger et al[11] identified prognostic subclasses in adult AML; the research group also constructed an optimal 133-gene predictor of overall survival. Yeoh et al[12] performed classification, subtype discovery, and outcome prediction in patients with pediatric acute lymphoblastic leukemia (ALL). However, no previous study has specifically addressed expression differences among large cohorts of pediatric and young-adult AML patients with regard to complete remission (CR). In this study, we compare pre-treatment gene-expression profiles using 3 supervised learning algorithms to discover predictors of CR.

Materials and Methods

We obtained 473 bone marrow specimens from 473 patients, both children and young adults with ages ranging between 8 days and 28 years who had been diagnosed with de novo AML. For comparison, we acquired an additional 20 bone marrow specimens from normal, healthy individuals. All samples were obtained by written consent from the parents/guardians of minors from the Children’s Oncology Group clinical trial AAML1031. The Institutional Review Board at Fred Hutchinson Cancer Research Center has reviewed and approved this study. It is filed under Institutional Review File #9950 (Biology of the Alterations of the Signal Transduction Pathway in Pediatric Cancer). The number of samples with clinical information regarding CR used in this study was 414. RNA sequencing was performed on all 493 samples using the Illumina platform HiSeq2000 (https://www.illumina.com). Reads were mapped to Ensembl Gene IDs (http://useast.ensembl.org/), which belong to 31 biotypes, including protein-coding sequences, non-coding sequences, and pseudogenes, among others. RPKM (reads per kilobase per million mapped reads) values were calculated for each gene. Genes that had a count of at least 1 per million (CPM) in at least 3 samples were retained. Quantile normalization was applied among all samples. Python library scikit-learn (http://scikit-learn.org/stable/) modules of commonly used statistical models and algorithms were directly implemented in the scripts. Gene set enrichment analysis (GSEA) was performed using the online tool Enrichr (http://amp.pharm.mssm.edu/Enrichr/), as well as our in-house OmicPath (v 0.1) R package. Violin plots showing gene-expression distribution patterns were generated using the in-house OmicPlot (v 0.1) R package.

Feature selection

Principal components analysis (PCA) was performed to examine the general pattern of the data, remove outliers, and select algorithms appropriate for our data. RNA sequencing expression data of m samples by n genes were used as inputs and learn the mapping using Samples were divided into a training set (N = 331) and a test set (N = 83). Three classifiers—k-nearest neighbors algorithm (K-NN), support vector machine (SVM), and random forest (RF)—were applied to select features for the training set via 5-fold cross-validation. With the features selected, the classifier was tested on the same training set (N = 331). The classifier with the best performance was then tested on the remaining test set (N = 83).

K-NN classifier

We performed 100 iterations of a 5-fold cross-validation. In each fold, we first carried out a t test for initial feature selection to identify the 100 most statistically significant genes, ie, those that were the most differentially expressed between the CR (positive class) and non-complete remission (NCR) (negative class). We found that using more than 100 genes did not improve performance. For further feature selection out of the genes identified by t testing, we compared the performance of 2 algorithms: Hill Climbing[13] (sequential feature addition) and Randomized Lasso[14] (using the model’s feature weights as ranks and selecting the highest ranking feature). At each fold, an area under the curve (AUC) was computed using a selected subset of genes and the fold’s validation set. Following the 100 iterations, the features (genes) were ranked by the average of AUCs computed using those genes across different folds. Essentially, the genes that on average helped yield the best AUCs were ranked highest.

SVM classifier

To overcome the issue of class imbalance, downsampling was applied,[15-17] ie, a smaller subset of 114 samples—N(CR) = 57, N(NCR) = 57; 91 for the training set and 23 for the test set—was used as input for the SVM classifier. Processes similar to those described above for K-NN were applied to SVM classifiers with 1 exception: we used a third method Recursive Feature Elimination for the second feature selection in addition to Hill Climbing and Randomized Lasso.

RF classifier

We trained RF classifiers using scikit-learn’s ensemble.RandomForestClassifier module. To select parameters for the RF classifier, we performed a grid search for the following parameters: number of trees (estimators), maximum number of features, and maximum tree depth. The remaining parameters were set to their default values. Then, the optimal parameters in terms of AUC were selected together with the best performing feature selection approach. For feature selection, we performed a comparison between 2 approaches: (1) nested 5-fold with built-in RF feature selection—we trained the classifier on 4/5 of the training set using the built-in “feature importance” attribute to rank the features (genes). Those genes were then used a second time on the same 4/5 of the training set. We then tested the classifier on the remaining 1/5 of the training set (ie, validation set) to assess performance. (2) We carried out 100 iterations of a 5-fold cross-validation while aggregating the feature importance values computed at each fold. We then computed a Spearman correlation between each gene’s importance values and the AUC computed in each fold. We used the genes with the highest correlation scores to train on the same 4/5 training set and then tested the method on the remaining validation set.

Following feature selection

At the end of each feature selection, the same cross-validation procedure was employed to generate the AUC results when testing the validation set. The final AUC result of the (chosen) K-NN classifier was a simple, 1-episode period of training on the training set with the selected genes followed by testing on the unseen test set with the same selected genes.

Results

According to PCA based on all genes, these 414 AML samples with clinical information regarding CR did not cluster by CR or NCR status, nor by age/year of diagnosis. There is no obvious outliers, so all of them were included in this study. Area-under-the curve results from different K values were used to estimate optimal K for the K-NN classifier. Figure 1 shows that statistically significant genes identified from the t test can help improve the AUC results and that K = 27 yielded the best average AUC. With the optimal K = 27, receiver operating characteristic (ROC) curves were produced using 2 feature selection methods: Hill Climbing and Randomized Lasso (Figure 2). Overall, the Hill Climbing resulted in better results with the best AUC = 0.84.

Figure 1.

Area under the curves from different Ks used to estimate an optimal K value for K-NN classifier. AUC indicates area under the curve; K-NN, k-nearest neighbors algorithm.

Figure 2.

Receiver operating characteristic curves of K-NN (with the optimal K = 27) using 2 feature selection methods: (A) Hill Climbing and (B) Randomized Lasso. K-NN indicates k-nearest neighbors algorithm; ROC, receiver operating characteristic; FS, feature selection; HC, Hill Climbing; R.LASSO, Randomized Lasso.

Area under the curves from different Ks used to estimate an optimal K value for K-NN classifier. AUC indicates area under the curve; K-NN, k-nearest neighbors algorithm. Receiver operating characteristic curves of K-NN (with the optimal K = 27) using 2 feature selection methods: (A) Hill Climbing and (B) Randomized Lasso. K-NN indicates k-nearest neighbors algorithm; ROC, receiver operating characteristic; FS, feature selection; HC, Hill Climbing; R.LASSO, Randomized Lasso. To compare the performance of K-NN and SVM classifier, the balanced data set with N(CR) = 57 and N(NCR) = 57 was split into training set (N = 91) and the test set (N = 23). Using a 5-fold cross-validation performed on the training set, ROC curves of K-NN and SVM algorithms were calculated using 3 feature-election methods: Hill Climbing, Recursive Feature Elimination, and Randomized Lasso. The K-NN outperformed SVM, and Hill Climbing still resulted in better AUC results for K-NN (Supplemental Figure S1). Hyperparameter tuning for RF suggested using 100 trees to have the best performance (AUC = 0.74). The simple method resulted in better results with the best (training set) AUC = 0.73 compared with the more complex approach (Supplemental Figure S2). Based on the above observations, K-NN with Hill Climbing performed the best on the training data (N = 331), yielding an AUC score of 0.84. When we tested this model on the remaining 1/5 of the data (N = 83), using the top 50 genes with the best AUC scores from the training set yielded an AUC score of 0.81 (Figure 3).

Figure 3.

Final K-NN model performance on test data (N = 83). ROC indicates receiver operating characteristic.

Final K-NN model performance on test data (N = 83). ROC indicates receiver operating characteristic. Based on using these top 50 genes, our GSEA analysis using OmicPath showed that BATF (basic leucine zipper transcriptional factor ATF-like) and RAC2 (Ras-related C3 botulinum toxin substrate 2) are related to a decreased IgM (Immunoglobulin M) level with FDR (false discovery rate) =0.0073. TSTA3 (GDP-l-fucose synthase) and RAC2 are related to an increased neutrophil cell number (FDR = 0.0073). Pathway enrichment analysis using Enrichr showed that TSTA3 and FPGT (fucose-1-phosphate guanylyltransferase) were mapped to the GDP-fucose biosynthesis pathway (Reactome 2016; https://reactome.org) with an adjusted P value of .0092. These 2 genes were also mapped to the pathway’s parent terms “Synthesis of substrates in N-glycan biosynthesis” and “Biosynthesis of the N-glycan precursor (dolichol lipid-linked oligosaccharide, LLO) and transfer to a nascent protein.” This indicates the vital role of N-glycosylation in AML pathology and patient prognosis. The expression of these top 50 genes compared with normal bone marrow (NBM) samples are shown in violin plot (Figure 4).

Figure 4.

Expressions of top 50 genes with the best AUC scores from K-NN with Hill Climbing. AML indicates acute myeloid leukemia; AUC, area under the curve; CR, complete remission; K-NN, k-nearest neighbors algorithm; NBM, normal bone marrow; NCR, non-complete remission; RPKM, reads per kilobase per million.

Discussion

This study explored and evaluated different ML algorithms for predicting CR in AML patients based on their pre-treatment gene-expression signatures. It revealed a significant underlying genetic difference between patients with contrasting outcomes following treatment. Gene set enrichment analysis results highlighted specific biological features that carry prognostic value for further exploration. For example, low IgM and leukocyte count >50 × 109/liter have been demonstrated as 2 of the adverse predictors for the duration of complete continuous remission in childhood ALL.[18] Fucose-containing glycans play important roles in selectin-mediated leukocyte-endothelial adhesion as well as various immunity and signaling processes. Alterations in expression or structure of fucosylated oligosaccharides have also been observed in cancer pathology. Conditional impairment in fucosylated glycan expression in mice exhibited altered myeloid development including aberrant proliferation of myeloid progenitors and an increased production of granulocytes which leads to neutrophilia. The loss of AB blood group antigen expression along with the increases in H and Lewisy expression are associated with poor prognosis. Increased expression of Lewisx/a structures, Tn/sialyl-Tn/T antigens, and β1,6 GlcNAc branching of N-linked core structures were observed in advanced cancers and related with poor prognosis.[19-22] This information may help physicians select more suitable courses of treatment, whether the treatment be more aggressive chemotherapy or an altogether novel alternative therapy. Click here for additional data file. Supplemental material, Predict_CR_AML_Supplemental_xyz13921555f86f3 for Predicting Complete Remission of Acute Myeloid Leukemia: Machine Learning Applied to Gene Expression by Ophir Gal, Noam Auslander, Yu Fan and Daoud Meerzaman in Cancer Informatics

7 in total

Review 1. Application of machine learning in the management of acute myeloid leukemia: current practice and future prospects.

Authors: Jan-Niklas Eckardt; Martin Bornhäuser; Karsten Wendt; Jan Moritz Middeke
Journal: Blood Adv Date: 2020-12-08

Review 2. Computer Based Diagnosis of Some Chronic Diseases: A Medical Journey of the Last Two Decades.

Authors: Samir Malakar; Soumya Deep Roy; Soham Das; Swaraj Sen; Juan D Velásquez; Ram Sarkar
Journal: Arch Comput Methods Eng Date: 2022-06-15 Impact factor: 8.171

3. A Clinical Prognostic Model Based on Machine Learning from the Fondazione Italiana Linfomi (FIL) MCL0208 Phase III Trial.

Authors: Gian Maria Zaccaria; Simone Ferrero; Eva Hoster; Roberto Passera; Andrea Evangelista; Elisa Genuardi; Daniela Drandi; Marco Ghislieri; Daniela Barbero; Ilaria Del Giudice; Monica Tani; Riccardo Moia; Stefano Volpetti; Maria Giuseppina Cabras; Nicola Di Renzo; Francesco Merli; Daniele Vallisa; Michele Spina; Anna Pascarella; Giancarlo Latte; Caterina Patti; Alberto Fabbri; Attilio Guarini; Umberto Vitolo; Olivier Hermine; Hanneke C Kluin-Nelemans; Sergio Cortelazzo; Martin Dreyling; Marco Ladetto
Journal: Cancers (Basel) Date: 2021-12-31 Impact factor: 6.639

Review 4. A Review of Artificial Intelligence Applications in Hematology Management: Current Practices and Future Prospects.

Authors: Yousra El Alaoui; Adel Elomri; Marwa Qaraqe; Regina Padmanabhan; Ruba Yasin Taha; Halima El Omri; Abdelfatteh El Omri; Omar Aboumarzouk
Journal: J Med Internet Res Date: 2022-07-12 Impact factor: 7.076

Review 5. Incorporating Machine Learning into Established Bioinformatics Frameworks.

Authors: Noam Auslander; Ayal B Gussow; Eugene V Koonin
Journal: Int J Mol Sci Date: 2021-03-12 Impact factor: 5.923

6. Random survival forest model identifies novel biomarkers of event-free survival in high-risk pediatric acute lymphoblastic leukemia.

Authors: Zachary S Bohannan; Frederick Coffman; Antonina Mitrofanova
Journal: Comput Struct Biotechnol J Date: 2022-01-06 Impact factor: 6.155

7. Blood cancer prediction using leukemia microarray gene data and hybrid logistic vector trees model.

Authors: Vaibhav Rupapara; Furqan Rustam; Wajdi Aljedaani; Hina Fatima Shahzad; Ernesto Lee; Imran Ashraf
Journal: Sci Rep Date: 2022-01-19 Impact factor: 4.379

7 in total