| Literature DB >> 30911218 |
Ophir Gal1, Noam Auslander2,3, Yu Fan4, Daoud Meerzaman4.
Abstract
Machine learning (ML) is a useful tool for advancing our understanding of the patterns and significance of biomedical data. Given the growing trend on the application of ML techniques in precision medicine, here we present an ML technique which predicts the likelihood of complete remission (CR) in patients diagnosed with acute myeloid leukemia (AML). In this study, we explored the question of whether ML algorithms designed to analyze gene-expression patterns obtained through RNA sequencing (RNA-seq) can be used to accurately predict the likelihood of CR in pediatric AML patients who have received induction therapy. We employed tests of statistical significance to determine which genes were differentially expressed in the samples derived from patients who achieved CR after 2 courses of treatment and the samples taken from patients who did not benefit. We tuned classifier hyperparameters to optimize performance and used multiple methods to guide our feature selection as well as our assessment of algorithm performance. To identify the model which performed best within the context of this study, we plotted receiver operating characteristic (ROC) curves. Using the top 75 genes from the k-nearest neighbors algorithm (K-NN) model (K = 27) yielded the best area-under-the-curve (AUC) score that we obtained: 0.84. When we finally tested the previously unseen test data set, the top 50 genes yielded the best AUC = 0.81. Pathway enrichment analysis for these 50 genes showed that the guanosine diphosphate fucose (GDP-fucose) biosynthesis pathway is the most significant with an adjusted P value = .0092, which may suggest the vital role of N-glycosylation in AML.Entities:
Keywords: acute Myeloid Leukemia (AML); gene expression profiling; machine Learning (ML); remission induction
Year: 2019 PMID: 30911218 PMCID: PMC6423478 DOI: 10.1177/1176935119835544
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1.Area under the curves from different Ks used to estimate an optimal K value for K-NN classifier. AUC indicates area under the curve; K-NN, k-nearest neighbors algorithm.
Figure 2.Receiver operating characteristic curves of K-NN (with the optimal K = 27) using 2 feature selection methods: (A) Hill Climbing and (B) Randomized Lasso. K-NN indicates k-nearest neighbors algorithm; ROC, receiver operating characteristic; FS, feature selection; HC, Hill Climbing; R.LASSO, Randomized Lasso.
Figure 3.Final K-NN model performance on test data (N = 83). ROC indicates receiver operating characteristic.
Figure 4.Expressions of top 50 genes with the best AUC scores from K-NN with Hill Climbing. AML indicates acute myeloid leukemia; AUC, area under the curve; CR, complete remission; K-NN, k-nearest neighbors algorithm; NBM, normal bone marrow; NCR, non-complete remission; RPKM, reads per kilobase per million.