Literature DB >> 35919062

Deep learning for prediction of isocitrate dehydrogenase mutation in gliomas: a critical approach, systematic review and meta-analysis of the diagnostic test performance using a Bayesian approach.

Mert Karabacak¹, Burak Berksu Ozkara¹, Seren Mordag², Sotirios Bisdas³.

Abstract

Background: Conventionally, identifying isocitrate dehydrogenase (IDH) mutation in gliomas is based on histopathological analysis of tissue specimens acquired via stereotactic biopsy or definitive resection. Accurate pre-treatment prediction of IDH mutation status using magnetic resonance imaging (MRI) can guide clinical decision-making. We aim to evaluate the diagnostic performance of deep learning (DL) to determine IDH mutation status in gliomas.
Methods: A systematic search of Cochrane Library, Web of Science, Medline, and Scopus was conducted to identify relevant publications until August 1, 2021. Articles were included if all the following criteria were met: (I) patients with histopathologically confirmed World Health Organization (WHO) grade II, III, or IV gliomas; (II) histopathological examination with the IDH mutation; (III) DL was used to predict the IDH mutation status; (IV) sufficient data for reconstruction of confusion matrices in terms of the diagnostic performance of the DL algorithms; and (V) original research articles. Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) and Checklist for Artificial Intelligence in Medical Imaging (CLAIM) was used to assess the studies' quality. Bayes theorem was utilized to calculate the posttest probability.
Results: Four studies with a total of 1,295 patients were included. In the training set, the pooled sensitivity, specificity, and area under the summary receiver operating characteristic (SROC) curve were 93.9%, 90.9% and 0.958, respectively. In the validation set, the pooled sensitivity, specificity, and area under the SROC curve were 90.8%, 85.5% and 0.939, respectively. With a known pretest probability of 80.2%, the Bayes theorem yielded a posttest probability of 97.6% and 96.0% for a positive test and 27.0% and 30.6% for a negative test for training sets and validation sets, respectively. Discussion: This is the first meta-analysis that summarizes the diagnostic performance of DL in predicting IDH mutation status in gliomas via the Bayes theorem. DL algorithms demonstrate excellent diagnostic performance in predicting IDH mutation in gliomas. Radiomic features associated with IDH mutation, and its underlying pathophysiology extracted from advanced MRI may improve prediction probability. However, more studies are required to optimize and increase its reliability. Limitations include obtaining some data via email and lack of training and test sets statistics. 2022 Quantitative Imaging in Medicine and Surgery. All rights reserved.

Entities: Chemical

Keywords: Deep learning (DL); gliomas; isocitrate dehydrogenase (IDH); magnetic resonance imaging (MRI); radiomics

Year: 2022 PMID： 35919062 PMCID： PMC9338374 DOI： 10.21037/qims-22-34

Source DB: PubMed Journal: Quant Imaging Med Surg ISSN： 2223-4306

Introduction

Gliomas with isocitrate dehydrogenase (IDH) mutation show a better prognosis, overall survival, and response to chemotherapy than IDH wild-type gliomas, independent of histological grade (1-4). On the other hand, IDH wild-type gliomas may show comparable survival outcomes to glioblastomas with similar molecular and clinical characteristics (5). At present, identification of IDH mutation is based on histopathological analysis of tissue specimens acquired via stereotactic biopsy or definitive resection. Immunohistochemistry and deoxyribonucleic acid (DNA) sequencing techniques have been used to identify IDH genotype (6,7). Nevertheless, the procedures for obtaining tumor tissue are invasive and have potential risks (8). Additionally, spatial and temporal alterations in genetic expression cause intratumoral heterogeneity of gliomas, creating a possibility for non-representative tissue samples (9). This possibility of sampling error may lead to pitfalls in determining tumor grade and tumor mutation status. Furthermore, molecular genetic testing has other impracticalities. Immunohistochemistry misses about 15% of all IDH mutations, whereas numerous alterations which have no impact on IDH enzyme activity may be detected by sequencing (10). Considering the beforementioned disadvantages of conventional methods for determining the IDH mutation status, research has been done to predict the IDH mutation status of gliomas noninvasively with imaging studies (11,12). The emerging field of radiology that works on obtaining molecular and genetic information from imaging studies is known as radiogenomics. It involves a series of qualitative and quantitative analyses to predict genetic and molecular properties (13). The treatment regimen differs based on IDH mutation status in gliomas (14). Therefore, an accurate pre-treatment prediction with radiogenomics of IDH mutation status can guide clinical decision-making. Magnetic resonance imaging (MRI) is done routinely in glioma workup. It helps to assess the entire tumor and the surrounding brain tissue noninvasively. Considering the intratumoral heterogeneity and the possibility of the inadequate sampling of tumor specimens, radiogenomic information from MRI studies holds the potential to be integrated into routine clinical practice with the superiority of obtaining information from the entire tumor. Artificial intelligence (AI) may help extracting imaging features that are incomprehensible to the human eye and associate them with outcomes. Machine learning (ML), a subgroup method of AI, is used for analyzing the correlation between radiomic features and genetic information. Predictive performance can further be enhanced, including clinical and demographic information, such as age, sex, and performance status (15). ML was reported to have a pooled sensitivity and specificity of 88% and 87%, respectively, for IDH mutation prediction in a recent meta-analysis by Zhao et al. (15). ML methods used in classical radiomic approaches utilize the extraction of predefined features, selection of these features, and application of ML techniques for outcome association and molecular and genetic status prediction (16). This pipeline is simplified and, at the same time, enhanced by using deep learning (DL) (17). In DL, radiomic features are extracted without human predefinition, unlike ML. After each round of training, the model’s internal parameters are recalibrated by the back-propagation algorithm (18). Although several reviews and meta-analyses were recently published on the diagnostic performance of ML and DL algorithms in IDH mutation prediction, none of them has quantitatively evaluated the diagnostic performance of DL methods with the Bayes theorem (12,15,19,20). With this study, we aimed to perform a systematic review and a meta-analysis of DL algorithms’ diagnostic performance in predicting the IDH mutation status of gliomas. We also utilized the Bayes theorem to calculate the posttest probability using likelihood ratios and predetermined pretest probabilities. We present the following article in accordance with the PRISMA-DTA reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-22-34/rc) (21).

Methods

The present meta-analysis is exempt from ethical approval of the Institutional Review Board since the analysis only involves de-identified data and all the included prospective studies have received local ethics approval. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Literature search

A comprehensive literature search was performed based on the following combination of Medical Subject Headings (MeSH) terms and keywords for a PubMed database search: ((“deep learning” [tw] OR “neural network*” [tw] OR “Deep Learning”[Mesh]) AND (radiogen* [tw] OR radiomic*[tw]) AND (IDH [tw] OR “isocitrate dehydrogenase” OR “Isocitrate Dehydrogenase” [Mesh]) AND (glioma [tw] OR “Glioma” [Mesh])). Corresponding keywords were used for Cochrane Library, Web of Science, Medline, and Scopus data search. The last search was conducted on August 1, 2021. The reference lists of all included studies were checked manually in order to identify other relevant papers.

Study selection

Two authors [M Karabacak and BB Ozkara (3 and 1 year of experience in performing systematic reviews and meta-analyses, respectively)] independently evaluated the eligibility of the articles, and any disagreement was resolved via discussion with a third author (S Bisdas, 18 years of experience in neuroradiology). Articles were included based on the fulfillment of all the following criteria: (I) patients with histopathologically confirmed World Health Organization (WHO) grade II, III, or IV gliomas; (II) histopathological examination with the IDH mutation; (III) DL was applied to predict the IDH mutation status; (IV) sufficient data for reconstruction of confusion matrices (2×2 tables) in terms of the diagnostic performance of the DL algorithms; (V) original research articles. Corresponding authors of the studies that did not include sufficient data for reconstruction of confusion matrices but fulfilled the rest of the inclusion criteria were contacted via email to inquire if their study originally contained sufficient data. Studies were included if their corresponding author supplied us with the data via email within two weeks of receiving our email. Articles were excluded if they fulfilled any of the following criteria: (I) reviews, letters, guidelines, editorials, or errata; (II) ML was applied to predict the IDH mutation status; (III) IDH mutation status was not predicted; (IV) insufficient data for the reconstruction of confusion matrices; (V) studies with overlapping cohorts.

Data extraction

Data were collected by the two authors (M Karabacak and BB Ozkara) for the following variables: (I) study characteristics (author, year, country, number of patients, age, sex, and distribution of tumor WHO grades); (II) MRI sequences used in DL algorithms; (III) information included in DL algorithms; (IV) number of patients whose IDH mutation status predicted with DL; and (V) number of patients who had molecular analysis results that revealed IDH mutation.

Quality assessment

The quality assessments were conducted by two authors (M Karabacak and BB Ozkara) independently according to the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) and Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) (22,23). The CLAIM is a new checklist with 42 items for assessing AI studies in medical imaging. Studies were scored 0 or 1 on a 2-point scale for each item. The CLAIM score was calculated by summing the scores of each item per study. An overall CLAIM compliance score was defined for each article by the ratio of applicable items fulfilled. The items were weighted equally. Four domains were evaluated with QUADAS-2: (I) patient selection; (II) index test; (III) reference standard; and (IV) flow and timing. The patient selection domain provides questions concerning methods of patient selection. The index test domain includes questions about the index test and how it was conducted and interpreted. The reference standard domain provides questions regarding the reference standard and how it was performed and analyzed. The flow and timing domain investigate whether any patients did not receive the index test or reference standard or were excluded from the confusion matrices. Concerns about applicability and bias risk were rated as low, high, or unclear on a 3-point scale. Any disagreements during the quality assessment were resolved through discussion or with the assistance of a third reviewer.

Statistical analysis

The primary outcome was to evaluate the diagnostic performance of DL algorithms for predicting IDH mutation status in patients with gliomas. Confusion matrices with true-positive (TP), false-positive (FP), true-negative (TN), and false-negative (FN) values were separately calculated for training and validation sets within each study. Zero-point-five was added to prevent the zero cell count problem if any TP, FP, TN, or FN value was 0 (24). Separate meta-analyses were computed for training and validation sets. All statistical analyses were conducted using R version 3.4.1 (R Foundation for Statistical Computing), implementing R packages mada and meta (25). An alpha level of 0.05 was considered statistically significant. A bivariate random-effects model was used to obtain the pooled sensitivity and specificity with their 95% confidence intervals (CIs) (26). The diagnostic odds ratio (DOR), defined as the odds of having a positive IDH mutation status prediction by DL algorithms in patients with IDH mutant gliomas compared to the odds of having a positive IDH mutation status prediction by DL algorithms in patients with IDH-wild type gliomas, was also obtained. Sensitivity, specificity, and DOR are expressed by forest plots. Summary receiver operating characteristics (SROC) curves were generated for the overall diagnostic accuracy, and the area under the curve (AUC) was calculated (27). Test performance accuracy was categorized as low (AUC, 0.50–0.69), moderate (AUC, 0.70–0.89), or high (AUC, 0.90–1.00) (28). It becomes less likely that someone with a positive test will actually have the tested condition and more likely that someone with a positive test will not actually have the tested condition as the prevalence of a specific disease decreases (29). To address this problem, we utilized the Bayes’ theorem in our meta-analysis based on the concept that diagnostic accuracy of a test depends on the prevalence of the tested condition in the test population, as well as the test characteristics such as sensitivity and specificity. Probabilities were calculated as pretest odds × likelihood ratio +/− = posttest odds, where odds were substituted with probabilities as odds = probability/(1 – probability) (30). For pretest probabilities, previous genome-wide association studies were used to calculate the overall percentage of IDH mutant gliomas with an overall 80.2% (5,31). Fagan nomograms were plotted to visualize the relationship between pretest probability, the likelihood ratio, and the posttest probability. Heterogeneity across all eligible studies was estimated using Q-test with P<0.05, indicating the presence of study heterogeneity and I2 statistics. I2 values were defined as follows: heterogeneity that might not be important (0–25%), low heterogeneity (26–50%), moderate heterogeneity (51–75%), and high heterogeneity (76–100%) (32). Publication bias was not assessed in our analysis, as the small number of studies included in our meta-analysis (n=4) may lead to inconclusive funnel plots and regression tests for detecting publication bias (33).

Results

The study selection process is illustrated in . The initial literature search yielded 54 articles: 19 from Medline, 17 from Scopus, 17 from Web of Science, and one from Cochrane library, databases, respectively. After removing 29 duplicate articles, the remaining 25 were screened based on their title and abstract, and eight were excluded. Full texts of the remaining 17 articles were obtained and reviewed.

Figure 1

The study selection process. n, number; IDH, isocitrate dehydrogenase.

The study selection process. n, number; IDH, isocitrate dehydrogenase. Fifteen articles were excluded because; they used ML to predict the IDH mutation status (n=7), did not predict IDH mutation status (n=2), had an overlapping patient population with one of the other included articles (n=3), and had insufficient data for the reconstruction of confusion matrices (n=3). Three articles were obtained and reviewed when the references provided in the selected articles were also screened. One article was again excluded because it had an overlapping patient population with one of the other included articles. Finally, four original articles that included 1,295 patients with glioma were eventually included and analyzed in this study. Among the four studies, three (34-36) included data for validation and training sets, while the remaining one (37) only had data for the validation set. The sample sizes for the training and validation sets were 1,007 and 437, respectively. A quality assessment summary of the included studies using the CLAIM is shown in . Three studies had an item marked as “not applicable” (35-37). The mean CLAIM score of the four studies was 25.25 with a standard deviation of 5.56 (range, 20.00–31.00). The mean CLAIM compliance score of the four studies was 0.61 with a standard deviation of 0.13 (range, 0.49–0.74).

Table 1

CLAIM assessment

Study	Title/abstract (N=2)	Introduction (N=2)	Methods (N=28)	Results (N=5)	Discussion (N=2)	Other information (N=3)	Total (N=42)	CLAIM compliance
Chang et al.	2	2	22	2	2	1	31	0.74
Choi et al.	2	2	20 (1 N/A)	2	2	1	29 (1 N/A)	0.71
Li et al.	1	2	15 (1 N/A)	0	1	1	20 (1 N/A)	0.49
Matsui et al.	2	2	12 (1 N/A)	2	2	1	21 (1 N/A)	0.51

CLAIM, Checklist for Artificial Intelligence in Medical Imaging; N, number; N/A, not applicable.

CLAIM, Checklist for Artificial Intelligence in Medical Imaging; N, number; N/A, not applicable. A quality assessment summary of the included studies using the QUADAS-2 tool is shown in . Regarding patient selection, one study indicated an unclear risk of bias as they failed to mention the inclusion criteria of patient enrollment (37). Regarding the index test, one study was considered to have an unclear risk of bias, as they did not mention the details of the DL algorithm (36). One study presented a high risk of bias regarding reference standards (34). Chang et al. combined three different cohorts in their study (34). In one cohort, only the IDH 1 R132H mutation was detected in the reference test, while both IDH 1 and 2 mutations were detected in the other two cohorts. One study was assessed to have unclear concerns regarding the applicability of the index test since the details of the DL algorithm were not mentioned (36). Two studies were considered to have high concerns regarding the applicability of the reference standards (34,36). Matsui et al. and Chang et al. detected only IDH R132 and IDH R132H, respectively, which did not match the review question exactly (34,36).

Figure 2

Quality assessment with QUADAS-2. QUADAS-2, Quality Assessment of Diagnostic Accuracy Studies-2.

Characteristics of included studies

The patient and study characteristics are described in . In terms of the applied MRI sequences, three studies used conventional [T1-weighted imaging (T1WI), contrast-enhanced T1WI (T1CE), T2-weighted imaging (T2WI), fluid-attenuated inversion recovery (FLAIR)] MRI sequences (34,36,37), one used dynamic susceptibility contrast-MRI (DSC-MRI) (35). For DL analysis, two studies included only DL extracted radiomics information (35,37), and two studies used radiomics and clinical information (34,36). One study included only age (34), and one included age, sex, and tumor location (36).

Table 2

Study characteristics

Authors	Year	N of Pts	Mean age in years	Sex (M:F)	WHO grade	MRI sequence	DL algorithm	Information used in DL	IDH mutation ratio (%)
Li et al. (37)	2017	119	40.7	89:30	Grade II	T1CE, FLAIR	Convolutional neural network	MRI data	74.0
Chang et al. (34)	2018	496	52.0	259:237	Grade II, III, IV	T1, T1CE, T2, FLAIR	Residual neural network	MRI data, age	47.2
Choi et al. (35)	2019	463	52.2	272:191	Grade II, III, IV	DSC perfusion MRI	Recurrent neural network	MRI data	27.0
Matsui et al. (36)	2020	217	42.0	131:86	Grade II, III	T1, T2, FLAIR	Residual neural network	MRI data, age, sex, and tumor position	77.0
Total	2017–2019	1,295	49.4	751:544	Grade II, III, IV	–	–	–	47.4

N, number; Pts, patients; M, male; F, female; WHO, World Health Organization; MRI, magnetic resonance imaging; DL, deep learning; IDH, isocitrate dehydrogenase; T1CE, T1-weighted contrast-enhanced; FLAIR, fluid attenuated inversion recovery; DSC, dynamic susceptibility contrast.

Assessment of diagnostic performance

Training sets

The sensitivities and specificities in the training sets of the individual included studies ranged from 85.1% to 98% and from 88.9% to 100%, respectively. The pooled sensitivity and specificity for the diagnostic performance of DL algorithms for prediction of IDH mutation status were 93.9% (95% CI: 84.1–97.8%; ) and 90.9% (95% CI: 88.2–93.0%; ), respectively. A DOR of 206.607 (95% CI: 20.400–2,092.500; ) was recorded; this value indicates that the likelihood of the distinction of IDH-mutant gliomas from IDH-wild type gliomas was approximately 207 times higher in training sets using DL. The large CI observed was the result of the small number of included studies. The area under the SROC curve was 0.958, showing high test performance accuracy. depicts the SROC curve with the point estimate and associated 95% confidence region for pooled sensitivity/specificity pairs. When the pretest probability was set at 80.2%, the positive posttest probability was 97.6% and the negative posttest probability was 27.0% (). The Q-test demonstrated that heterogeneity was present across the studies (Q=11.71, P=0.0029), and the Higgins I2 statistic demonstrated the presence of high heterogeneity in sensitivity (I2=87.2%, 95% CI: 63.6–95.5%) and low heterogeneity in specificity (I2=36.8%, 95% CI: 0.0–79.9%).

Figure 3

Forest plots of the deep learning algorithms in training sets. (A) Pooled sensitivity. (B) Specificity. (C) Diagnostic odds ratio. CI, confidence interval; OR, odds ratio.

Figure 4

Summary receiver operating characteristic curve of the diagnostic performance of deep learning algorithms for prediction of IDH mutation status in training sets. SROC, summary receiver operating characteristic; CI, confidence interval; IDH, isocitrate dehydrogenase.

Figure 5

Fagan’s nomogram for calculation of positive and negative posttest probabilities of IDH mutation status prediction by deep learning algorithms in training sets. IDH, isocitrate dehydrogenase.

Forest plots of the deep learning algorithms in training sets. (A) Pooled sensitivity. (B) Specificity. (C) Diagnostic odds ratio. CI, confidence interval; OR, odds ratio. Summary receiver operating characteristic curve of the diagnostic performance of deep learning algorithms for prediction of IDH mutation status in training sets. SROC, summary receiver operating characteristic; CI, confidence interval; IDH, isocitrate dehydrogenase. Fagan’s nomogram for calculation of positive and negative posttest probabilities of IDH mutation status prediction by deep learning algorithms in training sets. IDH, isocitrate dehydrogenase.

Validation sets

The sensitivities and specificities in the validation sets of the individual included studies ranged from 75.6% to 94.4% and from 67.3% to 93.1%, respectively. The pooled sensitivity and specificity for the diagnostic performance of DL algorithms for prediction of IDH mutation status were 90.8% (95% CI: 86.9–93.6%; ) and 85.5% (95% CI: 76.6–91.4%; ), respectively. A DOR of 53.200 (95% CI: 21.780–129.944; ) was recorded; this value indicates that the likelihood of the distinction of IDH-mutant gliomas from IDH-wild type gliomas was approximately 53 times higher in validation sets using DL. depicts the SROC curve with the point estimate and associated 95% confidence region for pooled sensitivity/specificity pairs. When the pretest probability was set at 80.2%, the positive posttest probability was 96.0% and the negative posttest probability was 30.6% (). The Q-test demonstrated that heterogeneity was not present across the studies (Q=4.31, P=0.2297), and the Higgins I2 statistic demonstrated the presence of heterogeneity that might not be important in sensitivity (I2=0.0%, 95% CI: 0.0–84.7%) and low heterogeneity in specificity (I2=38.8%, 95% CI: 0.0–79.1%).

Figure 6

Forest plots of deep learning algorithms in validation sets. (A) Pooled sensitivity. (B) Specificity. (C) Diagnostic odds ratio. CI, confidence interval; OR, odds ratio.

Figure 7

Summary receiver operating characteristic curve of the diagnostic performance of deep learning algorithms for prediction of IDH mutation status in validation sets. SROC, summary receiver operating characteristic; CI, confidence interval; IDH, isocitrate dehydrogenase.

Figure 8

Fagan’s nomogram for calculation of positive and negative posttest probabilities of IDH mutation status prediction by deep learning algorithms in validation sets. IDH, isocitrate dehydrogenase.

Forest plots of deep learning algorithms in validation sets. (A) Pooled sensitivity. (B) Specificity. (C) Diagnostic odds ratio. CI, confidence interval; OR, odds ratio. Summary receiver operating characteristic curve of the diagnostic performance of deep learning algorithms for prediction of IDH mutation status in validation sets. SROC, summary receiver operating characteristic; CI, confidence interval; IDH, isocitrate dehydrogenase. Fagan’s nomogram for calculation of positive and negative posttest probabilities of IDH mutation status prediction by deep learning algorithms in validation sets. IDH, isocitrate dehydrogenase.

Discussion

To our knowledge, this is the first systematic review and meta-analysis that summarizes the diagnostic performance of DL in the prediction of IDH mutation status in gliomas via the Bayes theorem. Our study revealed that DL models perform well in predicting IDH mutation in gliomas, with a pooled sensitivity and specificity of 93.9% (95% CI: 84.1–97.8%) and 90.9% (95% CI: 88.2–93.0%) in the training sets, and 90.8% (95% CI: 86.9–93.6%) and 85.5% (95% CI: 76.6–91.4%) in the validation sets, respectively. The AUC value was reported in three studies for validation sets and ranged from 0.95 to 0.98. The area under the SROC curve is a diagnostic accuracy index. The diagnostic results are better when it is closer to 1. The area under the SROC curve for training and validation sets was 0.958 and 0.939, indicating high test performance accuracy. Furthermore, more cautious estimates were derived from the Bayes theorem with a pre-established pretest probability of 80.2% and still demonstrated a high positive posttest probability in the training (97.6%) and validation sets (96.0%). This means that the probability of the patient having the disease increases from 80.2% to 97.7% with a positive test result based on the results of the training sets and to 96.0% based on the results of the validation sets. Again, calculations were derived from the Bayes theorem with a pre-established pretest probability of 80.2%, revealing a negative posttest probability of 27.0% and 30.6% in the training and validation sets, respectively. This means that the probability of the patient having the disease decreases from 80.2% to 27.0% with a negative test result based on the results of the training sets and to 30.6% based on the results of the validation sets. The Bayesian method for meta-analyses has the advantage of accounting for the uncertainty around the heterogeneity variance, making it a key strength of our work (38). Among the four studies included in this meta-analysis, Choi et al. demonstrated the highest positive posttest probability (98.2%) (35). This could be partly due to the utilization of DSC-MRI in the model. Compared to wild-type tumors, hypoxia-inducible-factor 1-alpha activity is lower in IDH mutant tumors. This contributes to a distinct transcriptome signature associated with vasculogenesis and angiogenesis-related signaling pathways, which leads to increased proangiogenic molecules in IDH wild-type tumors (39). Perhaps, the model of Choi et al. achieved a great diagnostic performance by utilizing DSC perfusion-weighted MRI, which is more suitable than conventional MRI for demonstrating the tumor vasculature of gliomas. Prior multi-centered studies have also shown the diagnostic accuracy of ML-assisted DSC-MRI radiomics (40,41). However, the lack of broad clinical dissemination of advanced MRI sequences in most hospital imaging protocols and the needed expertise for the data post-processing and interpretation makes this approach difficult to be widely implemented in clinical practice. It is worth noting that, because the majority of glioblastomas are IDH wild type, datasets containing a mix of low-grade glioma (LGG) and glioblastomas may suggest a higher level of accuracy than the true accuracy in purely LGG cohorts. The reason is that when the model detects a glioblastoma, it may conclude that it is an IDH wild-type tumor even if the IDH genotype has not been predicted (36). Glioblastoma patients were included in two studies. Choi et al. included 289 glioblastoma patients (62.4%), and Chang et al. included 233 glioblastoma patients (47%) in their studies (34,35). Several noninvasive methods, both utilizing and not utilizing DL/ML techniques, have been evaluated as predictors of IDH mutation in gliomas. Goyal et al. demonstrated that the T2-FLAIR-mismatch sign was not sensitive but very specific in predicting IDH mutation in gliomas (pooled sensitivity, 32.1%; pooled specificity, 100%) (42). T2-FLAIR mismatch sign is a radiologic signature in which a hyperintense signal on T2WI and a relatively suppressed signal on FLAIR imaging is combined. It does not involve any ML or DL technique, and it can be detected by the naked eye. Goyal et al. also used the Bayes theorem with a pretest probability of 80.2% to generate more conservative estimates, a positive posttest probability of 99.2%, and a negative posttest probability of 73.5% (42). In another meta-analysis, Suh et al. showed that the pooled sensitivity and specificity of 2-hydroxyglutarate magnetic resonance spectroscopy in predicting IDH mutation, without any ML or DL techniques, was 95% and 91%, respectively (43). In another study, Suh et al. also assessed the performances of various conventional and advanced imaging modalities in the prediction of IDH mutation, which consisted of conventional MRI, diffusion-weighted imaging/perfusion-weighted imaging, magnetic resonance spectroscopy, 2-hydroxyglutarate magnetic resonance spectroscopy, and radiomics methods (11). Pooled sensitivity was 86%, and the pooled specificity was 87%. Furthermore, Zhao et al. evaluated the diagnostic accuracy of ML in predicting IDH mutations in a recent meta-analysis (15). The pooled sensitivity and specificity were 87% and 88% in the training set, and 87% and 90% in the validation set, respectively. DL is a growing field of study that focuses on creating artificial neural networks (ANN) inspired by biological neural networks in the human brain (44). ANN are used in some of the most successful DL approaches. ANNs, by learning and capturing the information contained in the data, recognize complicated nonlinear correlations between dependent and independent variables (45). Choi et al. used a recurrent neural network that can process sequence inputs and take past outputs as inputs using their internal memory (35). Li et al. used a convolutional neural network (CNN), which evaluates the 2D structure of input data using local connections and weights, followed by pooling techniques to derive spatial invariant features (37). In the other two studies, Matsui et al. and Chang et al. used residual neural networks, which employ identical shortcut connections to let information flow between layers without the attenuation that several stacked nonlinear transformations would produce (34,36). DL algorithms do not require an intermediary feature extraction or an engineering phase to learn the relationship between the input and the appropriate labels. DL combines the processes of data representation and prediction. These are possibly the most significant advantages of DL algorithms compared to conventional ML algorithms (44). However, in medicine, there are some concerns and challenges with DL algorithms. DL models are often referred to as “black boxes” because they are very recursive. According to Rudin, this is causing interpretability issues, which makes clinicians hesitant (46). However, significant progress has been made in developing algorithms that may be able to open the “black box” of DL for a range of deep neural networks (44). Another concern with DL models, when combined with radiomics, is that current models still lack repeatability and validation. There are currently no guidelines for radiomic features or the construction of clinical models using these features (47). There are no evaluation criteria or standardized data collection. Perhaps, these are some of the challenges that need to be resolved to appreciate DL’s potential and value in medicine fully. The method of identifying the region of interest (ROI) on MRI scans is known as segmentation or labeling (37). Tumor tissue is differentiated from normal surrounding tissues during glioma segmentation. The segmented ROI can then be used to extract radiomic data, which is very time-consuming and highly variable between operators when done manually (48). Automatic tumor segmentation methods have been developed to improve tumor segmentation and overcome these difficulties. In a meta-analysis, van Kempen et al. demonstrated that tumor segmentation based on ML algorithms showed an overall dice similarity coefficient score of 0.84, with high heterogeneity (80.4%) (48). In the included studies in our review, several segmentation methods were used. Li et al. used a CNN (37). They used it to compete in the Brain Tumor Segmentation Challenge (BRATS) (49), where it was placed first and second in BRATS 2013 and BRATS 2015, respectively (37). Choi et al. also used a fully automatic segmentation method based on a CNN algorithm with a manual correction that was the second-placed method in the international BRATS 2017 (35). Matsui et al. and Chang et al. manually segmented the ROIs in their studies (34,36). This meta-analysis demonstrated that DL is promising in detecting IDH mutations in gliomas. Even though an increasing number of papers being published in recent years, little of them has been translated into clinical practice, indicating the need for several improvements in study design before these methods can be implemented in clinical practice (50). Firstly, standardized data collection, evaluation criteria, and reporting guidelines are required to establish the generalizability and validation of the findings (44). More research for developing clinically oriented DL models with high interpretability is needed to alleviate clinicians’ concerns. According to some authors, current model interpretation methods such as saliency mapping and class activation are unreliable (46,51). As a result, Rudin et al. recommended developing and applying inherently interpretable models (46). Chen et al. proposed a method in which their model distinguishes several parts of an image where it thinks a part of the image mimics a prototypical part of a bird species and then identifies the species of birds based on a weighted combination of the similarity scores between parts of the image and the trained prototypes (52). As a result, their model is interpretable, with a transparent reasoning process similar to human reasoning when making classifications. We believe that inherently interpretable models, such as the one mentioned above, are required to implement DL models into clinical practice by increasing clinician trust in the model. Furthermore, to integrate radiomic models into clinical practice, clinical trials must demonstrate improvements in patient management and decision making (47). Nonetheless, our study revealed that DL models have a high potential for predicting IDH mutations in gliomas. Upon successful improvements, they may be integrated into clinical practice at a low cost compared to conventional methods for detecting IDH mutation and a reliable diagnostic tool. Our study is not without limitations. Some study data were obtained via email, which reduces their reliability. However, the main limitation is the lack of statistics for the training and test sets. As a result, the statistics of the validation sets were the primary focus of our research. Finally, three studies in the literature were excluded due to a lack of data required to reconstruct a confusion matrix. It is worth noting that, even though it is not a limitation of our study, all the included studies had the limitation of being conducted retrospectively, necessitating the need for additional large-scale prospective studies to validate their findings.

Conclusions

This study revealed that DL algorithms demonstrate excellent diagnostic performance in predicting IDH mutation in gliomas, with an overall 96.0% positive posttest probability in validation sets. Radiomic features related to IDH mutation, and its underlying pathophysiology extracted from advanced MRI, such as perfusion-weighted sequences, may further increase the prediction probability and support large-scale, prospective trials to distill the diagnostic and clinical added value of such DL models in neuro-oncology. The article’s supplementary files as

48 in total

Review 1. Detection of oncogenic IDH1 mutations using magnetic resonance spectroscopy of 2-hydroxyglutarate.

Authors: Ovidiu C Andronesi; Otto Rapalino; Elizabeth Gerstner; Andrew Chi; Tracy T Batchelor; Dan P Cahill; A Gregory Sorensen; Bruce R Rosen
Journal: J Clin Invest Date: 2013-09-03 Impact factor: 14.808

2. Chest Radiographs in Congestive Heart Failure: Visualizing Neural Network Learning.

Authors: Jarrel C Y Seah; Jennifer S N Tang; Andy Kitchen; Frank Gaillard; Andrew F Dixon
Journal: Radiology Date: 2018-11-06 Impact factor: 11.105

3. Prediction of IDH genotype in gliomas with dynamic susceptibility contrast perfusion MR imaging using an explainable recurrent neural network.

Authors: Kyu Sung Choi; Seung Hong Choi; Bumseok Jeong
Journal: Neuro Oncol Date: 2019-05-24 Impact factor: 12.300

4. Radiological features combined with IDH1 status for predicting the survival outcome of glioblastoma patients.

Authors: Kai Wang; Yinyan Wang; Xing Fan; Jiangfei Wang; Guilin Li; Jieling Ma; Jun Ma; Tao Jiang; Jianping Dai
Journal: Neuro Oncol Date: 2015-09-25 Impact factor: 12.300

5. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies.

Authors: Penny F Whiting; Anne W S Rutjes; Marie E Westwood; Susan Mallett; Jonathan J Deeks; Johannes B Reitsma; Mariska M G Leeflang; Jonathan A C Sterne; Patrick M M Bossuyt
Journal: Ann Intern Med Date: 2011-10-18 Impact factor: 25.391

6. IDH1 or IDH2 mutations predict longer survival and response to temozolomide in low-grade gliomas.

Authors: C Houillier; X Wang; G Kaloshi; K Mokhtari; R Guillevin; J Laffaire; S Paris; B Boisselier; A Idbaih; F Laigle-Donadey; K Hoang-Xuan; M Sanson; J-Y Delattre
Journal: Neurology Date: 2010-10-26 Impact factor: 11.800

7. Magnetic Resonance Imaging-guided Stereotactic Biopsy: A Review of 83 Cases with Outcomes.

Authors: Ali Akay; Mete Rüksen; Sertaç Islekel
Journal: Asian J Neurosurg Date: 2019 Jan-Mar

8. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews.

Authors: Matthew J Page; Joanne E McKenzie; Patrick M Bossuyt; Isabelle Boutron; Tammy C Hoffmann; Cynthia D Mulrow; Larissa Shamseer; Jennifer M Tetzlaff; Elie A Akl; Sue E Brennan; Roger Chou; Julie Glanville; Jeremy M Grimshaw; Asbjørn Hróbjartsson; Manoj M Lalu; Tianjing Li; Elizabeth W Loder; Evan Mayo-Wilson; Steve McDonald; Luke A McGuinness; Lesley A Stewart; James Thomas; Andrea C Tricco; Vivian A Welch; Penny Whiting; David Moher
Journal: BMJ Date: 2021-03-29

9. IDH mutation status is associated with a distinct hypoxia/angiogenesis transcriptome signature which is non-invasively predictable with rCBV imaging in human glioma.

Authors: Philipp Kickingereder; Felix Sahm; Alexander Radbruch; Wolfgang Wick; Sabine Heiland; Andreas von Deimling; Martin Bendszus; Benedikt Wiestler
Journal: Sci Rep Date: 2015-11-05 Impact factor: 4.379

Review 10. Performance of machine learning algorithms for glioma segmentation of brain MRI: a systematic literature review and meta-analysis.

Authors: Evi J van Kempen; Max Post; Manoj Mannil; Richard L Witkam; Mark Ter Laan; Ajay Patel; Frederick J A Meijer; Dylan Henssen
Journal: Eur Radiol Date: 2021-05-21 Impact factor: 5.315