Literature DB >> 31234143

A systematic review on machine learning in sellar region diseases: quality and reporting items.

Abstract

INTRODUCTION: Machine learning methods in sellar region diseases present a particular challenge because of the complexity and the necessity for reproducibility. This systematic review aims to compile the current literature on sellar region diseases that utilized machine learning methods and to propose a quality assessment tool and reporting checklist for future studies.
METHODS: PubMed and Web of Science were searched to identify relevant studies. The quality assessment included five categories: unmet needs, reproducibility, robustness, generalizability and clinical significance.
RESULTS: Seventeen studies were included with the diagnosis of general pituitary neoplasms, acromegaly, Cushing's disease, craniopharyngioma and growth hormone deficiency. 87.5% of the studies arbitrarily chose one or two machine learning models. One study chose ensemble models, and one study compared several models. 43.8% of studies did not provide the platform for model training, and roughly half did not offer parameters or hyperparameters. 62.5% of the studies provided a valid method to avoid over-fitting, but only five reported variations in the validation statistics. Only one study validated the algorithm in a different external database. Four studies reported how to interpret the predictors, and most studies (68.8%) suggested possible clinical applications of the developed algorithm. The workflow of a machine-learning study and the recommended reporting items were also provided based on the results.
CONCLUSIONS: Machine learning methods were used to predict diagnosis and posttreatment outcomes in sellar region diseases. Though most studies had substantial unmet need and proposed possible clinical application, replicability, robustness and generalizability were major limits in current studies.

Entities: Chemical

Keywords: artificial intelligence; craniopharyngioma; growth; pituitary; prediction

Year: 2019 PMID： 31234143 PMCID： PMC6612064 DOI： 10.1530/EC-19-0156

Source DB: PubMed Journal: Endocr Connect ISSN： 2049-3614 Impact factor: 3.335

Introduction

Studies using machine learning methods gained popularity in medical researches in recent years. Machine learning methods integrate computer-based algorithms into data analysis to find similar patterns among different samples. The ultimate goal aims at using multiple variables to predict a specific outcome in a particular cohort. There are two types of machine learning algorithms in general: supervised and unsupervised machine learning. In supervised machine learning, both of the predictors and outcome are known; but in unsupervised machine learning, only the predictors are fed into the algorithm. The most common type of tumors originated in the sellar region included pituitary neoplasm, craniopharyngioma, meningioma and chordoma, which overall takes more than 10–15% of tumors in the central nervous system (1). Other non-tumorous sellar region diseases include Rathke’s cyst, hypophysitis, hypopituitarism and all the complications due to the treatment of these diseases (2). Machine learning may help to build a more reliable aided diagnostic tool for neuroradiologists and neuropathologists. Better prediction of clinical outcomes in these patients may provide better clinical decision support for either neuroendocrinologists or neurosurgeons. While on the other hand, machine learning methods present a particular challenge because of the complexity in model training and testing. The reproducibility of scientific research has always been of critical importance, which also applies in machine learning studies (3). With the expansion of machine learning in medical studies, the applications in real clinical decision making are booming, which requires both robustness and generalizability (4, 5). This systematic review aims to compile the current literature on sellar region diseases that utilized different machine learning methods and analyze the reporting items regarding cohort selection, model building and model explanation. Unlike traditional statistical methods, risks of bias and confounding are not the main question of interest in machine learning studies. How to assess the quality of these studies remains unsolved, and a reporting guideline was not available for these studies to follow. This review presents a quality assessment tool and proposes a checklist of reporting items for studies built on machine learning methods.

Methods

Literature for this review was identified by searching PubMed and Web of Science from the date of the first available article to December 1, 2018. The keywords containing ‘machine learning’ or the algorithms of machine learning were queried with the combination of keywords containing sellar region diseases (Supplementary Table 1, see section on supplementary data given at the end of this article). The search was limited to studies published in English. References in published reviews were manually screened for possible inclusions. The study adheres to the PRISMA guideline, and the checklist was provided in Supplementary Table 2. Studies were included if they evaluated machine learning algorithms (logistic regression with regulation, linear discriminant analysis, k means, k nearest neighbor, cluster analysis, support vector machine, decision tree-based models and neural networks) for application in prediction of disease originated in the sellar region (both tumorous and non-tumorous diseases). Exclusion criteria were lack of full-text or animal studies. Data obtained from each study were publication characterizes (first author’s last name, publication time), cohort selection (sample size, diagnosis), predictors (variables fed into the machine learning models), outcomes (the outcomes as well as the controls, including the distributions between them), model selection (models used in the study, including platforms, packages and parameters), statistics for model performance (methods to evaluated the model, the mean and the variance) and model explanation (any explanation on how important of each predictors and proposed clinical application). Supplements in each study were also reviewed if available. Quantitative synthesis was inappropriate due to the heterogeneity in outcomes. Summary of included studies was listed in a table and using a narrative approach. The proposed quality assessment (Table 1) of each study consists of five categories: unmet need (limits in current non-machine-learning approach), reproducibility (feature engineering methods, platforms/packages, hyperparameters), robustness (valid methods to overcome over-fit, the stability of results), generalizability (external data validation) and clinical significance (predictors explanation and suggested clinical use). A quality assessment table was provided by listing ‘yes’ or ‘no’ of corresponding items in each category.

Table 1

Quality assessment of machine learning studies.

Categories	Items	Description	Reported
Unmet need	Limits in current non-machine-learning approach	Low diagnostic accuracy, low human-level prediction accuracy or prolonged diagnostic procedure	Yes/no
Reproducibility	Feature engineering methods	How features were generated before model training	Yes/no
	Platforms/packages	Both platforms and packages should be reported	Yes/no
	Hyperparameters	All hyperparameters which are needed for study replication	Yes/no
Robustness	Valid methods to overcome over-fit	Leave-one-out or k-fold cross-validation or bootstrap	Yes/no
Robustness	The stability of results	Calculated variation in the validation statistics	Yes/no
Generalizability	External data validation	Validation in settings different from the research framework	Yes/no
Clinical significance	Predictors explanation	Explanation of the importance of each predictor	Yes/no
Clinical significance	Suggested clinical use	Proposed possible applications in clinical care	Yes/no

Quality assessment of machine learning studies. To provide a clear picture of how to perform a machine learning study, the workflow of a machine learning study was summarized, and notations of terms used in these machine learning studies were provided. The recommended reporting items were also provided based on the results.

Results

After scrutinizing the titles and abstracts generated by the searching strategy, 31 articles left for full-text screening, in which 13 studies were excluded: one study not in English, two duplicated studies, four conference abstracts without full-text and six studies without outcomes in sellar region diseases. Three studies used the same image database such that only the latest published study was included. At last, this systematic review included 16 studies (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21) (Table 2), with the diagnosis of general pituitary neoplasms, acromegaly, Cushing’s disease, craniopharyngioma and growth hormone deficiency. More than half of the studies were published in the recent 2 years.

Table 2

Summary of studies on sellar region disease using machine learning methods.

	Cohort selection		Predictors	Outcomes		Models (parameters)	Performance statistics
	Sample size	Diagnosis	Predictors	Outcomes and controls	Distribution	Models (parameters)	Discrimination	CV method	Variation in validation
Learned-Miller 2006 (6)	49	Acromegaly	Parameters from 3D shape of face	Acromegaly/healthy	24:25	SVM (linear or quadratic kernel)	Acc: 85.7%	LOOCV	NA
Kitajima 2009 (7)	43	Sellar mass	Age and 9 MRI features	Pituitary adenoma/craniopharyngioma/Rathke’s cyst	20:11:12	NN (FC(7)*1)	AUC: 0.990	LOOCV	NA
Lalys 2011 (8)	500	Pituitary adenoma	Features in surgical images	Six surgical phases: nasal incision/retract/tumor removal/column replacement/suture/nose compress	NA	SVM (linear kernel), HMM	Acc: 87.6%	10-fold CV	s.d.: 2.4%
Hu 2012 (9)	68	Pituitary adenoma	9 serum proteins	NFPA healthy	34:34	Decision tree (Gini index)	Sen: 82.4%Spe: 82.4%	10-fold CV	NA
Steiner 2012 (10)	15	Pituitary adenoma	Spectrum from histology	GH+/GH−/non-tumor cells	1000:1000:1000	k means (k = 10), LDA	Acc: 85.3%	LOOCV	s.d.: 10.5%
Calligaris 2015 (11)	45	Pituitary adenoma	Protein signature in mass spectrometry from histology	ACTH pituitary tumor/GH pituitary tumor/PRL pituitary tumor/pituitary gland	6:9:9:6	SVM	Sen: 83.0%Spe: 93.0%	NA	NA
Paul 2017 (12)	233	Brain tumors	Pixels in MRI images	Meningioma/glioma/pituitary tumor	208:492:289	CNN ((Cov(64)-Max)2 + FC(800)2), NN, SVM	Acc: 94.0%	5-fold CV	s.d.: 4.5%
Kong 2018 (13)	1123	Acromegaly	Features in photos	Acromegaly/healthy	527:596	Ensemble	Acc: 95.5%	NA	NA
Zhang 2018 (14)	112	Pituitary adenoma	Features in MRI images	Null cell adenoma/other subtypes	46:66	SVM (radial kernel)	AUC:0.804Acc: 81.1%	Bootstrap	NA
Murray 2018 (15)	124	Growth hormone deficiency	Age, sex, IGF1, gene expressions	Growth hormone deficiency/healthy	98:26	RF	AUC:0.990	Out-of-bag (3-fold CV)	NA
Yang 2018 (16)	168	Craniopharyngioma	Expression levels of signature genes	Craniopharyngioma/other brain or brain tumor samples	24:144	SVM (radial kernel)	AUC:0.850	NA	NA
Hollon 2018 (17)	400	Pituitary adenoma	26 patient’s characteristics	Poor early postoperative outcome/good	124:276	Elastic net, NB, SVM, RF	Acc: 87.0%	NA	NA
Staartjes 2018 (18)	140	Pituitary adenoma	patient characteristics, MRI features	Gross-total resection/not	95:45	NN (FC(5)*NA)	AUC: 0.96Acc: 90.9%	5-fold CV without holdout	s.d.:0.08%
Kocak 2018 (19)	47	Acromegaly	Features in MRI images	Response to somatostatin analogs/resistant	24:23	k-NN (k = 5)	Acc: 85.1%AUC: 0.847	10-fold CV	s.d.: 1.5%
Ortea 2018 (20)	30	Growth hormone deficiency	Three serum proteins	Growth hormone deficiency/healthy	15:15	RF, SVM	Acc: 100%AUC: 1.000	Bootstrap	NA
Smyczynska 2018 (21)	272	Growth hormone deficiency with GH treatment	Patient characteristics, GH level, IGF-1 level, GH dose	Height change after GH treatment	0.66 ± 0.57	NN (FC(2)*1)	RMSE: 0.267	NA	NA

Acc, accuracy; ACTH, adrenocorticotropic hormone; AUC, area under curve; BoVW, bag-of-visual-word; CNN, convolutional neural network; Cov, convolutional layer; CV, cross-validation; FC, fully-connected neural network; GH, growth hormone; HMM, hidden Markov model; IGF1, insulin-like growth hormone 1; LDA, linear discriminant analysis; LOOCV, leave-one-out cross-validation; Max, max pooling layer; MRI, magnetic resonance image; NA, not available; NB, naïve Bayesian; NFPA, non-functional pituitary adenoma; NN, neural network; PRL, prolactin; RF, random forest; RMSE, root mean square error; s.d., standard deviation; Sen, sensitivity; Spe, specificity; SVM, support vector machine.

Summary of studies on sellar region disease using machine learning methods. Acc, accuracy; ACTH, adrenocorticotropic hormone; AUC, area under curve; BoVW, bag-of-visual-word; CNN, convolutional neural network; Cov, convolutional layer; CV, cross-validation; FC, fully-connected neural network; GH, growth hormone; HMM, hidden Markov model; IGF1, insulin-like growth hormone 1; LDA, linear discriminant analysis; LOOCV, leave-one-out cross-validation; Max, max pooling layer; MRI, magnetic resonance image; NA, not available; NB, naïve Bayesian; NFPA, non-functional pituitary adenoma; NN, neural network; PRL, prolactin; RF, random forest; RMSE, root mean square error; s.d., standard deviation; Sen, sensitivity; Spe, specificity; SVM, support vector machine. The scheme of a machine learning study was summarized in Fig. 1. The process can be categorized into four stages when developing a prediction model. The first step is to bring out the clinical question, which is summarized as ‘predicting Outcome using Predictors in a Cohort’. A study should choose the appropriate outcome, predictors and the data source. The data are then pre-processed, which can involve data coding, transformations, imputation and dimension reduction. The training step means how the model (algorithm) finds patterns from the features to match the outcome variable. The trained model should be validated (both internally and externally). Finally, models are explained, and possible clinical applications are provided. The notations of terms used in these machine learning studies were described in Table 3.

Figure 1

Table 3

Notations of special machine learning terms.

Terms	Explanations
Unsupervised learning	A subgroup of machine leaning models with the purpose of finding similarities among samples where no outcomes are available
Supervised learning	A subgroup of machine leaning models with both predictors and outcomes, and the purpose is to learn the mapping function from the predictors to the outcomes
Feature	Predictors in a machine learning algorithm
Categorization	Transforming a continuous variable into a categorical variable
One-hot encoding	Using a vector (all the elements of the vector are 0 except one) to re-code a categorical variable
Standardization	Rescale data to a specific range, e.g., dividing by mean or dividing by standard deviation
Normalization	Transforming unnormalized data into normalized data, e.g., logarithm transformation
Over-fit	The established model corresponds too exactly to the training dataset, and may therefore fail to predict future unseen observations
Imputation	Assigning the value of a missing data, e.g., using the mean of the existing data
Dimension reduction	Representing the original data with lesser dimensions
Training	The learning process of the data pattern by a model
LASSO	Least Absolute Shrinkage and Selection Operator: A regression analysis method that performs both variable selection and regularization
SVM	Support Vector Machine: Finding the best hyperplane to separate data in a high dimensional space
Naïve Bayes	A simple probabilistic classifier based on Bayes’ theorem
kNN	k Nearest Neighbor: Classification of a sample according to the distance to other samples in the multidimensional space
Neural network	A family of models inspired by biological neural networks
Tree	A tree-like graph model of decisions and their possible consequences
Ensemble	Combining several different models, calculating predictions from these models and then those predictions are used as weighted inputs into another regression model for the ultimate prediction
Parameters	Coefficients of a model formula that need to be learned from the data
Hyperparameters	All the configuration variables of a model which are often set manually by the practitioner
Validation	Calculating performance of a trained model in a separated dataset
Discrimination	The ability of a model to separate individuals in multiple classes
Calibration	How well a model’s predicted probabilities concur with the actual probabilities
Cross-validation	First, the data is partitioned into k (5 or 10) equally sized parts randomly with one part as the validation dataset and others as the training dataset. This process is repeated for k times with each of the subsamples used exactly once as the validation dataset
Leave-one-out	Leaving one sample out each time and training the model on the remaining samples. The process is repeated multiple times till all the samples are “leave-outed” once
Bootstrapping	Randomly sampled data from the whole original data (patients can be sampled multiple times) can be used to create new data. Training and validation are based on the new data, and the resampling process is repeated multiple times
Robust	The stability of a model in cross-validation or in sensitivity analysis
Feature importance	How much the accuracy decreases when the feature is excluded

The scheme of a machine learning study. The process can be categorized into four steps: a good clinical question; pre-processed data; training and validation of the model and significance in clinical applications. Notations of special machine learning terms. Sample size in these studies varied from tens to thousands. The majority of the studies (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 20) (76.5%) used the diagnosis of a specific disease as the outcome, only four studies (17, 18, 19, 21) tested on the treatment outcome. In the diagnostic studies, three studies (7, 12, 14) used image features to categorize magnetic resonance images (MRIs), two (6, 13) used face photos to predict acromegaly, two (15, 20) predicted growth home deficiency using serum proteins, two (10, 11) used histological spectrum to predict histology diagnosis, one (9) used serum proteins to predict pituitary adenoma and one (8) predicted surgical phase using videos. In studies on treatment outcomes, one study (17) predicted poor early postoperative outcome, one (18) predicted gross-total resection, one (19) predicted response to somatostatin analogs and one (21) predicted growth after growth hormone treatment. All the outcomes were either dichotomized or categorical outcomes except one in the continuous form (21). Most of the studies (87.5%) arbitrarily chose one or two machine learning models without arguing the reasons. One study (13) chose ensemble models by combining the decisions from multiple models to improve the overall performance. One study (17) compared several models and chose the one with the best performance. With regard to validation methods, five studies (8, 9, 12, 15, 19) used k-fold cross-validation, two studies (14, 20) used bootstrap and three studies (6, 7, 10) used leave-one-out cross-validation. One study (18) used cross-validation but without holdout and five studies (11, 13, 16, 17, 21) did not report the validation method. In studies reporting validation methods, only five (8, 10, 12, 18, 19) reported the variation of the validation statistics. In the quality assessment (Table 4), limits in current non-machine-learning approach were mentioned in most of the studies. During the model training process, only two studies (6, 17) did not provide how the data were transformed into a way that can be fed into the algorithm. But nearly half the studies (43.8%) did not offer the program or the platform for model training and roughly half (43.8%) did not provide hyperparameters which are necessary for the training process. As mentioned above, 62.5% of the studies provided a valid method to combat over-fitting, but only five reported variations in the validation statistics. Only one study (16) validated the algorithm in an external database. Though only four studies (15, 17, 18, 21) reported how to interpret the predictors, most studies (68.8%) suggested possible clinical application of the developed machine learning algorithm.

Table 4

Quality assessment of machine learning studies in sellar region disease.

	Unmet need	Reproducibility			Robustness		Generalizability	Clinical significance
	Limits in current non-machine-learning approach	Feature engineering	Platforms, packages	Hyperparameters	Valid methods for over-fitting	Stability of results	External data validation	Predictors explanation	Suggested clinical use
Learned-Miller 2006 (6)	Yes	No	Yes	No	Yes	No	No	No	Yes
Kitajima 2009 (7)	Yes	Yes	No	Yes	Yes	No	No	No	Yes
Lalys 2011 (8)	No	Yes	No	Yes	Yes	Yes	No	No	Yes
Hu 2012 (9)	No	NA	Yes	Yes	Yes	No	No	No	Yes
Steiner 2012 (10)	Yes	Yes	No	Yes	Yes	Yes	No	No	Yes
Calligaris 2015 (11)	Yes	NA	No	No	No	No	No	No	Yes
Paul 2017 (12)	Yes	Yes	No	Yes	Yes	Yes	No	No	No
Kong 2018 (13)	Yes	Yes	No	Yes	No	No	No	No	Yes
Zhang 2018 (14)	Yes	Yes	Yes	No	Yes	No	No	No	Yes
Murray 2018 (15)	Yes	Yes	Yes	No	Yes	No	No	Yes	Yes
Yang 2018 (16)	Yes	Yes	Yes	Yes	No	No	Yes	No	No
Hollon 2018 (17)	No	No	Yes	No	No	No	No	Yes	No
Staartjes 2018 (18)	Yes	Yes	Yes	No	No	Yes	No	Yes	Yes
Kocak 2018 (19)	Yes	Yes	Yes	Yes	Yes	Yes	No	No	No
Ortea 2018 (20)	Yes	NA	Yes	No	Yes	No	No	No	Yes
Smyczynska 2018 (21)	Yes	Yes	No	Yes	No	No	No	Yes	No

NA, no need.

Quality assessment of machine learning studies in sellar region disease. NA, no need. Based on the results, several recommended reporting items for a machine learning study were proposed (Table 5). Reporting of the background should include results by human intelligence and a summarized research question. In the methods part, it is recommended to report the diagnosis of the cohort; locations and period of the included patients; how the control group was determined; all the variables as predictors; the data coding, data transformation methods; missing data imputation methods; any censoring data. The methods part should also include the reason for choosing a specific model; the platform and the package for model building (recommended in Supplementary Table 3) and all the hyperparameters in the model if applicable. Reporting of the results should include the rate of binary outcome or the distribution of categorical or continuous outcome; the appropriate validation statistic based on the clinical question; the 95% confidence interval by cross-validation or bootstrap and whether an external validation was obtained. Reporting of the discussion should include the reason if arbitrarily chosen cut-off value; the clinical meaning of the discrimination or calibration statistics; explanation of the model (provide coefficients or feature importance if possible); discussion on how the model will be integrated into clinical care.

Table 5

A proposed reporting checklist of future studies using machine learning.

Reporting of background should include

Results of human intelligence or non-machine-learning approach

A summarized research question

Reporting of method should include

Diagnoses of the cohort

Locations and time span of the patients included

How the control group was determined

All the variables as predictors

Data coding and data transformation methods

Missing data imputation methods

Any censoring data

The reason for choosing a specific model

The platform and the package for model building

All the hyperparameters in the model if applicable

Reporting of results should include

The rate of binary outcome or the distribution of categorical or continuous outcome

The appropriate validation statistic based on the clinical question

95% confidence interval of validation statistic by cross-validation or bootstrapping

Whether an external validation was obtained

Reporting of the discussion should include

The reason if arbitrarily chosen cut-off value

Clinical meaning of the discrimination or calibration statistics

Explanation of the model (provide coefficients or feature importance if possible);

Discussion on how the model will be integrated in clinical care

A proposed reporting checklist of future studies using machine learning.

Discussion

The review summarized studies on sellar region disease with machine learning methods about cohort selections, predictors, outcomes, model buildings and validation methods. A quality assessment tool was proposed in these aspects: unmet needs, reproducibility, robustness, generalizability and clinical significance. A reporting checklist from the introduction to the discussion was also provided for future studies. Though machine learning methods have the potential advantage of increasing the prediction power, researchers should always focus on the clinical questions. The unmet needs in current practice either in diagnosis or in posttreatment prediction were the drivers for expanding the use of this new method. In particular cases, results of human-level intelligence (13) should be tested in scenarios where the predictions are majorly dependent on clinicians’ subjective judgments in current standard care, for example, in studies in predicting gross-total resection rate after pituitary adenoma surgery (18). In both studies, human-level intelligence results by either physicians’ judgment or conventional prediction technique were provided. In particular situations, if the diagnostic process needs too much human labor, it was also a good argument for the application of machine learning methods (22, 23). In general, machine learning studies were retrospective observational studies, and the predictors were usually all the variables which have been recorded. On the other hand, features can be generated by transforming data already collected using specific methods (standardization, normalization, centralization) (24). These methods should be reported in the method part for study replication. There were also a few feature selection methods (25), and most of them were based on maximizing the validation statistics. But we should be bear in mind that feature selection can either improve the robustness or have the potential to harm the generalizability. Unsupervised leaning models are usually not used in clinical studies, because the purpose of these approaches is to find similarities among samples where no outcomes are available, for example, genomic grouping. In selecting specific supervised machine learning algorithms, no common rules apply. Because there was no guarantee that a certain algorithm performs the best in all kind of data. In general, neural networks performs better than other models in image data, and tree-based models perform better in tabular data. Platforms, packages, parameters and hyperparameters were other critical issues for study replication, but only half of the studies provided this information. Algorithms like logistic regression with regulation, linear discriminant analysis, k means and k nearest neighbor are relatively easy to implement and do not require many hyperparameters. Support vector machine, decision tree-based models and neural networks are more complicated and need tons of hyperparameters during training. Proper reporting was necessary for study replication using these models. Leave-one-out holds one sample out each time and trains the model on the remaining samples. Similarly, k-fold cross-validation (k = 5 or 10 in general) holds 1/5th or 1/10th of the samples out each time and trains the model on the remaining samples (9, 19, 22). Bootstrap samples patients from the whole original data randomly to create new data in which a model was trained, and the resampling process is repeated multiple times (14, 20). It was not recommended to randomly split the data into two parts (training and testing) because it may have a big chance to achieve a relatively ‘good’ testing data such that biasing the model performance to the better direction. Calibration seems not so important in sellar region disease in this systematic review. When the research question is to predict the classification, it is not important whether the predicted probabilities deviate to the real probabilities because the goal is to discriminate the predicted values between the two classes. On the other hand, in situations when predicting the probability of a specific class (e.g. mortality risk) is important, the predicted probabilities should be calibrated to avoid deviating too much from the actual probabilities (26). Calibration is usually measured by Hosmer–Lemeshow goodness-of-fit test or by calibration belt plotting the distribution of real probability versus the predicted probability (27). Generalizability is another major concern in machine learning studies. The population to be generalized should have similar characteristics distribution and outcome proportion. If a model is to be truly applied in the clinical setting, it should be validated in another database. Recent food drug administration approved aided diagnostic tool for diabetic retinopathy diagnosis, atrial fibrillation detection and other diseases all require validation in the external dataset (4). Sometimes clinicians want to know which factors drive the model for the prediction in the whole population or in a particular patient, which highlights the importance of model explanation. On the population level, this can be solved by looking at coefficients of each variable in logistic regressions or calculating feature importance in tree-based models or neural networks. But sometimes individual-level explanation may be more important, which necessitate the interpretation of each variable in each sample. This procedure can be calculated by SHAP score, which means the contribution of each variable to the final prediction value (28). But both explanations only tell why the model performs like the way it functions, but not anything about how we can improve our clinical practice, which is one major limit in machine learning methods. Clinical applications include multiple aspects. Developing a smartphone application for acromegaly detection may help to increase the diagnostic rate of acromegaly (13). Using histological spectrum to differentiate different tumor types may help quicker and more accurate intra-operative diagnosis (11). Predicting somatostatin analog sensitivity can guide future clinical trials by recruiting patients more sensitive to the medications (19). Precise prediction of postoperative adverse events may help to alarm surgeons to pay more attention to those patients who have a higher likelihood of developing these events (17). Web-based online real-time prediction can also help increase physician–physician or physician–patient communication (29). Although machine learning approach provided additional prediction power comparing to conventional regression models, several concerns in applying this approach were listed as follows: (1) the superiority of prediction power were not guaranteed in every case; (2) machine learning approach is more data consuming and time consuming, thus is less efficient than conventional models; (3) different platforms, different packages and multiple hyperparameters of machine learning approach restrict its replicability among different research groups. Current gaps of knowledge still exist in how to correctly explain the machine learning models either in the global level or in the individual level.

Conclusion

Machine learning methods were used to predict diagnosis and posttreatment outcomes in sellar region diseases. Though most studies had substantial unmet needs and proposed possible clinical application, replicability robustness assessed by variations in the validation statistics and generalizability evaluated by the external database, were major limits in current studies. Population-level and individual-level predictors explanation are also directions for future improvements.

Declaration of interest

The author declares that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported.

Funding

Dr Qiao is supported by 2018 Milstein Medical Asian American Partnership Foundation translational medicine fellowship. This study is supported by Shanghai Committee of Science and Technology, China (grant NO. 17JC1402100 and 17YF1426700).

27 in total

Review 1. Sellar Lesions/Pathology.

Authors: Damien Bresson; Philippe Herman; Marc Polivka; Sébastien Froelich
Journal: Otolaryngol Clin North Am Date: 2016-02 Impact factor: 3.346

2. Utility of deep neural networks in predicting gross-total resection after transsphenoidal surgery for pituitary adenoma: a pilot study.

Authors: Victor E Staartjes; Carlo Serra; Giovanni Muscas; Nicolai Maldaner; Kevin Akeret; Christiaan H B van Niftrik; Jorn Fierstra; David Holzmann; Luca Regli
Journal: Neurosurg Focus Date: 2018-11-01 Impact factor: 4.047

Review 3. Feature selection methods for big data bioinformatics: A survey from the search perspective.

Authors: Lipo Wang; Yaoli Wang; Qing Chang
Journal: Methods Date: 2016-08-31 Impact factor: 3.608

4. Non-invasive radiomics approach potentially predicts non-functioning pituitary adenomas subtypes before surgery.

Authors: Shuaitong Zhang; Guidong Song; Yali Zang; Jian Jia; Chao Wang; Chuzhong Li; Jie Tian; Di Dong; Yazhuo Zhang
Journal: Eur Radiol Date: 2018-03-23 Impact factor: 5.315

5. Development of Machine Learning Algorithms for Prediction of 30-Day Mortality After Surgery for Spinal Metastasis.

Authors: Aditya V Karhade; Quirina C B S Thio; Paul T Ogink; Akash A Shah; Christopher M Bono; Kevin S Oh; Phil J Saylor; Andrew J Schoenfeld; John H Shin; Mitchel B Harris; Joseph H Schwab
Journal: Neurosurgery Date: 2019-07-01 Impact factor: 4.654

6. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery.

Authors: Scott M Lundberg; Bala Nair; Monica S Vavilala; Mayumi Horibe; Michael J Eisses; Trevor Adams; David E Liston; Daniel King-Wai Low; Shu-Fang Newman; Jerry Kim; Su-In Lee
Journal: Nat Biomed Eng Date: 2018-10-10 Impact factor: 25.671

7. A machine learning approach to predict early outcomes after pituitary adenoma surgery.

Authors: Todd C Hollon; Adish Parikh; Balaji Pandian; Jamaal Tarpeh; Daniel A Orringer; Ariel L Barkan; Erin L McKean; Stephen E Sullivan
Journal: Neurosurg Focus Date: 2018-11-01 Impact factor: 4.047

8. Differentiation of common large sellar-suprasellar masses effect of artificial neural network on radiologists' diagnosis performance.

Authors: Mika Kitajima; Toshinori Hirai; Shigehiko Katsuragawa; Tomoko Okuda; Hirofumi Fukuoka; Akira Sasao; Masuma Akter; Kazuo Awai; Yoshiharu Nakayama; Ryuji Ikeda; Yasuyuki Yamashita; Shigetoshi Yano; Jun-ichi Kuratsu; Kunio Doi
Journal: Acad Radiol Date: 2009-03 Impact factor: 3.173

Review 9. High-performance medicine: the convergence of human and artificial intelligence.

Authors: Eric J Topol
Journal: Nat Med Date: 2019-01-07 Impact factor: 53.440

10. Using Deep Learning for the Classification of Images Generated by Multifocal Visual Evoked Potential.

Authors: Nidan Qiao
Journal: Front Neurol Date: 2018-08-03 Impact factor: 4.003

9 in total

1. Development and assessment of machine learning algorithms for predicting remission after transsphenoidal surgery among patients with acromegaly.

Authors: Yanghua Fan; Yansheng Li; Yichao Li; Shanshan Feng; Xinjie Bao; Ming Feng; Renzhi Wang
Journal: Endocrine Date: 2019-10-30 Impact factor: 3.633

Review 2. Application of artificial intelligence and radiomics in pituitary neuroendocrine and sellar tumors: a quantitative and qualitative synthesis.

Authors: Kelvin Koong; Veronica Preda; Anne Jian; Benoit Liquet-Weiland; Antonio Di Ieva
Journal: Neuroradiology Date: 2021-11-27 Impact factor: 2.804

Review 3. Predicting Major Adverse Cardiovascular Events in Acute Coronary Syndrome: A Scoping Review of Machine Learning Approaches.

Authors: Sara Chopannejad; Farahnaz Sadoughi; Rafat Bagherzadeh; Sakineh Shekarchi
Journal: Appl Clin Inform Date: 2022-05-26 Impact factor: 2.762

4. Machine learning in predicting early remission in patients after surgical treatment of acromegaly: a multicenter study.

Authors: Nidan Qiao; Ming Shen; Wenqiang He; Min He; Zhaoyun Zhang; Hongying Ye; Yiming Li; Xuefei Shou; Shiqi Li; Changzhen Jiang; Yongfei Wang; Yao Zhao
Journal: Pituitary Date: 2020-10-06 Impact factor: 4.107

Review 5. Machine learning applications in imaging analysis for patients with pituitary tumors: a review of the current literature and future directions.

Authors: Ashirbani Saha; Samantha Tso; Jessica Rabski; Alireza Sadeghian; Michael D Cusimano
Journal: Pituitary Date: 2020-06 Impact factor: 4.107

Review 6. Machine Learning Approach for Preterm Birth Prediction Using Health Records: Systematic Review.

Authors: Zahra Sharifi-Heris; Juho Laitala; Antti Airola; Amir M Rahmani; Miriam Bender
Journal: JMIR Med Inform Date: 2022-04-20

Review 7. Machine learning versus conventional clinical methods in guiding management of heart failure patients-a systematic review.

Authors: George Bazoukis; Stavros Stavrakis; Jiandong Zhou; Sandeep Chandra Bollepalli; Gary Tse; Qingpeng Zhang; Jagmeet P Singh; Antonis A Armoundas
Journal: Heart Fail Rev Date: 2021-01 Impact factor: 4.214

8. Harnessing Big Data, Smart and Digital Technologies and Artificial Intelligence for Preventing, Early Intercepting, Managing, and Treating Psoriatic Arthritis: Insights From a Systematic Review of the Literature.

Authors: Nicola Luigi Bragazzi; Charlie Bridgewood; Abdulla Watad; Giovanni Damiani; Jude Dzevela Kong; Dennis McGonagle
Journal: Front Immunol Date: 2022-03-10 Impact factor: 7.561

9. Development and Interpretation of Multiple Machine Learning Models for Predicting Postoperative Delayed Remission of Acromegaly Patients During Long-Term Follow-Up.

Authors: Congxin Dai; Yanghua Fan; Yichao Li; Xinjie Bao; Yansheng Li; Mingliang Su; Yong Yao; Kan Deng; Bing Xing; Feng Feng; Ming Feng; Renzhi Wang
Journal: Front Endocrinol (Lausanne) Date: 2020-09-16 Impact factor: 5.555

9 in total