Literature DB >> 30985529

Can Hyperparameter Tuning Improve the Performance of a Super Learner?: A Case Study.

Jenna Wong¹, Travis Manderson², Michal Abrahamowicz¹, David L Buckeridge¹, Robyn Tamblyn¹.

Abstract

BACKGROUND: Super learning is an ensemble machine learning approach used increasingly as an alternative to classical prediction techniques. When implementing super learning, however, not tuning the hyperparameters of the algorithms in it may adversely affect the performance of the super learner.
METHODS: In this case study, we used data from a Canadian electronic prescribing system to predict when primary care physicians prescribed antidepressants for indications other than depression. The analysis included 73,576 antidepressant prescriptions and 373 candidate predictors. We derived two super learners: one using tuned hyperparameter values for each machine learning algorithm identified through an iterative grid search procedure and the other using the default values. We compared the performance of the tuned super learner to that of the super learner using default values ("untuned") and a carefully constructed logistic regression model from a previous analysis.
RESULTS: The tuned super learner had a scaled Brier score (R) of 0.322 (95% [confidence interval] CI = 0.267, 0.362). In comparison, the untuned super learner had a scaled Brier score of 0.309 (95% CI = 0.256, 0.353), corresponding to an efficiency loss of 4% (relative efficiency 0.96; 95% CI = 0.93, 0.99). The previously-derived logistic regression model had a scaled Brier score of 0.307 (95% CI = 0.245, 0.360), corresponding to an efficiency loss of 5% relative to the tuned super learner (relative efficiency 0.95; 95% CI = 0.88, 1.01).
CONCLUSIONS: In this case study, hyperparameter tuning produced a super learner that performed slightly better than an untuned super learner. Tuning the hyperparameters of individual algorithms in a super learner may help optimize performance.

Entities: CellLine Chemical Disease Gene Species

Year: 2019 PMID： 30985529 PMCID： PMC6553550 DOI： 10.1097/EDE.0000000000001027

Source DB: PubMed Journal: Epidemiology ISSN： 1044-3983 Impact factor: 4.822

Predictive modeling has many important applications in public health, clinical practice, and epidemiologic research. Prediction algorithms can help identify target populations for health interventions, improve clinical decision making, and facilitate confounding control in observational studies.[1] The increasing amount of healthcare data being generated could help to improve the accuracy with which we can predict health outcomes,[2] but it is no trivial task sorting through the masses of data to separate the signal from the noise. Standard epidemiologic approaches to prediction typically involve using parametric regression methods where the optimal set of covariates is identified through such procedures as performing stepwise variable selection, exploring different functional forms for continuous variables, and testing for interactions between main effects. Such model-building practices are necessary because the probability estimates from regression models may be biased if the model is incorrectly specified.[3] However, as the dimensionality of the dataset grows, researchers may find that these standard model-building procedures become cumbersome and difficult to implement properly. Because of these challenges, there has been a growing interest to use more flexible prediction techniques from the machine learning literature that can automatically learn associations in high-dimensional data.[4,5] To allow researchers to simultaneously consider multiple machine learning techniques rather than just one, many studies[6-15] have implemented “super learning”[16]–an ensemble machine learning approach that determines the optimal weights for combining the predictions from a collection of machine learning algorithms to yield a final super learner prediction function that performs at least as well as any of its component algorithms. Despite the potential benefits of using the super learning methodology, few studies have conducted a head-to-head comparison of a super learner with a carefully constructed regression model. Furthermore, to our knowledge, no applications of super learning thus far have included formalized efforts to tune the hyperparameters of the machine learning algorithms in the super learner. Hyperparameters refer to parameters whose values are typically set by the user manually before an algorithm is trained and can impact the algorithm’s behavior by affecting such properties as its structure or complexity.[17] Although the super learning methodology itself does not dictate what hyperparameter values investigators should use for their machine learning algorithms, most investigators appear to use the default values in the statistical packages used to implement super learning.[6-15] This observation is concerning given that the performance of an algorithm can be sensitive to the value of its hyperparameters.[17,18] In the machine learning literature, hyperparameters are commonly tuned using an iterative procedure called grid search whereby an algorithm’s cross-validated performance is repeatedly assessed over a grid of possible hyperparameter values to identify the best one.[17,19] If this tuning process is not carried out (e.g., default values are used), then machine learning algorithms–and thus super learners–may not perform as well. In this study, we applied super learning to a high-dimensional dataset from a previous study[20] that used multivariable logistic regression methods with classical model-building techniques to predict when antidepressants were prescribed for indications other than depression. This prediction task is important because the medical reasons for drug use (“treatment indications”) are not routinely documented in structured electronic health data, thus creating challenges when using these data to study antidepressant use for depression and other (e.g., off-label) indications.[21,22] This study had two main objectives: (1) to compare the performance of a “tuned” super learner (fit using tuned hyperparameter values) to that of an “untuned” super learner (fit using default values) and (2) to compare the performance of the tuned super learner to that of the final logistic regression model derived in the previous study.[20]

METHODS

Data Source

The Medical Office of the XXIst Century (MOXXI) is an indication-based electronic prescribing and drug management system used by over 185 consenting primary care physicians at community-based clinics around two major urban centers in the Canadian province of Quebec.[23] The MOXXI system requires physicians to document at least one treatment indication for every prescription using either a drop-down menu containing on-label and off-label indications (without distinction) or by typing the indication(s) into a free-text field. Treatment indications in the MOXXI system were previously validated against a blinded, post hoc, physician-facilitated chart review where they had excellent sensitivity (98.5%) and high positive predictive value (97.0%).[24] Health services data on all MOXXI patients are available through the system’s integration with Quebec’s health insurance agency (Régie de l’assurance maladie du Québec) and hospital discharge summary database (MED-ECHO). These data sources provide information on patient demographics, diagnoses, hospitalizations, and medical services received. This study included MOXXI prescriptions for all drugs approved for depression in Canada written between January 2003 and December 2012. We excluded drugs with fewer than 120 prescriptions written during the study period. The unit of analysis was the antidepressant prescription. All patients gave informed consent to have their information used for research purposes. This study was approved by the McGill institutional review board.

Study Variables

The outcome being predicted was a binary variable indicating whether an antidepressant had been prescribed for an indication other than depression. The outcome was measured using the physician-documented treatment indications in the MOXXI system. Table 1 lists all variables that were considered as potential predictors of the outcome. There were a total of 373 variables related to characteristics of the prescription, patient, or prescribing physician. Prescription-related variables (n = 4) included the molecule name, the prescribed dose, whether the drug was prescribed on a take-as-needed basis, and the number of other drugs concurrently prescribed with the antidepressant. Patient-related variables (n = 362) captured information on demographics, socioeconomic status, diagnostic codes for plausible antidepressant treatment indications and other morbidities, health services use (e.g., previous hospitalizations, outpatient visits, emergency room visits, medical services received), and drugs prescribed in the past year. Finally, physician-related variables (n = 7) included physician sex, place of medical training, level of clinical experience, size of patient workload, and scores from a survey[25] that measured physicians’ attitudes towards new information about good clinical practices. Further details on the creation of these variables are included in eAppendix 1; http://links.lww.com/EDE/B513 and were described in the earlier article.[20]

TABLE 1.

Candidate Predictors of Antidepressant Prescriptions for Indications Other than Depression (n = 373)

Candidate Predictors of Antidepressant Prescriptions for Indications Other than Depression (n = 373) Of the 373 variables, 13 were continuous, two were multicategorical, and the remaining 358 were binary. Each categorical variable was expressed using dummy coding, yielding a final covariate matrix with 391 columns. Because some of the algorithms in the super learner required prior scaling of the inputs, we standardized each continuous variable by subtracting the variable’s mean and dividing by twice the variable’s SD.[18]

Prediction Approaches

Classical Epidemiologic Techniques

This approach replicated our previous analysis of the same dataset[20] that used classical multivariable logistic regression methods to predict the same outcome. In the previous analysis, we started with a baseline logistic regression model containing 26 of the 373 candidate variables, all of which were binary variables indicating whether the patient had a diagnostic code for any of 13 plausible antidepressant treatment indications recorded within two separate observation windows: (1) ±3 days around the index prescription date and (2) 4–365 days before the index prescription date. We then built upon this baseline model by considering the remaining 347 variables and applying a comprehensive suite of model-building techniques commonly used with regression methods in epidemiology. First, for all candidate continuous variables, we identified the best fitting first-degree fractional polynomial (FP1) function[26] among eight candidate FP1 functions: X, where the powers p were represented by the set {–2, –1, –0.5, 0, 0.5, 1, 2, 3}, and X0 denoted log(X). Next, we used a score-based forward stepwise variable selection procedure to iteratively add covariates to the baseline model, starting with the variable that produced the greatest improvement in performance and stopping when none of the remaining variables further improved performance. Finally, we added first-order interaction terms to this main-terms-only model if they offered additional improvement. More details of this model-building procedure are available in the previous article.[20]

Super Learning

We used super learning to combine the prediction functions from five machine learning algorithms: (1) a least absolute shrinkage and selection operator (LASSO) model,[27] (2) a recursive partitioning and regression tree (hereafter referred to as simply “decision tree”),[28] (3) a random forest,[29] (4) a neural network,[30] and (5) a support vector machine.[31] We chose these five algorithms for their diverse approaches for solving prediction tasks and their popularity of use in other fields like genetics[32] and biomedicine.[33] Because of the computational time required to tune each of their respective hyperparameters, we did not consider more than five algorithms. We implemented super learning using the SuperLearner package in the R programming language.[34] Table 3 shows the R packages we used to implement each algorithm in the super learner and the corresponding hyperparameters that we tuned. For the LASSO model, we tuned the regularization parameter lambda, where higher values imply more shrinkage of the regression coefficients. For the decision tree, we tuned the hyperparameter cp (“complexity parameter”), where higher values generally yield simpler, smaller trees. For the random forest, we tuned the number of trees in the forest (nTree) and the number of predictors randomly selected for consideration at each tree node (mTry). For the neural network, we fit a network with one hidden layer (the maximum allowed for by the nnet package) and tuned the number of nodes in this hidden layer (size). Finally, for the support vector machine, we tuned the regularization parameter C, where higher values allow for a more complex decision surface separating data points from different outcome classes. When fitting support vector machines, a common practice is to increase the dimensionality of the covariate space by applying a kernel to the predictor matrix.[18] Thus, we used one of the most commonly used kernels for support vector machines–the radial basis function kernel[35,36]–and tuned its gamma parameter, where higher values generally yield more complex decision boundaries.[35,36] The eAppendix 2; http://links.lww.com/EDE/B513 contains further details of these machine learning algorithms and their corresponding hyperparameters.

Primary Performance Metric

For all models, we used the scaled Brier score[37,38] as the primary performance metric to guide our modeling decisions during the training phase and to assess the performance of the final models during the testing phase. Similar to the statistic in linear regression,[9] we calculated the scaled Brier score using the following formula: where N represents the total number of antidepressant prescriptions in the validation set (during training) or the testing set (during testing), represents the predicted probability that prescription was written for an indication other than depression, represents the observed outcome for prescription (1 if the prescription was not written for depression, 0 otherwise), and represents the overall (marginal) observed probability of Y = 1 in the validation set (during training) or the testing set (during testing). Accordingly, the scaled Brier score can be interpreted as the relative reduction in the mean squared error yielded by a given algorithm relative to a noninformative (random) algorithm that assigns all prescriptions the marginal probability of having an indication other than depression.

Analytic Procedure

The Figure illustrates the flow of the study analysis. Only antidepressant prescriptions with complete data for all covariates were used in the main analysis (~95% of all eligible prescriptions). All prescriptions with complete data were randomly divided into a “training set” versus “testing set” using a 3:1 split. Because prescriptions were clustered within patients, who in turn were nested within physicians, we assigned a randomly selected 75% of physicians (rather than individual prescriptions) to the training set and the remaining 25% of physicians to the testing set. Thus, all prescriptions from the same physician and patient were limited to either the training or testing set. To ensure that patients and prescriptions were also divided approximately 3:1 between the training and testing sets, we first divided physicians into four strata by the number of their patients and then randomly sampled physicians separately within each stratum. We used the training set (Figure Box A) to build, tune, and fit the final models. The testing set (Figure Box B) was used only to evaluate the performance of the final models–it was not used in any part of the training process so that the final algorithms would be tested on completely independent data. Flowchart of the study analysis. We assigned all antidepressant prescriptions in the analysis to either the training set (Box A) or testing set (Box B). Physicians and patients were mutually exclusive between the training and testing sets. We used the training set to build, tune, and fit the final logistic regression model and two super learners. We assessed the performance of these final models in the testing set, which had not been used during any part of the training process. To measure the statistical uncertainty around our performance estimates in the testing set, we bootstrapped the testing set using a two-stage cluster bootstrap[40] to account for multilevel clustering of prescriptions within patients, who in turn were clustered within physicians. For each performance estimate, the reported 95% CI corresponds to the values of the 2.5th and 97.5th percentiles of the distribution across 1000 bootstrap resamples of the testing set (Box C).

Cross-validation During the Training Phase

To reduce the risk of overfitting our final models to the training data, we split the training set into three mutually exclusive subsets using the same stratified randomization procedure as before. We used these three subsets to calculate a cross-validated estimate of the scaled Brier score whenever it was used to make a modeling decision. To compute the cross-validated scaled Brier score for a candidate algorithm, we fit the algorithm on two of the three training subsets (the “derivation set”) and calculated the scaled Brier score in the held-out subset (the “validation set”). We repeated this process three times using a different subset as the validation set each time and then averaged the scaled Brier score across the three validation sets. Even with cross-validation, however, repeated use of the training data for model selection can lead to some overfitting of the validation sets.[39] Thus, it is for this reason that we tested the final models on a third independent dataset (the testing set).

Fitting the Multivariable Logistic Regression Model

In our previous analysis,[20] we implemented the classical model-building procedures for the multivariable logistic regression model (as described previously) on the training set using the cross-validated estimate of the scaled Brier score to guide all modeling decisions. The final logistic regression model included 40 main terms, which were comprised of three prescription-related variables (molecule name, prescribed dose, and whether the drug was prescribed on a take-as-needed basis), 36 patient-related variables (age−2; less than university education; 26 indicator variables for whether diagnostic codes for 13 plausible antidepressant treatment indications were recorded within ±3 days and −4 to −365 days of the prescription date; three indicator variables for whether diagnostic codes were recorded within the past year for three conditions: diabetes without chronic complications, dementia, and unspecified nonpsychotic mental disorder following organic brain damage; number of outpatient visits in the past year−0.5; whether the patient had a diagnostic procedure performed in the past year; and three indicator variables for whether three drugs were prescribed in the past year: trazodone, quetiapine, and furosemide), and one physician-related variable (average number of patients seen per working day−0.5). The final logistic regression model also included one interaction term between the molecule name and the prescribed dose.

Fitting the Two Super Learners

Because the purported advantage of using more flexible machine learning algorithms is that they can automatically detect and model complex, nonlinear associations in the data, we submitted all 373 candidate predictors to each machine learning algorithm without applying any categorization or transformations to continuous variables (other than standardization). To tune the algorithms’ hyperparameters, we applied a grid search procedure that iteratively assessed the cross-validated performance of the algorithms in the training set over a range of plausible hyperparameter values (Table 2). For the LASSO model, rather than define our own subset of possible lambda values, we used the sequence of values automatically generated by the glmnet package. For algorithms with multiple hyperparameters, we assessed all possible combinations of their candidate hyperparameter values. For example, for the random forest, we assessed a total of 30 unique combinations of nTree and mTry. For each algorithm, the tuned value for its respective hyperparameter (or combination of hyperparameters) was defined as the value(s) yielding the best cross-validated scaled Brier score in the training set (for R code showing how we implemented the grid search procedure for the random forest, see eAppendix 3; http://links.lww.com/EDE/B514).

TABLE 2.

Machine Learning Algorithms in the Super Learner and Their Corresponding Hyperparameters

Machine Learning Algorithms in the Super Learner and Their Corresponding Hyperparameters After completing the grid search procedure, we used the SuperLearner package to fit two super learners on the training set. For the first super learner, we specified a library of learners that included each of the five machine learning algorithms fit using their tuned hyperparameter values identified from the grid search (the “tuned” super learner). For the second super learner, we specified a library of learners that included each algorithm fit using the default value in the SuperLearner package (the “untuned” super learner) (see eAppendix 4; http://links.lww.com/EDE/B513 for details on how these tasks were implemented). The SuperLearner package then determined the optimal weighted combination of algorithms for each super learner as follows.[14] First, it obtained the cross-validated predictions for each algorithm (i.e., the predictions in the validation set of each fold when the algorithm was fit on the derivation set). Next, it performed a constrained regression of the observed outcome on a matrix of the cross-validated predictions (one column per algorithm) to determine the optimal convex combination of regression coefficients (i.e., a vector of nonnegative coefficients summing to one), corresponding to the weights for combining the predictions from each algorithm in the super learner. Finally, the SuperLearner package refit each machine learning algorithm on the entire training set. The predictions from these fitted algorithms, combined with their corresponding weights, constituted the final super learner prediction function, developed in the training set.

Performance Assessment in the Testing Set

We assessed the performance of the final logistic regression model and the two super learners by applying these models to prescriptions in the independent testing set (Figure Box B). For each model, we used the scaled Brier score as our primary performance metric to assess its overall performance. As our secondary performance metric, we calculated the concordance (c) statistic to assess its discriminative ability.[38] We compared the performance of these models by measuring the relative efficiency (RE), which we defined as the performance of a given model relative to that of the tuned super learner. For example, the RE of the scaled Brier score for the logistic model was calculated as RElogistic = scaled Brier scorelogistic/scaled Brier scoretunedSuperLearner, where RE > 1 indicated an efficiency gain (i.e., better performance) and RE < 1 indicated an efficiency loss (i.e., worse performance) compared to the tuned super learner. To report the level of statistical uncertainty around our performance estimates in the testing set, we calculated 95% confidence intervals (CIs) using a two-stage cluster bootstrap[40] to account for clustering of prescriptions within patients, who in turn were clustered within physicians. The reported 95% CIs correspond to the values of the 2.5th and 97.5th percentiles of the distribution of the respective estimates across 1000 bootstrap resamples of the testing set (Figure Box C). All analyses were performed in the R environment for statistical computing, version 3.4.1.[41] The following R packages were used: glmnet,[27] rpart,[42] randomForest,[43] nnet,[30] e1071,[44] SuperLearner,[34] and AUC.[45]

RESULTS

The analytical dataset included 73,576 antidepressant prescriptions that were written by 141 physicians for 16,262 patients (Figure). Of these, 52,019 (70.7%) antidepressant prescriptions, written by 103 physicians for 11,827 patients, were randomized to the training set. The remaining prescriptions were assigned to the testing set. Overall, 32,405 (44.0%) antidepressant prescriptions were written for indications other than depression, with this prevalence being similar between the training (43.0%) and testing sets (44.5%).

Grid Search Procedure

The grid search procedure revealed that for the random forest, neural network, and support vector machine, there was a better hyperparameter value (or combinations of hyperparameter values) than the default values in the SuperLearner package (Table 3). For instance, for the random forest, although the best value for nTree was the same as the default value of 1000 trees, the best value for mTry was 50 compared to the default value of 19. For the decision tree and LASSO model, the best values for their corresponding hyperparameters coincided with the default values.

TABLE 3.

Tuned and Default Hyperparameter Values for Each Machine Learning Algorithm

Super Learner Coefficients

Table 4 shows the weights (or coefficients) for each machine learning algorithm in the two super learners. In the tuned super learner, the random forest contributed the most with a weight of 0.526, followed by the neural network and LASSO model with similar weights of 0.186 and 0.173, and finally the support vector machine with the lowest non-zero weight of 0.114. The decision tree did not contribute at all (weight of 0). In the untuned super learner, the LASSO model and random forest contributed the most with a weight of 0.424 each, followed by the neural network and decision tree with much a lower weight of 0.106 and 0.045, respectively. This time, the support vector machine did not contribute at all (weight of 0).

TABLE 4.

Weights for the Individual Machine Learning Algorithms in the Super Learner Functions

Performance of the Two Super Learners in the Testing Set

In the testing set, the tuned super learner had a scaled Brier score of 0.322 (95% CI = 0.267, 0.362), corresponding to a 32% reduction of the mean squared error relative to random classification (Table 5). The tuned super learner also had good discrimination, with a c statistic of 0.822 (95% CI = 0.795, 0.847). In comparison, the untuned super learner using default hyperparameter values had a scaled Brier score of 0.309 (95% CI = 0.256, 0.353), corresponding to an efficiency loss of 4% relative to the tuned super learner (RE of 0.96; 95% CI = 0.93, 0.99). The c statistic for the untuned super learner was also slightly lower at 0.817 (95% CI = 0.791, 0.846), but the efficiency loss in model discrimination relative to the tuned super learner was not statistically significant (RE of 0.99; 95% CI = 0.99, 1.00).

TABLE 5.

Performance of Super Learning (Using Tuned and Default Hyperparameters) and Classical Epidemiologic Methods (Using Logistic Regression) for Predicting When Antidepressants are Prescribed for Indications Other Than Depression In terms of the individual performance of each machine learning algorithm in the super learners, the decision tree had by far the worst performance of any algorithm in both super learners, with a scaled Brier score of 0.226 (95% CI = 0.168, 0.276) and c statistic of 0.746 (95% CI = 0.717, 0.779) (Table 6). In the tuned super learner, the support vector machine had the best individual performance (highest scaled Brier score and c statistic), although the random forest and neural network had comparable performance (Table 6). For those algorithms where the tuned hyperparameter value differed from the default value, the performance of the tuned version was always better than that of the default version, especially for the neural network (Table 6).

TABLE 6.

Performance of the Individual Machine Learning Algorithms in the Two Super Learners

Performance of the Tuned Super Learner Compared to the Final Logistic Model

The final logistic regression model had a scaled Brier score of 0.307 (95% CI = 0.245, 0.360) and c statistic of 0.815 (95% CI = 0.787, 0.847) (Table 5). These point estimates were slightly lower (worse) than those for the tuned super learner, but the efficiency loss was not statistically significant for both the scaled Brier score (RE of 0.95; 95% CI = 0.88, 1.01) and the c statistic (RE of 0.99, 95% 0.98 – 1.00).

DISCUSSION

In this case study, we used an ensemble machine learning approach called super learning to predict when primary care physicians prescribed antidepressants for indications other than depression. We applied an iterative grid search procedure to tune the hyperparameter values of the five machine learning algorithms in the super learner and found that, compared to using the default values, the super learner using tuned hyperparameter values had slightly better overall performance. When we compared the performance of the tuned super learner to that of a carefully constructed logistic regression model derived using classical epidemiologic techniques, we found no differences in performance. A growing number of researchers are using super learning to predict clinical outcomes[7,8,12-14] and improve confounding control when estimating causal effects.[6,10,15] However, researchers may oftentimes not tune the hyperparameter values of the machine learning algorithms in their super learners. Indeed, in many studies, hyperparameters are not mentioned at all,[7-13,15] or if they are mentioned, their reported values often correspond to the default values in the statistical packages used to implement super learning.[6,14] Our findings from this case study suggest that if investigators tune the hyperparameter values of their machine learning algorithms, their super learners may achieve slightly better performance than if default values were used. These gains in performance–even if small–may be practically meaningful to avoid “losing” any benefits of undertaking the extra effort to use the super learning methodology instead of classical prediction methods. There are several reasons why researchers may often not perform hyperparameter tuning when fitting a super learner. First, the SuperLearner package is a “black box” that allows investigators to easily run complex machine learning algorithms without requiring much knowledge about the algorithms themselves. However, to tune an algorithm’s hyperparameters, one must first understand the algorithm’s architecture, know the main hyperparameters that influence its performance, and identify a plausible range of hyperparameter values to test. Second, users may find it daunting to modify an algorithm’s hyperparameter values within this “black box.” We found that the create.Learner function in the SuperLearner package was very helpful for doing this task as long as the hyperparameter of interest was a modifiable parameter in the algorithm’s original wrapper function in the SuperLearner package. When this requirement was not met (in our case, for the support vector machine), we had to create our own custom wrapper for the algorithm, which required extra programming and a deeper understanding of the SuperLearner code. Third, the process of manually searching over a grid of possible hyperparameter values to identify the one with the best cross-validated performance requires advanced programming skills and can be computationally expensive, especially for algorithms like neural networks and support vector machines that can have long training times. To address these barriers, we suggest taking a heuristic approach to building a super learner whereby investigators include a smaller, yet still diverse collection of algorithms and take extra care to ensure each algorithm’s hyperparameters are carefully tuned. As an alternative to hyperparameter tuning, some researchers[46] have suggested including multiple versions of an algorithm in a super learner library–each using different hyperparameter values–and then letting the super learning methodology choose the best variant or combination of variants to use. Given that some algorithms have multiple hyperparameters that must be tuned simultaneously (yielding a multidimensional matrix of possible hyperparameter values rather than a vector) or hyperparameters with a wide range of possible values, it may not suffice to include only several variants of a given algorithm. However, alternatively including a large number of variants representing a more thorough subset of possible hyperparameter values for each algorithm would likely yield a super learner that is not only computationally prohibitive but also cumbersome to present and interpret. In contrast, performing hyperparameter tuning before implementing super learning–as done in this study–is advantageous because it yields a more parsimonious super learner and allows investigators to better allocate their often-limited computing power to include a greater variety of algorithms in their super learners rather than multiple variants of the same one. Furthermore, as a byproduct of assessing the individual performance of each algorithm (i.e., outside the super learner) during the tuning process, investigators may be able to better interpret the super learner weights. For example, in this study, the support vector machine received a very low weight of only 0.114 and 0.000 in the tuned and untuned super learner, respectively. Based on these weights alone, one might conclude that the support vector machine performed poorly. However, Table 6 shows that the support vector machine in fact had the highest (best) scaled Brier score among all algorithms in both super learners, suggesting that its low weight was instead likely due to its high correlation with the predictions from other algorithms in the super learner. There were at least three advantages of using the scaled Brier score as the primary performance metric in our analysis. First, for the multivariable logistic regression model, the scaled Brier score directly assessed the predictive value of adding a candidate variable to the model during the forward stepwise variable selection procedure, unlike P-values (commonly used in epidemiology for variable selection) that can only assess a variable’s statistical significance and are sensitive to large sample sizes, collinearity between variables, and multiple hypothesis testing.[47] Second, we could calculate the scaled Brier score for all the machine learning algorithms because we only needed to obtain the probability estimates from the algorithms and the observed outcomes (unlike other goodness-of-fit measures like the Akaike’s Information Criterion that require additional information and cannot be calculated for nonparametric algorithms). Finally, by using the scaled Brier score to quantify model performance, our performance scores had a more meaningful interpretation compared to simply reporting the mean squared error (i.e., unscaled Brier score). To our knowledge, this study is the first to derive a tuned super learner and compare its performance to that of a carefully constructed regression model. Acion et al[48] recently compared the performance of a super learner with logistic regression and found that their super learner outperformed three different configurations of logistic regression. In contrast, we did not find evidence to suggest performance gains of a tuned super learner over a well-specified logistic regression model. However, there are notable differences between our studies. First, none of their three logistic models simultaneously employed variable selection, tests for nonlinear associations, and tests for interactions, whereas our final logistic model was derived using all these model-building techniques. Second, all algorithms in their super learner used the default hyperparameter values, whereas all our algorithms used tuned hyperparameter values. Finally, our dataset contained many more predictors (373 versus their dataset of 28 predictors). In the previous study,[20] adding more predictors to a “baseline” logistic regression model containing only variables based on diagnostic codes for plausible antidepressant treatment indications drastically improved its performance, increasing the scaled Brier score from 0.076 (95% CI = −0.007, 0.131) for the baseline model to 0.307 (95% CI = 0.245, 0.360) for the final model. In comparison, the gains in performance of the tuned super learner from this study over the final logistic model from the previous study were far smaller and imprecisely estimated. These observations highlight the notion that the quality of predictors often plays a far more important role in achieving good predictive performance than the type of predictive machinery used. Our study has several considerations. First, although the grid search procedure we used is one of the most common approaches for performing hyperparameter tuning, the manual and iterative nature of this process makes it labor-intensive and requires advanced programming skills to implement.[17] Researchers may therefore want to consider using newer methods that are being developed to automatically and more efficiently select hyperparameter values.[17] Second, because the performance of a super learner depends upon the collection of algorithms in it, it is possible that our findings could have been different had we chosen a different set of algorithms. However, to decrease the likelihood of this possibility, we chose a set of algorithms that employed a diverse range of approaches to prediction and have been found to perform well in other applications. Finally, when interpreting the findings from this case study, readers should keep in mind the properties of our analytical dataset (e.g., number of training samples, number of variables, distribution of variable types), as the relative performance of different machine learning algorithms and the effect of hyperparameter tuning could differ in a dataset with different properties.[49] In conclusion, based on this case study, we found that a super learner fit using tuned hyperparameter values performed slightly better than a super learner fit using default values. When we compared the performance of this tuned super learner to that of a multivariable logistic regression model derived using classical model-building techniques, the difference in performance was small and imprecise. Should investigators choose to use super learning, they may want to consider first tuning the hyperparameters of their individual machine learning algorithms before applying the super learning methodology to achieve optimal predictive performance.

30 in total

1. Validating an instrument for selecting interventions to change physician practice patterns: a Michigan Consortium for Family Practice Research study.

Authors: Lee A Green; Daniel W Gorenflo; Leon Wyszewianski
Journal: J Fam Pract Date: 2002-11 Impact factor: 0.493

2. Enhancing pharmacosurveillance with systematic collection of treatment indication in electronic prescribing: a validation study in Canada.

Authors: Tewodros Eguale; Nancy Winslade; James A Hanley; David L Buckeridge; Robyn Tamblyn
Journal: Drug Saf Date: 2010-07-01 Impact factor: 5.606

3. Super learner.

Authors: Mark J van der Laan; Eric C Polley; Alan E Hubbard
Journal: Stat Appl Genet Mol Biol Date: 2007-09-16

4. Bootstrap-based methods for estimating standard errors in Cox's regression analyses of clustered event times.

Authors: Yongling Xiao; Michal Abrahamowicz
Journal: Stat Med Date: 2010-03-30 Impact factor: 2.373

5. Mortality risk score prediction in an elderly population using machine learning.

Authors: Sherri Rose
Journal: Am J Epidemiol Date: 2013-01-29 Impact factor: 4.897

Review 6. Brief review of regression-based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience.

Authors: Abhijit Dasgupta; Yan V Sun; Inke R König; Joan E Bailey-Wilson; James D Malley
Journal: Genet Epidemiol Date: 2011 Impact factor: 2.135

7. The development and evaluation of an integrated electronic prescribing and drug management system for primary care.

Authors: Robyn Tamblyn; Allen Huang; Yuko Kawasumi; Gillian Bartlett; Roland Grad; André Jacques; Martin Dawes; Michal Abrahamowicz; Robert Perreault; Laurel Taylor; Nancy Winslade; Lise Poissant; Alain Pinsonneault
Journal: J Am Med Inform Assoc Date: 2005-12-15 Impact factor: 4.497

8. Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors: Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal: J Stat Softw Date: 2010 Impact factor: 6.440

9. Assessing the performance of prediction models: a framework for traditional and novel measures.

Authors: Ewout W Steyerberg; Andrew J Vickers; Nancy R Cook; Thomas Gerds; Mithat Gonen; Nancy Obuchowski; Michael J Pencina; Michael W Kattan
Journal: Epidemiology Date: 2010-01 Impact factor: 4.822

10. Time-dependent prediction and evaluation of variable importance using superlearning in high-dimensional clinical data.

Authors: Alan Hubbard; Ivan Diaz Munoz; Anna Decker; John B Holcomb; Martin A Schreiber; Eileen M Bulger; Karen J Brasel; Erin E Fox; Deborah J del Junco; Charles E Wade; Mohammad H Rahbar; Bryan A Cotton; Herb A Phelan; John G Myers; Louis H Alarcon; Peter Muskat; Mitchell J Cohen
Journal: J Trauma Acute Care Surg Date: 2013-07 Impact factor: 3.313

6 in total

1. Hyperspectral Image Analysis of Colon Tissue and Deep Learning for Characterization of Health care.

Authors: Ammar Akram Abdulrazzaq; Sana Sulaiman Hamid; Asaad T Al-Douri; A A Hamad Mohamad; D Selvi; Abdelrahman Mohamed Ibrahim
Journal: J Environ Public Health Date: 2022-05-31

2. Electronic phenotyping of health outcomes of interest using a linked claims-electronic health record database: Findings from a machine learning pilot project.

Authors: Teresa B Gibson; Michael D Nguyen; Timothy Burrell; Frank Yoon; Jenna Wong; Sai Dharmarajan; Rita Ouellet-Hellstrom; Wei Hua; Yong Ma; Elande Baro; Sarah Bloemers; Cory Pack; Adee Kennedy; Sengwee Toh; Robert Ball
Journal: J Am Med Inform Assoc Date: 2021-07-14 Impact factor: 4.497

3. Machine Learning for Causal Inference: On the Use of Cross-fit Estimators.

Authors: Paul N Zivich; Alexander Breskin
Journal: Epidemiology Date: 2021-05-01 Impact factor: 4.860

4. Predicting the presence of depressive symptoms in the HIV-HCV co-infected population in Canada using supervised machine learning.

Authors: Gayatri Marathe; Erica E M Moodie; Marie-Josée Brouillette; Joseph Cox; Curtis Cooper; Charlotte Lanièce Delaunay; Brian Conway; Mark Hull; Valérie Martel-Laferrière; Marie-Louise Vachon; Sharon Walmsley; Alexander Wong; Marina B Klein
Journal: BMC Med Res Methodol Date: 2022-08-12 Impact factor: 4.612

5. Factors Associated with E-Cigarette Use in U.S. Young Adult Never Smokers of Conventional Cigarettes: A Machine Learning Approach.

Authors: Nkiruka C Atuegwu; Cheryl Oncken; Reinhard C Laubenbacher; Mario F Perez; Eric M Mortensen
Journal: Int J Environ Res Public Health Date: 2020-10-05 Impact factor: 3.390

Review 6. The use of narrative electronic prescribing instructions in pharmacoepidemiology: A scoping review for the International Society for Pharmacoepidemiology.

Authors: Robert J Romanelli; Naomi R M Schwartz; William G Dixon; Carla Rodriguez-Watson; Brian C Sauer; Dawn Albright; Zachary A Marcum
Journal: Pharmacoepidemiol Drug Saf Date: 2021-07-28 Impact factor: 2.732

6 in total