| Literature DB >> 32607451 |
Anita L Lynam1, John M Dennis1, Katharine R Owen2,3, Richard A Oram1, Angus G Jones1, Beverley M Shields1, Lauric A Ferrat1.
Abstract
BACKGROUND: There is much interest in the use of prognostic and diagnostic prediction models in all areas of clinical medicine. The use of machine learning to improve prognostic and diagnostic accuracy in this area has been increasing at the expense of classic statistical models. Previous studies have compared performance between these two approaches but their findings are inconsistent and many have limitations. We aimed to compare the discrimination and calibration of seven models built using logistic regression and optimised machine learning algorithms in a clinical setting, where the number of potential predictors is often limited, and externally validate the models.Entities:
Keywords: Logistic regression; Machine learning; Model selection
Year: 2020 PMID: 32607451 PMCID: PMC7318367 DOI: 10.1186/s41512-020-00075-2
Source DB: PubMed Journal: Diagn Progn Res ISSN: 2397-7523
Algorithm description and references
| Algorithm | Description | References |
|---|---|---|
| Logistic regression | A classic statistical algorithm for binary outcomes that use maximum likelihood estimation. It is fully parametric. There are no model hyperparameters to be set. Coefficients are adjusted to allow for dependence between the characteristics. It is useful for inference, estimation, interpretation and prediction. | [ |
| Random forest | An algorithm that grows a large ensemble of classification trees on bootstrapped samples using a random selection of the predictor variables and performs bagging for class selection; after all the trees have been grown, the predicted class is determined from the average estimated class probability calculated over the ensemble of trees. | [ |
| Gradient boosting machine | An ensemble learning technique similar to random forest in the sense they average a large number of decision trees to make prediction. The difference between the two is the application of gradient boosting. In gradient boosting, the decision trees are trained sequentially with the weights of each successive model adjusted based on reducing the errors of the previous model. The predicted class is determined from the average estimated class probability (or majority vote of predicted class) calculated over the ensemble of trees. | [ |
| Multivariate adaptive regression spline | MARS and logistic regression share similarities. For the logistic regression model, the logarithm of the odds is fitted with a linear combination of the predictors. For the MARS model, the logarithm of the odds is fitted with splines to cover non-linear and interactions terms. The hinge function (sometimes called rectifier) is used to model the splines. | [ |
| Neural network | A method using an adaptive and non-sequential approach to learning that mimics a biological neural network. It is a non-parametric technique where signals travel from the first layer (the input layer), to the last layer (the output layer). Each layer is made of a set of neurons. The output of each neuron is computed by some non-linear function of the sum of its weighted inputs from neurons from the previous layer. The weight increases or decreases the strength of the signal at a connection. | [ |
| K-nearest neighbours | A model-free method; it is a type of instance-based learning or lazy learning in which there is no training phase, instead the algorithm memorises the training data. Based on the principle that observations located close together in n-dimensional space will have the same outcome, the classification process involves a search the entire dataset for the k training points closest in Euclidean distance (k-neighbours), the probability predicted class is determined based on the average vote of the actual class among these k-neighbours. | [ |
| Support vector machine | It is a quadratic optimisation problem involving minimising penalties and maximising margin width, and the two classes are separated by constructing nonlinear decision boundaries (hyperplanes) using a kernel trick that maximises the margin between them. The produced posterior estimates are a rescaled version of the original classifiers scores through a logistic transformation. | [ |
ROC AUC [95% CI] performance comparison of the seven models applied to the internal and external validation datasets. Internal validation was estimated with 5-fold-nested cross-validation while external validation was performed on the YDX dataset
| Model | Internal validation | External validation |
|---|---|---|
| Gradient boosting machine | 0.96 [0.94, 0.98] | 0.93 [0.90, 0.96] |
| K-nearest neighbours | 0.93 [0.90, 0.97] | 0.92 [0.89, 0.95] |
| Logistic regression | 0.96 [0.93, 0.98] | 0.95 [0.92, 0.97] |
| MARS | 0.96 [0.90, 0.99] | 0.94 [0.92, 0.97] |
| Neural network | 0.96 [0.93, 0.99] | 0.94 [0.92, 0.97] |
| Random forest | 0.95 [0.92, 0.98] | 0.94 [0.91, 0.96] |
| Support vector machine | 0.96 [0.93, 0.98] | 0.94 [0.92, 0.97] |
Fig. 1Calibration plots with 95% confidence interval obtained using external validation dataset for prediction models. a Gradient boosting machine. b K-nearest neighbours. c Logistic regression. d MARS. e Neural network. f Random forest. g Support vector machine. Legend: Dashed line = reference line, solid black line = linear model
Calibration test results on external validation dataset. Calibration-in-the-large indicates whether predicted probabilities are, on average, too high (value below 0) or too low (value above 0). Conversely, the calibration slope quantifies whether predicted risks are, on average, too extreme (value below 1) or too invariant (value above 1)
| Model | Calibration slope ( | Calibration-in-the-large ( |
|---|---|---|
| Gradient boosting machine | 0.979 | − 0.005 |
| K-nearest neighbours | 1.495 | 0.046 |
| Logistic regression | 0.903 | − 0.039 |
| MARS | 0.799 | 0.081 |
| Neural network | 0.995 | − 0.031 |
| Random forest | 1.412 | 0.065 |
| Support vector machine | 0.914 | − 0.028 |
Fig. 2Decision curve analysis obtained using external validation dataset for prediction models. The graph gives the expected net benefit per patient relative to treat all patients as type 2 diabetes. The unit is the benefit associated with one patient with type 1 diabetes receiving the correct treatment. ‘all’: assume all patients have type 1 diabetes. ‘none’: assume no patients have type 1 diabetes