| Literature DB >> 32393540 |
Fons van der Sommen1, Jeroen de Groof2, Maarten Struyvenberg2, Joost van der Putten1, Tim Boers1, Kiki Fockens2, Erik J Schoon3, Wouter Curvers3, Peter de With1, Yuichi Mori4, Michael Byrne5, Jacques J G H M Bergman6.
Abstract
There has been a vast increase in GI literature focused on the use of machine learning in endoscopy. The relative novelty of this field poses a challenge for reviewers and readers of GI journals. To appreciate scientific quality and novelty of machine learning studies, understanding of the technical basis and commonly used techniques is required. Clinicians often lack this technical background, while machine learning experts may be unfamiliar with clinical relevance and implications for daily practice. Therefore, there is an increasing need for a multidisciplinary, international evaluation on how to perform high-quality machine learning research in endoscopy. This review aims to provide guidance for readers and reviewers of peer-reviewed GI journals to allow critical appraisal of the most relevant quality requirements of machine learning studies. The paper provides an overview of common trends and their potential pitfalls and proposes comprehensive quality requirements in six overarching themes: terminology, data, algorithm description, experimental setup, interpretation of results and machine learning in clinical practice. © Author(s) (or their employer(s)) 2020. Re-use permitted under CC BY. Published by BMJ.Entities:
Keywords: computerised image analysis; endoscopy; gastrointesinal endoscopy
Mesh:
Year: 2020 PMID: 32393540 PMCID: PMC7569393 DOI: 10.1136/gutjnl-2019-320466
Source DB: PubMed Journal: Gut ISSN: 0017-5749 Impact factor: 23.059
Overview of the most commonly used terminology in machine learning literature (addressed in this review)
| Machine learning | The use of mathematical models for capturing structure in data. After the optimisation procedure on example data—so-called training—the models can be used to make predictions about new, unseen data. |
| Features | Visual properties of the data that are quantitatively summarised in an array of numbers. In conventional machine learning, these features are clinically inspired and thus handcrafted, while in deep learning these features are automatically learnt from the data. |
| Computer-aided detection (CADe) | Machine learning algorithms applied to medical data for primary detection of pathology (eg, polyp detection). |
| Computer-aided diagnosis(CADx) | Machine learning algorithms applied to medical data for predicting diagnoses (eg, polyp classification). |
| Deep learning | A form of machine learning in which a neural network of several layers is used, exploiting hierarchical relations in the data. The major difference from conventional machine learning is that these features and relations are all learnt from the data, a property which is also referred to as |
| Pretraining | Training a deep learning algorithm with data that are different from the target data. This technique can be exploited to first train a rough model on a large set that can be fine-tuned using a smaller dataset of interest. ImageNet is by far the most commonly used dataset for pretraining. |
| Transfer learning | This is used after a deep neural network is pretrained on a large dataset that is different from the target data. Generally, a dataset is used with general imagery not specific to the final purpose of the algorithm. This pretrained model extracts basic discriminating features from the large dataset and these features and their weights are then ‘transferred’ for training and fine-tuning on new data which are specific for the target purpose of the model—often applied when sufficient target data are lacking to train the network from scratch. |
| Hyperparameters | Almost all machine learning models are regulated by so-called hyperparameters, which govern the model architecture and its training procedure. Examples of common hyperparameters in neural networks are the number of layers and the learning rate. These parameters can generally not be optimised during the training process and are typically chosen based on a number of trials using empirically driven approach. |
| Hyperparameter optimisation | The process of finding the right hyperparameters of a model, based on the performance on the validation set. This is performed either by using a grid-search, in which a number of options are defined for each hyperparameter and all combinations are systematically evaluated, or using a random search, in which the values are randomly sampled from a predefined range. |
| Ensemble learning | Instead of training a single model on the whole dataset, one can also train multiple models that are trained slightly differently to yield a prediction about the same data point. These models are generally trained on different subsets of the data and with slightly different hyperparameters. Averaging the scores of different models generally leads to a better and more robust prediction. |
| Training dataset | A set of data (examples) on which the mathematical model is optimised (trained). In supervised learning, the examples are labelled, and the model is trained to predict the labels of the samples. |
| Validation dataset | A separate set of data samples that can be used to tune the hyperparameters of the model. A model can be trained several times with different hyperparameter values (on the training set) and the ones that achieve the best performance on the validation set are chosen. Often referred to as ‘internal validation’. |
| Test dataset | A set of data samples neither used for training the model nor for optimisation of the hyperparameters. The performance on the test set reflects how good the model generalises to new, unseen data. |
| Cross-validation | A validation approach that is more robust to outliers than a regular hold-out approach. In K-fold cross-validation, the data used for training and validation are split into K parts, after which the model is subsequently trained with K-1 folds of data and validated on the left-out fold. This is repeated for all folds after which the scores are pooled. |
| Overfitting | A phenomenon that occurs when the model is too tightly fitted to the training data and does not generalise to new data (ie, the model only works for the given training examples). Overfitting can be recognised by high training performance combined with low test performance. |
| Data augmentation | A way to artificially enhance the size of a dataset, by adding slightly distorted copies of the original data points to the training set. The samples are distorted in such a way that the labels do not change after applying the transformation (eg, rotation, slight skewing, minor zooming, adding noise). The use of data augmentation generally leads to more robust models. |
Overview of commonly used terminology in machine learning literature, not further described in this paper
| Support vector machine (SVM) | An efficient machine learning algorithm that aims to find a line (hyperplane) separating data with maximum margin between two classes. SVMs can be linear or non-linear; the latter is more powerful, but also more prone to overfitting. |
| Random forests | An ensemble machine learning algorithm in which a large amount of binary decision trees are trained, each using a different (random) subset of the data (ie, bagging) and with different split options (sampled randomly). |
| Backpropagation | A method for training neural networks, in which the network is first used to make a prediction for a given sample, after which the error is propagated back through the network for updating the network weights such that the error will be reduced. This is repeated many times for all the data points in the dataset, until the network is said to |
| Regularisation | A collection of techniques that can be used to counter overfitting. This can be done either by explicitly introducing some model constraints in the mathematical optimisation procedure (training) or implicitly, for example, by using slightly distorted copies of the data during training (also known as data augmentation). |
| Batch normalisation | A method to force the network activations of each layer into a certain range, so that the network can optimally learn from errors during backpropagation. |
| Gradient descent | A mathematical optimisation procedure in which the gradient of a function is exploited to move towards a (local) minimum of a function. In machine learning this is generally the loss function, which captures the number of errors the model makes. The gradient is used to subsequently take steps on this function along the steepest slope downwards. |
| Mini-batch | A group of data samples for which the loss is jointly computed during backpropagation in order to make an update step. Mini-batches are generally sampled randomly without replacement. Once there are no data points left, one epoch is passed (see epoch). |
| Epoch | During backpropagation, all data points pass through the network either individually or in mini-batches in order to update the model and minimise the loss/error. An epoch represents the period for which all data points have passed through the network once. |
| Learning rate | During the training of a neural network, the model gradually adjusts its weights until the prediction error on the data is minimised. The direction of these updates is determined by gradient descent optimisation, while the magnitude of these updates is governed by the learning rate. A large learning rate will lead to fast convergence towards an optimum, but if the update steps are too large, the real optimum can never be achieved. |
| Classification | Classification is a form of supervised learning, for which the input comprises numerical data (eg, images) and the goal of the algorithm is to match that input with a target class of a predefined set of potential categories at the output. An example here would be polyp classification, where a polyp can be either hyperplastic, sessile serrated or adenomatous. |
| Regression | Regression is a form of supervised learning, for which the input comprises numerical data (eg, images) and the goal of the algorithm is to match that input with a target continuous numerical value at the output. For example, estimating the oxygen saturation of the blood based on an image of the mucosa. |
| Object detection | Object detection is a form of supervised learning, for which the input comprises numerical data (eg, images) and the goal of the algorithm is to detect whether or not an object from a predefined list of objects is present in that image and indicate its location within the image, typically with a rectangular bounding box, at the output. An example is polyp detection in colonoscopy. |
| Image segmentation | Image segmentation is a form of supervised learning, for which the input comprises an image and the goal of the algorithm is to segment parts of that image that are associated with a predefined category or set of categories at the output. Typically, the output is numerical mask, indicating for each pixel to what category it belongs to. An example is lesion segmentation in Barrett’s oesophagus. |
Figure 1Graphical display of overfitting of training data. In this figure, the leftmost panel displays data points of two classes, in which the class is indicated by the colour. The centre panel shows the same data including the prediction of a model trained on that data as the background colour. Overfitting is clearly visible as the model isolates points of the red class, rather than capturing the class as a whole. The rightmost panel shows the prediction of a different model as background colour. Although this model makes mistakes (red points can be seen on a blue background and vice versa), this model demonstrates better generalisation, as it captures the class distributions rather than individual points.
Figure 2Visualisation of training, validation and test set and overfitting, and their appropriate use. The training dataset is used to train the model, followed by validation. In case of unsatisfactory performance, the model is changed, retrained and again validated. In case of satisfactory performance, the model is then tested on a separate test set to evaluate model performance.
Figure 3Graphical display of fourfold cross-validation.
Checklist of key elements that should be described in machine learning papers, structured by manuscript section
| Methods | Page | |
| Data | Minimise risk of overfitting by the use of multiple, heterogeneous and independent datasets. | |
| Provide a complete description of the data acquisition process. | ||
| Describe the basic technical information of the imagery and the use of any preprocessing methods. | ||
| Define a reliable gold standard for all data used to train, validate and test the model. | ||
| Algorithms that localise lesions on images and videos, reliable gold standard input for the model should incorporate annotations by multiple experts. | ||
| Provide detailed information on ethics approval concerning the use of patient data. | ||
| Algorithm architecture | Provide a basic description of the algorithm architecture and clear-cut motivation for the most relevant technical choices. | |
| Describe extensive technical details in separate technical publications, or in supplementary materials. | ||
| Experimental set-up | Describe the experimental set-up of the algorithm and choose the appropriate performance metrics. | |
| Primary outcome parameters should be based on the envisioned clinical application of the model. | ||
| Do not optimise hyperparameters on test set. | ||
| Ensure that training, validation and test sets are always split on a patient-basis. | ||
| Report a complete overview of all evaluated models to prevent ‘cherry-picking’ of the best performing algorithms. | ||
| Results | ||
| Results should be presented with caution and in a structured approach. | ||
| Discussion | ||
| Include a section where data selection bias, overfitting and generalisability are explicitly discussed. | ||
| Describe the necessary steps towards clinical implementation. |
Figure 4Exemplary case of subtle Barrett’s neoplasia, delineated by three experts (yellow, blue and green). Parts of the lesion (‘the sweet spot’) are recognised by all experts (black), yet other parts are only recognised by one or two experts. Reprinted from Bergman J, de Groof AJ, Pech O, et al. An interactive web-based educational tool improves detection and delineation of Barrett's esophagus-related neoplasia. Gastroenterology 2019;156:1299-1308, with permission from Elsevier.
The expected number of false positives (FPs) per true positive (TP) for a fixed performance and varying incidence
| Sensitivity/Specificity | Incidence | #FPs per TP |
| 0.90/0.90 | 1:1 | 0.1 |
| 1:10 | 1 | |
| 1:100 | 11 | |
| 1:1000 | 111 | |
| 1:10 000 | 1111 |