| Literature DB >> 31523704 |
Hooman H Rashidi1, Nam K Tran1, Elham Vali Betts1, Lydia P Howell1, Ralph Green1.
Abstract
Increased interest in the opportunities provided by artificial intelligence and machine learning has spawned a new field of health-care research. The new tools under development are targeting many aspects of medical practice, including changes to the practice of pathology and laboratory medicine. Optimal design in these powerful tools requires cross-disciplinary literacy, including basic knowledge and understanding of critical concepts that have traditionally been unfamiliar to pathologists and laboratorians. This review provides definitions and basic knowledge of machine learning categories (supervised, unsupervised, and reinforcement learning), introduces the underlying concept of the bias-variance trade-off as an important foundation in supervised machine learning, and discusses approaches to the supervised machine learning study design along with an overview and description of common supervised machine learning algorithms (linear regression, logistic regression, Naive Bayes, k-nearest neighbor, support vector machine, random forest, convolutional neural networks).Entities:
Keywords: algorithms; artificial intelligence; convolutional neural network; deep learning; k-nearest neighbor; machine learning; random forest; supervised learning; supervised methods; support vector machine; unsupervised learning
Year: 2019 PMID: 31523704 PMCID: PMC6727099 DOI: 10.1177/2374289519873088
Source DB: PubMed Journal: Acad Pathol ISSN: 2374-2895
Common Machine Learning Terms.
| Term | Definition | Example(s) | ||
|---|---|---|---|---|
| Bias-variance trade off | This refers to finding the right balance between bias and variance in a machine learning (ML) model, with the ultimate goal of finding the most generalizable model. Notably, increased bias usually leads to an underfitted model while increased variance may lead to overfitting. Finding the happy balance between the bias and variance is the key to finding the most generalizable model | See | ||
| Boosted tree | An ensemble method that uses weak predictors (eg, decision trees) that can ultimately be boosted and lead to a better performing model (ie, the boosted tree). | Also, Gradient boosting machine (GBM) | ||
| Bootstrapping | To randomly pull samples from the original data set for creating a new data set of the same size. Note: “Bagging” is a term that is related to this which stands for | Regularly used in certain ensemble decision tree algorithms such as random forest | ||
| Categorical data/features | These include features that have discrete values and are often binary. Although not infrequently, more than 2 categories have also been used | Patients with diabetes and patients without diabetes | ||
| Class | Refers to the labeled target values | For a binary classification of cancer diagnosis, the classes could be cancer versus no cancer designated as the targets | ||
| Classification modeling | A supervised machine learning approach that builds models that are able to distinguish 2 or more discrete classes | Cancer versus no cancer | ||
| Classification threshold | This is usually the probability/value threshold that allows the model to be able to separate the 2 classes. | If one has used the default probability value of 0.5 in our logistic regression-based model to separate the positive cases of cancer from the negative cases of cancer, the positive cases are those that are ≥0.5 and the negative cases are those that are <0.5. | ||
| Clustering | Refers to grouping related data points. This is commonly seen in certain unsupervised machine learning methods. | k-means algorithm (see K entry terms in the glossary below) | ||
| Confusion matrix | Summarizes the model’s predictions. Specifically denoting the true positive, true negative, false-positive, and false-negative predictions. These values will then make it possible to calculate the various performance measures of our model (eg, accuracy, sensitivity, specificity, etc) | Predicted Cancer Cases | Predicted Negative Cases | |
| Actual cancer cases | True positive | False Negative | ||
| Actual negative cases | False positive | True Negative | ||
| Convolutional Neural network (CNN) | These are usually deep neural networks with at least 1 convolutional layer and that are well suited for certain complex tasks such as image recognition/classification. | See | ||
| Cross Industry Standard Process for Data Mining (CRISP-DM) | A systematic approach to supervised machine learning process that includes the following steps: Data collection and processing, followed by model building and validation steps and ultimately the model deployment step | See | ||
| Cross validation | A way to statistically estimate the model’s generalizability by withholding 1 or more internal test sets that can then be tested against the trained model(s). | k-fold cross validation (see K entry terms in the glossary below) | ||
| Data types | Data within ML platforms are best categorized into the following 4 types: numerical (exact numbers), text (which will need to be converted to numbers), categorical (represents characteristics such as normal tissue vs cancer, etc), and time series (sequence of numbers collected over time intervals) | An example of categorical is normal tissue versus cancer cases. Note: 7 data type grouping has also been proposed which include: useless, nominal, binary, ordinal, count, time and interval data. | ||
| Decision boundary | This refers to the boundary between the classes that is learned by the machine learning model. In the example, the blue dashed lines represent the best decision boundaries that separate the 3 classes (black group, gray group, and the red group). | See | ||
| Decision tree | These use a flowchart structure that typically contains a root, internal nodes, branches, and leaves. The internal node is where the attribute in question (eg, creatinine >1 or creatine <1) is tested while the branch is where the outcome of this tested question is then delegated. The leaves are where the final class label is assigned which, in short represents the final decision after the results of all the attributes have been incorporated. The end result of the decision tree is a set of rules that governs the path from the root to the leaves | Simple Decision tree (not often used) | ||
| Deep Neural Network (DNN) | Refers to a neural network with multiple hidden layers and a large number of nodal connections | Common deep neural networks that are currently used within the image analysis field include AlexNet, Inception which is also known as GoogleLeNet (eg, Inception-v3) and ResNet (eg, ResNet50). These are commonly employed in various transfer learning projects | ||
| Discrete | Qualitative targets or features within a classification schema in supervised learning. | Examples of discrete targets may include cancer versus normal tissue or acute kidney injury (AKI) versus No-AKI in a classification ML model. This is in contrast to quantitative targets that can be used within supervised learning models and that can then also be used to predict a numerical outcome as in a linear regression model | ||
| Feature | Refers to the input variables that are used to map to the target in a model | For example, in an acute kidney injury (AKI) model certain features such as, creatinine, urine output, and NGAL may be used to map to its Target categories (AKI vs No-AKI), which would ultimately make it possible to build a model that can predict AKI | ||
| Feature engineering | This the process that allows for the selection of certain key features or the transformation/conversion of certain features within a data set that will ultimately lead to a better prediction model | A data set with a large number of features can be refined by certain methods (eg, PCA or K-Best, etc) to find the most relevant features (within all features) based on various statistical methods. This can then be used within the algorithm to build a new model. For example, a data set with 20 features or more can be refined though PCA or K-Best to a new data set with 10 or less features | ||
| Generalizability | The ability of a model to accurately predict on new previously unseen test sets (secondary or tertiary test sets) that were completely outside of the training data set. | In machine learning, the validation accuracy typically refers to the model’s accuracy based on the primary test set that was created with the data set that was used to train the model within its train-test split phase. In contrast, the generalization accuracy is a new test set that is used to test the final model and its capability of predicting using previously unseen data. | ||
| Generalization test set | The secondary or tertiary test set that is unknown to the initial train-test split data set and used to assess the model’s generalizability. | A cancer-predicting model trained with images from one institution can be tested for its generalizability using another institution’s images as a secondary test set. | ||
| Input variables | These usually refer to the features from the training data. | In a cancer predicting model trained with cancer and no-cancer images, the image features within cancer are used to map to the cancer-labeled group while the image features of the benign group are mapped to the no-cancer group. | ||
| K-fold cross validation (CV) | A type of cross validation in which the train-test data set is split k times. | If the | ||
| K-means | An unsupervised method that utilizes discrete or continuous data as its input parameter for identifying input regularities (ie, clusters). | Clustering a medical data set to characterize subpopulations of a disease based on various known input clinical parameters. | ||
| K-nearest neighbor (KNN) | A nonparametric clustering algorithm used for data classification and regression. Classification is based on the number of | See | ||
| Kernel trick | Allows the data to be transformed into another dimension (eg, z-plane) which ultimately enhances the dividing margin between the classes of interest. | See | ||
| Leave One Out cross validation | The extreme version of the k-fold CV approach in which | For example, n of 100 could be the individual sets of data in 100 patients. Hence, the k-fold CV in this case will split the data into 100 train-test splits which assure full sampling | ||
| Linear regression | This algorithm allow us to find the target variable (usually a numerical value) by finding the best-fitted straight line which is also known as the “least squares regression line” (the best dotted line with the lowest error sum) between the independent variables (the cause or features) and dependent variables (the effect or target). The ultimate goal of this technique is to fit a straight line to the data set in question. | See | ||
| Logistic regression | The term regression is somewhat of a misnomer since in general this is a classification method that uses a logistic function for predicting a dichotomous dependent variable (target). | See | ||
| Loss | It measures how poorly the model is performing when comparing its predictions to the target labels. Hence, typically the lower the loss, the better the model. | For example, measuring “log loss” in logistic regression models and “mean squared error” (MSE) in linear regression models. | ||
| Machine learning (ML) | Machine-based intelligence (in contrast to natural human intelligence). Also interchangeably used with the term artificial intelligence (AI). Paraphrasing Arthur Samuel and others, ML models are built by a set of data points trained through mathematical and statistical approaches that ultimately enable prediction of new previously unseen data without being explicitly programmed. | Supervised ML: Classification and Regression models | ||
| Model | This usually refers to the end result of the machine learning algorithm’s training phase in which the variables are ultimately mapped to the desired target. | A deep neural network model that is trained to predict cancer from benign tissue or a | ||
| Naive Bayes | A classifier that uses a probabilistic approach based on the Bayes theorem. This approach assumes the naïve notion that the features being evaluated are independent of each other. | See | ||
| Natural Language Processing (NLP) | An ML process that enables computers to learn and ultimately analyze human (natural) language data. This may include various aspects of speech, syntax, semantics, and discourse (oral and written). | Apple’s | ||
| Normalization | A standardization or scaling procedure that is often used in the processing phase of the data in preparation for machine learning. Certain methods (eg, distance-based algorithms) within machine learning (eg, | For example, Standard scaler (scales the data so that the mean of the distribution is set to 0 with a standard deviation of 1.) | ||
| Overfitting | This gives rise to the model appearing as a good predictor on the training data while underperforming on future new and previously unseen data (ie, not generalizable). This is due to its a low bias and high variance in which the model may now adapt too strongly to the data which could have included noise. | See | ||
| Parametric algorithm | The set of parameters in a parametric algorithm is fixed which confines the function to a known form. In general, the assumption within parametric algorithms is that the function is linear or assumes a normal distribution while nonparametric methods do not make such assumptions | The most commonly encountered parametric algorithms include linear regression, logistic regression, and naïve Bayes, while some of the most common nonparametric algorithms include | ||
| Prediction | Refers to the model’s output based on some initial input variables | A deep neural network trained to identify cancer cases is tested against an unknown histologic image which predicts it as cancer with a certain probability. | ||
| Principal Component Analysis (PCA) | A statistical approach that can lower dimensional representations. This technique can highlight the contribution of various features within a data set through its principal components which could ultimately allow a reduction in the number of features that are required to build the model. | Within a data set that contains, say, 8 features, 92% of the explained variance within the data may have been found to be contained within the first 2 principal components (PC1 and PC2) which were subsequently shown to be mainly due to 4 of the 8 features. Then, those 4 features may be selected to build an ML model if one is looking to train a model with a smaller number of features | ||
| Random Forest (RF) | Uses a network of decision trees for ensemble learning. Using bootstrapping, this method generates randomly generated data sets that can then be used to train the data for building an ensemble of decision trees. Ultimately, each decision tree will determine an outcome, and a majority “vote” approach is used to classify the data. Appropriately, this is called random forest since a large number of randomly generated decision trees are used to construct the final model. | See | ||
| Regularization | A process that helps to minimize the overfitting of an ML model by reducing the effect of noise. | This usually refers to adding some information to the process that ultimately reduces overfitting and makes the model a better predicting tool. | ||
| Reinforcement learning | These platforms may share features of both a supervised and an unsupervised process and usually function through a policy-based platform. | IBM’s Watson and Google’s Go that were able to beat champion chess and Go players, respectively. | ||
| Scikit learn | A popular machine learning library that enables users to build and assess various ML models though a variety of supervised and unsupervised algorithms. Other commonly used ML libraries (besides Scikit learn) that are especially useful for image classification model building include Turi create and Tensor flow. | Building a classification model through its support vector machine or logistic regression algorithm. | ||
| Supervised learning | These platforms employ “labeled” training data sets (labeled/supervised by subject experts) to yield a qualitative or quantitative output. The 2 major categories of supervised learning are classification and regression which lead to discrete/qualitative and continuous/quantitative targets, respectively. | Classification models; regression models | ||
| Support Vector Machine (SVM) | Classifies data by defining a hyperplane that best differentiates 2 groups. This differentiation is maximized by increasing the margin (the distance) on either side of this hyperplane. In the end, the hyperplane-bounded region with the largest possible margin is used for the analysis. One of the key highlights of the SVM method is its ability to find nonlinear relationships through the use of a kernel function (kernel trick). | See | ||
| Target | Within supervised ML, the target is also sometimes referred to as the label which is comprised of the results or classes that one seeks to find. | In a supervised ML model for cancer versus normal tissue, the “cancer” and “normal tissue” are the labels (targets). | ||
| Train-Test Split | A common approach employed in supervised machine learning in which a subset of the initial data is used to train the model and a subset that is set aside is used to test its initial validation. | A 80–20 train-test split is one in which 80% of the initial data set is assigned to the training phase and 20% kept behind and used to test the model’s performance. | ||
| Transfer learning | In this method unrelated images (eg, cancer vs benign histology) are retrained into a preestablished convolutional neural network (eg, ResNet-50) that is usually devoid of such data. | This approach can be used to build accurate ML models that can distinguish histologic variants of disease by retraining a preestablished neural network such as ResNet50 or Inception-v3. | ||
| Underfitting | An underfitted model has a higher bias and lower variance. In this situation, important potential interrelationships between the data features may be ignored. | See | ||
| Unsupervised learning | These involve agnostic aggregation of unlabeled data sets yielding groups or clusters of entities with shared similarities that may be unknown to the user prior to the analysis step. | Clustering; Dimensionality Reduction (eg, PCA) | ||
| Validation testing | This usually refers to the initial validation testing phase in which the test set (eg, 20% of the initial data set) that was set aside from the train-test split is used to assess the model’s initial performance. This does not always correlate with the generalizability of the model. Hence, testing the model with secondary test sets is usually recommended. | In a model in which 20% of the initial data set was kept behind and used to test the model’s performance, the model was found to be 92% accurate (its validation accuracy). However, using a secondary test set the model was found to be 81% accurate (the model’s likely generalizability). | ||
Abbreviations: AKI, acute kidney injury; CV, cross-validation; k-NN, k-nearest neighbor; ML, machine learning; NGAL, neutrophil gelatinase-associated lipocalin; PCA, principal component analysis; RF, ensemble decision tree algorithm random forest; SVM, support vector machine.
Figure 1.Overview diagram of machine learning algorithms. Machine learning is a subset of artificial intelligence. This figure illustrates the hierarchy of different machine learning algorithms including supervised versus unsupervised versus reinforcement learning techniques. The 2 major categories of supervised learning are classification and regression which lead to discrete/qualitative and continuous/quantitative targets, respectively.
Comparison of Most Common Supervised Learning Algorithms.
| Algorithm | General Accuracy of the Models Built | Used for Classification or Regression Tasks | Training Time | Algorithm Is Relatively Transparent | Able to Deal With Noise (to Tune out Irrelevant Features) | Need for Scaling/Normalization of Data | Highlights | Limitations |
|---|---|---|---|---|---|---|---|---|
| Linear regression | Low-intermediate | Regression | Rapid | Yes | No | Yes | Well studied and well known | Less predictive on closely correlated variables |
| Simple | Sensitive to background noise | |||||||
| Logistic Regression | Low-Intermediate | Classification | Rapid | Yes | No | Yes | Well studied and well known | May be limited by high number of features |
| Simple | ||||||||
| Naïve Bayes | Low-Intermediate | Classification | Rapid | Yes | Yes | No | Very transparent | The assumption that the features are independent may be false |
| Very Rapid | ||||||||
| K-Nearest Neighbor | Intermediate | Both | Rapid | Yes | No | Yes | Able to find nonlinear relationship | Risk of overfitting |
| No real training process is required (grouped based on data distances) | The type of distance metric used may alter the performance | |||||||
| Decision Tree (eg, Boosted Tree) | Intermediate-High | Both | Rapid | No | No | No | Able to find nonlinear relationships | Risk of overfitting |
| Random Forest | High | Both | Intermediate | No | Yes | No | Able to find nonlinear relationships | Risk of overfitting |
| Support Vector Machine | High | Classification* | Rapid | Yes | Yes | Yes | Able to find nonlinear relationships | Risk of overfitting |
| Convolutional Neural Network | High | Both | Slower | No | Yes | Yes | Method of choice for many image analysis studies | Black box algorithm |
| Able to find non-linear relationships | Risk of overfitting | |||||||
| Requires higher computational power and time |
*Support Vector Regression (SVR), not discussed here, is the counterpart to SVM and used for regression studies.
Figure 2.Bias-variance trade-off in machine learning. This figure illustrates the trade-off between bias and variance. Training data (green line) often do not completely represent results from the testing phase. Underfitting data are less variable but exhibit a high error rate and high bias (blue box). In contrast, overfitting data result in low bias and high variance (yellow box). The ideal zone lies between over- versus underfitting of data and may not be optimal until several attempts at testing have been made (red line).
Figure 3.Comparison of popular supervised learning methodologies. This figure illustrates a variety of popular supervised machine learning (ML) methodologies. In the top row, linear regression, logistic regression, and Naïve Bayes Classifier (via TensorFlow) are shown. In the second row, k-nearest neighbor (k-NN), the ensemble decision tree algorithm random forest (RF), and support vector machine (SVM) are compared. Finally, the bottom row illustrates a convoluted neural network evaluating an image. Each image pixel is evaluated (input layer). The network contains several “hidden layers” (yellow circles) which is then processed and sent to the output layer (green circles).
Figure 4.Supervised (labeled) machine learning model study design overview. Steps for the deployment of a supervised machine learning model. From left to right, the figure shows the initial team of multidisciplinary experts defining a study design to address a need. Data are then collected, processed, trained tested, validated, and ultimately deployed.
Figure 5.Stepwise considerations for development and validation of the machine learning (ML) model. The figure describes a very general stepwise approach for development and validation of an ML model. Common metrics used in each step are shown on the right. Step 1 involves assessing the quality and accessibility of the data, followed by step 2 that requires method validation to identify optimal ML model(s). Once optimal ML models have been identified, step 3 involves determining their ability to work with other data sets to assess generalizability. Finally, step 4 involves evaluating the data in more “real-world” conditions to further assess performance and generalizability along with further refinement (go back to step 2) to improve the performance and desirable outcomes.