| Literature DB >> 31864346 |
Shahadat Uddin1, Arif Khan2,3, Md Ekramul Hossain2, Mohammad Ali Moni4.
Abstract
BACKGROUND: Supervised machine learning algorithms have been a dominant method in the data mining field. Disease prediction using health data has recently shown a potential application area for these methods. This study ai7ms to identify the key trends among different types of supervised machine learning algorithms, and their performance and usage for disease risk prediction.Entities:
Keywords: Disease prediction; Machine learning; Medical data; Supervised machine learning algorithm
Mesh:
Year: 2019 PMID: 31864346 PMCID: PMC6925840 DOI: 10.1186/s12911-019-1004-8
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1An illustration of how supervised machine learning algorithms work to categorise diabetic and non-diabetic patients based on abstract data
Fig. 2A simplified illustration of how the support vector machine works. The SVM has identified a hyperplane (actually a line) which maximises the separation between the ‘star’ and ‘circle’ classes
Fig. 3An illustration of a Decision tree. Each variable (C1, C2, and C3) is represented by a circle and the decision outcomes (Class A and Class B) are shown by rectangles. In order to successfully classify a sample to a class, each branch is labelled with either ‘True’ or ‘False’ based on the outcome value from the test of its ancestor node
Fig. 4An illustration of a Random forest which consists of three different decision trees. Each of those three decision trees was trained using a random subset of the training data
Fig. 5An illustration of the Naïve Bayes algorithm. The ‘white’ circle is the new sample instance which needs to be classified either to ‘red’ class or ‘green’ class
Fig. 6A simplified illustration of the K-nearest neighbour algorithm. When K = 3, the sample object (‘star’) is classified as ‘black’ since it gets more ‘vote’ from the ‘black’ class. However, for K = 5 the same sample object is classified as ‘red’ since it now gets more ‘vote’ from the ‘red’ class
Fig. 7An illustration of the artificial neural network structure with two hidden layers. The arrows connect the output of nodes from one layer to the input of nodes of another layer
Fig. 8Number of articles published in different years
Fig. 9The overall data collection procedure. It also shows the number of articles considered for each disease
Fig. 10Composition of initially selected 329 articles with respect to the seven supervised learning algorithms
Fig. 11a The basic framework of the confusion matrix; and (b) A presentation of the ROC curve
Summary of all references
| Reference | Disease predicted | Algorithms compared | Type of data | Number of subjects | Cross validation method | Prediction performance | Best one (s) |
|---|---|---|---|---|---|---|---|
| Aneja and Lal [ | Asthma | ANN, NB | Disease symptom | 1024 | – | Accuracy (ANN = 85, NB = 88) | NB |
| Ayer et al. [ | Breast cancer | ANN, LR | Clinical and demographic data | 62,219 | 10-fold cross validation | AUC (ANN = 0.965, LR = 0.963) | ANN |
| Ahmad et al. [ | Breast cancer | ANN, DT, SVM | Clinical data for cancer incidence and survival | 1189 | 10-fold cross validation | Accuracy (ANN = 0.947, DT = 0.936, SVM = 0.957) Sensitivity (ANN = 0.956, DT = 0.958, SVM = 0.971) Specificity (ANN = 0.928, DT = 0.907, SVM = 0.945) | SVM |
| Lundin et al. [ | Breast cancer | ANN, LR | Clinical and demographic data | 951 | – | AUC (ANN = 0.909, LR = 0.897) | ANN |
| Delen et al. [ | Breast cancer | ANN, DT, LR | Clinical and demographic data | 202,932 | 10-fold cross validation | Accuracy (ANN = 0.909, DT = 0.935, LR = 0.894) | DT |
| Yao et al. [ | Breast cancer | DT, RF, SVM | Image data | 569 | 10-fold cross validation | Accuracy (DT = 0.932, RF = 0.963, SVM = 0.959) | RF |
| Chen et al. [ | Cerebral infarction | DT, KNN, NB | Electronic health records, medical image and gene data | 31,919 | 10-fold cross validation | AUC (DT = 0.646, KNN = 0.454, NB = 0.495) | DT |
| Cai et al. [ | Diabetes | LR, NB, SVM | Gut microbiota | 489 | 10-fold cross validation | AUC (LR = 0.98, NB = 0.94, SVM = 0.99) | SVM |
| Malik et al. [ | Diabetes | ANN, LR, SVM | Electrochemical measurements of saliva | 175 | 3-fold cross validation | Accuracy (ANN = 80.70, LR = 75.86, SVM = 84.09) F1 score (ANN = 80.20, LR = 75.71, SVM = 84.06) | SVM |
| Farran [ | Diabetes | KNN, LR, SVM | Demographic, anthropometric, vital signs, diagnostic and clinical lab measurement data | 10,632 | 5-fold cross validation | Accuracy (KNN = 79.5, LR = 80.7, SVM = 82.6) | SVM |
| Mani et al. [ | Diabetes | KNN, LR, NB, RF, SVM | Demographic and clinical test result | 2280 | 5-fold cross validation | AUC (KNN = 0.721, LR = 0.755, NB = 0.762, RF = 0.803, SVM = 0.749) | RF |
| Tapak et al. [ | Diabetes | ANN, LR, RF, SVM | Demographic, anthropometric, diagnostic and clinical lab measurement data | 6500 | 10-fold cross validation | Accuracy (ANN = 0.931, LR = 0.935, RF = 0.930, SVM = 0.986) AUC (ANN = 0.751, LR = 0.763, RF = 0.717, SVM = 0.979) | SVM |
| Sisodia and Sisodia [ | Diabetes | DT, NB, SVM | Clinical test result | 768 | 10-fold cross validation | Accuracy (DT = 0.738, NB = 0.763, SVM = 0.651) | NB |
| Yang et al. [ | Diabetes | RF, SVM | Clinical and gene expression data | 9343 | 10-fold cross validation | Accuracy (RF = 0.742, SVM = 0.723) | RF |
| Juhola et al. [ | Heart disease | KNN, RF, SVM | Signal data | – | – | Accuracy (84.5, RF = 87.6, SVM = 87.1) | RF |
| Long et al. [ | Heart disease | ANN, NB, SVM | Clinical, demographic and image data | 537 | – | Accuracy (ANN = 77.8, NB = 83.3, SVM = 75.9 | NB |
| Palaniappan and Awang [ | Heart disease | ANN, DT, NB | Clinical and demographic data | 909 | 2-fold cross validation | Accuracy (ANN = 85.682, DT = 78.8334, NB = 87.885) | NB |
| Jin et al. [ | Heart disease | LR, RF | Electronic health records | 20,000 | 5-fold cross validation | AUC (LR = 0.663, RF = 0.627) | LR |
| Puyalnithi and Viswanatham [ | Heart disease | DT, NB, RF, SVM | Clinical and demographic data | 746 | k-fold and leave-one-out | AUC (DT = 0.940, NB = 0.942, RF = 0.917, SVM = 0.731) | NB |
| Forssen et al. [ | Heart disease | LR, RF | Metabolomic data | 3409 | 50-fold cross validation | Accuracy (LR = 0.767, RF = 0.732) AUC (LR = 0.765, RF = 0.711) | LR |
| Tang et al. [ | Heart disease | ANN, LR | Clinical, demographic, behavioural and medical data | 2092 | – | AUC (ANN = 0.762, LR = 0.758) Accuracy (ANN = 0.714, LR = 0.698) | ANN |
| Toshniwal et al. [ | Heart disease | NB, RF, SVM | Electrocardiography data | 47 | – | Accuracy (NB = 88.44, RF = 98.49, SVM = 98.41) | RF |
| Alonso et al. [ | Heart disease | LR, SVM | Clinical data | 8321 | 5-fold cross validation | AUC (LR = 0.76 and SVM = 0.83) | SVM |
| Mustaqeem et al. [ | Heart disease | KNN, NB, RF, SVM | Electrocardiography data | 452 | 10-fold cross validation | Accuracy (KNN = 76.60, NB = 74.43, RF = 76.50, SVM = 74.47) | KNN |
| Mansoor et al. [ | Heart disease | LR, RF | Demographic and hospital admission | 9637 | 10-fold cross validation | Accuracy (LR = 0.88, RF = 0.89) | RF |
| Kim et al. [ | Heart disease | ANN, DT, LR, SVM | Demographic, behavioural and disease data | 748 | – | AUC (ANN = 0.663, DT = 0.631, LR = 0.658, SVM = 0.664) | SVM |
| Kim et al. [ | Heart disease | ANN, LR | Demographic, behavioural and disease data | 4146 | – | Accuracy (ANN = 87.04, LR = 86.11) | ANN |
| Taslimitehrani et al. [ | Heart disease | DT, LR, RF, SVM | Electronic health records | 119,749 | 2-fold cross validation | AUC (DT = 0.66, LR = 0.81, RF = 0.80, SVM = 0.59) | LR |
| Anbarasi et al. [ | Heart disease | DT, NB | Clinical and demographic data | 909 | k-fold cross validation | Accuracy (DT = 99.2%, NB = 96.5%) | DT |
| Bhatla and Jyoti [ | Heart disease | ANN, DT, NB | Clinical data | 3000 | 10-fold cross validation | Accuracy (ANN = 85.53%, DT = 89%, NB = 86.53%) | DT |
| Thenmozhi and Deepika [ | Heart disease | ANN, DT, NB | Clinical data and medical diagnostic data | – | 10-fold cross validation | Accuracy (ANN = 99.25, DT = 96.66, NB = 94.44) | ANN |
| Tamilarasi and Porkodi [ | Heart disease | ANN, KNN, NB | Clinical and demographic data | – | – | Accuracy (ANN = 99.25, KNN = 100, NB = 85.92) | KNN |
| Marikani and Shyamala [ | Heart disease | DT, KNN, NB, RF, SVM | Clinical and demographic data | 303 | – | Accuracy (DT = 0.954, KNN = 0.757, NB = 0.817, RF = 0.963, SVM = 1.0) | SVM |
| Lu et al. [ | Heart disease | ANN, NB, SVM | Clinical, demographic and diagnostic data | 1090 | – | Accuracy (ANN = 86.04, NB = 82.31, SVM = 86.62) | SVM |
| Khateeb and Usman [ | Heart disease | DT, KNN, NB | Clinical and demographic data | 303 | 10-fold cross validation | Accuracy (DT = 76.89, KNN = 79.20, NB = 66.66) | KNN |
| Patel et al. [ | Heart disease | DT, NB | Clinical and demographic data | – | – | Accuracy (DT = 99.2, NB = 96.5) | DT |
| Venkatalakshmi and Shivsankar [ | Heart disease | DT, NB | Clinical and demographic data | 294 | – | Accuracy (DT = 84.01, NB = 85.03) | DT |
| Borah et al. [ | Hemoglobin variants | DT, KNN, LR, RF, SVM | Clinical and demographic data | 1500 | – | DT and RF (Precision = 93.84, Recall = 92.78, F1 score = 93.33) Precision (KNN = 92.23, LR = 89.23, SVM = 66.67) Recall (KNN = 91.67, LR = 87.34, SVM = 64.78) F1 score (KNN = 91.95, LR = 88.27, SVM = 65.71) | DT, RF |
| Farran [ | Hypertension | KNN, LR, SVM | Demographic, anthropometric, vital signs, diagnostic and clinical lab measurement data | 10,632 | 5-fold cross validation | Accuracy (KNN = 82.4, LR = 82.1, SVM = 83) | SVM |
| Ani et al. [ | Kidney disease | ANN, DT, KNN, NB | Clinical and demographic data | 400 | 10-fold cross validation | Accuracy (ANN = 81, DT = 93, KNN = 90, NB = 78) | DT |
| Islam et al. [ | Liver disease | ANN, LR, RF, SVM | Clinical, demographic and ultrasonography test data | 994 | 10-fold cross validation | Accuracy (ANN = 0.691, LR = 0.707, RF = 0.658, SVM = 0.690) AUC (ANN = 0.733, LR = 0.763, RF = 0.708, SVM = 0.657) | LR |
| Lynch et al. [ | Lung cancer | DT, RF, SVM | Clinical and demographic data | – | 10-fold cross validation | Running Mean Square Error (DT = 15.81, RF = 15.63, SVM = 15.82) | RF |
| Chen et al. [ | microRNA | RF, SVM | microRNA data | 96,325 | 5-fold cross validation | Accuracy (RF = 75.24, SVM = 70.02) | RF |
| Eskidere et al. [ | Parkinson’s disease | ANN, SVM | Voice recording and demographic data | 42 | 10-fold cross validation | Mean absolute error (SVM = 6.99), ANN = 8.20) | SVM |
| Chen et al. [ | Parkinson’s disease | KNN, SVM | Voice recording and demographic data | 31 | 10-fold cross validation | Accuracy (KNN = 95.78, SVM = 93.52) AUC (KNN = 95.60, SVM = 91.12) | KNN |
| Behroozi and Sami [ | Parkinson’s disease | KNN, NB, SVM | Voice recording and demographic data | 40 | Leave-one-out | Accuracy (KNN = 77.50, NB = 80.00, SVM = 87.50) | SVM |
| Hussain et al. [ | Prostate cancer | DT, NB, SVM | Magnetic resonance imaging data | 20 | 10-fold cross validation | AUC (DT = 0.955, NB = 0.989, SVM = 0.997) | SVM |
| Zupan et al. [ | Prostate cancer | DT, NB | Clinical data | 2051 | 10-fold cross validation | Accuracy (NB = 70.80, DT = 68.80) | NB |
| Hung et al. [ | Stroke | ANN, LR, SVM | Electronic medical claim and demographic data | 798,611 | – | Accuracy (ANN = 0.873, LR = 0.866, SVM = 0.839) | ANN |
Advantages and limitations of different supervised machine learning algorithms
| Supervised algorithm | Advantages | Limitations |
|---|---|---|
| Artificial neural network (ANN) | - Can detect complex nonlinear relationships between dependent and independent variables. - Requires less formal statistical training. - Availability of multiple training algorithms. - Can be applied to both classification and regression problems. | - Have characteristics of ‘black box’ - user can not have access to the exact decision-making process and therefore, - Computationally expensive to train the network for a complex classification problem. - Predictor or Independent variables require pre-processing. |
| Decision tree (DT) | - Resultant classification tree is easier to understand and interpret. - Data preparation is easier. - Multiple data types such as numeric, nominal, categorical are supported. - Can generate robust classifiers and can be validated using statistical tests. | - Require classes to be mutually exclusive. - Algorithm cannot branch if any attribute or variable value for a non-leaf node is missing. - Algorithm depends on the order of the attributes or variables. - Do not perform as well as some other classifier (e.g., Artificial Neural Network) [ |
| K-nearest neighbour (KNN) | - Simple algorithm and can classify instances quickly. - Can handle noisy instances or instances with missing attribute values. - Can be used for classification and regression. | - Computationally expensive as the number of attributes increases. - Attributes are given equal importance, which can lead to poor classification performance. - Provide no information on which attributes are most effective in making a good classification. |
| Logistic regression (LR) | - Easy to implement and straightforward. - LR-based models can be updated easily. - Does not make any assumptions regarding the distribution of independent variable (s). - It has a nice probabilistic interpretation of model parameters. | - Does not have good accuracy when input variables have complex relationships. - Does not consider the linear relationship between variables. - Key components of LR - logic models, are vulnerable to overconfidence. - May overstate the prediction accuracy due to sampling bias. - Unless multinomial, generic LR can only classify variables that have two states (i.e., dichotomous). |
| Naïve Bayes (NB) | - Simple and very useful for large datasets. - Can be used for both binary and multi-class classification problems. - It requires less amount of training data. - It can make probabilistic predictions and can handle both continuous and discrete data. | - Classes must be mutually exclusive. - Presence of dependency between attributes negatively affects the classification performance. - It assumes the normal distribution of numeric attributes. |
| Random forest (RF) | - Lower chance of variance and overfitting of training data compared to DT, since RF takes the average value from the outcomes of its constituent decision trees. - Empirically, this ensemble-based classifier performs better than its individual base classifiers, i.e., DTs. - Scales well for large datasets. - It can provide estimates of what variables or attributes are important in the classification. | - More complex and computationally expensive. - Number of base classifiers needs to be defined. - It favours those variables or attributes that can take high number of different values in estimating variable importance. - Overfitting can occur easily. |
| Support vector machine (SVM) | - More robust compared to LR - Can handle multiple feature spaces. - Less risk of overfitting. - Performs well in classifying semi-structured or unstructured data, such as texts, images etc. | - Computationally expensive for large and complex datasets. - Does not perform well if the data have noise. - The resultant model, weight and impact of variables are often difficult to understand. - Generic SVM cannot classify more than two classes unless extended. |
Comparison of usage frequency and accuracy of different supervised machine learning algorithms
| Supervised machine learning algorithms | Number of published articles used this algorithm | Number of times this algorithm showed superior accuracy (%) |
|---|---|---|
| Artificial neural network (ANN) | 20 | 6 (30%) |
| Decision tree (DT) | 21 | 7 (33%) |
| K-nearest neighbour (KNN) | 13 | 4 (31%) |
| Logistic regression (LR) | 20 | 5 (25%) |
| Naïve Bayes (NB) | 23 | 7 (30%) |
| Random forest (RF) | 17 | 9 (53%) |
| Support vector machine (SVM) | 29 | 13 (41%) |
Comparison of the performance of different supervised machine learning algorithms based on different criteria
| Criteria | # articles meet this criterion (%) | Name and frequency of the algorithm that showed ‘superior’ accuracy | |
|---|---|---|---|
| Disease names that were frequently modelled | |||
| Heart disease | 23 (48%) | NB, SVM (4 times, each) | ANN, DT, KNN, LR (3 times, each) |
| Diabetes | 7 (15%) | SVM (4 times) | RF (2 times) |
| Breast cancer | 5 (10%) | ANN (2 times) | DT, RF, SVM (1 time, each) |
| Parkinson’s disease | 3 (6%) | SVM (2 times) | KNN (1 time) |
| Type of the data that were used | |||
| Clinical and Demographical | 15 (30%) | DT (6) | ANN, KNN, NB, RF (2 times, each) |
| Other data types | 33 (66%) | SVM, RF (12 times, each) | RF (7) |
| Validation method followed | |||
| 10-fold validation | 21 (42%) | SVM (5 times) | DT, RF (4 times, each) |
| 5-fold validation | 6 (12%) | SVM (3 times) | RD (2 times) |
| Other method | 7 (14%) | LR, NB, SVM (2 times, each) | DT (1 time) |
| Do not use any method | 16 (32%) | ANN (4 times) | DT, RF, SVM (3 times, each) |
Fig. 12Illustration of the superior performance of the Support vector machine using ROC graphs (based on the data from Table 4) – (a) for disease names that were modelled; and (b) for validation methods that were followed