| Literature DB >> 28806936 |
Alice M Richardson1,2, Brett A Lidbury3,4.
Abstract
BACKGROUND: Data mining techniques such as support vector machines (SVMs) have been successfully used to predict outcomes for complex problems, including for human health. Much health data is imbalanced, with many more controls than positive cases.Entities:
Keywords: Analysis of variance; Hepatitis B; Hepatitis C; Machine learning; Random forests; Synthetic minority oversampling technique
Mesh:
Year: 2017 PMID: 28806936 PMCID: PMC5557531 DOI: 10.1186/s12911-017-0522-5
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Description of variables used in SVM analyses
| Variable abbreviation | Description and definition | Measurement units |
|---|---|---|
| Response variables | ||
| HBsAg | Hepatitis B Surface Antigen (marker of HBV infection) | Positive (1) or Negative (0) |
| HepC | Patient antibody to HCV, indicating contact with virus | |
| Explanatory variables | ||
| Age | Patient (case) Age | Years |
| Sex | Gender 1 = F, 2 = M | M or F |
| ALT | Alanine aminotransferase; an intracellular enzyme released after liver and other tissue cell damage | U/L |
| GGT | Gamma-glutamyl transpeptidase; an intracellular enzyme also relevant to liver damage | U/L |
| Hb | Haemoglobin | g/L |
| Hct | Haematocrit; formerly known as “packed cell volume” | % |
| Mch | Mean corpuscular haemoglobin | pg/RBC |
| MCHC | Mean corpuscular haemoglobin concentration | g/L |
| MCV | Mean corpuscular volume | f/L |
| Plt | Platelets; an agent in blood clotting | × 109/L |
| WCC | White cell count | × 109/L |
| RCC | Red cell count | × 1012/L |
| Crea | Creatinine; excreted by filtration through glomerulus and tubular section | μmol/L |
| K | Potassium; predominant intracellular cation whose plasma level is regulated by renal excretion | mmol/L |
| ALKP | Alkaline Phosphate; found in liver, bone, intestine and liver | U/L |
| ALB | Albumin; major component of plasma proteins | g/L |
| TBil | Total Bilirubin levels are reflective of the rate that the body recycles the red cells in the blood; bilirubin is a breakdown product of old, spent red blood cells. | μmol/L |
| Sodium | Sodium; predominant extracellular cation | mmol/L |
| Urea | Blood urea; often used to detect kidney related infections. | mmol/L |
| RDW | Red cell distribution width | % |
| Neut | Neutrophils; white blood cells, elevated by bacterial infection and early viral infection | × 109/L |
| Lymph | Lymphocytes; white blood cells, elevated by viral infection and some cancers | × 109/L |
| Mono | Monocytes; white blood cells, elevated by infection, inflammation, and some cancers | × 109/L |
| Eos | Eosinophils; white blood cells, elevated by allergy and parasite infection | × 109/L |
| Bas | Basophils; white blood cell, elevated in hypersensitivity reactions | × 109/L |
Workflow for SVM analyses
| HBV | |
| Extract 9170 individuals with HBV recorded of which 172 positive, 8998 negative | |
| Split data into training (70%) and testing (30%) with 120 positive and 6300 negative in each split | |
| Either | Downsize the training data into 52 sets of 120 positive plus 120 negative |
| Or | SMOTE the training data 400% oversampling and 100% under sampling leading to 52 sets of 3960 individuals with 1920 positive, 2040 negative |
| Or | Multiply downsize the training data into 11 sets of 120 positive and 120 negative |
| Then either | grow a random forest and pick the top five variables, apply SVM with the top five variables from the random forest |
| Or | proceed straight to SVM |
| HCV | |
| Extract 7820 individuals with HCV recorded with 533 positive, 7287 negative | |
| Split data into training (70%) and testing (30%) with 373 positive and 5100 negative in each split | |
| Either | Downsize the training data into 13 sets of 373 positive, 373 negative |
| Or | SMOTE the training data at 400% oversampling and 100% under sampling leading to 13 sets of 4797 individuals with 1492 positive, 1865 negative |
| Or | Multiply downsize the training data into 11 sets of 373 positive and 373 negative |
| Then either | grow a random forest and pick the top five variables, apply SVM with the top five variables from the random forest |
| Or | proceed straight to SVM |
Summary statistics for patient demographics
| Variable | HBV positive ( | HBV negative ( |
| HCV positive ( | HCV negative ( |
|
|---|---|---|---|---|---|---|
| Sex | 34% female | 47% female | 0.0008a | 36% female | 45% female | 0.0001a |
| Age mean (s.d.) | 40.5 (13.9) | 45.2 (18.7) | 0.0001b | 40.6 (14.4) | 47.1 (19.2) | <0.0001b |
aTwo-sample test of proportions, bTwo sample t test, HBV Hepatitis B Virus, HCV Hepatitis C virus
Sensitivity, precision and F scores by virus, balancing method and feature selection
| HBV mean (95% CI) | SMOTE | SMOTE RF | Downsize | Downsize RF | MDS | MDS RF |
| Fscore | 0.056 (0.054, 0.057) | 0.052 (0.050, 0.053) | 0.056 (0.054, 0.057) | 0.052 (0.050, 0.053) | 0.065 (0.061 0.068) | 0.059 (0.055, 0.063) |
| Precision | 0.034 (0.032, 0.036) | 0.026 (0.025, 0.027) | 0.029 (0.028, 0.030) | 0.027 (0.026, 0.028) | 0.034 (0.032, 0.036) | 0.031 (0.029 0.032) |
| Sensitivity | 0.625 (0.605, 0.645) | 0.611 (0.587, 0.634) | 0.625 (0.605, 0.645) | 0.611 (0.587, 0.634) | 0.246 (0.231, 0.260) | 0.675 (0.654, 0.680) |
| HCV mean (95% CI) | SMOTE | SMOTE RF | Downsize | Downsize RF | MDS | MDS RF |
| Fscore | 0.187 (0.179, 0.196) | 0.200 (0.196, 0.202) | 0.174 (0.170, 0.178) | 0.208 (0.200, 0.215) | 0.192 (0.190, 0.195) | 0.225 (0.220, 0.229) |
| Precision | 0.134 (0.128, 0.140) | 0.117 (0.115, 0.119) | 0.103 (0.100, 0.105) | 0.124 (0.20, 0.129) | 0.115 (0.113, 0.117) | 0.138 (0.134, 0.141) |
| Sensitivity | 0.311 (0.296, 0.326) | 0.668 (0.654, 0.682) | 0.567 (0.545, 0.590) | 0.625 (0.600, 0.650) | 0.589 (0.579, 0.598) | 0.610 (0.596, 0.623) |
Downsize simple downsizing, Downsize RF Simple downsizing with random forest variable selection, MDS Multiple downsizing, MDS RF MDS with random forest variable selection, SMOTE Synthetic Minority Oversampling Technique, SMOTE RF SMOTE with random forest variable selection, HBV Hepatitis B virus, HCV Hepatitis C virus
Analysis of variance of F score, precision and sensitivity by balancing method and feature selection for HBV
| Precision source | SS | df | MS | F | p |
| Method | 0.0004 | 2 | 0.0002 | 13.088 | 0.000 (a) |
| Pre-processing | 0.0013 | 1 | 0.0013 | 80.504 | 0.000 (a) |
| Method.Pre-processing | 0.0004 | 2 | 0.0002 | 12.222 | 0.000 (a) |
| Sensitivity Source | SS | df | MS | F | p |
| Method | 1.6151 | 2 | 0.8075 | 159.98 | 0.000 (a) |
| Pre-processing | 1.9877 | 1 | 1.9877 | 393.78 | 0.000 (a) |
| Method.Pre-processing | 2.8062 | 2 | 1.4031 | 277.97 | 0.000 (a) |
| F score Source | SS | df | MS | F | p |
| Method | 0.0011 | 2 | 0.0006 | 10.838 | 0.000 (a) |
| Pre-processing | 0.0025 | 1 | 0.0025 | 47.154 | 0.000 (a) |
| Method.Pre-processing | 0.0003 | 2 | 0.0002 | 3.007 | 0.052 |
(a) = Significant at 0.0025 level with adjustment for multiple testing. Method = simple downsizing, multiple downsizing or SMOTE. Pre-processing = random forest variable selection or not. Method.Pre-processing = the interaction between Pre-processing and Method
Analysis of variance of F score, precision and sensitivity by balancing method and feature selection for HCV
| Precision source | SS | df | MS | F | p |
| Method | 0.0025 | 2 | 0.0013 | 32.843 | 0.000 (a) |
| Pre-processing | 0.0011 | 1 | 0.0011 | 28.713 | 0.000 (a) |
| Method.Pre-processing | 0.0064 | 2 | 0.0032 | 84.402 | 0.000 (a) |
| Sensitivity Source | SS | df | MS | F | p |
| Method | 0.0194 | 2 | 0.0970 | 114.86 | 0.000 (a) |
| Pre-processing | 0.4375 | 1 | 0.4375 | 518.21 | 0.000 (a) |
| Method.Pre-processing | 0.4162 | 2 | 0.2081 | 246.46 | 0.000 (a) |
| F score Source | SS | df | MS | F | p |
| Method | 0.0041 | 2 | 0.0021 | 25.546 | 0.000 (a) |
| Pre-processing | 0.0114 | 1 | 0.0114 | 141.771 | 0.000 (a) |
| Method.Pre-processing | 0.0019 | 2 | 0.0010 | 11.844 | 0.000 (a) |
(a) = Significant at 0.0025 level with adjustment for multiple testing. Method = simple downsizing, multiple downsizing or SMOTE. Pre-processing = random forest variable selection or not. Method. Pre-processing = the interaction between Pre-processing and Method
Fig. 1Receiver-operator characteristic (ROC) curves for a HBV and b HCV summarising the six models under consideration for improving prediction for imbalanced data. The models are: Downsize = simple downsizing; Downsize RF = simple downsizing with random forest variable selection; MDS = multiple downsizing; MDS RF = multiple downsizing with random forest variable selection; SMOTE = Synthetic Minority Oversampling; SMOTE RF = Synthetic Minority Oversampling with random forest variable selection
Fig. 2a Hepatitis B virus (HBV) and b Hepatitis C virus (HCV) SVM plots post SMOTE to overcome HBV/HCV immunoassay class imbalance, and random forest to identify the top three predictors of HBV or HCV positive/negative immunoassay class. For HBV and HCV SVM visualisation, the SVM was sliced at ALT equal to 35 IU/L