| Literature DB >> 23800244 |
Alice M Richardson1, Brett A Lidbury.
Abstract
BACKGROUND: Advanced data mining techniques such as decision trees have been successfully used to predict a variety of outcomes in complex medical environments. Furthermore, previous research has shown that combining the results of a set of individually trained trees into an ensemble-based classifier can improve overall classification accuracy. This paper investigates the effect of data pre-processing, the use of ensembles constructed by bagging, and a simple majority vote to combine classification predictions from routine pathology laboratory data, particularly to overcome a large imbalance of negative Hepatitis B virus (HBV) and Hepatitis C virus (HCV) cases versus HBV or HCV immunoassay positive cases. These methods were illustrated using a never before analysed data set from ACT Pathology (Canberra, Australia) relating to HBV and HCV patients.Entities:
Mesh:
Year: 2013 PMID: 23800244 PMCID: PMC3697984 DOI: 10.1186/1471-2105-14-206
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Description of response and explanatory variables subjected to decision tree analyses
| | | |
| HBSA | Hepatitis B Surface Antigen (marker of HBV infection) | Positive (1) or |
| HepC | Patient antibody to HCV, indicating contact with virus | Negative (0) |
| | | |
| Age | Patient (case) Age | Years |
| Sex | Gender: 1 = F, 2 = M | M or F |
| ALT | Alanine aminotransferase (An intracellular enzyme released in after liver & other tissue cell damage) | U/L |
| GGT | Gamma-glutamyltranspeptidase (An intracellular enzyme also relevant to liver damage) | U/L |
| Hb | Haemoglobin | g/L |
| Hct | Haematocrit (formerly known as “packed cell volume”) | % |
| Mch | Mean corpuscular haemoglobin | pg/RBC |
| MCHC | Mean corpuscular haemoglobin concentration | g/L |
| MCV | Mean corpuscular volume | fL |
| Plt | Platelets (blood clotting) | x 109/L |
| WCC | White cell count | x 109/L |
| RCC | Red cell count | x 1012/L |
| RDW | Red cell distribution width | % |
| Neut | Neutrophil. White blood cell, elevated by bacterial infection and early viral infection | x 109/L |
| Lymph | Lymphocyte. White blood cell, elevated by viral infection and some cancers | x 109/L |
| Mono | Monocyte. White blood cell, elevated by infection, inflammation, some cancers. | x 109/L |
| Eos | Eosinophil. White blood cell, elevated by allergy and parasite infection | x 109/L |
| Bas | Basophils. White blood cell, elevated in hypersensitivity reactions. | x 109/L |
U/L Units per litre, g/L grams/Litre, pg picograms, fL femtolitres.
Specificity and sensitivity (%) of HBV and HCV immunoassay outcome prediction after single decision tree analysis
| HBSA specificity | 98.38 | 89.57 | 45.65 |
| HBSA sensitivity | 4.6 | 16.92 | 64.62 |
| HepC specificity | 99.17 | 83.03 | 65.88 |
| HepC sensitivity | 32.35 | 35.29 | 65.89 |
See Software Methods for sensitivity and specificity calculations, and Phase 1 Methods for descriptions of the decision tree analyses.
Figure 1A representative decision tree. From the matched single analysis featuring popular explanatory variables associated with the HBSA response variable.
Figure 2Weighted importance for leading explanatory variables strongly linked to a positive HBSA immunoassay result. Variable importance was calculated as the number of times a variable appeared in testing phase decision trees, Depth in decision tree weights indicates predictor variables at the top of the tree with the highest importance, with lower nodes contributing a lower weighting based on a lesser hierarchy importance.
Specificity and sensitivity (%) of HBV and HCV immunoassay outcome prediction after decision tree ensemble analyses
| HBSA specificity | 53.91 | 54.46 | 54.41 | 54.41 |
| HBSA sensitivity | 62.22 | 59.82 | 59.82 | 59.82 |
| HepC specificity | 57.75 | 57.65 | 57.77 | 57.66 |
| HepC sensitivity | 63.19 | 63.45 | 63.08 | 63.31 |
| Raw | Scale | Log | Scale-log | |
| HBSA specificity | 68.57 | 68.82 | 68.80 | 68.57 |
| HBSA sensitivity | 46.83 | 46.91 | 46.83 | 46.83 |
| HepC specificity | 58.87 | 58.91 | 58.88 | 58.87 |
| HepC sensitivity | 63.40 | 63.34 | 63.34 | 63.37 |
| Raw | Scale | Log | Scale-log | |
| HBSA specificity | 54.45 | 54.59 | 45.74 | 45.74 |
| HBSA sensitivity | 61.43 | 61.43 | 70.20 | 70.20 |
| HepC specificity | 35.04 | 34.87 | 36.90 | 36.88 |
| HepC sensitivity | 80.37 | 80.84 | 76.53 | 76.53 |
Methods employed were (a) basic multiple, (b) majority multiple and (c) clear negative analyses (see Methods). Prior to accuracy analysis, explanatory variables were subject to one of four pre-processing methods: none (raw), scaling, logging and scale-logging. Scaling sets the range of each explanatory variable to a common range of 0 – 100. Logging uses natural logarithm transformation. Scale-logging uses a common range of 0 – 100 then takes the natural logarithm.
Analysis of variance of mean accuracy rates for a four-factor experiment
| Method | 28.015 | 2 | 14.008 | 0.488 | 0.620 |
| Pre-processing | 0.967 | 3 | 0.322 | 0.011 | 0.998 |
| Virus | 44.815 | 1 | 44.815 | 1.560 | 0.224 |
| Outcome | 927.169 | 1 | 927.169 | 32.279 | 0.000 (*) |
| Method.Outcome | 2909.082 | 2 | 1454.541 | 50.640 | 0.000 (*) |
| Method.Pre-processing | 0.863 | 6 | 0.144 | 0.005 | 1.000 |
| Method.Virus | 42.649 | 2 | 21.324 | 0.742 | 0.487 |
| Pre-processing.Outcome | 8.436 | 3 | 2.812 | 0.098 | 0.960 |
| Virus.Outcome | 922.604 | 1 | 922.604 | 32.120 | 0.000 (*) |
| Pre-processing.Virus | 0.301 | 2 | 0.100 | 0.003 | 1.000 |
The experiment examines interactions affecting the prediction of HBSA and HepC immunoassay outcome.
(*) = Significant at 0.001 level.
Method = basic single, basic multiple, majority multiple or clear negative.
Pre-processing = none, log, scale, or scale-log.
Virus = Hepatitis B or Hepatitis C.
Outcome = positive or negative.
Method.Outcome = the interaction between method and outcome; other interactions between pairs of variables to be interpreted similarly.
Figure 3Sensitivity and specificity for the best method of data pre-processing - Negative viral infection. Sensitivity and specificity rates are shown for the scale method of pre-processing associated with negative hepatitis B virus (HBV) or hepatitis C virus (HCV) infection, including the two outcomes (positive or negative prediction). BM = Basic multiple approach. MM = majority multiple approach.
Figure 4Sensitivity and specificity for the best method of data pre-processing - Positive viral infection. Sensitivity and specificity rates are shown for the scale method of pre-processing associated with positive hepatitis B virus (HBV) or hepatitis C virus (HCV) infection, including the two outcomes (positive or negative prediction). BM = Basic multiple approach. MM = majority multiple approach.