| Literature DB >> 25077568 |
Mattia Cf Prosperi, Susana Marinho, Angela Simpson, Adnan Custovic, Iain E Buchan.
Abstract
BACKGROUND: There is increasing recognition that asthma and eczema are heterogeneous diseases. We investigated the predictive ability of a spectrum of machine learning methods to disambiguate clinical sub-groups of asthma, wheeze and eczema, using a large heterogeneous set of attributes in an unselected population. The aim was to identify to what extent such heterogeneous information can be combined to reveal specific clinical manifestations.Entities:
Mesh:
Year: 2014 PMID: 25077568 PMCID: PMC4101570 DOI: 10.1186/1755-8794-7-S1-S7
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Study data
| variable | median (IQR) | #missing (%) | ||||
|---|---|---|---|---|---|---|
| age (years) | 42.6 (39.7-45.7) | 0 (0%) | ||||
| year of birth | 1964 (1961-1967) | 0 (0%) | ||||
| body mass index (BMI) | 26 (23.6-29.1) | 2 (0.4%) | ||||
| whole body impedance | 613.5 (550-685) | 34 (6.1%) | ||||
| fat % | 29.5 (23.7-36) | 34 (6.1%) | ||||
| exhaled nitric oxide (eNO), ppb (loge scale) | 2.8 (2.4-3.3) | 94 (17.0%) | ||||
| specific airway resistance (sRaw), kPa/s (loge scale) | -0.1 (-0.3-0.1) | 11 (2.0%) | ||||
| peak expiratory flow (PEF) % predicted | 113.1 (102.1-124.6) | 9 (1.6%) | ||||
| forced vital capacity (FVC) % predicted | 114.5 (105.6-123.1) | 11 (2.0%) | ||||
| forced expiratory volume in 1 second (FEV^ % predicted | 106 (98.6-115.5) | 10 (1.8%) | ||||
| forced expiratory flow (FEF25-75) % predicted | 80 (66-96.3) | 11 (2.0%) | ||||
| total lung capacity (TLC) | 108.5 (101-116.9) | 11 (2.0%) | ||||
| residual volume (RV) | 113 (98.6-127.8) | 11 (2.0%) | ||||
| FEVj/FVC ratio | 0.8 (0.8-0.8) | 11 (2.0%) | ||||
| provocative concentration of methacholine needed to produce a 20% fall in FEVj (PC20), of those completing the test | 5.3 (1.2-9.0) | 43 (7.8%) | ||||
| methacholine dose-response slope (MDRS), transformed as 100/(MdRS+10) | 5.7 (4.2-7.5) | 43 (7.8%) | ||||
| Gender | male | 234 (42.2%) | 0 (0%) | |||
| smoking status | never | 341 (61.6%) | 0 (0%) | |||
| ex-smoker | 144 (26%) | 0 (0%) | ||||
| current | 69 (12.5%) | 0 (0%) | ||||
| cat/dog ownership | 186 (33.6%) | 1 (0.2%) | ||||
| allergen sensitisation by skin prick test (SPT) | dust mite (mean wheal diameter >3 mm) | 162 (29.3%) | 1 (0.2%) | |||
| cat (mean wheal diameter >3 mm) | 106 (19.2%) | 1 (0.2%) | ||||
| dog (mean wheal diameter >3 mm) | 48 (8.6%) | 1 (0.2%) | ||||
| tree (mean wheal diameter >3 mm) | 76 (13.8%) | 1 (0.2%) | ||||
| grass (mean wheal diameter >3 mm) | 129 (23.4%) | 1 (0.2%) | ||||
| mould (mean wheal diameter >3 mm) | 16 (2.9%) | 1 (0.2%) | ||||
| peanut (mean wheal diameter >3 mm) | 9 (1.7%) | 1 (0.2%) | ||||
| bird ownership | 13 (2.4%) | 1 (0.2%) | ||||
| medications in the past three months | short-acting beta agonists (SABA) | 34 (6.1%) | 1 (0.2%) | |||
| inhaled corticosteroids (ICS) or ICS/long-acting beta agonists (LABA) | 37 (6.7%) | 1 (0.2%) | ||||
| illness or problem caused by eating a particular food or foods, ever | 97 (17.5%) | 1 (0.2%) | ||||
| accident at home, work or elsewhere exposing to high levels of vapours, gas or dust | 22 (4%) | 2 (0.4%) | ||||
| carpets in the house | 292 (52.8%) | 1 (0.2%) | ||||
| gas stove in the house | 432 (78.1%) | 1 (0.2%) | ||||
| electric stove in the house | 231 (41.8%) | 1 (0.2%) | ||||
| job causing wheezing problems | 33 (6%) | 3 (0.5%) | ||||
| proportion of subjects not completing PC20 | 423 (82.8%) | 43 (7.8%) | ||||
| proportion of subjects with current asthma (CA) | 93 (16.7%) | 0 (0%) | ||||
| proportion of subjects with level-2 asthma (A2) | 70 (12.7%) | 0 (0%) | ||||
| proportion of subjects with level-3 asthma (A3) | 24 (4.3%) | 0 (0%) | ||||
| proportion of subjects with current wheeze (CW) | 68 (12.3%) | 0 (0%) | ||||
| proportion of subjects with self-diagnosed eczema (SDE) | 146 (26.3%) | 0 (0%) | ||||
| proportion of subjects with doctor's diagnosed eczema (DDE) | 120 (21.7%) | 0 (0%) | ||||
| SDE | DDE | CA | A2 | CW | A3 | |
| SDE | 95.47% | 74.82% | 74.82% | 73.37% | 74.09% | |
| DDE | 95.47% | 77.17% | 77.90% | 76.09% | 78.26% | |
| CA | 74.82% | 77.17% | 95.29% | 89.49% | 87.68% | |
| A2 | 74.82% | 77.90% | 95.29% | 92.39% | 91.67% | |
| CW | 73.37% | 76.09% | 89.49% | 92.39% | 90.94% | |
| A3 | 74.09% | 78.26% | 87.68% | 91.67% | 90.94% | |
Characteristics of the study population (n = 554) and cross-tabulation of outcomes.
Figure 1Comparison of machine learning methods. Performance comparison of different machine learning techniques in terms of area under the receiver operating characteristic curve in predicting current asthma (left panel), current wheeze (middle panel), and doctor's diagnosed eczema (right panel) using the whole feature set (demographic, environmental, genetic, lung function markers, and allergen sensitization). Results are out-of-bag predictions averaged over 100 bootstrap runs.
Comparison of machine learning methods.
| outcome | Model | AUROC | sensitivity (at 90% specificity) | sensitivity (at 80% specificity) | accuracy |
|---|---|---|---|---|---|
| Doctor's Diagnosed Eczema | Decision Tree* | 0.57 (0.04) | 0.15 (0.07) | 0.29 (0.07) | 0.78 (0.02) |
| Random Forest | 0.64 (0.03) | 0.2 (0.06) | 0.34 (0.07) | 0.79 (0.02) | |
| Logistic Regression | 0.59 (0.04) | 0.18 (0.06) | 0.31 (0.08) | 0.78 (0.02) | |
| One Rule* | 0.58 (0.06) | 0.2 (0.11) | 0.3 (0.15) | 0.79 (0.02) | |
| AdaBoost | 0.58 (0.04) | 0.17 (0.06) | 0.3 (0.07) | 0.78 (0.02) | |
| Current Asthma | Decision Tree* | 0.72 (0.06) | 0.39 (0.12) | 0.54 (0.11) | 0.85 (0.02) |
| Random Forest | 0.84 (0.03) | 0.55 (0.09) | 0.72 (0.08) | 0.87 (0.02) | |
| Logistic Regression | 0.79 (0.04) | 0.45 (0.08) | 0.63 (0.08) | 0.86 (0.02) | |
| One Rule* | 0.76 (0.06) | 0.44 (0.09) | 0.61 (0.11) | 0.86 (0.02) | |
| AdaBoost | 0.81 (0.04) | 0.48 (0.09) | 0.66 (0.07) | 0.86 (0.02) | |
| Current Wheeze | Decision Tree* | 0.62 (0.06) | 0.27 (0.1) | 0.36 (0.11) | 0.88 (0.02) |
| Random Forest | 0.76 (0.04) | 0.47 (0.09) | 0.6 (0.09) | 0.89 (0.02) | |
| Logistic Regression | 0.72 (0.04) | 0.34 (0.08) | 0.51 (0.08) | 0.88 (0.02) | |
| One Rule* | 0.69 (0.06) | 0.33 (0.09) | 0.49 (0.12) | 0.88 (0.02) | |
| AdaBoost | 0.73 (0.04) | 0.32 (0.09) | 0.5 (0.09) | 0.88 (0.02) | |
Performance of machine learning models on different outcomes using the full set of demographic, environmental, genetic (single nucleotide polymorphisms), allergen sensitisation, and lung functions variables. Results are mean (standard deviation) values estimated from out-of-bag distributions across 100 bootstrap runs.
*difference in AUROC significantly shifted from zero at the 0.05 level as compared to that of a random forest. AUROC: area under the receiver operating characteristic curve.
Comparison of random forest performance using selected input domains.
| outcome | feature set | AUROC | p-value* | sensitivity (at 90% specificity) | sensitivity (at 80% specificity) | accuracy |
|---|---|---|---|---|---|---|
| Doctor's Diagnosed Eczema | allergens | 0.62 (0.03) | 0.34 | 0.22 (0.06) | 0.37 (0.06) | 0.79 (0.02) |
| lung functions | 0.56 (0.04) | 0.08 | 0.13 (0.05) | 0.24 (0.06) | 0.78 (0.02) | |
| genetic | 0.56 (0.04) | 0.11 | 0.14 (0.05) | 0.25 (0.06) | 0.78 (0.02) | |
| demographic/environ. | 0.56 (0.04) | 0.05 | 0.12 (0.05) | 0.24 (0.07) | 0.78 (0.02) | |
| all | 0.65 (0.04) | reference | 0.2 (0.07) | 0.35 (0.08) | 0.79 (0.02) | |
| Current Asthma | allergens | 0.79 (0.04) | 0.11 | 0.43 (0.08) | 0.64 (0.07) | 0.86 (0.02) |
| lung functions | 0.76 (0.04) | 0.04 | 0.44 (0.08) | 0.6 (0.09) | 0.86 (0.02) | |
| genetic | 0.54 (0.04) | <0.0001 | 0.12 (0.05) | 0.23 (0.07) | 0.83 (0.02) | |
| demographic/environ. | 0.62 (0.04) | <0.0001 | 0.2 (0.08) | 0.38 (0.07) | 0.83 (0.02) | |
| all | 0.84 (0.03) | reference | 0.56 (0.09) | 0.73 (0.08) | 0.87 (0.02) | |
| Current Wheeze | allergens | 0.75 (0.04) | 0.35 | 0.34 (0.09) | 0.54 (0.1) | 0.88 (0.02) |
| lung functions | 0.72 (0.05) | 0.19 | 0.42 (0.09) | 0.55 (0.08) | 0.89 (0.02) | |
| genetic | 0.5 (0.05) | 0.0002 | 0.11 (0.06) | 0.21 (0.08) | 0.88 (0.02) | |
| demographic/environ. | 0.6 (0.05) | 0.006 | 0.17 (0.07) | 0.32 (0.09) | 0.88 (0.02) | |
| all | 0.77 (0.04) | reference | 0.5 (0.09) | 0.62 (0.07) | 0.89 (0.02) | |
Performance of random forest on different outcomes using specific variable subsets and the full set of demographic, environmental, genetic (single nucleotide polymorphisms), allergen sensitisation, and lung functions variables. Results are mean (standard deviation) estimated from out-of-bag distributions across 100 bootstrap runs.
*testing the hypothesis of difference in AUROC significantly shifted from zero as compared to that of a random forest model using all variables with a corrected paired t-test.
AUROC: area under the receiver operating characteristic curve.
Figure 2Comparison of random forest performance using selected input domains. Performance comparison of random forests in terms of area under the receiver operating characteristic curve in predicting current asthma (left panel), current wheeze (middle panel), and doctor's diagnosed eczema (right panel) using the whole feature set (demographic, environmental, genetic, lung function markers, and allergen sensitization) and selected feature subsets. Results are out-of-bag predictions averaged over 100 bootstrap runs.
Figure 3Feature importance evaluation by means of random forests. Importance is calculated and shown as the rescaled mean (standard deviation) decrease in accuracy over 1000 independent runs (green colour). Boxplots represent a null feature importance distribution obtained by permuting the outcome randomly for 1000 times. Variables significant at the 0.1 level (p-values in red) are shown for current asthma (upper panel), current wheeze (middle panel), and doctor's diagnosed eczema (lower panel) using the whole feature set as input.