| Literature DB >> 26848571 |
Joanna F Dipnall1,2, Julie A Pasco1,3,4,5, Michael Berk1,5,6,7,8, Lana J Williams1, Seetal Dodd1,5,6, Felice N Jacka1,6,9,10, Denny Meyer2.
Abstract
BACKGROUND: Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26848571 PMCID: PMC4744063 DOI: 10.1371/journal.pone.0148195
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Hybrid Methodology Steps.
Estimated covariate statistics.
| Not Depressed | 0.923 | 0.005 | 0.913 | 0.934 | |
| Depressed | 0.077 | 0.005 | 0.066 | 0.087 | |
| Male | 0.496 | 0.006 | 0.483 | 0.509 | Reference |
| Female | 0.504 | 0.006 | 0.491 | 0.517 | <0.001 |
| 18–34 years | 0.329 | 0.009 | 0.309 | 0.348 | Reference |
| 35–44 years | 0.210 | 0.008 | 0.193 | 0.226 | 0.457 |
| 45–54 years | 0.223 | 0.007 | 0.207 | 0.238 | 0.117 |
| 55–64 years | 0.176 | 0.006 | 0.162 | 0.189 | 0.272 |
| 65+ years | 0.063 | 0.004 | 0.056 | 0.071 | 0.006 |
| Mexican American/Hispanic | 0.147 | 0.031 | 0.082 | 0.213 | 0.010 |
| Non-Hispanic White | 0.678 | 0.035 | 0.602 | 0.753 | Reference |
| Non-Hispanic Black | 0.110 | 0.009 | 0.089 | 0.130 | <0.001 |
| Other | 0.065 | 0.009 | 0.046 | 0.084 | 0.542 |
| Current Smoker | 0.219 | 0.008 | 0.201 | 0.237 | <0.001 |
| Former Smoker | 0.226 | 0.014 | 0.195 | 0.257 | 0.800 |
| Never Smoked | 0.555 | 0.019 | 0.515 | 0.595 | Reference |
| Full food security | 0.782 | 0.013 | 0.755 | 0.809 | <0.001 |
| Food insecurity | 0.218 | 0.013 | 0.191 | 0.245 | Reference |
| Underweight | 0.018 | 0.003 | 0.012 | 0.024 | 0.396 |
| Normal | 0.295 | 0.014 | 0.266 | 0.324 | Reference |
| Overweight | 0.331 | 0.011 | 0.307 | 0.355 | 0.358 |
| Obese | 0.356 | 0.011 | 0.333 | 0.379 | 0.035 |
| Low to high activity | 0.451 | 0.018 | 0.411 | 0.490 | Reference |
| No low to high activity | 0.549 | 0.018 | 0.510 | 0.589 | 0.882 |
| No Diabetes | 0.910 | 0.005 | 0.900 | 0.920 | Reference |
| Diabetes | 0.090 | 0.005 | 0.080 | 0.100 | 0.001 |
| No Cardiovascular Disease | 0.924 | 0.007 | 0.910 | 0.938 | Reference |
| Cardiovascular Disease | 0.076 | 0.007 | 0.062 | 0.090 | 0.009 |
| No Arthritis | 0.928 | 0.005 | 0.918 | 0.938 | Reference |
| Arthritis | 0.072 | 0.005 | 0.062 | 0.082 | <0.001 |
| No Cancer or malignancy | 0.901 | 0.008 | 0.884 | 0.919 | Reference |
| Cancer or malignancy | 0.099 | 0.008 | 0.081 | 0.116 | 0.925 |
| No | 0.854 | 0.008 | 0.836 | 0.871 | Reference |
| Yes | 0.146 | 0.008 | 0.129 | 0.164 | <0.001 |
| No | 0.915 | 0.006 | 0.902 | 0.928 | Reference |
| Yes | 0.085 | 0.006 | 0.072 | 0.098 | <0.001 |
| 3.013 | 0.044 | 2.918 | 3.108 | <0.001 | |
| 6.235 | 0.290 | 5.615 | 6.856 | 0.005 |
Note: Multiple-imputation, survey estimation. Based on 20 imputations, primary N = 5,227. P-value indicates the significance of biomarker with depression.
Boosted regression statistics.
| Biomarker | Original data set | Imputation | sets 1 to 20 | ||
|---|---|---|---|---|---|
| Mean | Std Dev | Min | Max | ||
| T.gondii antibodies (IU/ml) | 0.220 | 0.562 | 0.444 | 0.326 | 2.388 |
| Blood lead (ug/dL) | 1.482 | 1.658 | 0.066 | 1.537 | 1.753 |
| Mercury, total (ug/L) | 1.958 | 1.847 | 0.110 | 1.628 | 2.049 |
| Mercury, inorganic (ug/L) | 0.290 | 1.668 | 0.116 | 1.358 | 1.788 |
| White blood cell count (1000 cells/uL) | 1.243 | 1.126 | 0.065 | 1.000 | 1.277 |
| Lymphocyte percent (%) | 1.331 | 0.978 | 0.078 | 0.780 | 1.176 |
| Monocyte percent (%) | 1.996 | 1.595 | 0.172 | 1.371 | 1.904 |
| Segmented neutrophils percent (%) | 1.240 | 1.004 | 0.082 | 0.856 | 1.121 |
| Eosinophils percent (%) | 1.770 | 0.971 | 0.126 | 0.819 | 1.406 |
| Basophils percent (%) | 0.565 | 0.585 | 0.051 | 0.503 | 0.690 |
| Lymphocyte number (1000 cells/uL) | 0.754 | 0.912 | 0.171 | 0.691 | 1.559 |
| Monocyte number (1000 cells/uL) | 0.835 | 0.492 | 0.043 | 0.392 | 0.552 |
| Segmented neutrophils num (1000 cell/uL) | 1.138 | 1.103 | 0.075 | 0.950 | 1.282 |
| Eosinophils number (1000 cells/uL) | 0.229 | 0.129 | 0.026 | 0.052 | 0.157 |
| Basophils number (1000 cells/uL) | 0.069 | 0.084 | 0.012 | 0.056 | 0.100 |
| Red blood cell count (million cells/uL) | 0.829 | 1.083 | 0.148 | 0.898 | 1.595 |
| 1.717 | 0.152 | 2.667 | 3.210 | ||
| 0.847 | 0.148 | 2.533 | 3.103 | ||
| Mean cell volume (fL) | 0.494 | 0.943 | 0.082 | 0.853 | 1.154 |
| Mean cell hemoglobin (pg) | 0.848 | 1.307 | 0.045 | 1.251 | 1.432 |
| 0.105 | 3.195 | 3.610 | |||
| 1.918 | 0.105 | 1.685 | 2.058 | ||
| 0.112 | 2.443 | 2.842 | |||
| Mean platelet volume (fL) | 1.257 | 1.348 | 0.095 | 1.214 | 1.519 |
| 0.136 | 3.908 | 4.447 | |||
| Glycohemoglobin (%) | 1.526 | 1.830 | 0.107 | 1.604 | 1.986 |
| 1.877 | 0.107 | 2.024 | 2.405 | ||
| Direct HDL-Cholesterol (mg/dL) | 1.097 | 1.149 | 0.104 | 0.946 | 1.328 |
| RBC folate (ng/mL) | 0.536 | 0.779 | 0.097 | 0.588 | 1.021 |
| Serum folate (ng/mL) | 1.955 | 1.840 | 0.163 | 1.486 | 2.087 |
| 1.982 | 0.277 | 1.542 | 2.739 | ||
| 0.496 | 4.544 | 6.459 | |||
| Albumin (g/dL) | 0.914 | 0.543 | 0.050 | 0.459 | 0.662 |
| Alanine aminotransferase ALT (U/L) | 0.843 | 0.836 | 0.033 | 0.791 | 0.889 |
| Aspartate aminotransferase AST (U/L) | 0.727 | 0.577 | 0.053 | 0.462 | 0.694 |
| 1.966 | 0.086 | 1.825 | 2.184 | ||
| Blood urea nitrogen (mg/dL) | 0.840 | 0.875 | 0.088 | 0.718 | 1.061 |
| Total calcium (mg/dL) | 0.671 | 0.610 | 0.053 | 0.528 | 0.753 |
| Cholesterol (mg/dL) | 0.373 | 0.793 | 0.086 | 0.680 | 1.076 |
| 1.840 | 0.089 | 1.661 | 2.035 | ||
| 0.189 | 2.491 | 3.370 | |||
| Gamma glutamyl transferase (U/L) | 1.023 | 0.689 | 0.045 | 0.628 | 0.823 |
| 0.168 | 2.140 | 2.742 | |||
| Iron, refigerated (ug/dL) | 1.752 | 1.443 | 0.088 | 1.288 | 1.559 |
| Lactate dehydrogenase (U/L) | 1.050 | 1.102 | 0.066 | 0.973 | 1.202 |
| Phosphorus (mg/dL) | 0.556 | 0.587 | 0.040 | 0.514 | 0.668 |
| 0.291 | 1.841 | 2.853 | |||
| Total protein (g/dL) | 0.639 | 0.930 | 0.064 | 0.820 | 1.055 |
| Triglycerides (mg/dL) | 0.987 | 1.285 | 0.170 | 1.126 | 1.867 |
| 0.116 | 2.232 | 2.675 | |||
| Sodium (mmol/L) | 0.957 | 0.299 | 0.034 | 0.242 | 0.366 |
| Potassium (mmol/L) | 0.654 | 0.664 | 0.059 | 0.559 | 0.778 |
| 1.629 | 0.065 | 1.460 | 1.726 | ||
| Osmolality (mmol/Kg) | 0.749 | 1.165 | 0.061 | 1.019 | 1.263 |
| Globulin (g/dL) | 0.501 | 0.679 | 0.062 | 0.529 | 0.778 |
| Total Cholesterol (mg/dL) | 0.457 | 0.609 | 0.058 | 0.488 | 0.714 |
| Albumin, urine (ug/mL) | 1.247 | 1.459 | 0.085 | 1.326 | 1.680 |
| Creatinine, urine (umol/L) | 0.963 | 0.992 | 0.108 | 0.769 | 1.193 |
| First albumin creatinine ratio (mg/g) | 0.723 | 0.787 | 0.118 | 0.636 | 1.110 |
| 0.975 | 1.184 | 1.071 | 5.309 | ||
| 1.768 | 0.525 | 1.459 | 3.329 | ||
| 1.176 | 0.825 | 1.783 | 5.326 | ||
| 1.485 | 0.092 | 1.948 | 2.336 | ||
| 0.720 | 1.743 | 4.456 | |||
| Urine osmolality (mOsm/kg) | 1.308 | 1.675 | 0.179 | 1.352 | 2.242 |
| Hepatitis A Antibody | 0.031 | 0.089 | 0.028 | 0.046 | 0.151 |
| Hepatitis B surface antibody | 0.034 | 0.051 | 0.018 | 0.023 | 0.082 |
Note: Highlighted indicate biomarker selected for univariate logistic regression. Validation on original plus each imputed data set. Random splitting of 60:40 training:validation, λ = 0.001, 50% bagging, 6 maximum number of boosting interactions. Original pseudo-R² = 0.032, imputed data set pseudo-R² ranged from 0.044 to 0.052. Variables selected at this step accounted for more than 50% of the total relative importance: original data was 53.85%; mean of 20 imputation sets was 53.33%.
Univariate Logistic Regression statistics.
| TRAINING | VALIDATION | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Biomarker | Odds Ratio | Std. Err. | p-value | CI Low | CI High | Odds Ratio | Std. Err. | p-value | CI Low | CI High |
| 0.85 | 0.057 | 0.042 | 0.73 | 0.99 | 0.85 | 0.059 | 0.048 | 0.72 | 1.00 | |
| Hematocrit (%) | 0.95 | 0.024 | 0.084 | 0.90 | 1.01 | 0.95 | 0.025 | 0.077 | 0.89 | 1.01 |
| MCHC (g/dL) | 0.85 | 0.115 | 0.261 | 0.63 | 1.15 | 0.89 | 0.131 | 0.454 | 0.65 | 1.23 |
| 1.20 | 0.080 | 0.024 | 1.03 | 1.40 | 1.20 | 0.079 | 0.023 | 1.03 | 1.40 | |
| Platelet count (1000 cells/uL) | 1.00 | 0.002 | 0.058 | 1.00 | 1.01 | 1.00 | 0.002 | 0.083 | 1.00 | 1.01 |
| 1.07 | 0.018 | 0.014 | 1.02 | 1.11 | 1.07 | 0.018 | 0.012 | 1.02 | 1.11 | |
| C-reactive protein(mg/dL) | 1.18 | 0.106 | 0.091 | 0.97 | 1.44 | 1.15 | 0.098 | 0.141 | 0.95 | 1.38 |
| 1.00 | 0.001 | 0.011 | 1.00 | 1.00 | 1.00 | 0.001 | 0.009 | 1.00 | 1.00 | |
| Urinary Total NNAL (ng/mL) | 1.07 | 0.092 | 0.444 | 0.87 | 1.32 | 1.10 | 0.117 | 0.390 | 0.85 | 1.44 |
| Alkaline phosphotase (U/L) | 1.00 | 0.004 | 0.197 | 1.00 | 1.01 | 1.00 | 0.004 | 0.262 | 1.00 | 1.01 |
| Bicarbonate (mmol/L) | 0.93 | 0.045 | 0.156 | 0.83 | 1.03 | 0.92 | 0.044 | 0.133 | 0.83 | 1.03 |
| Creatinine (mg/dL) | 0.87 | 0.409 | 0.770 | 0.30 | 2.48 | 0.87 | 0.430 | 0.778 | 0.28 | 2.67 |
| 1.00 | 0.002 | 0.050 | 1.00 | 1.01 | 1.01 | 0.002 | 0.039 | 1.00 | 1.01 | |
| 0.19 | 0.093 | 0.009 | 0.06 | 0.58 | 0.24 | 0.112 | 0.016 | 0.08 | 0.71 | |
| Uric acid (mg/dL) | 0.97 | 0.069 | 0.655 | 0.83 | 1.13 | 0.94 | 0.062 | 0.345 | 0.81 | 1.08 |
| Chloride (mmol/L) | 0.98 | 0.039 | 0.700 | 0.90 | 1.08 | 0.98 | 0.038 | 0.624 | 0.90 | 1.07 |
| Second albumin (ug/mL) | 1.00 | 0.001 | 0.477 | 1.00 | 1.00 | 1.00 | 0.001 | 0.290 | 1.00 | 1.00 |
| Second creatinine (mg/dL) | 1.00 | 0.002 | 0.438 | 1.00 | 1.00 | 1.00 | 0.001 | 0.479 | 1.00 | 1.00 |
| Second albumin creatinine ratio (mg/g) | 1.00 | 0.000 | 0.516 | 1.00 | 1.00 | 1.00 | 0.000 | 0.315 | 1.00 | 1.00 |
| The volume of urine collection #1 | 1.00 | 0.001 | 0.813 | 1.00 | 1.00 | 1.00 | 0.001 | 0.651 | 1.00 | 1.00 |
| Urine #1 Flow Rate | 0.89 | 0.125 | 0.438 | 0.65 | 1.23 | 0.86 | 0.136 | 0.382 | 0.60 | 1.24 |
Note: Bold Biomarker indicates selection. Multiple imputation logistic regression used taking account of the survey design of NHANES with 15 strata, 31 Primary Sampling Units (PSU).
Final Four biomarkers from boosted regression.
| Biomarker | Training | Validation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Odds Ratio | Std. Err. | p-value | 95% CI Low | 95% CI High | Odds Ratio | Std. Err. | p-value | 95% CI Low | 95% CI High | |
| Red cell distribution width | 1.159 | 0.079 | 0.057 | 0.995 | 1.350 | 1.161 | 0.080 | 0.063 | 0.990 | 1.362 |
| Blood cadmium (nmol/L) | 1.060 | 0.017 | 0.020 | 1.015 | 1.107 | 1.060 | 0.017 | 0.017 | 1.016 | 1.106 |
| Glucose, serum (mg/dL) | 1.005 | 0.002 | 0.066 | 1.000 | 1.009 | 1.005 | 0.002 | 0.051 | 1.000 | 1.010 |
| Total bilirubin (mg/dL) | 0.241 | 0.112 | 0.016 | 0.082 | 0.703 | 0.315 | 0.143 | 0.034 | 0.111 | 0.895 |
| Constant | 0.017 | 0.014 | 0.001 | 0.002 | 0.116 | 0.012 | 0.011 | 0.001 | 0.002 | 0.094 |
Note: Multiple imputation logistic regression using subpopulation based on a random split of approximately 50:50 train:validation (n = 2,590 train: n = 2,637 validation).
Final Multivariate Logistic Regression.
| Odds Ratio | Std. Err. | p-value | 95% CI Low | 95% CI High | |
|---|---|---|---|---|---|
| Red cell distribution width | 1.145 | 0.067 | 0.037 | 1.009 | 1.298 |
| Blood cadmium (nmol/L) | 1.024 | 0.018 | 0.182 | 0.987 | 1.063 |
| Glucose, serum (mg/dL) | 1.005 | 0.002 | 0.009 | 1.001 | 1.008 |
| Total bilirubin (mg/dL) | 0.116 | 0.049 | <0.001 | 0.047 | 0.284 |
| Male (Reference) | 1.000 | ||||
| Female | 1.610 | 0.312 | 0.027 | 1.064 | 2.439 |
| 18–34 (Reference) | 1.000 | ||||
| 35–44 | 1.287 | 0.346 | 0.364 | 0.724 | 2.288 |
| 45–54 | 1.993 | 0.419 | 0.005 | 1.271 | 3.124 |
| 55–64 | 1.475 | 0.398 | 0.171 | 0.828 | 2.627 |
| 65+ | 0.660 | 0.421 | 0.525 | 0.169 | 2.585 |
| Non-Hispanic White (Reference) | 1.000 | ||||
| Mexican Amer/Hispanic | 0.409 | 0.150 | 0.029 | 0.186 | 0.898 |
| Non-Hispanic Black | 0.842 | 0.391 | 0.718 | 0.312 | 2.278 |
| Other | 1.552 | 1.527 | 0.662 | 0.189 | 12.743 |
| Never smoked (Reference) | 1.000 | ||||
| Current smoker | 0.382 | 0.162 | 0.039 | 0.154 | 0.946 |
| Former smoker | 0.506 | 0.326 | 0.308 | 0.127 | 2.013 |
| Food insecurity (Reference) | 1.000 | ||||
| Full food security | 0.492 | 0.093 | 0.002 | 0.328 | 0.737 |
| 0.787 | 0.058 | 0.006 | 0.671 | 0.923 | |
| Normal (Reference) | 1.000 | ||||
| Underweight | 2.497 | 1.630 | 0.182 | 0.617 | 10.101 |
| Overweight | 0.864 | 0.177 | 0.486 | 0.557 | 1.339 |
| Obese | 1.007 | 0.194 | 0.973 | 0.666 | 1.521 |
| Active (Reference) | 1.000 | ||||
| Inactive | 1.104 | 0.135 | 0.431 | 0.850 | 1.433 |
| 1.008 | 0.014 | 0.601 | 0.978 | 1.038 | |
| No (Reference) | 1.000 | ||||
| Yes | 0.798 | 0.210 | 0.407 | 0.454 | 1.403 |
| No (Reference) | 1.000 | ||||
| Yes | 2.279 | 0.292 | <0.001 | 1.732 | 2.999 |
| No (Reference) | 1.000 | ||||
| Yes | 2.912 | 0.433 | <0.001 | 2.118 | 4.002 |
| 18–34 (Reference) | 1.000 | ||||
| 35–44 | 0.974 | 0.024 | 0.308 | 0.924 | 1.027 |
| 45–54 | 0.950 | 0.019 | 0.025 | 0.910 | 0.993 |
| 55–64 | 0.946 | 0.029 | 0.088 | 0.887 | 1.009 |
| 65+ | 0.966 | 0.062 | 0.606 | 0.842 | 1.110 |
| No (Reference) | 1.000 | ||||
| Yes | 1.115 | 0.049 | 0.025 | 1.016 | 1.225 |
| Non-Hispanic White (Reference) | 1.000 | ||||
| Mexican Amer/Hispanic | 3.946 | 1.988 | 0.016 | 1.342 | 11.603 |
| Non-Hispanic Black | 1.715 | 1.333 | 0.499 | 0.325 | 9.050 |
| Other | 0.484 | 0.628 | 0.585 | 0.030 | 7.777 |
| Never smoked (Reference) | 1.000 | ||||
| Current smoker | 9.131 | 4.193 | <0.001 | 3.418 | 24.398 |
| Former smoker | 2.676 | 2.396 | 0.290 | 0.394 | 18.187 |
| Constant | 0.040 | 0.030 | 0.001 | 0.008 | 0.200 |
Note: Multiple imputation logistic regression taking account of the complex survey design of NHANES with 15 strata, 31 PSUs. (n = 3,326).
Top 15 biomarkers selected from lasso regression.
| Biomarker | Frequency |
|---|---|
| Blood cadmium (nmol/L) | 21 |
| Blood urea nitrogen (mg/dL) | 21 |
| Blood lead (ug/dL) | 20 |
| Cotinine (ng/mL) | 20 |
| Mercury, total (ug/L) | 20 |
| Platelet count (1000 cells/uL) | 19 |
| Mercury, inorganic (ug/L) | 18 |
| Globulin (g/dL) | 18 |
| Red blood cell count (million cells/uL) | 17 |
| Albumin (g/dL) | 16 |
| Phosphorus (mg/dL) | 16 |
| Direct HDL-Cholesterol (mg/dL) | 15 |
Note: Bold represents the final 3 biomarkers selected from proposed methodology.