| Literature DB >> 35862421 |
Ann-Kristin Becker1,2, Till Ittermann3, Markus Dörr2,4, Stephan B Felix2,4, Matthias Nauck2,5, Alexander Teumer3, Uwe Völker2,6, Henry Völzke2,3, Lars Kaderali1,2, Neetika Nath1,2.
Abstract
BACKGROUND: Approaching epidemiological data with flexible machine learning algorithms is of great value for understanding disease-specific association patterns. However, it can be difficult to correctly extract and understand those patterns due to the lack of model interpretability.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35862421 PMCID: PMC9302835 DOI: 10.1371/journal.pone.0271610
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Fig 1Workflow.
Schematic representation of the workflow. After data preparation, a RF model is trained using nested cross-validation. Relevant predictors are identified based on two feature importance measures and a mixture model approach. Lastly, feature interactions among the relevant predictors are examined in a Bayesian network analysis.
The table reports the results of the simulation study.
A) Feature Subset Extraction: Average of (false positive (fp) / false negative (fn)) features (averaged over 10 iterations). B) Relearned Network Structure: Average of false positive (fp) / false negative (fn) arcs of the relearned network structure (averaged over 10 iterations).
| Network Sample Size | small network (5 nodes) | medium network (10 nodes) | large network (20 nodes) |
|---|---|---|---|
|
| |||
|
| 0.4 fp / 2 fn | 0.2 fp / 1 fn | 1.6 fp / 6.2 fn |
|
| 0.4 fp / 1.6 fn | 0 fp / 1.2 fn | 1.4 fp / 5.8 fn |
|
| 0.4 fp / 1.6 fn | 0.2 fp / 0.8 fn | 0.6 fp / 5.8 fn |
|
| |||
|
| 0 fp / 1.2 fn | 2.4 fp / 2.4 fn | 2.6 fp / 2.8 fn |
|
| 0 fp / 1.2 fn | 1 fp / 1.2 fn | 1.8 fp / 2.2 fn |
|
| 0 fp / 0.8 fn | 1 fp / 1.1 fn | 0.8 fp / 2 fn |
Descriptive statistics of thyroid examination results from SHIP.
Mean, standard deviation, median and skewness are presented for continuous features. For categorical features, the exact distribution is shown. The analysis is based on n = 3,989 probands.
|
|
|
|
|
|
|
|
| Thyroid stimulating hormone (TSH) [mU/l] | 0.89 | 2.28 | 0.66 | 25.5 |
|
| log-transformed TSH | -0.45 | 0.73 | -0.41 | -0.65 |
|
| free triiodothyronine [pmol/l] | 5.25 | 0.88 | 5.2 | 1.24 |
|
| free thyroxine [pmol/l] | 12.84 | 3.82 | 12.5 | 1.24 |
|
| total sonography volume of the thyroid | 21.54 | 12.57 | 18.8 | 3.35 |
|
| Iodide (urine) [μg/dl] | 14.42 | 11.64 | 12.5 | 5.1 |
|
| anti-TPO antibodies [IU/l] | 90.28 | 294.28 | 45.1 | 25.47 |
|
|
|
|
| ||
|
| presence of thyroid nodule(s) | 3299 (77.2%) | 975 (22.8%) | ||
|
| hypoechoic thyroid pattern | 3958 (92.7%) | 313 (7.3%) | ||
|
| enlargement of the thyroid gland | 2660 (62.2%) | 1611 (37.8%) | ||
Evaluation of the final RF model for the prediction of TSH.
As a baseline comparison, we trained a similar model on the same dataset where TSH values have been randomly shuffled. The scores given in the column (random) baseline prediction thus represent scores achieved by random guessing. Average results are presented together with standard deviations given in brackets.
| Evaluation criteria | prediction of TSH (± SD) | (random) baseline prediction (± SD) |
|---|---|---|
| RMSE Training | 0.63 (± 0.041) | 0.70 (± 0.004) |
| RMSE Test | 0.66 (± 0.003) | 0.72 (± 0.301) |
| R2 Training | 0.23 (± 0.003) | 0.0001 (± 0.002) |
| R2 Test | 0.15 (± 0.002) | 0.0004 (± 0.011) |
| MAE Training | 0.52 (± 0.002) | 0.52 (± 0.001) |
| MAE Test | 0.55 (± 0.003) | 0.62 (± 0.111) |
Fig 2Inferred Bayesian network structure among the extracted relevant predictors and the TSH level.
The four hub nodes, sex, age, medication (taken during the last seven days), and hip circumference are colored in blue. Arcs originating from the hub nodes are plotted in light gray to make the network more readable. The TSH level is colored in dark red, thyroid-related examinations in red. Yellow nodes refer to metabolic factors, green nodes to hematological and hemostasis factors, and grey nodes to socioeconomic parameters. Antibody titer against toxoplasmosis is presented in orange. Further information on the features can be found in S1 Table. The completed partially directed acyclic graph is shown.
RF prediction results for different feature subgroups.
Columns refer to models built based on different feature subgroups. The first two rows show the respective RF hyperparameters. The remaining six rows contain the prediction metrics achieved by the models. Average results are stated with standard deviations given in brackets.
| Model | (random) baseline prediction (± SD) | All Features (± SD) | metabolism [yellow nodes] (± SD) | socioeconomic status [grey nodes] (± SD) | hematological factors [green nodes] (± SD) |
|---|---|---|---|---|---|
| RMSE Training | 0.703 (± 0.003) | 0.632 (± 0.003) | 0.697 (± 0.003) | 0.702 (± 0.003) | 0.702 (± 0.003) |
| RMSE Test | 0.719 (± 0.029) | 0.662 (± 0.032) | 0.703 (± 0.003) | 0.704 (± 0.032) | 0.705 (± 0.032) |
| MAE Training | 0.599 (± 0.104) | 0.515 (± 0.11) | 0.592 (± 0.105) | 0.598 (± 0.104) | 0.598 (± 0.104) |
| MAE Test | 0.618 (± 0.106) | 0.551 (± 0.11) | 0.599 (± 0.111) | 0.601 (± 0.111) | 0.601 (± 0.111) |
| R2 Training | 0.045 (± 0.008) | 0.229 (± 0.0035) | 0.061 (± 0.002) | 0.046 (± 0.002) | 0.046 (± 0.002) |
| R2 Test | -0.0004 (± 0.002) | 0.149 (± 0.023) | 0.042 (± 0.021) | 0.037 (± 0.015) | 0.037 (± 0.015) |
Properties of three Bayesian network structures used for the simulation study.
| network | nodes | arcs | parameters | additional noise variables |
|---|---|---|---|---|
|
| 5 | 5 | 15 | 5 |
|
| 10 | 23 | 43 | 10 |
|
| 20 | 26 | 66 | 20 |