| Literature DB >> 30737985 |
Norbert Krautenbacher1,2, Nicolai Flach1,2, Andreas Böck3, Kristina Laubhahn3,4, Michael Laimighofer1,2, Fabian J Theis1,2, Donna P Ankerst2,5, Christiane Fuchs1,2,6, Bianca Schaub3,4.
Abstract
BACKGROUND: Associations between childhood asthma phenotypes and genetic, immunological, and environmental factors have been previously established. Yet, strategies to integrate high-dimensional risk factors from multiple distinct data sets, and thereby increase the statistical power of analyses, have been hampered by a preponderance of missing data and lack of methods to accommodate them.Entities:
Keywords: childhood asthma; complex study design; immunology; machine learning; risk prediction
Mesh:
Substances:
Year: 2019 PMID: 30737985 PMCID: PMC6767756 DOI: 10.1111/all.13745
Source DB: PubMed Journal: Allergy ISSN: 0105-4538 Impact factor: 13.146
Figure 1Structure of the given data after imputation within each modality. The blue‐colored areas depict the given data values (all white areas correspond to missing data). The given data consist of seven groups of variables of the same type (modalities). There are only few subjects containing data for all modalities. The given gene expression by microarray data is the restricting component regarding complete cases and contains the most variables (reduced in figure for illustration reasons)
Figure 2Schematic illustration of data partitions taken into account for prediction modeling at a time. A, All observations per modality were included, but training and validation were done separately for each block. B, Only complete observations were used, and classifiers were trained on all modalities at once. C, All modalities and all observations were incorporated in a single prediction model and validated on complete observations
Figure 3Comparison of prediction for different modalities for different statistical methods and strategies. A, Performance of prediction models on each modality analyzed separately (Strategy A). B, Performance for complete case model (Strategy B). C, Performance of combination strategy (Strategy C)
Figure 4Performance of prediction models on the 33 complete cases (Strategy B). The procedure was run twice—once the modified model including genes which only contained annotated genes (left), once the original model including nonannotated genes in addition (right). The AUCs are calculated as the average over the 5 imputations; the error bars show 95% bootstrap confidence intervals
Figure 5Sensitivities and specificities in terms of ROC curves for the two best‐performing prediction models, LASSO and boosting, on the 33 complete cases (Strategy B), when all variables were used but nonannotated genes were excluded. ROC curves were calculated separately (aggregated over all 5 imputations) as (A) Healthy controls (HC) vs all others, (B) Allergic asthmatics (AA) vs all others and (C) Nonallergic asthmatics (NA) vs all others. The overall AUC of 0.77 for both prediction models is a weighted average over the three single AUC comparisons. The weights correspond to the proportions of HC (0.36), AA (0.39), and NA (0.24), respectively
Figure 6Variable importance for best models on complete observations. Genes are denoted by their names with the type of stimulation in parentheses. A, Boosting variable importance: Variables ranked under the top 50 by boosting in the complete case model averaged over all five imputations. B, LASSO‐selected variables: Variables selected by LASSO in the complete case model over all five imputations. C, Venn diagram/pie charts for sets of variables ranked highest by boosting (50 variables) and of variables selected by LASSO (19 variables). Three variables (genes) were selected in both prediction models