| Literature DB >> 30745560 |
Jason D Cooper1, Sung Yeon Sarah Han2, Jakub Tomasik2, Sureyya Ozcan2,3, Nitin Rustogi2, Nico J M van Beveren4,5,6, F Markus Leweke7, Sabine Bahn2.
Abstract
In the present study, to improve the predictive performance of a model and its reproducibility when applied to an independent data set, we investigated the use of multimodel inference to predict the probability of having a complex psychiatric disorder. We formed training and test sets using proteomic data (147 peptides from 77 proteins) from two-independent collections of first-onset drug-naive schizophrenia patients and controls. A set of prediction models was produced by applying lasso regression with repeated tenfold cross-validation to the training set. We used feature extraction and model averaging across the set of models to form two prediction models. The resulting models clearly demonstrated the utility of a multimodel based approach to make good (training set AUC > 0.80) and reproducible predictions (test set AUC > 0.80) for the probability of having schizophrenia. Moreover, we identified four proteins (five peptides) whose effect on the probability of having schizophrenia was modified by sex, one of which was a novel potential biomarker of schizophrenia, foetal haemoglobin. The evidence of effect modification suggests that future schizophrenia studies should be conducted in males and females separately. Future biomarker studies should consider adopting a multimodel approach and going beyond the main effects of features.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30745560 PMCID: PMC6370882 DOI: 10.1038/s41398-019-0419-4
Source DB: PubMed Journal: Transl Psychiatry ISSN: 2158-3188 Impact factor: 6.222
A summary of predictive performance in the training and independent test sets of the prediction models with averaged coefficients
| Training set (Cologne) | Independent test set (Rotterdam) | ||
|---|---|---|---|
| Schizophrenia patients | 60 | 9 | |
| Controls | 77 | 12 | |
| (a) | Training set | Independent test set | |
| Number of features | AUC | AUC | |
| Model with >0.80 inclusion fraction | 6 | 0.807 | Males only 0.880 |
| Model with relative feature importance >0.90 | 8 | 0.821 | Males only 0.917 |
| (b) | |||
| Model with >0.80 inclusion fraction | 17 with 5 first-order interactions with sex | 0.858 | Males only 0.889 |
| Model with relative feature importance >0.90 | 13 with 3 first-order interactions with sex | 0.854 | Males only 0.815 |
Model selection (a) did not consider first-order interactions with sex and (b) did allow for first-order interactions with sex
Fig. 1A summary of the 100 models selected using lasso regression with repeated cross-validation.
a A bar chart summarizing the AUCs for each model. AUC: Fail 0.5–0.6; Poor 0.6–0.7; Fair 0.7–0.8; Good 0.8–0.9; and, 0.9–1.00 Excellent. b A bar chart summarizing the number of features selected in each model. c An inclusion fraction plot summarizing the proportion of times each feature was selected in a model. One hundred and seventeen features out of 149 (147 peptides, sex, and age) were not selected in the 100 models. d A plot of inclusion frequencies and probabilities for the 32 selected features
A summary of the model averaged coefficients for the two models, the first consisting of six features with an inclusion fraction >0.8 and the second consisting of eight features with an inclusion probability (relative feature importance) >0.9
| Inclusion fraction | Inclusion probability | Mean coefficient | Weighted mean coefficient | Model | ||
|---|---|---|---|---|---|---|
| (Intercept) | – | – | 0.792 | 1.027 | 1, 2 | |
| APOA4 | IDQNVEELK | 1.00 | 1.000 | −0.238 | −0.320 | 1, 2 |
| APOC3 | GWVTDGFSSLK | 1.00 | 1.000 | −0.287 | −0.334 | 1, 2 |
| HPT | VTSIQDWVQK | 1.00 | 0.998 | 0.266 | 0.287 | 1, 2 |
| IC1 | TNLESILSYPK | 0.93 | 0.996 | 0.141 | 0.196 | 1, 2 |
| APOA2 | SPELQAEAK | 0.89 | 1.000 | −0.248 | −0.390 | 1, 2 |
| ITIH4 | GPDVLTATVSGK | 0.83 | 0.992 | 0.0932 | 0.153 | 1, 2 |
| ANT3 | LPGIVAEGR | 0.67 | 0.965 | 0.0382 | 0.0775 | 2 |
| APOH | EHSSLAFWK | 0.59 | 0.949 | 0.0320 | 0.0622 | 2 |
The mean coefficient is the mean of the coefficients for the feature of interest based on all of the models. The weights used for the weighted mean coefficient are the model probabilities (Akaike weights)
Fig. 2A summary of the 100 models selected using glinternet with repeated cross-validation.
a A bar chart summarizing the AUCs for each model. AUC: Fail 0.5–0.6; Poor 0.6–0.7; Fair 0.7–0.8; Good 0.8–0.9; and, 0.9–1.00 Excellent. b A bar chart summarizing the number of features selected in each model. c A bar chart summarizing the number of first-order interactions with sex selected in each model. d An inclusion fraction plot summarizing the proportion of times each feature was selected in a model. e An inclusion fraction plot summarizing the proportion of times each first-order interactions with sex was selected in a model. One hundred and twenty six features out of 149 (147 peptides, sex, and age) were not selected in the 100 models. f A plot of inclusion frequencies and probabilities for the selected features and interactions
A summary of the model averaged coefficients for the two models, the first consisting of 17 features and five interactions with an inclusion fraction of >0.8 and the second consisting of 13 features and three interactions with an inclusion probability (relative feature importance) >0.9
| UniProt accession number[ | Main effects | Inclusion fraction | Inclusion probability | Mean coefficient | Weighted mean coefficient | Model | |
|---|---|---|---|---|---|---|---|
| (Intercept) | – | – | 1.5400 | 1.1300 | 1, 2 | ||
| Female | 1.00 | 1.000 | 0.0356 | 0.01180 | 1, 2 | ||
|
|
|
| 1.00 | 1.000 | 0.2800 | 0.28100 | 1, 2 |
|
|
|
| 1.00 | 1.000 | 0.1720 | 0.15700 | 1, 2 |
|
|
|
| 1.00 | 1.000 | 0.2390 | 0.17600 | 1, 2 |
| P04278 | SHBG | IALGGLLFPASNLR | 1.00 | 1.000 | 0.1260 | 0.08490 | 1, 2 |
|
|
|
| 1.00 | 1.000 | 0.0540 | 0.04700 | 1, 2 |
|
|
|
| 1.00 | 1.000 | −0.4380 | −0.35800 | 1, 2 |
|
|
|
| 1.00 | 1.000 | −0.4580 | −0.38300 | 1, 2 |
|
|
|
| 1.00 | 1.000 | −0.3240 | −0.33900 | 1, 2 |
| P02656 | APOC3 | DALSSVQESQVAQQAR | 1.00 | 1.000 | −0.0998 | −0.04730 | 1, 2 |
| P02649 | APOE | LEEQAQQIR | 1.00 | 1.000 | −0.1920 | −0.12900 | 1, 2 |
|
|
|
| 1.00 | 1.000 | 0.1990 | 0.12900 | 1, 2 |
| P08697 | A2AP | DFLQSLK | 0.98 | 0.999 | −0.0269 | −0.03220 | 1, 2 |
| O75636 | FCN3 | YGIDWASGR | 0.98 | 0.248 | 0.0552 | 0.01090 | 1 |
| P02765 | FETUA | HTLNQIDEVK | 0.98 | 0.248 | −0.1120 | −0.02180 | 1 |
| P69905 | HBA | MFLSFPTTK | 0.98 | 0.248 | −0.0120 | −0.00228 | 1 |
| P69891 | HBG1 | MVTAVASALSSR | 0.98 | 0.248 | −0.0101 | −0.00200 | 1 |
|
| |||||||
| P04278 | SHBG | IALGGLLFPASNLR | 1.00 | 1.000 | −0.3180 | −0.21100 | 1, 2 |
| P02649 | APOE | LEEQAQQIR | 1.00 | 1.000 | 0.5510 | 0.37700 | 1, 2 |
| P08697 | A2AP | DFLQSLK | 0.98 | 0.999 | 0.0805 | 0.09690 | 1, 2 |
| P69905 | HBA | MFLSFPTTK | 0.98 | 0.248 | 0.0514 | 0.00930 | 1 |
| P69891 | HBG1 | MVTAVASALSSR | 0.98 | 0.248 | 0.0425 | 0.00817 | 1 |
The mean coefficient is the mean of the coefficients for the feature of interest based on all of the models. The weights used for the weighted mean coefficient are the model probabilities (Akaike weights). The eight features selected in the earlier analysis (Table 1), are shown in bold. HBA and HBG1 are haemoglobin subunits alpha and gamma-1, respectively