| Literature DB >> 32826857 |
Yuta Takahashi1,2,3, Masao Ueki4,5, Gen Tamiya4,5, Soichi Ogishima4, Kengo Kinoshita4,6, Atsushi Hozawa4, Naoko Minegishi4, Fuji Nagami4, Kentaro Fukumoto7,8, Kotaro Otsuka7,8, Kozo Tanno7, Kiyomi Sakata7, Atsushi Shimizu7, Makoto Sasaki7, Kenji Sobue7, Shigeo Kure4, Masayuki Yamamoto4,9, Hiroaki Tomita10,11,12.
Abstract
The accuracy of previous genetic studies in predicting polygenic psychiatric phenotypes has been limited mainly due to the limited power in distinguishing truly susceptible variants from null variants and the resulting overfitting. A novel prediction algorithm, Smooth-Threshold Multivariate Genetic Prediction (STMGP), was applied to improve the genome-based prediction of psychiatric phenotypes by decreasing overfitting through selecting variants and building a penalized regression model. Prediction models were trained using a cohort of 3685 subjects in Miyagi prefecture and validated with an independently recruited cohort of 3048 subjects in Iwate prefecture in Japan. Genotyping was performed using HumanOmniExpressExome BeadChip Arrays. We used the target phenotype of depressive symptoms and simulated phenotypes with varying complexity and various effect-size distributions of risk alleles. The prediction accuracy and the degree of overfitting of STMGP were compared with those of state-of-the-art models (polygenic risk scores, genomic best linear-unbiased prediction, summary-data-based best linear-unbiased prediction, BayesR, and ridge regression). In the prediction of depressive symptoms, compared with the other models, STMGP showed the highest prediction accuracy with the lowest degree of overfitting, although there was no significant difference in prediction accuracy. Simulation studies suggested that STMGP has a better prediction accuracy for moderately polygenic phenotypes. Our investigations suggest the potential usefulness of STMGP for predicting polygenic psychiatric conditions while avoiding overfitting.Entities:
Mesh:
Year: 2020 PMID: 32826857 PMCID: PMC7442807 DOI: 10.1038/s41398-020-00957-5
Source DB: PubMed Journal: Transl Psychiatry ISSN: 2158-3188 Impact factor: 6.222
Fig. 1The concept of genetic architecture and predictive models for polygenic diseases.
a The distribution of P values in GWAS for polygenic disease models in training and test datasets. To depict the concept of genetic architecture and predictive models for polygenic disease, the simulated distribution of variants analyzed in GWAS for a certain target phenotype is shown in the figures. The Y axis indicates the negative logarithm (−log) of P values, and the X axis indicates the logarithm (log) of the number of variants. While the P values of variants with true susceptibility to the disease of interest (depicted in orange and yellow) tend to be small, some of them can be large due to insufficient power. Likewise, while the majority of the P values of null variants (variants with no effect on the susceptibility to the disease, depicted in blue) tend to be large, some of them can be small by random chance due to a large number of statistical tests. The variants with true susceptibility to the disease can be divided into a set of variants that are independent of each other (depicted in orange) and a set of remaining variants that are dependent on the former variants due to the linkage disequilibrium (depicted in yellow). While true susceptibility variants increase prediction accuracy, null variants decrease prediction accuracy when the variants are included in the prediction model because associations between the null variants and the target phenotype are not replicated in the validation cohort, which is referred to as overfitting. Distinguishing true susceptibility variants and null variants in single GWAS is difficult with currently available sample-size data. b Concepts of PRS. PRS intends to select variants with true susceptibility and avoid influence from null variants by setting a cutoff of P values in GWAS; however, the model decreases prediction accuracy because the model (i) still includes and overestimates a large number of the null variants, and (ii) incorporates clumping and excludes correlated true susceptibility variants, which can contribute to prediction accuracy. c Concepts of GBLUP. GBLUP utilizes true susceptibility variants correlated with each other for better prediction accuracy; however, the model includes a large number of null variants and results in decreasing prediction accuracy due to overfitting. d Concepts of STMGP. STMGP decreases overfitting by weighting selected variants to decrease overestimation of null variants, utilizes correlated true susceptibility variants effectively by building generalized ridge regression, and sets an optimal cutoff for the P value with low computer costs by avoiding CV. GWAS genome-wide association study, PRS polygenic risk score, CV cross-validation, GBLUP genomic best linear-unbiased prediction, STMGP Smooth-Threshold Multivariate Genetic Prediction.
Demographics of the members of the discovery and validation datasets.
| Discovery dataset | Validation dataset | ||
|---|---|---|---|
| Subjects | 3685 | 3048 | |
| Percent of females | 70.1% | 65.3% | 3.31 × 10−5 |
| CES-D, mean (SD) | 13.6 (7.2) | 13.4 (6.9) | 0.226 |
| Age, mean (SD) | 58.5 (12.1) | 62.0 (10.1) | 1.35 × 10−38 |
| Educational background | 6.54 × 10−37 | ||
| Elementary/junior high school | 640 (17.4%) | 946 (31.0%) | |
| High school | 1852 (50.3%) | 1260 (41.3%) | |
| Junior college | 903 (24.5%) | 649 (21.3%) | |
| College | 279 (7.6%) | 187 (6.1%) | |
| Graduate school | 11 (0.3%) | 6 (0.2%) | |
| House damage from the 2011 Great East Japan Earthquake and Tsunami | 1.09 × 10−278 | ||
| Total collapse | 561 (15.2%) | 218 (7.2%) | |
| Large-scale damage | 248 (6.7%) | 61 (2.0%) | |
| Half-scale damage | 302 (8.2%) | 75 (2.5%) | |
| Small-scale damage | 1534 (41.6%) | 522 (17.1%) | |
| No damage | 1040 (28.2%) | 2172 (71.3%) | |
| Previous psychiatric history | |||
| Depression | 104 (2.8%) | 81 (2.7%) | 0.708 |
| Bipolar disorder | 9 (0.2%) | 6 (0.2%) | 0.798 |
| Family historyb | |||
| Depression | 203 (5.5%) | 167 (5.5%) | 1.00 |
| Bipolar disorder | 27 (0.7%) | 26 (0.9%) | 0.583 |
| The gap time between the 2011 Great East Japan Earthquake and measurement of CES-D (months), mean (SD) | 28.5 (2.0) | 30.8 (1.3) | 9.88 × 10−324 |
| Prefectures | Miyagi, Japan | Iwate, Japan |
CES-D Center for Epidemiologic Studies-Depression Scale, SD standard deviation, GEJE Great East Japan Earthquake.
aP values were calculated using Student’s t tests for CES-D, age, and the time gap between the 2011 Great East Japan Earthquake and measurement of CES-D and Fisher’s exact tests for the percentage of females, educational background, house damage from the 2011 Great East Japan Earthquake and Tsunami, previous psychiatric history, and family history.
bFamily history refers to the previous psychiatric history of first-degree relatives (i.e., parents, siblings, or children).
Prediction accuracy for depressive states.
| Partial correlations in the independent validation datasets (SE) | Partial correlations in the training datasets (SE) | Number of variants included in prediction models | ||
|---|---|---|---|---|
| STMGP | 0.0530 (0.0180) | 3.424 × 10−3 | 0.3230 (0.0151) | 102 |
| PRS | 0.0247 (0.0178) | 0.1724 | 0.9025 (0.0076) | 13,421 |
| GBLUP | 0.0211 (0.0178) | 0.2431 | 0.9623 (0.0017) | 601,239 |
| SBLUP | 0.0134 (0.0178) | 0.3663 | 0.9554 (0.0019) | 599,149 |
| BayesR | 0.0190 (0.0185) | 0.2871 | 0.9633 (0.0015) | 615,386 |
| Ridge | 0.0160 (0.0178) | 0.4321 | 0.9998 (0.0000) | 30,333 |
PCC predictive correlation coefficient, SE standardized error, STMGP Smooth-Threshold Multivariate Genetic Prediction, PRS polygenic risk scores, GBLUP genomic best linear-unbiased prediction, SBLUP summary-data-based best linear-unbiased prediction, SNP single-nucleotide polymorphism, PC principal component.
Partial correlations were adjusted by covariates such as sex, age, and PC1 ~26.
Since ridge regression based on raw SNP data was difficult to implement in our environment due to the substantial computational cost, the genome data were clumped into approximately 30,000 SNPs in a manner similar to a previous study for these analyses[51].
Prediction accuracy in simulation studies in which the phenotype is associated with SNPs only (heritability = 0.05).
| Distribution of the true SNP effects | Prediction models | Number of true susceptibility variants | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 100 | 200 | 500 | 2000 | 5000 | |||||||
| Mean (SE) PCC | Powera | Mean (SE) PCC | Powera | Mean (SE) PCC | Powera | Mean (SE) PCC | Powera | Mean (SE) PCC | Powera | ||
| Laplace distribution | STMGP | 0.0594 (0.0243) | 0.85 | 0.0440 (0.0311) | 0.65 | −0.0044 (0.0215) | 0.15 | 0.0143 (0.1939) | 0.15 | −0.0028 (0.0135) | 0.00 |
| PRS | 0.0089 (0.0240) | 0.15 | 0.0094 (0.0187) | 0.10 | −0.0070 (0.0208) | 0.15 | 0.0059 (0.0173) | 0.05 | −0.0017 (0.0133) | 0.05 | |
| GBLUP | 0.0118 (0.0159) | 0.05 | 0.0080 (0.0155) | 0.00 | 0.0067 (0.0210) | 0.15 | 0.0149 (0.0013) | 0.05 | 0.0160 (0.0142) | 0.10 | |
| SBLUP | 0.0048 (0.0140) | 0.00 | 0.0100 (0.0137) | 0.00 | 0.0083 (0.0198) | 0.10 | 0.0142 (0.0129) | 0.05 | 0.0111 (0.0193) | 0.10 | |
| BayesR | 0.0391 (0.0494) | 0.65 | 0.0264 (0.0273) | 0.30 | 0.0073 (0.0234) | 0.15 | 0.0144 (0.0142) | 0.10 | 0.0109 (0.0176) | 0.10 | |
| Ridge | 0.0052 (0.0146) | 0.05 | 0.0049 (0.0155) | 0.00 | 0.0104 (0.0216) | 0.15 | 0.0085 (0.0132) | 0.05 | 0.0072 (0.0172) | 0.00 | |
| Normal distribution | STMGP | 0.0475 (0.0238) | 0.70 | 0.0140 (0.0170) | 0.10 | 0.0082 (0.0197) | 0.15 | 0.0112 (0.0071) | 0.10 | 0.0040 (0.0175) | 0.05 |
| PRS | 0.0028 (0.0207) | 0.05 | 0.0054 (0.0191) | 0.15 | 0.0017 (0.0185) | 0.05 | −0.0011 (0.0189) | 0.05 | 0.0031 (0.0146) | 0.10 | |
| GBLUP | 0.0120 (0.0135) | 0.05 | 0.0103 (0.0171) | 0.05 | 0.0133 (0.0147) | 0.10 | 0.0127 (0.0199) | 0.10 | 0.0130 (0.0154) | 0.10 | |
| SBLUP | 0.0117 (0.0177) | 0.15 | 0.0109 (0.0167) | 0.05 | 0.0057 (0.0145) | 0.10 | 0.0068 (0.0155) | 0.00 | 0.0116 (0.0127) | 0.05 | |
| BayesR | 0.0239 (0.0271) | 0.35 | 0.0147 (0.0185) | 0.10 | 0.0073 (0.0168) | 0.10 | 0.0108 (0.0194) | 0.05 | 0.0092 (0.0133) | 0.05 | |
| Ridge | 0.0144 (0.0162) | 0.20 | 0.0135 (0.0170) | 0.05 | 0.0083 (0.0185) | 0.00 | 0.0100 (0.0197) | 0.05 | 0.0075 (0.0187) | 0.00 | |
PCC predictive correlation coefficient, SE standardized error, STMGP Smooth-Threshold Multivariate Genetic Prediction, PRS polygenic risk scores, GBLUP genomic best linear-unbiased prediction, SBLUP summary-data-based best linear-unbiased prediction, NEG normal–exponential–gamma.
aPower is the proportion of replicates achieving a significant prediction at P value < 0.05.
Prediction accuracy in simulation studies in which the phenotype is associated with SNPs only (heritability = 0.10).
| Distribution of the true SNP effects | Prediction models | Number of true susceptibility variants | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 100 | 200 | 500 | 2000 | 5000 | |||||||
| Mean (SE) PCC | Powera | Mean (SE) PCC | Powera | Mean (SE) PCC | Powera | Mean (SE) PCC | Powera | Mean (SE) PCC | Powera | ||
| Laplace distribution | STMGP | 0.1520 (0.0293) | 1.00 | 0.1029 (0.0408) | 1.00 | 0.0521 (0.0252) | 0.80 | 0.0241 (0.0193) | 0.35 | 0.0217 (0.0171) | 0.25 |
| PRS | 0.0454 (0.0434) | 0.75 | 0.0421 (0.0247) | 0.85 | –0.0018 (0.0283) | 0.10 | 0.0128 (0.0203) | 0.15 | 0.0004 (0.0203) | 0.10 | |
| GBLUP | 0.0137 (0.0134) | 0.05 | 0.0201 (0.0143) | 0.15 | 0.0163 (0.0190) | 0.20 | 0.0198 (0.0133) | 0.25 | 0.0199 (0.0201) | 0.15 | |
| SBLUP | 0.0140 (0.0148) | 0.05 | 0.0186 (0.0143) | 0.10 | 0.0150 (0.0200) | 0.20 | 0.0189 (0.0159) | 0.25 | 0.0186 (0.0189) | 0.15 | |
| BayesR | 0.1217 (0.0680) | 0.90 | 0.0782 (0.0475) | 0.85 | 0.0345 (0.0337) | 0.35 | 0.0202 (0.0195) | 0.25 | 0.0172 (0.0222) | 0.15 | |
| Ridge | 0.0183 (0.0158) | 0.20 | 0.0188 (0.0138) | 0.20 | 0.0215 (0.0212) | 0.30 | 0.0184 (0.0111) | 0.10 | 0.0171 (0.0192) | 0.15 | |
| Normal distribution | STMGP | 0.1045 (0.0281) | 1.00 | 0.0638 (0.0205) | 0.95 | 0.0236 (0.0122) | 0.30 | 0.0208 (0.0156) | 0.25 | 0.0195 (0.0186) | 0.15 |
| PRS | 0.0258 (0.0305) | 0.50 | 0.0177 (0.0220) | 0.30 | 0.0079 (0.0224) | 0.15 | 0.0053 (0.0216) | 0.10 | 0.0015 (0.0233) | 0.00 | |
| GBLUP | 0.0220 (0.0168) | 0.30 | 0.0202 (0.0172) | 0.15 | 0.0161 (0.0147) | 0.15 | 0.0172 (0.0191) | 0.15 | 0.0204 (0.0132) | 0.15 | |
| SBLUP | 0.0215 (0.0173) | 0.30 | 0.0195 (0.0174) | 0.15 | 0.0173 (0.0150) | 0.15 | 0.0185 (0.0198) | 0.20 | 0.0206 (0.0129) | 0.15 | |
| BayesR | 0.0943 (0.0489) | 0.90 | 0.0444 (0.0224) | 0.70 | 0.0210 (0.0171) | 0.15 | 0.0189 (0.0135) | 0.20 | 0.0130 (0.0127) | 0.05 | |
| Ridge | 0.0251 (0.0156) | 0.40 | 0.0269 (0.0180) | 0.40 | 0.0187 (0.0184) | 0.15 | 0.0170 (0.0162) | 0.15 | 0.0154 (0.0179) | 0.10 | |
PCC predictive correlation coefficient, SE standardized error, STMGP Smooth-Threshold Multivariate Genetic Prediction, PRS polygenic risk scores, GBLUP genomic best linear-unbiased prediction, SBLUP summary-data-based best linear-unbiased prediction, NEG normal–exponential–gamma.
aPower is the proportion of replicates achieving a significant prediction at P value < 0.05.