| Literature DB >> 30165448 |
Jie Zheng1, Tom G Richardson1, Louise A C Millard1,2, Gibran Hemani1, Benjamin L Elsworth1, Christopher A Raistrick1, Bjarni Vilhjalmsson3, Benjamin M Neale4,5, Philip C Haycock1, George Davey Smith1, Tom R Gaunt1.
Abstract
Background: Identifying phenotypic correlations between complex traits and diseases can provide useful etiological insights. Restricted access to much individual-level phenotype data makes it difficult to estimate large-scale phenotypic correlation across the human phenome. Two state-of-the-art methods, metaCCA and LD score regression, provide an alternative approach to estimate phenotypic correlation using only genome-wide association study (GWAS) summary results.Entities:
Mesh:
Year: 2018 PMID: 30165448 PMCID: PMC6109640 DOI: 10.1093/gigascience/giy090
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Flowchart of PhenoSpD.
Figure 2:Demonstration of the simulation. For two samples A and B, we simulated the genotype data and phenotype data of two correlated human traits, phenotype 1 and phenotype 2. The sample overlap between sample A and sample B ranged from 10% to 90% in this simulation.
The influence of genetic and environmental components on phenotypic correlation estimation using metaCCA
| Model | N_ind_A | N_ind_B | N_overlap | Overlap_% | N_SNPs | SNP_region | N_EnvF | Genetic% | N_simu | Obs_rp | rG | rE | Est_rp |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genetic_Env_components 1 | 5000 | 5000 | 5000 | 100 | 1000 | SNPs across the genome | 1000 | 0 | 100 | 0.497 | −0.007 | 0.502 | 0.035 |
| Genetic_Env_components 2 | 5000 | 5000 | 5000 | 100 | 1000 | SNPs across the genome | 1000 | 10 | 100 | 0.498 | −0.050 | 0.550 | −0.002 |
| Genetic_Env_components 3 | 5000 | 5000 | 5000 | 100 | 1000 | SNPs across the genome | 1000 | 20 | 100 | 0.496 | −0.103 | 0.601 | −0.044 |
| Genetic_Env_components 4 | 5000 | 5000 | 5000 | 100 | 1000 | SNPs across the genome | 1000 | 30 | 100 | 0.505 | −0.150 | 0.649 | −0.079 |
| Genetic_Env_components 5 | 5000 | 5000 | 5000 | 100 | 1000 | SNPs across the genome | 1000 | 40 | 100 | 0.493 | −0.202 | 0.700 | −0.128 |
| Genetic_Env_components 6 | 5000 | 5000 | 5000 | 100 | 1000 | SNPs across the genome | 1000 | 50 | 100 | 0.502 | −0.250 | 0.752 | −0.166 |
| Genetic_Env_components 7 | 5000 | 5000 | 5000 | 100 | 1000 | SNPs across the genome | 1000 | 60 | 100 | 0.497 | −0.302 | 0.800 | −0.211 |
| Genetic_Env_components 8 | 5000 | 5000 | 5000 | 100 | 1000 | SNPs across the genome | 1000 | 70 | 100 | 0.508 | −0.347 | 0.850 | −0.246 |
| Genetic_Env_components 9 | 5000 | 5000 | 5000 | 100 | 1000 | SNPs across the genome | 1000 | 80 | 100 | 0.504 | −0.401 | 0.900 | −0.288 |
| Genetic_Env_components 10 | 5000 | 5000 | 5000 | 100 | 1000 | SNPs across the genome | 1000 | 90 | 100 | 0.498 | −0.449 | 0.950 | −0.330 |
| Genetic_Env_components 11 | 5000 | 5000 | 5000 | 100 | 1000 | SNPs across the genome | 1000 | 100 | 100 | 0.509 | −0.496 | 1.000 | −0.373 |
In this simulation, we compared the agreements of the observational (calculated from phenotypes) and estimated phenotypic correlation (estimated using metaCCA) of two human traits in two samples A and B. We explored the influence of the genetic and environmental components on phenotypic correlation. More details of the simulation can be found in the Methods section. Abbreviations: N_ind_A and N_ind_B: number of individual in samples A and B; N_overlap: number of overlapped samples in samples A and B; overlap_%; percentage of overlapped samples in A and B; N_SNPs: number of SNPs in GWAS of samples A and B; SNP_region: simulated SNPs are from either one or a few LD blocks or from the whole genome; Genetic%: percentage of genetic influences on the phenotypic correlation; N_EnvF: number of environmental factors included in the model; N_simu: number of simulations; Obs_rp: observed phenotypic correlation between two traits in the mixed samples; Est_rp: mean value of the estimated phenotypic correlations in 100 simulations using metaCCA. rG and rE the simulated genetic and environmental correlation in each case.
The influence of genetic and environmental components, number of SNPs, sample sizes of two GWASs, and sample overlap between two GWASs on phenotypic correlation estimation using LD score regression
| Model | N_ind_A | N_ind_B | N_overlap | Overlap_% | N_SNPs | SNP_region | N_EnvF | Genetic% | N_simu | Obs_rp | Est_rp | Deviation, % |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genetic_Env_components 1 | 5000 | 5000 | 5000 | 100 | 200K | SNPs across the genome | 1000 | 0 | 100 | 0.49 | 0.32 | 35.70 |
| Genetic_Env_components 2 | 5000 | 5000 | 5000 | 100 | 200K | SNPs across the genome | 1000 | 10 | 100 | 0.50 | 0.34 | 31.30 |
| Genetic_Env_components 3 | 5000 | 5000 | 5000 | 100 | 200K | SNPs across the genome | 1000 | 20 | 100 | 0.50 | 0.37 | 24.90 |
| Genetic_Env_components 4 | 5000 | 5000 | 5000 | 100 | 200K | SNPs across the genome | 1000 | 30 | 100 | 0.50 | 0.39 | 21.30 |
| Genetic_Env_components 5 | 5000 | 5000 | 5000 | 100 | 200K | SNPs across the genome | 1000 | 40 | 100 | 0.51 | 0.41 | 18.70 |
| Genetic_Env_components 6 | 5000 | 5000 | 5000 | 100 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.42 | 15.30 |
| Genetic_Env_components 7 | 5000 | 5000 | 5000 | 100 | 200K | SNPs across the genome | 1000 | 60 | 100 | 0.49 | 0.43 | 13.10 |
| Genetic_Env_components 8 | 5000 | 5000 | 5000 | 100 | 200K | SNPs across the genome | 1000 | 70 | 100 | 0.49 | 0.44 | 11.60 |
| Genetic_Env_components 9 | 5000 | 5000 | 5000 | 100 | 200K | SNPs across the genome | 1000 | 80 | 100 | 0.50 | 0.45 | 9.70 |
| Genetic_Env_components 10 | 5000 | 5000 | 5000 | 100 | 200K | SNPs across the genome | 1000 | 90 | 100 | 0.50 | 0.46 | 7.90 |
| Genetic_Env_components 11 | 5000 | 5000 | 5000 | 100 | 200K | SNPs across the genome | 1000 | 100 | 100 | 0.50 | 0.47 | 5.90 |
| sample size 1 | 1000 | 1000 | 500 | 50 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.22 | 55.10 |
| sample size 2 | 3000 | 3000 | 1500 | 50 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.51 | 0.30 | 41.70 |
| sample size 3 | 5000 | 5000 | 2500 | 50 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.33 | 33.30 |
| sample size 4 | 10 | 10 | 5000 | 50 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.35 | 30.60 |
| sample size 5 | 50 | 50 | 25 | 50 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.36 | 28.20 |
| sample size 6 | 100 | 100 | 50 | 50 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.51 | 0.39 | 23.90 |
| sample overlap 1 | 5000 | 5000 | 500 | 10 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.08 | 83.30 |
| sample overlap 2 | 5000 | 5000 | 1000 | 20 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.16 | 68.10 |
| sample overlap 3 | 5000 | 5000 | 1500 | 30 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.51 | 0.23 | 55.40 |
| sample overlap 4 | 5000 | 5000 | 2000 | 40 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.51 | 0.29 | 42.90 |
| sample overlap 5 | 5000 | 5000 | 2500 | 50 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.51 | 0.34 | 34.30 |
| sample overlap 6 | 5000 | 5000 | 3000 | 60 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.51 | 0.38 | 24.90 |
| sample overlap 7 | 5000 | 5000 | 3500 | 70 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.41 | 18.50 |
| sample overlap 8 | 5000 | 5000 | 4000 | 80 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.45 | 10.70 |
| sample overlap 9 | 5000 | 5000 | 4500 | 90 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.51 | 0.48 | 6.10 |
| unbalance sample 1 | 5000 | 5000 | 9000 | 90 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.47 | 5.90 |
| unbalance sample 2 | 5000 | 6000 | 9000 | 82 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.45 | 10.50 |
| unbalance sample 3 | 5000 | 8000 | 9000 | 69 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.41 | 18.20 |
| unbalance sample 4 | 5000 | 10 | 9000 | 60 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.38 | 23.40 |
| unbalance sample 5 | 5000 | 13 | 9000 | 50 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.34 | 31.40 |
| number of SNPs 1 | 5000 | 5000 | 2500 | 50 | 7.5K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.04 | 92.30 |
| number of SNPs 2 | 5000 | 5000 | 2500 | 50 | 12.5K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.11 | 78.20 |
| number of SNPs 3 | 5000 | 5000 | 2500 | 50 | 25K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.14 | 72.10 |
| number of SNPs 4 | 5000 | 5000 | 2500 | 50 | 50K | SNPs across the genome | 1000 | 50 | 100 | 0.51 | 0.22 | 56.70 |
| number of SNPs 5 | 5000 | 5000 | 2500 | 50 | 100K | SNPs across the genome | 1000 | 50 | 100 | 0.50 | 0.30 | 40.90 |
| number of SNPs 6 | 5000 | 5000 | 2500 | 50 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.51 | 0.34 | 33.70 |
| Linkage disequilibrium 1 | 5000 | 5000 | 2500 | 50 | 10K | SNPs from one LD block | 1000 | 50 | 100 | 0.51 | 0.09 | 82.30 |
| Linkage disequilibrium 2 | 5000 | 5000 | 2500 | 50 | 20K | SNPs from two LD blocks | 1000 | 50 | 100 | 0.50 | 0.12 | 75.40 |
| Linkage disequilibrium 3 | 5000 | 5000 | 2500 | 50 | 30K | SNPs from three LD blocks | 1000 | 50 | 100 | 0.50 | 0.16 | 68.80 |
| Linkage disequilibrium 4 | 5000 | 5000 | 2500 | 50 | 40K | SNPs from four LD blocks | 1000 | 50 | 100 | 0.51 | 0.20 | 60.30 |
| Linkage disequilibrium 5 | 5000 | 5000 | 2500 | 50 | 50K | SNPs from five LD blocks | 1000 | 50 | 100 | 0.50 | 0.22 | 55.70 |
| Linkage disequilibrium 6 | 5000 | 5000 | 2500 | 50 | 200K | SNPs across the genome | 1000 | 50 | 100 | 0.51 | 0.34 | 33.90 |
In this simulation, we compared the agreements of the observational (calculated from phenotypes) and estimated phenotypic correlation (estimated using LD score regression) of two human traits in two samples A and B. We explored the influence of the following properties: genetic and environmental components; sample size; sample overlap; unbalanced sample size in samples A and B; number of SNPs; and linkage disequilibrium. More details of the simulation can be found in the Methods section. Abbreviations: N_ind_A and N_ind_B: number of individual in samples A and B; N_overlap: number of overlapped samples in samples A and B; overlap_%: percentage of overlapped samples in A and B; N_SNPs: number of SNPs in GWAS of samples A and B; SNP_region: simulated SNPs are from either one or a few LD blocks or from the whole genome; Genetic%: percentage of genetic influences on the phenotypic correlation; N_EnvF: number of environmental factors included in the model; N_simu: number of simulations; Obs_rp: observed phenotypic correlation between two traits in the mixed samples; Est_rp: mean value of the estimated phenotypic correlations in 100 simulations; Deviation (%): deviation between observational phenotypic correlation and estimated phenotypic correlation in each model of simulation.
Figure 3:The comparison between the observed and estimated phenotypic correlations using LD score regression among 487 traits from UK Biobank. Each point is one trait. The red line is X = Y. Some traits got estimated phenotypic correlation out of bound (correlation more than one). This can occur due to the noises within the error covariance matrix (built up by the error term of the genetic association test) of a pair of traits.
Figure 4:Validation of the influence of number of causal variants on phenotypic correlation estimation. Four pairs of metabolites (leucine against N1-methyl-3-pyridone-4-carboxamide, tryptophan, phenylalanine, and valine) from Shin et al. [13] were selected based on their observed phenotypic correlations (0.2, 0.4, 0.6, and 0.85, respectively). Eight sets of SNPs were selected to estimate the phenotypic correlations using LD score regression. The 8 sets were all GWAS SNPs; SNPs with Chi square statistics (square of Z scores) smaller than 40; SNPs with X2 < 30; SNPs with X2 < 20; SNPs with X2 < 10; SNPs with X2 < 3.84; SNPs with X2 < 2.69; and SNPs with X2 <1. Notes: Four columns on the x-axis were the four selected pairs of metabolites. The y-axis was the value of the phenotypic correlation. Dark blue points are the observed phenotypic correlations (noted as rP-OBS). The lighter blue points are the eight groups of SNPs included in the phenotypic correlation estimation using LD score regression (noted as LDSC_X2).
Summary of number of independent traits for the complex human trait networks
| First author | Category | N_traits | N_SNPs | N_indep |
|---|---|---|---|---|
| Kettunen et al. | Metabolites | 107 | 9 | 33.5 |
| UK Biobank | All traits | 487 | 10 | 399.6 |
Abbreviations: N_traits: number of traits in each molecular network; N_SNPs: number of SNPs in each network; and N_indep: number of independent tests in each network.