| Literature DB >> 31548641 |
Yan Xu1, Li Xing2, Jessica Su3, Xuekui Zhang4, Weiliang Qiu3.
Abstract
Genome-wide association studies (GWASs) aim to detect genetic risk factors for complex human diseases by identifying disease-associated single-nucleotide polymorphisms (SNPs). The traditional SNP-wise approach along with multiple testing adjustment is over-conservative and lack of power in many GWASs. In this article, we proposed a model-based clustering method that transforms the challenging high-dimension-small-sample-size problem to low-dimension-large-sample-size problem and borrows information across SNPs by grouping SNPs into three clusters. We pre-specify the patterns of clusters by minor allele frequencies of SNPs between cases and controls, and enforce the patterns with prior distributions. In the simulation studies our proposed novel model outperforms traditional SNP-wise approach by showing better controls of false discovery rate (FDR) and higher sensitivity. We re-analyzed two real studies to identifying SNPs associated with severe bortezomib-induced peripheral neuropathy (BiPN) in patients with multiple myeloma (MM). The original analysis in the literature failed to identify SNPs after FDR adjustment. Our proposed method not only detected the reported SNPs after FDR adjustment but also discovered a novel BiPN-associated SNP rs4351714 that has been reported to be related to MM in another study.Entities:
Mesh:
Year: 2019 PMID: 31548641 PMCID: PMC6757104 DOI: 10.1038/s41598-019-50229-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Simulation results for 500,000 SNPs with 200 effective SNPs using four different MAF distributions for data generation and two different sample sizes respectively. The upper and lower four rows contain results of simulated data with 200 samples and 1,000 samples respectively. The left panel shows truncated MAF distributions for data generation (solid line) and prior distributions for analysis (approximated beta distributions via the moment matching: dashed lines are Beta-approximations of truncated Beta(2, 5) and dotted lines are Beta-approximations of empirical distributions estimated from data). The middle panel shows boxplots for FDR and the vertical dashed line represents the nominal level (0.05). The right panel shows boxplots for paired difference of sensitivity between our method (using truncated Beta(2, 5) or empirical distribution for analysis) and the SNP-wise approach, and the vertical dashed line represents 0 (i.e., same performance between our method and the SNP-wise approach). White boxes represent the SNP-wise approach, light grey boxes represent our method using truncated Beta(2, 5) for analysis, and dark grey boxes represent our method using empirical distributions for analysis.
Significant SNPs detected by our method based on the discovery dataset (GSE65777).
| Pseudo count |
| Detected SNPs |
|---|---|---|
| (3,3,3) | <0.1 | rs10862339 rs1344016 rs2839629* |
| <0.05 | rs10862339 rs1344016 | |
| (20,20,20) | <0.1 | rs10862339 rs1344016 rs2414277 rs2839629* rs4351714** rs4776196 |
| <0.05 | rs10862339 rs1344016 |
The SNP labeled with ‘*’ is the only SNP reported by Magrangeas et al.[29] as validated SNP, the SNP labeled with ‘**’ is a novel SNP detected by our method, and all other SNPs in the table are reported by Magrangeas et al.[29] in discovery data, but not validated in replication data.
Validation results for SNPs listed in Table 1. One-sided logistic regression with permutations followed by FDR adjustment for the validation set (GSE66903).
| SNP_ID | Odds ratio | P.raw | P.permutation | P.FDR |
|---|---|---|---|---|
| rs10862339 | 0.98 (0.57–1.69) | 0.4662 | 0.4783 | 0.4783 |
| rs1344016 | 1.05 (0.59–1.86) | 0.4329 | 0.425 | 0.4783 |
| rs2839629 | 2.02 (1.12–3.65) | 0.0096 | 0.0108 | 0.0324 |
P.raw: raw P-values from one-sided logistic regression; P.permutation: P-values determined by permutation; P.FDR: permuted P-values after FDR adjustment.
Figure 2Directed acyclic graph representation of our model-based clustering method. Observed data (SNP genotypes), cluster memberships, MAFs, and mixture proportions are denoted by S, z, θ, and π respectively. Plain solid rectangles represent observations. Diamonds represent latent variables of unknown cluster membership. Plain solid circles indicate model parameters to be estimated, while dashed circles represent nuisance parameters to be integrated out from the model likelihood by marginalization. Gray-filled circles represent pre-specified hyper-parameters.