| Literature DB >> 24665129 |
Pekka Marttinen1, Matti Pirinen2, Antti-Pekka Sarin1, Jussi Gillberg2, Johannes Kettunen1, Ida Surakka1, Antti J Kangas2, Pasi Soininen1, Paul O'Reilly2, Marika Kaakinen1, Mika Kähönen2, Terho Lehtimäki2, Mika Ala-Korpela1, Olli T Raitakari1, Veikko Salomaa2, Marjo-Riitta Järvelin1, Samuli Ripatti1, Samuel Kaski1.
Abstract
MOTIVATION: A typical genome-wide association study searches for associations between single nucleotide polymorphisms (SNPs) and a univariate phenotype. However, there is a growing interest to investigate associations between genomics data and multivariate phenotypes, for example, in gene expression or metabolomics studies. A common approach is to perform a univariate test between each genotype-phenotype pair, and then to apply a stringent significance cutoff to account for the large number of tests performed. However, this approach has limited ability to uncover dependencies involving multiple variables. Another trend in the current genetics is the investigation of the impact of rare variants on the phenotype, where the standard methods often fail owing to lack of power when the minor allele is present in only a limited number of individuals.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24665129 PMCID: PMC4080737 DOI: 10.1093/bioinformatics/btu140
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Graphical illustration of the model. (A) The variables and dependencies between them. The phenotypes Y are assumed to be affected by known factors, such as age or sex, unknown factors, such as batch effects caused by varying experimental conditions, and the SNPs. The influence of the SNPs is mediated by unknown combinations of the original SNPs, represented by black squares. (B) The same model using matrix notation. Matrices containing the observed variables, Y (the phenotypes), X (the SNPs) and Z (known factors), are blue. The regression coefficient matrices are red. Note that the coefficient matrix for the SNP effects is written as a product of two matrices, and Γ, corresponding to a low-rank approximation to an unconstrained coefficient matrix. The brown matrices comprise unobserved variables, H (unknown factors) and E (noise terms)
Fig. 2.Prior and posterior distributions for the proportion of total variation explained (PTVE) by the model. The panel on the left shows the prior distribution imposed on the proportion of total variation of the phenotypes explained by the SNPs under consideration (here, the SNPs from the LIPC gene). The median of the prior distribution is located at ∼4e-6. The characteristic features of the prior distribution include the peak at values close to zero, effectively removing noise unless there is strong evidence about a possible association, and the long tail allowing a small percentage of genes to explain larger proportions of the phenotype variation. The rightmost bin on the x-axis contains the total probability of values exceeding the maximum value on the axis. The panel on the right shows the posterior distribution of the PTVE for the same SNPs. Two posterior densities are shown, one showing the distribution for the original data, the other showing the distribution for data in which the rows of the phenotype matrix have been permuted. Notice the differing scales on the x-axes of the two panels
Summary of results from the genome-wide analysis of the real data
| Chr | Locus | PTVE (SD) | Rare | Gene rank | ||||
|---|---|---|---|---|---|---|---|---|
| PTVE | Pairwise | S-CCA | CCA-single | |||||
| (a) | ||||||||
| 15 | 0.01 (5e-04) | 0.015 | 5e-19 | 1 | 1 | 1 | 132 | |
| 19 | 0.0046 (3e-04) | 0.017 | 1.4e-26 | 2 | 8 | 18 | 275 | |
| 19 | 0.0045 (1e-04) | 0.0015 | 7e-35 | 3 | 9 | 14 | 276 | |
| 2 | 0.0044 (3e-04) | 0.017 | 1.1e-17 | 4 | 45 | 41 | 3244 | |
| 11 | 0.0043 (2e-04) | 0.021 | 5e-10 | 5 | 26 | 33 | 2433 | |
| (b) | ||||||||
| 16 | 0.0015 (9e-05) | 0.89 | 0.00091 | 5 (rare) | 8344 | 5490 | 4973 | |
| 5 | 0.0024 (2e-04) | 0.55 | 0.0016 | 6 (rare) | 2706 | 5163 | 2155 | |
| (c) | ||||||||
| 2 | 0.0015 (2e-04) | 0.15 | 2.6e-04 | 138 | 4715 | 338 | 7652 | |
| 4 | 0.0015 (2e-04) | 0.019 | 7e-06 | 163 | 102 | 1444 | 1552 | |
Note: (a) Reference results for genes with five highest PTVE scores. (b) Replicated genes from the PTVE-rare score (of six genes tested for replication). (c) Replicated genes from the PTVE score (of 167 genes). Five other replicated genes (PPBP, CXCL5, CXCL2, PF4 and CXCL3) are not shown as they were located within 1 Mb from MTHFD2L, which had the strongest effect. Column PTVE (SD) shows the proportion of total variation explained and its SD, rare specifies the proportion of the variation explained by the gene attributed to the rare variants, P-value specifies the P-value pooled over YFS and FINRISK replication datasets (unless stated otherwise) and the last four columns specify the ranking of the gene among all genes with different methods. aDenotes genes that replicated significantly in only one of the two replication datasets. b/cReplication P-value in FINRISK/YFS.
Fig. 3.Results for genes with significant replication in both test sets: XRCC4 and MTHFD2L; for reference, the well-known LIPC lipid locus is also shown. Each panel shows the identified phenotype combination plotted against the genotype combination. The left column shows results in the NFBC1966 dataset, in which the associations were detected. The center and right columns show results with the YFS and FINRISK datasets, where coefficient matrices learned with the NFBC1966 data were used to form the variable combinations. The green and red background colorings mark the individuals with the highest/lowest genotype combination values, and the phenotype values in these extreme groups are investigated in more detail in Supplementary Figure S3
Power comparison of the different methods
| Method | FDR = 0 | FDR = 0.1 | FDR = 0.2 | FDR = 0.4 | Novel |
|---|---|---|---|---|---|
| PTVE | 36 | 55 | 103 | 305 | 167 |
| PTVE-rare | 3 | 3 | 3 | 6 | 6 |
| Pairwise | 103 | 176 | 243 | 651 | 300 |
| CCA, single SNP | 7 | 7 | 7 | 11 | 11 |
| Sparse CCA | 37 | 50 | 66 | 117 | 51 |
Note: The table shows the numbers of gene-metabolome associations that had false discovery rate below the specified threshold. The last column shows the number of putative novel associations within genes with FDR = 0.4 after removing the known associations as described in the main text.