| Literature DB >> 30089455 |
Lawrence M Chen1,2, Nelson Yao1,2, Elika Garg1,2, Yuecai Zhu1,2, Thao T T Nguyen1,2, Irina Pokhvisneva1,2, Shantala A Hari Dass1,2, Eva Unternaehrer1,2, Hélène Gaudreau1,2, Marie Forest2,3, Lisa M McEwen4, Julia L MacIsaac4, Michael S Kobor4, Celia M T Greenwood2,3,5,6,7, Patricia P Silveira1,2,8,9, Michael J Meaney1,2,8,9,10,11, Kieran J O'Donnell12,13,14,15,16.
Abstract
BACKGROUND: Polygenic risk scores (PRS) describe the genomic contribution to complex phenotypes and consistently account for a larger proportion of variance in outcome than single nucleotide polymorphisms (SNPs) alone. However, there is little consensus on the optimal data input for generating PRS, and existing approaches largely preclude the use of imputed posterior probabilities and strand-ambiguous SNPs i.e., A/T or C/G polymorphisms. Our ability to predict complex traits that arise from the additive effects of a large number of SNPs would likely benefit from a more inclusive approach.Entities:
Keywords: Bioinformatics; Genetic profile score, Multi-core processing; Major depressive disorder; PRS-on-spark; PRSoS; Polygenic risk score
Mesh:
Year: 2018 PMID: 30089455 PMCID: PMC6083617 DOI: 10.1186/s12859-018-2289-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Allele matching for polygenic risk scores (PRS) between discovery and target data. The effect alleles and their reverse complements are indicated in red. Matching the effect alleles from the discovery data with the reported alleles in the target data is straightforward when SNPs are not strand-ambiguous (top and middle panel). The allele in the target data can be misassigned for strand-ambiguous SNPs (bottom)
Fig. 2PRSoS allele matching solution for strand-ambiguous SNPs. The effect alleles and their reverse complements are indicated in red. The discovery effect allele and the target allele 1 are the same if their allele frequencies are both less than 0.4 or both more than 0.6 (top). The target allele 1 is not the effect allele if one has low allele frequency and the other has high allele frequency (middle). Strand-ambiguous SNPs with an allele frequency between 0.4 and 0.6 are excluded to increase the certainty of matching alleles
PRSoS optional data output
| PRS_0.001 | PRS_0.001_flag | PRS_0.5 | PRS_0.5_flag | Discard |
|---|---|---|---|---|
| rs1115507 | A1 | rs1115507 | A1 | rs2503243 |
| rs17692694 | A2 | rs11661323 | A2 | rs519113 |
| rs4544201 | A2 | rs12296077 | A1 | |
| rs6683133 | A1 | rs12611811 | A1 | |
| rs7609940 | A2 | rs17024456 | A1 | |
| rs7620685 | A1 | rs17692694 | A2 | |
| rs4544201 | A2 | |||
| rs6683133 | A1 | |||
| rs7609940 | A2 | |||
| rs7620685 | A1 |
Example of the SNP log included in the PRSoS output. The SNP log records the SNPs that are used in the PRS at each p-value threshold and whether the first allele column (“A1”) or the second allele column (“A2”) in the target data was scored. SNPs are recorded in the Discard column if the SNPs are discarded due to non-matching alleles between the discovery and the target data
Maternal Adversity, Vulnerability and Neurodevelopment (MAVAN) cohort demographics. Symptoms of depression were assessed using the Center for Epidemiological Studies – Depression (CES-D) scale
| Cohort Demographics | |
|---|---|
| Sample size | |
| Genotyping data only (used in software performance test) | N = 264 |
| Genotyping data with symptoms of depressive score (CES-D) | |
| Mean age at time of assessment in years (SD) | 34.65 (4.89) |
| Mean symptoms of depressive score (SD) | 10.07 (8.81) |
| Reported ethnicity among sample with genotyping data and CES-D data | |
| Caucasian | |
| Others | |
| Not reported | N = 1 |
Genotyping file information
| Genotyping file format | File size (GB) | SNP count | ||
|---|---|---|---|---|
| PRSice v1.25 | Array Data | .bim/.bed/.fam | 0.03 | 316,480 |
| Imputed HC | .bim/.bed/.fam | 1.66 | 17,434,284 | |
| Imputed PP | .gen/.fam | 29.02 | 17,434,284 | |
| PRSoS | Array Data | .gen/.sample | 0.51 | 316,480 |
| Imputed HC | .gen/.sample | 28.09 | 17,434,284 | |
| Imputed PP | .gen/.sample | 29.02 | 17,434,284 | |
The file size and SNP count provide an idea of how much data processing needs to be done by each software in our analysis. The file formats that we used in PRSice and PRSoS are different due to differences in file compatibility. All files have the same sample size (N = 264)
Fig. 3PRSice v1.25 and PRSoS performance across datasets. Bar plot shows the results of the performance test comparing running PRSice v1.25 and PRSoS across the datasets. Error bars indicate standard deviations. Numbers in boxed inserts indicate the size of the genotype data input. †Note that the file sizes used for the Imputed PP are same for PRSice v1.25 and PRSoS, thus illustrating the processing speed difference with same file size input. Imputed PP = imputed posterior probabilities, Imputed HC = imputed posterior probabilities converted to “hard calls”, Array Data = observed genotypes. Significance values derived from paired t-tests
Fig. 4PRSice v1.25 and PRSoS performance across increasing number of p-value thresholds. Line plot shows the results of the performance test comparing PRSice v1.25 and PRSoS across increasing number of p-value thresholds to construct in a single run using a dataset based on imputed posterior probabilities converted to “hard calls” (Imputed HC)
Fig. 5A PRS for major depressive disorder (MDD) predicts symptoms of depression. Bar plots show the proportion of variance explained by PRS for MDD in the prediction of symptoms of depression. PRS were calculated across three datasets including or excluding strand-ambiguous SNPs at a range of p-value thresholds (PT = 0.1, 0.2, 0.3, 0.4, and 0.5). *p < 0.05, **p < 0.01, ***p < 0.001. Imputed PP = imputed posterior probabilities, Imputed HC = imputed posterior probabilities converted to “hard calls”, Array Data = observed genotypes
Fig. 6Best-fit PRS model selection. Bar plots show the proportion of variance in depressive symptoms explained by PRS for major depressive disorder (MDD) as a function of dataset with and without strand-ambiguous SNPs. Only the best-fit models are shown (PT: Imputed PP = 0.1, Imputed HC = 0.1, Array Data = 0.2). Numbers in boxed inserts refer to the number of SNPs included in each PRS. Imputed PP = imputed posterior probabilities, Imputed HC = imputed posterior probabilities converted to “hard calls”, Array Data = observed genotypes