| Literature DB >> 35251129 |
Jennifer A Collister1, Xiaonan Liu1, Lei Clifton1.
Abstract
A polygenic risk score estimates the genetic risk of an individual for some disease or trait, calculated by aggregating the effect of many common variants associated with the condition. With the increasing availability of genetic data in large cohort studies such as the UK Biobank, inclusion of this genetic risk as a covariate in statistical analyses is becoming more widespread. Previously this required specialist knowledge, but as tooling and data availability have improved it has become more feasible for statisticians and epidemiologists to calculate existing scores themselves for use in analyses. While tutorial resources exist for conducting genome-wide association studies and generating of new polygenic risk scores, fewer guides exist for the simple calculation and application of existing genetic scores. This guide outlines the key steps of this process: selection of suitable polygenic risk scores from the literature, extraction of relevant genetic variants and verification of their quality, calculation of the risk score and key considerations of its inclusion in statistical models, using the UK Biobank imputed data as a model data set. Many of the techniques in this guide will generalize to other datasets, however we also focus on some of the specific techniques required for using data in the formats UK Biobank have selected. This includes some of the challenges faced when working with large numbers of variants, where the computation time required by some tools is impractical. While we have focused on only a couple of tools, which may not be the best ones for every given aspect of the process, one barrier to working with genetic data is the sheer volume of tools available, and the difficulty for a novice to assess their viability. By discussing in depth a couple of tools that are adequate for the calculation even at large scale, we hope to make polygenic risk scores more accessible to a wider range of researchers.Entities:
Keywords: UK biobank; genetic risk score; polygenic risk score; polygenic score; worked example
Year: 2022 PMID: 35251129 PMCID: PMC8894758 DOI: 10.3389/fgene.2022.818574
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Distribution of a polygenic risk score for breast cancer among individuals with and without registry-verified breast cancer events in UK Biobank (4,789 cases). Score used was 313-SNP PRS from (Mavaddat et al., 2019).
Glossary.
| Term | Meaning |
|---|---|
| Allele | An alternative form of a genetic variant |
| Alternate id | In the UK Biobank multi-allelic SNPs are represented as multiple SNPs with different alleles but the same rsID and same position on the chromosome. In order to have a unique identifier for each SNP, an “alternate_id” was created that is typically the rsID, chr:pos or Affymetrix identifier followed by the reference and alternate alleles |
| Base data | Typically GWAS summary statistics containing SNP identifiers, risk alleles and effect sizes |
| Genome Build | The genome build is a common “reference genome” developed by combining the sequences most commonly observed across available individual genomes to create a representative genome against which individual genomes can be compared |
| Genotype data | Genotyping is the identification of the genetic variants in the DNA of an individual. This is typically done using arrays or chips, which contain probes that target specific locations in the DNA. These locations contain known variants of interest—so genotyping is good at identifying which known variants a person has, but not at finding new variants |
| Genotype Imputation | Genotype imputation uses a reference panel to estimate genotypes at locations that were not directly called by statistical inference |
| Heritability | Heritability is the amount of observable (phenotypic) variation among individuals of a population that is due to genetic variation between the individuals |
| Linkage Disequilibrium (LD) | Linkage disequilibrium (LD) is a measure of the correlation between neighbouring genetic variants that are more likely to be inherited together because of their physical proximity, leading to association within a population |
| Locus | Physical location of a gene or DNA polymorphism on a chromosome (plural “loci”) |
| Multi-allelic SNPs | When there is more than one possible variant nucleotide (in addition to the reference) at a location, then we say this location is “multi-allelic” |
| Next generation sequencing | Sequencing enables the exact sequence of bases in a length of DNA to be determined. This technique can be used on targeted areas such as the exome, although it is becoming increasingly cost effective to do whole genome sequencing |
| Phenotype | The phenotype of an organism is its observable characteristics, for example its physical appearance |
| rsID | The rsID for a SNP is the unique RefSNP ID number identifying the “reference SNP cluster” containing this SNP in dbSNP. This cluster contains all SNPs that map to the same location on the genome |
| Since genome assemblies are still a work in progress, occasionally there will be changes that alter our understanding of where a refSNP is located, so that it may co-locate with another existing refSNP. In these cases, the higher refSNP number is retired and all SNPs are reassigned to the refSNP with the lower number | |
| Single Nucleotide Polymorphism (SNP) | A single nucleotide polymorphism (or single nucleotide variant) is a location on the genome where a single DNA nucleotide that differs from that in the reference genome has been identified |
| Target data | The data in which the PRS is developed, using effect sizes from the base data. Multiple PRS may be calculated, using different thresholds for association, and the one with best performance is selected |
| Validation data | The data in which the PRS is calculated and used in analyses. These analyses may validate the association between the PRS and the trait of interest |
Comparison between genetic software for various usages.
| Genetic software | |||
|---|---|---|---|
| Usage | bgenix | QCTOOL | PLINK |
| Extract SNPs | Yes, very quickly, although can only specify up to 9,980 SNPs by chromosome and position identifier | Yes, and has useful wildcard feature to extract from all chromosome files in one step, but slow | Yes, have to extract per chromosome, slow for BGEN data as it has to auto-convert the entire file not just the required SNPs |
| Conduct QC | No | Yes, it computes summary statistics but filtering has to be done in a separate step, and with additional tools (such as awk or R) | Yes, fast, it can compute summary statistics and apply filtering. Not all commands are suitable for use on imputed data |
| Compute PRS | No | Yes but poorly documented | Yes, with many options |
PLINK 2 commands for summary statistics and filtering.
| Function | Summary statistics | As exclusion criteria | |
|---|---|---|---|
| Option | Meaning | ||
| Allele frequency |
| --maf [threshold] | Include SNPs with MAF above [threshold] (default = 0.01) |
| SNP call rate |
| --geno [threshold] | Exclude SNPs with missing call rates exceeding the [threshold] (default = 0.1) |
| Filter SNPs | --exclude [file] | Exclude SNPs listed in [file] | |
| Filter samples | --keep [file] | Retains only the samples listed in [file], all others are excluded | |
| HWE |
| --hwe [threshold] | Exclude SNPs with |
| Linkage Disequilibrium (LD) |
|
| Pruning with a [window] size, sliding across the genome with [step] size at a time and filter out any SNPs with LD r2 higher than [threshold] |
* Command in PLINK 1.9.
Five examples of possible disagreements between PRS and validation data, when data harmonisation may be required. We illustrate five different situations in the table: Perfect agreement, labelling disagreement, strand flip, strand flip and labelling disagreement, palindromic (ambiguous) SNP.
| PRS summary data file | Validation data | ||||
|---|---|---|---|---|---|
| Effect allele | Non-effect allele | Effect allele | Non-effect allele | ||
| 1 | Expected scenario - perfect agreement | A | C | A | C |
| 2 | PRS and validation data disagree on labelling of effect allele | A | C | C | A |
| 3 | “Strand flip” | A | C | T | G |
| 4 | Strand flip and labelling disagreement | A | C | G | T |
| 5 | Palindromic | A | T | T | A |
FIGURE 2Summary of quality control and alignment steps.
Hard-call vs. allelic dosages: genotype probability trios and allelic and hard-called dosages for 2 SNPs of a theoretical individual.
| P (AA) | P (AB) | P(BB) | Allelic Dosage (B) | Hard-call Dosage (B) | |
|---|---|---|---|---|---|
| SNP1 | 0.22 | 0.50 | 0.28 | 1.06 | 1 |
| SNP2 | 0.02 | 0.90 | 0.08 | 1.06 | 1 |
FIGURE 3Flowchart showing quality control exclusions in worked example of LDL-C PRS in UK Biobank data.
FIGURE 4Imputation information against beta of each SNP in LDL-C PRS. Navy dashed line is our imputation information threshold of 0.4, and SNPs are coloured by our MAF threshold of 0.005.
FIGURE 5Histogram of LDL-C PRS with overlaid density plot.
FIGURE 6Association between LDL-C PRS and measured LDL-C at baseline among genetically White British UK Biobank participants.
FIGURE 7Comparison of PRS calculated using allelic and hard-call dosages (Pearson’s correlation coefficient = 0.99). PRS used was 223-SNP score from (Klarin et al., 2018), with hard-called dosage approach from (Trinder et al., 2020b).
Comparison of times taken. Please note absolute times may vary depending on the computation power of the system used, our interest is in the relative performance of the tools.
| bgenix | QCTOOL v2 | PLINK 2 | |
|---|---|---|---|
| 223 variants | |||
| SNP extraction | 53 s | 2,696 s | 18,403 s |
| QC | — | 795 s | 7 s |
| PRS calculation | — | — | 1 s |
| 100 k variants | |||
| SNP extraction | 2,681 s | >108 k s (exceeded 30 h limit) | 20,821 s |
| QC | — | 7,942 s | 76 s |
| PRS calculation | — | — | 256 s |