| Literature DB >> 29484742 |
Andries T Marees1,2,3,4,5, Hilde de Kluiver6, Sven Stringer7, Florence Vorspan1,2,3,4,8,9, Emmanuel Curis3,10,11, Cynthia Marie-Claire2,3,4, Eske M Derks1,5.
Abstract
OBJECTIVES: Genome-wide association studies (GWAS) have become increasingly popular to identify associations between single nucleotide polymorphisms (SNPs) and phenotypic traits. The GWAS method is commonly applied within the social sciences. However, statistical analyses will need to be carefully conducted and the use of dedicated genetics software will be required. This tutorial aims to provide a guideline for conducting genetic analyses.Entities:
Keywords: GitHub; PLINK; genome-wide association study (GWAS); polygenic risk score (PRS); tutorial
Mesh:
Year: 2018 PMID: 29484742 PMCID: PMC6001694 DOI: 10.1002/mpr.1608
Source DB: PubMed Journal: Int J Methods Psychiatr Res ISSN: 1049-8931 Impact factor: 4.035
Figure 1Overview of various commonly used PLINK files. SNP = single nucleotide polymorphism
Figure 2Structure of the PLINK command line. *Not all shells will show this. **Provide the path to the directory where PLINK is installed if this is not in the current directory (e.g., /usr/local/bin/plink). Note that this example command was generated using PuTTY, a free SSH and Telnet client. When using other resources, there might be small graphical variations; however, the basic structure of a PLINK command will be identical
Overview of seven QC steps that should be conducted prior to genetic association analysis
| Step | Command | Function | Thresholds and explanation |
|---|---|---|---|
| 1: Missingness of SNPs and individuals | ‐‐geno | Excludes SNPs that are missing in a large proportion of the subjects. In this step, SNPs with low genotype calls are removed. |
We recommend to first filter SNPs and individuals based on a relaxed threshold (0.2; >20%), as this will filter out SNPs and individuals with very high levels of missingness. Then a filter with a more stringent threshold can be applied (0.02). Note, SNP filtering should be performed before individual filtering. |
| ‐‐mind | Excludes individuals who have high rates of genotype missingness. In this step, individual with low genotype calls are removed. | ||
| 2: Sex discrepancy |
‐‐check‐sex | Checks for discrepancies between sex of the individuals recorded in the dataset and their sex based on X chromosome heterozygosity/homozygosity rates. | Can indicate sample mix‐ups. If many subjects have this discrepancy, the data should be checked carefully. Males should have an X chromosome homozygosity estimate >0.8 and females should have a value <0.2. |
| 3: Minor allele frequency (MAF) | ‐‐maf | Includes only SNPs above the set MAF threshold. | SNPs with a low MAF are rare, therefore power is lacking for detecting SNP‐phenotype associations. These SNPs are also more prone to genotyping errors. The MAF threshold should depend on your sample size, larger samples can use lower MAF thresholds. Respectively, for large ( |
| 4: Hardy–Weinberg equilibrium (HWE) | ‐‐hwe | Excludes markers which deviate from Hardy–Weinberg equilibrium. |
Common indicator of genotyping error, may also indicate evolutionary selection. For For |
| 5: Heterozygosity | For an example script see | Excludes individuals with high or low heterozygosity rates |
Deviations can indicate sample contamination, inbreeding. We suggest removing individuals who deviate ±3 SD from the samples' heterozygosity rate mean. |
|
6: Relatedness | ‐‐genome | Calculates identity by descent (IBD) of all sample pairs. | Use independent SNPs ( |
| ‐‐min | Sets threshold and creates a list of individuals with relatedness above the chosen threshold. Meaning that subjects who are related at, for example, pi‐hat >0.2 (i.e., second degree relatives) can be detected. | Cryptic relatedness can interfere with the association analysis. If you have a family‐based sample (e.g., parent‐offspring), you do not need to remove related pairs but the statistical analysis should take family relatedness into account. However, for a population based sample we suggest to use a pi‐hat threshold of 0.2, which in line with the literature (Anderson et al., | |
| 7: Population stratification | ‐‐genome | Calculates identity by descent (IBD) of all sample pairs. | Use independent SNPs ( |
| ‐‐cluster ‐‐mds‐plot | Produces a |
|
Figure 3Multidimensional scaling (MDS) plot of 1KG against the CEU of the HapMap data (which could be seen as your “own” data in this example, as it is being used in the online tutorial at https://github.com/MareesAT/GWA_tutorial/). The black crosses (+ = “OWN”) in the upper left part represent the first two MDS components of the individuals in the HapMap sample (the colored symbols represent the 1KG data ( = European; = African; = Ad Mixed American; = Asian). The MDS components representing the European samples () are located in the upper left, the African samples () are located in the upper right, the Ad Mixed American samples () are located near the intersection point of the dashed lines, the Asian components () are located in the lower left part
Figure 4Working example of three single nucleotide polymorphisms (SNPs) aggregated into a single individual polygenic risk score (PRS). *The weight is either the beta or the log of the odds‐ratio, depending on whether a continuous or binary trait is analysed