| Literature DB >> 26343929 |
Eric Reed1, Sara Nunez1, David Kulp2, Jing Qian3, Muredach P Reilly4, Andrea S Foulkes1.
Abstract
This tutorial is a learning resource that outlines the basic process and provides specific software tools for implementing a complete genome-wide association analysis. Approaches to post-analytic visualization and interrogation of potentially novel findings are also presented. Applications are illustrated using the free and open-source R statistical computing and graphics software environment, Bioconductor software for bioinformatics and the UCSC Genome Browser. Complete genome-wide association data on 1401 individuals across 861,473 typed single nucleotide polymorphisms from the PennCATH study of coronary artery disease are used for illustration. All data and code, as well as additional instructional resources, are publicly available through the Open Resources in Statistical Genomics project: http://www.stat-gen.org.Entities:
Keywords: Bioconductor; Hardy-Weinberg equilibrium (HWE); IBD; Manhattan plot; Q-Q plot; R code; SNP filtering; UCSC Genome Browser; ancestry; call rate; genome-wide association (GWA) study; heatmap; heterozygosity; imputation; lambda statistic; minor allele frequency (MAF); parallel processing; principal component analysis (PCA); regional association plot; relatedness; sample filtering; statistical genomics; substructure; tutorial
Mesh:
Year: 2015 PMID: 26343929 PMCID: PMC5019244 DOI: 10.1002/sim.6605
Source DB: PubMed Journal: Stat Med ISSN: 0277-6715 Impact factor: 2.373
Figure 1Genome‐wide association (GWA) analysis workflow. GWA analysis is composed of 10 essential steps that fall into four broadly defined categories as illustrated in this figure. Additional detail on the structure of the data files, particularly the relationship of the.ped and.map files with the.bim,.bed, and.fam files, is provided in Figure 2. This workflow is based on a single GWA analysis and may be modified in the context of a large collaborative meta‐analysis involving the combination of multiple GWA studies that require harmonization. Additional detail on typical modifications in this context is provided in Section 6. *Substructure, also referred to as population admixture and population stratification, refers to the presence of genetic diversity (e.g., different allele frequencies) within an apparently homogenous population that is due to population genetic history (e.g., migration, selection, and/or ethnic integration).
Figure 2Genome‐wide association data files. GWA data files are typically organized into either.ped and.map files or.bim,.bed, and.fam files. Plink converts.ped and.map files into.bim,.bed, and.fam files. The later set is substantially smaller because the.bed file contains a binary version of the genotype data. R can read in either set of files although the later is preferable.
Figure 5Heatmap and regional association plots. Heatmap (top) illustrating linkage disequilibrium (LD) between typed (black) and imputed (red) single nucleotide polymorphisms (SNPs) in the cholesteryl ester transfer protein (CETP) region. A total of two typed SNPs and 16 imputed SNPs are significant at the less stringent 5 × 10−6 threshold; however, the heat map only illustrates imputed SNPs with a posterior probability of 1 for the associated genotype. We observe the presence of two distinct LD blocks within the CETP gene region, with high levels of LD between SNPs within each block and lower LD between SNPs across the the two blocks. A related regional association plot (bottom) illustrates association levels and LD for a larger window surrounding CETP.
Figure 6UCSC Genome Browser with specified tracks open.
Figure 3Manhattan plot of genome‐wide association analysis results. This figure illustrates the level of statistical significance (y‐axis), as measured by the negative log of the corresponding p‐value, for each single nucleotide polymorphism (SNP). Each typed SNP is indicated by a grey or black dot. SNPs are arranged by chromosomal location (x‐axis). Imputation was performed on chromosome 16 only using 1000 Genomes data, and imputed SNPs are indicated by blue dots. None of the SNPs reached the Bonferroni level of significance (p < 5×10−8− solid horizontal line); however, two typed SNPs and 22 imputed SNPs (on chromosome 16) were suggestive of association (p < 5×10−6 – dashed horizontal line).
Figure 4Quantile–quantile plots for quality control check and visualizing crude association. Quantile–quantile plots illustrate the relationship between observed (y‐axis) and expected (x‐axis) test statistics and are used as a tool for visualizing appropriate control of population substructure and the presence of association. The left panel (a) is based on an unadjusted model, where the deviation is below expected, while the right panel (b) is based on a model adjusted for potential confounders, which brings the tail closer to the y = x line. The extreme observed statistics are suggestive of association. Data generally falling on the y = x lines suggests no clear systemic bias. Unstandardized λ's are reported. PCs, principal components.
Example data types and select resources for post‐analytic interrogation.*Listed resources are intended to provide primary examples and are not comprehensive. National Center for Biotechnology Information (NCBI) dbSNP; ENSEMBL Genome Browser; NCBI RefSeq; NCBI GenBank; The encyclopedia of DNA elements (ENCODE) Project; NIH Roadmap Epigenomics Project; GTex Portal; NCBI Sequence Read Archive (SRA); The Universal Protein Resource Knowledgebase (UniProtKB); The Human Metabolome Database (HMDB).
| Example data types | Select data sources* | UCSC genome browser navigation |
|---|---|---|
|
| ||
|
| ||
| (1) SNPs | NCBI dbSNP[a], ENSEMBL[b] | Variation: Common SNPs(141) |
| (2) Insertions and delations (INDELs) | ||
| (3) Copy number variants (CNVs) | ||
|
| ||
| (1) Protein‐coding genes | NCBI RefSeq[c], NCBI GenBank[d], ENSEMBL[b] | Gene and Gene Predictions: UCSC Genes |
| (2) Non‐protein‐coding genes | NCBI RefSeq[c], NCBI GenBank[d], ENSEMBL[b] | Gene and Gene Predictions: UCSC Genes |
|
| ||
|
| ||
| (1) DNA hypersensitivity (DNase‐Seq) | ENCODE[e], ENSEMBL[b] | Regulation: ENCODE Regulation |
| (2) FAIRE sequencing | ENCODE[e], ENSEMBL[b] | Regulation: ENC DNase/FAIRE |
|
| ||
| (1) Methylation promoter marks | ENCODE[e], NIH Roadmap Epigenomics[f] | Regulation: ENCODE Regulation |
| (2) Methylation enhancer marks | ENCODE[e], NIH Roadmap Epigenomics[f] | Regulation: ENCODE Regulation |
| (3) Acetylation marks (e.g. #H3K27Ac histone mark) | ENCODE[e], NIH Roadmap Epigenomics[f] | Regulation: ENCODE Regulation |
|
| ||
| (1) ChipSeq data | ENCODE[e], ENSEMBL[b], custom | Regulation: ENCODE Regulation |
|
| ||
|
| ||
| (1) historic mRNA | NCBI GenBank[d] | mRNA and EST: Human mRNAs |
| (2) genome‐wide cell‐specific RNA data (e.g. RNAseq) | ENCODE[e], GTex Portal[g], NCBI SRA[h] | Expression: ENC RNA‐seq |
|
| ||
| (1) Expression quantitative trait locis (eQTL) | GTex Portal[g], custom | N/A |
| (2) Allelic imbalance (AI); allele specific expression (ASE) | GTex Portal[g], custom | N/A |
|
| ||
|
| ||
| (1) Proteomic (e.g. pQTLs) | UniProtKB[i] | N/A |
| (2) Metabolomic | HMDB[j] | N/A |