| Literature DB >> 25038819 |
Abstract
BACKGROUND: Genome-wide association studies (GWAS) have successfully identified genes associated with complex human diseases. Although much of the heritability remains unexplained, combining single nucleotide polymorphism (SNP) genotypes from multiple studies for meta-analysis will increase the statistical power to identify new disease-associated variants. Meta-analysis requires same allele definition (nomenclature) and genome build among individual studies. Similarly, imputation, commonly-used prior to meta-analysis, requires the same consistency. However, the genotypes from various GWAS are generated using different genotyping platforms, arrays or SNP-calling approaches, resulting in use of different genome builds and allele definitions. Incorrect assumptions of identical allele definition among combined GWAS lead to a large portion of discarded genotypes or incorrect association findings. There is no published tool that predicts and converts among all major allele definitions.Entities:
Mesh:
Year: 2014 PMID: 25038819 PMCID: PMC4223508 DOI: 10.1186/1471-2164-15-610
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Study design and GACT functionality. The left side of the figure indicates that microarray data can be used to call SNPs in any of the four listed SNP definitions. Often, when genotypes are obtained from public repositories (e.g. dbGaP), allele definitions may not be immediately known to investigators. GACT will predict allele definition and genome build, and convert to any new definitions or builds. Since the SNP definition in the NGS data is determined during alignment to the human reference genome (Plus is a commonly-used definition), the SNP alleles from genotyping microarrays can be converted and matched to those from NGS. After GACT’s conversion, imputation, meta-analysis and (or) other analyses may be carried out using the commonly-used tools such as GWAMA, METAL, PLINK, and IMPUTE2.
Figure 2GACT pipeline. The flow diagram shows the major procedures in the GACT design. The bottom left panel shows the prediction model of allele definitions based on the distribution of each definition (Figure 2). The bottom right panel shows the allele conversion pathway among the four allele definitions. The input file to be uploaded is a PLINK format map file. This pipeline is implemented in both command-line and web interface.
Genotype mismatches between the GWAS and 1000 genomes datasets
| T/C | C/T | FLIP | 0 | 0 | 0 |
| T/C | A/G | CSF | 5,048 | 9,875 | 301 |
| T/C | G/A | FLIP & CSF | 8,556 | 27,648 | 1,840 |
| T/A | */* | AMBIG | 432 | 432 | 432 |
| */* | −/− | NAR | 3,344 | 3,344 | 3,344 |
| Matches (%) | 62,793 (78.3) (81.7)† | 38,875 (48.5) | 74,256 (92.6) (96.7)† | ||
FLIP: switch both alleles with one another (from A1 to A2 and vice versa).
CSF: complimentary strand flip.
AMBIG: ambiguous SNPs in study GWAS.
NAR: not available in the reference.
*/*: any genotype.
−/−: missing genotype.
Fwd: Forward/Reverse.
Top: TOP/BOT.
Plus: Plus (+)/Minus (−).
†, percentages of matched genotypes after excluding the NAR genotype counts.
Both the “GWAS” (the 3,096 Ashkenazi Jewish samples) and “1000 Genome” columns show the example alleles in the A1/A2 order. The “Type” column indicates the changes required to match the study SNP to the reference. The last three columns refer to numbers of genotype mismatches on chromosome 1 (80,173 SNPs in total). The “Fwd-Plus” and “Top-Plus” columns show the numbers of genotype mismatches between the “Fwd” and “Top” definitions of our GWAS data (we first generated two versions of the same GWAS data: “Fwd” and “Top”) and the “Plus” definition of the 1000 Genome data, respectively, while the “Plus” column refers to the numbers after we converted the GWAS data to “Plus” using GACT. The last row shows the numbers (percentages) of correct genotype matches (e.g., “T/C” and “T/C”) between the GWAS and 1000 Genome data, where the (%) and (%) †represent the percentages measured by including and excluding the SNPs (NAR) unique to our GWAS data, respectively. Similar ratios were observed in other chromosomes.
Figure 3Frequencies and distributions of all possible genotypes of biallelic SNPs. The data were generated for the Plus/Minus, Forward/ Reverse, A/B, and TOP/BOT definitions based on the 1000 Genomes, dbSNP, and our GWAS datasets for the last two, respectively. The prediction model of allele definitions was trained using these distributions.
Figure 4Comparison of SNP density plots before (“Top” allele definition; black line) and after (“Plus” allele definition; red line) GACT conversion. The SNP density was measured per 500,000 bp window. It is clear that the SNP count (or density) increase after GACT converts all the mismatched loci, e.g., from 61.05 (median) to 117 SNPs per window. Moreover, it is evident that the increase is not biased with regard to physical location, which indicates that the allele definition mismatches are uniformly distributed across the chromosome. The dotted horizontal lines represent the median of values of each line matched by color. The median, instead of mean, was used since the former was less vulnerable to outliers (e.g. zero counts in the centromere region). The “Forward/Reverse” allele definition showed a similar distribution of mismatches with the 1000 Genomes, however, only the “TOP” definition is shown due to its higher level of mismatches (51.5% mismatches in “TOP” versus 21.7% mismatch in “Forward”). Other chromosomes showed similar patterns, and thus only the results of chromosome 1 are shown.
Quality scores of the imputed (I) and study (S) SNPs for each MAF category
| .520 (.222) | .854 (.249) | .727 (.222) | .902 (.184) | .853 (.173) | .945 (.131) | .939 (.118) | .971 (.089) | .965 (.086) | .981 (.060) | .975 (.071) | .981 (.063) | ||
| .584 (.289) | .854 (.239) | .738 (.227) | .906 (.181) | .855 (.174) | .945 (.132) | .939 (.118) | .970 (.092) | .966 (.086) | .982 (.060) | .975 (.071) | .981 (.064) | ||
| .571 (.275) | .859 (.245) | .730 (.222) | .901 (.186) | .854 (.172) | .945 (.131) | .939 (.118) | .971 (.089) | .965 (.086) | .981 (.060) | .975 (.071) | .981 (.063) | ||
| .571 (.275) | .858 (.245) | .730 (.222) | .903 (.184) | .854 (.172) | .945 (.131) | .939 (.118) | .971 (.089) | .965 (.086) | .981 (.060) | .975 (.071) | .981 (.063) | ||
| .572 (.274) | .855 (.245) | .731 (.222) | .900 (.185) | .854 (.172) | .944 (.131) | .940 (.117) | .971 (.091) | .966 (.085) | .981 (.060) | .975 (.071) | .981 (.064) | ||
| .570 (.274) | .859 (.245) | .730 (.222) | .901 (.187) | .853 (.173) | .944 (.131) | .939 (.118) | .970 (.091) | .965 (.086) | .981 (.061) | .974 (.073) | .981 (.064) | ||
| .568 (.274) | .851 (.251) | .726 (.223) | .899 (.186) | .851 (.174) | .942 (.134) | .937 (.120) | .969 (.094) | .964 (.088) | .980 (.064) | .973 (.074) | .979 (.067) | ||
| .563 (.273) | .841 (.258) | .722 (.223) | .897 (.190) | .848 (.175) | .938 (.140) | .934 (.121) | .966 (.099) | .962 (.090) | .978 (.067) | .971 (.076) | .977 (.067) | ||
| .557 (.272) | .830 (.263) | .715 (.224) | .884 (.197) | .843 (.177) | .933 (.144) | .930 (.126) | .962 (.104) | .958 (.092) | .975 (.070) | .968 (.079) | .974 (.073) | ||
| .542 (.269) | .810 (.270) | .700 (.225) | .872 (.207) | .830 (.180) | .922 (.152) | .921 (.129) | .954 (.110) | .949 (.100) | .967 (.080) | .960 (.087) | .966 (.080) | ||
| .507 (.258) | .756 (.293) | .662 (.222) | .824 (.231) | .793 (.189) | .891 (.169) | .893 (.138) | .930 (.130) | .923 (.114) | .941 (.102) | .934 (.100) | .943 (.095) | ||
MAF: minor allele frequency.
NoSin: no singletons.
NoAm: no ambiguous.
NoSM: no singletons or monomorphs.
0.05-3per: after removing SNPs with genotype missing rate higher than 0.05-3%.
The quality (information) scores were generated using IMPUTE2. The mean/average and standard deviation are shown outside and inside the brackets, respectively. We observed a high correlation between the imputed and study (true) genotypes, which incremented from low to high MAF ranges.
Figure 5Comparison of imputation quality of imputed SNPs. The quality score columns list three SNP minor allele frequency (MAF) categories: very rare (0.001 < MAF < 0.05), rare (0.05 < MAF <0.1), and common (0.1 < MAF < 0.5). The results under the missing thresholds of 0.03 and 0.01 showed the similar patterns to those under the threshold of 0.05, and thus are not shown. Bold indicates P < 0.05 in the Welch two sample t-test between the missing rate of 0.05 (black line) and the other thresholds.
Figure 6Distribution of SNP missing genotypes. The green histograms represent the numbers of remaining SNPs after removing the SNPs with missing rate > 0.05% while the plain histograms represent the total numbers of SNPs (on chromosome 1). The red circles represent the fractions of SNPs that passed the threshold. It is clear that the range of the fractions is narrow (i.e. 0.3-0.5).
Comparisons of tools for genome build and allele definition conversions
| Allele definition prediction | No | No | No | No | Yes |
| Uninformed strand/allele flip1 | No | Yes | Yes | Yes | No |
| Informed allele conversion2 | Yes3 | No | No | No | Yes |
| Automatic allele conversion | Yes3 | No | No | No4 | Yes |
| Genome build prediction | No | No | No | No | Yes |
| Genome build conversion | No | No | No | Yes4 | Yes |
| Command line | Yes | Yes | Yes | Yes | Yes |
| Interactive web interface | No | No | No | No | Yes |
1“Uninformed” refers to flipping without SNP allele annotation knowledge.
2“Informed” refers to use of the original SNP definition and microarray-specific annotation information.
3GenGen converts between Top, Forward, A/B and 1/2 allele definitions; by comparison, GACT converts between Top, Forward, A/B and Plus definitions while the Plus definition is used by the 1000 Genomes Project and most next-generation sequencing studies.
4PLINK can strand- or allele-flip but it cannot directly convert from one allele definition to another, unless the user manually extracts information from the microarray annotation file; by comparison, GACT automatically converts between genome builds and allele definitions.