| Literature DB >> 35223420 |
Ting-Yuan Liu1, Chih-Fan Lin2, Hsing-Tsung Wu2, Ya-Lun Wu2, Yu-Chia Chen1, Chi-Chou Liao1, Yu-Pao Chou1, Dysan Chao1, Ya-Sian Chang1,3, Hsing-Fang Lu4, Jan-Gowth Chang1,3, Kai-Cheng Hsu2,5,6, Fuu-Jen Tsai7,8,9,10.
Abstract
A genome-wide association study (GWAS) can be conducted to systematically analyze the contributions of genetic factors to a wide variety of complex diseases. Nevertheless, existing GWASs have provided highly ethnic specific data. Accordingly, to provide data specific to Taiwan, we established a large-scale genetic database in a single medical institution at the China Medical University Hospital. With current technological limitations, microarray analysis can detect only a limited number of single-nucleotide polymorphisms (SNPs) with a minor allele frequency of >1%. Nevertheless, imputation represents a useful alternative means of expanding data. In this study, we compared four imputation algorithms in terms of various metrics. We observed that among the compared algorithms, Beagle5.2 achieved the fastest calculation speed, smallest storage space, highest specificity, and highest number of high-quality variants. We obtained 15,277,414 high-quality variants in 175,871 people by using Beagle5.2. In our internal verification process, Beagle5.2 exhibited an accuracy rate of up to 98.75%. We also conducted external verification. Our imputed variants had a 79.91% mapping rate and 90.41% accuracy. These results will be combined with clinical data in future research. We have made the results available for researchers to use in formulating imputation algorithms, in addition to establishing a complete SNP database for GWAS and PRS researchers. We believe that these data can help improve overall medical capabilities, particularly precision medicine, in Taiwan. © the Author(s).Entities:
Keywords: CMUH genetic biobank; Imputation; SNP array; Whole genome sequencing
Year: 2021 PMID: 35223420 PMCID: PMC8823485 DOI: 10.37796/2211-8039.1302
Source DB: PubMed Journal: Biomedicine (Taipei) ISSN: 2211-8020
Fig. 1Overview of study pipeline. WGS data of TWB and EAS were used for model construction. For the TWB data, 100 WGS data items were in the validation cohorts and 1363 WGS data items were in the reference cohorts. For the 1000 Genome Project data, 504 EAS WGS data items were obtained.
Imputation algorithms. Asterisks indicate the best value in this item.
| IMPUTE2 (WE) | IMPUTE2 (W) | IMPUTE4 | IMPUTE5 | Beagle5.2 | |
|---|---|---|---|---|---|
| Imputation Time (min) | 133 | 8.5 | 1.68 | 1.22 | 0.68* |
| Storage (Gb) | 26 | 23 | 14.5 | 1.5 | 1* |
| Total Imputed Variants | 16,298,564* | 14,757,187 | 14,763,606 | 15,548,597 | 15,471,490 |
| Intersection with WGS | 13,218,326 | 13,208,509 | 13,212,007 | 15,471,490 | 15,471,490* |
| Extra | 3,080,238 | 1,548,678 | 1,551,599 | 77,107 | NA* |
| Specificity | 0.8110 | 0.8951 | 0.8949 | 0.9950 | 1.0000* |
| Accuracy | 0.9973 | 0.9971 | 0.9976* | 0.9873 | 0.9875 |
| High Quality Variants | 13,182,597 | 13,169,683 | 13,180,755 | 15,275,732 | 15,277,414* |
Fig. 2Imputation accuracy rates and number of imputed variants. (A) Accuracy breakdown of whole-genome imputation per chromosome for each imputation algorithm; (B) intersection of imputed genotype with WGS ground truth.
Fig. 3Variant distributions of MAF. (A) MAF of TPMv1 variants; (B) MAF of WGS reference panel variants; (C) MAF of imputed variants.
Fig. 4R2 and concordance of MAF. (A) R2 of imputed SNP array data; (B) concordance of imputed SNP array data. Horizontal axis represents MAF. The vertical axis represents R2 and concordance.
Fig. 5External WGS data for verifying the imputation results. Horizontal axis represents MAF. The left vertical axis represents mapping rate and accuracy for the line chart. The right vertical axis represents allele count for the graph.