| Literature DB >> 35300047 |
Shilin Zhao1, Limin Jiang2, Hui Yu2, Yan Guo2.
Abstract
Genotyping array is the most economical approach for conducting large-scale genome-wide genetic association studies. Thorough quality control is key to generating high integrity genotyping data and robust results. Quality control of genotyping array is generally a complicated process, as it requires intensive manual labor in implementing the established protocols and curating a comprehensive quality report. There is an urgent need to reduce manual intervention via an automated quality control process. Based on previously established protocols and strategies, we developed an R package GTQC (GenoTyping Quality Control) to automate a majority of the quality control steps for general array genotyping data. GTQC covers a comprehensive spectrum of genotype data quality metrics and produces a detailed HTML report comprising tables and figures. Here, we describe the concepts underpinning GTQC and demonstrate its effectiveness using a real genotyping dataset. R package GTQC streamlines a majority of the quality control steps and produces a detailed HTML report on a plethora of quality control metrics, thus enabling a swift and rigorous data quality inspection prior to downstream GWAS and related analyses. By significantly cutting down on the time on genotyping quality control procedures, GTQC ensures maximum utilization of available resources and minimizes waste and inefficient allocation of manual efforts. GTQC tool can be accessed at https://github.com/slzhao/GTQC. © The author(s).Entities:
Keywords: genotyping; microarray; quality control
Year: 2022 PMID: 35300047 PMCID: PMC8922302 DOI: 10.7150/jgen.69860
Source DB: PubMed Journal: J Genomics
Figure 1The trends of genotype-centric studies/data in GWAS and post-GWAS era (1998-2020). Despite a declining trend past the peak year 2015, the genotyping array based publications still largely surpass exome sequencing based publications in quantity. The growth of dbSNP is in line with the growth of genotyping and exome sequencing publications. The left y-axis denotes the number of publications (red), the right y-axis denotes the number of SNPs in dbSNP (blue), and the x-axis denotes the year.
Figure 2Quality control results generated by GTQC on TCGA's LIHC genotyping dataset. A. Summary of Sample Call Rate and SNP Call Rate at varied levels. B. Histogram of MAF. C. Scatter plot of MAF between TCGA's LIHC cohort and IGSR project, with only Asian subjects included. Each dot denotes the paired MAF values for a common SNP from the two cohorts. A high correlation indicates better data quality. D. Histogram of chromosome X inbreeding estimate. This is the result of GTQC's gender check of the example dataset. The gender check results help identify mislabeled gender information and recover gender for missing data. Alleged female cases with >0.2 inbreeding coefficient and alleged male cases with <0.98 inbreeding coefficient are flagged as problematic gender (yellow). E. Scatter plot of PC1 and PC2 from PC analysis. Race clusters can be clearly identified within the plot. Certain samples located between clusters suggest suspicious hybrids. F. Histogram of adjusted p-values from HWE check. The red dotted line denotes the threshold for statistical significance (0.05). G. Box and violin plots of Heterozygosity Ratios stratified by race. Due to the limited sample size, American Indians were not plotted. Subjects with unknown race information were not included either. The ethnic disparity in Heterozygosity Ratio can be seen. Asians show the lowest ratio values while black people show the highest values.