| Literature DB >> 33861770 |
Daniel P Wickland1,2, Yingxue Ren1, Jason P Sinnwell3, Joseph S Reddy1, Cyril Pottier4, Vivekananda Sarangi3, Minerva M Carrasquillo4, Owen A Ross4,5, Steven G Younkin4, Nilüfer Ertekin-Taner4,6, Rosa Rademakers4, Matthew E Hudson2,7, Liudmila Sergeevna Mainzer2,7, Joanna M Biernacka3, Yan W Asmann1.
Abstract
Genetic studies have shifted to sequencing-based rare variants discovery after decades of success in identifying common disease variants by Genome-Wide Association Studies using Single Nucleotide Polymorphism chips. Sequencing-based studies require large sample sizes for statistical power and therefore often inadvertently introduce batch effects because samples are typically collected, processed, and sequenced at multiple centers. Conventionally, batch effects are first detected and visualized using Principal Components Analysis and then controlled by including batch covariates in the disease association models. For sequencing-based genetic studies, because all variants included in the association analyses have passed sequencing-related quality control measures, this conventional approach treats every variant as equal and ignores the substantial differences still remaining in variant qualities and characteristics such as genotype quality scores, alternative allele fractions (fraction of reads supporting alternative allele at a variant position) and sequencing depths. In the Alzheimer's Disease Sequencing Project (ADSP) exome dataset of 9,904 cases and controls, we discovered hidden variant-level differences between sample batches of three sequencing centers and two exome capture kits. Although sequencing centers were included as a covariate in our association models, we observed differences at the variant level in genotype quality and alternative allele fraction between samples processed by different exome capture kits that significantly impacted both the confidence of variant detection and the identification of disease-associated variants. Furthermore, we found that a subset of top disease-risk variants came exclusively from samples processed by one exome capture kit that was more effective at capturing the alternative alleles compared to the other kit. Our findings highlight the importance of additional variant-level quality control for large sequencing-based genetic studies. More importantly, we demonstrate that automatically filtering out variants with batch differences may lead to false negatives if the batch discordances come largely from quality differences and if the batch-specific variants have better quality.Entities:
Mesh:
Substances:
Year: 2021 PMID: 33861770 PMCID: PMC8051815 DOI: 10.1371/journal.pone.0249305
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Fig 1Principal Component (PC) eigenvector plots using genotypes of a pruned set of 16,187 high-quality common variants for 9,904 ADSP individuals.
Each data point represents a single individual. Clustering of samples for a particular variable signifies genotypic similarity between individuals for the trait represented by that color. (A) PCs of the genotypes. (B) PCs color coded based on sub-population. (C) PCs color coded based on center. (D) PCs color coded based on gender. (E) PCs color coded based on AD phenotype. As expected, clustering is apparent only by sub-population.
Seven SNPs in APOE and TOMM40 (indicated by * of the SNP IDs) and 29 novel SNPs reaching exome-wide significance (p < 3.0 x 10−7, Bonferroni-corrected cutoff of p < 0.05 / # tests): Population minor allele frequency (MAF) in cases and controls, MAF in controls processed by Illumina or NimbleGen exome capture kit, and MAF in Non-Finish European (NFE) cohort of the ExAC database (http://exac.broadinstitute.org/).
| Chr | Position | Ref | Alt | p-value (Model 1) | p-value (Model 2) | Gene | SNP ID | ADSP Cases MAF | ADSP Controls MAF | ADSP Controls MAF, Illumina | ADSP Controls MAF, NimbleGen | ExAC AAF, NFE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19 | 45411941 | T | C | 2.4E-185 | N/A | APOE | rs429358* | 0.2293 | 0.0701 | 0.074 | 0.067 | 0.1504 |
| 19 | 45396144 | C | T | 4.7E-103 | 0.7097 | TOMM40 | rs11556505* | 0.2023 | 0.0879 | 0.085 | 0.090 | 0.0875 |
| 19 | 45395714 | T | C | 2.4E-75 | 0.1439 | TOMM40 | rs157581* | 0.2766 | 0.1629 | 0.165 | 0.162 | 0.2326 |
| 19 | 45412079 | C | T | 1.8E-48 | N/A | APOE | rs7412* | 0.042 | 0.0988 | 0.116 | 0.087 | 0.0813 |
| 15 | 75913319 | T | G | 4.3E-45 | 2.7E-35 | SNUPN | rs1004285543 | 0.0825 | 0.0255 | 0.063 | 0.001 | NA |
| 17 | 25973604 | A | C | 2.6E-45 | 1.4E-34 | LGALS9 | rs761436847 | 0.0823 | 0.0256 | 0.056 | 0.006 | 0.0906 |
| 6 | 36979483 | T | G | 1.7E-39 | 1.6E-29 | FGD2 | rs769719224 | 0.0998 | 0.0404 | 0.072 | 0.019 | 0.0294 |
| 14 | 99976645 | A | C | 4.1E-36 | 7.5E-28 | CCNK | rs745936510 | 0.0899 | 0.0348 | 0.068 | 0.013 | 0.0713 |
| 13 | 114188430 | C | T | 4.7E-38 | 2.2E-27 | TMCO3 | rs77834374 | 0.1068 | 0.0459 | 0.080 | 0.024 | 0.1336 |
| 14 | 99976639 | G | C | 9.4E-36 | 3.7E-26 | CCNK | rs778243462 | 0.0932 | 0.0372 | 0.071 | 0.015 | 0.0808 |
| 17 | 25973598 | A | C | 4.3E-25 | 1.1E-21 | LGALS9 | rs760143837 | 0.0472 | 0.014 | 0.033 | 0.002 | 0.0243 |
| 14 | 77706020 | A | C | 9.7E-28 | 6.3E-21 | TMEM63C | rs774212969 | 0.0577 | 0.019 | 0.036 | 0.008 | 0.009 |
| 19 | 45397229 | G | A | 9.6E-21 | 0.3137 | TOMM40 | rs1160983* | 0.0173 | 0.0416 | 0.045 | 0.039 | 0.0718 |
| 11 | 117280516 | A | C | 7.9E-29 | 2.5E-20 | CEP164 | rs756182128 | 0.0748 | 0.0296 | 0.050 | 0.016 | 0.081 |
| 3 | 42739737 | T | G | 5.7E-25 | 9.7E-20 | HHATL | rs763168412 | 0.0539 | 0.0182 | 0.034 | 0.008 | 0.0919 |
| 3 | 48451952 | A | C | 3.2E-24 | 2.7E-19 | PLXNB1 | rs770786389 | 0.0562 | 0.02 | 0.037 | 0.009 | 0.0255 |
| 19 | 45397307 | C | T | 1.5E-18 | 0.928 | TOMM40 | rs112849259* | 0.0308 | 0.0107 | 0.005 | 0.014 | 0.0011 |
| 12 | 56622883 | A | C | 4.3E-24 | 1.2E-17 | NABP2 | rs757798976 | 0.0714 | 0.0301 | 0.054 | 0.014 | 0.0476 |
| 2 | 85662149 | A | C | 4.0E-21 | 4.3E-16 | SH2D6 | rs748669078 | 0.068 | 0.0309 | 0.044 | 0.022 | 0.0026 |
| 19 | 10946797 | G | C | 6.4E-21 | 2.5E-15 | TMED1 | rs767166604 | 0.0421 | 0.0128 | 0.029 | 0.002 | 0.0007 |
| 14 | 105932775 | G | C | 5.5E-21 | 3.1E-15 | MTA1 | rs782227993 | 0.0627 | 0.0259 | 0.047 | 0.012 | 0.0208 |
| 6 | 29429950 | A | C | 1.1E-19 | 3.6E-15 | OR2H1 | rs746691570 | 0.0402 | 0.0132 | 0.022 | 0.007 | 0.0207 |
| 11 | 117280522 | A | C | 6.5E-21 | 3.3E-14 | CEP164 | rs758240656 | 0.0529 | 0.0198 | 0.037 | 0.009 | 0.0768 |
| 2 | 85662154 | A | C | 4.9E-18 | 4.0E-14 | SH2D6 | rs760146451 | 0.0617 | 0.0288 | 0.040 | 0.021 | 0.0018 |
| 13 | 88330245 | A | C | 3.1E-17 | 1.1E-13 | SLITRK5 | rs773717935 | 0.0277 | 0.0065 | 0.014 | 0.002 | 3.1E-05 |
| 19 | 45409167 | C | G | 9.7E-13 | 0.3854 | APOE | rs440446* | 0.3332 | 0.3817 | 0.361 | 0.395 | 0.4346 |
| 19 | 10946802 | T | C | 1.2E-16 | 3.1E-12 | TMED1 | rs776909029 | 0.0366 | 0.0117 | 0.028 | 0.001 | 0.0009 |
| 9 | 34564740 | A | C | 6.5E-16 | 3.6E-12 | CNTFR | rs774039930 | 0.0516 | 0.0222 | 0.039 | 0.011 | 0.0008 |
| 3 | 108474687 | T | G | 1.4E-15 | 5.8E-12 | RETNLB | rs199707443 | 0.0328 | 0.0107 | 0.025 | 0.001 | 0.0493 |
| 19 | 43025485 | T | G | 2.6E-15 | 7.7E-12 | CEACAM1 | rs763190977 | 0.0921 | 0.0523 | 0.107 | 0.016 | 0.0026 |
| 3 | 31659462 | A | T | 4.8E-17 | 9.2E-12 | STT3B | rs74346226 | 0.0891 | 0.0514 | 0.076 | 0.035 | 0.131 |
| 12 | 109719316 | T | G | 9.1E-12 | 2.1E-10 | FOXN4 | rs760573591 | 0.0309 | 0.0115 | 0.025 | 0.003 | 1.5E-05 |
| 8 | 145112936 | T | C | 6.8E-14 | 5.0E-10 | OPLAH | rs781948612 | 0.0331 | 0.0114 | 0.026 | 0.002 | 0.0364 |
| 19 | 42799299 | T | C | 5.7E-12 | 1.4E-09 | CIC | rs745695673 | 0.019 | 0.0043 | 0.011 | 0.000 | 0 |
| 13 | 111164389 | A | C | 7.2E-12 | 1.6E-08 | COL4A2 | rs199702442 | 0.0517 | 0.0274 | 0.041 | 0.018 | 0.0285 |
| 22 | 30951295 | T | G | 2.0E-11 | 1.8E-08 | GAL3ST1 | rs762634521 | 0.0204 | 0.0056 | 0.013 | 0.001 | 0.028 |
Model 1 adjusted for sequencing center and the first four PCs underlying population substructure. Model 2 adjusted for sequencing center, the first 4 PCs, sex and APOE genotype.
Fig 2Sequencing center specific association p-values of SNPs that reached exome-wide significance (denoted by the dashed horizontal lines) in the full-dataset analysis.
(A) Seven SNPs in TOMM40 and APOE. (B) Twenty-nine novel SNPs.
Fig 3PC eigenvector plots of genotypes at 29 exome-wide significant SNPs.
Each data point represents a single individual. Clustering of samples for a particular variable indicates genotypic similarity between individuals for the trait represented by that color. (A) PCs color coded based on sub-population. (B) PCs color coded based on gender. (C) PCs color coded based on capture kit. The NimbleGen-captured samples cluster tightly together, indicating their genotypic similarity that is distinct from the Illumina-captured samples.
Fig 4Density plots of variant quality parameters between two exome capture kits.
Mean values were computed across all samples for each variant. The solid lines show the distributions of all 166,947 variants used in the association analyses, and the scattered dots represent the 29 novel SNPs. (A) Density plot for mean genotype quality (GQ). (B) Density plot for mean read depth (DP). (C) Density plot for mean alternative allele fraction (AAF).
Fig 5Minor Allele Frequency (MAF) of 29 exome-wide significant SNPs in AD control exomes processed by two capture kits and in the ExAC Non-Finnish European (NFE) population.
Fig 6PC eigenvector plots of genotypes at variants lying in different sections of quality-metric ratio distributions.
Each data point represents a single individual, color coded according to capture kit. (A) PCs of variants in either 5% tail. (B) PCs of variants in right 5% tail. (C) PCs of variants in left 5% tail. (D) Variants in middle 90% of distributions. Variants in the tails, in particular the left 5% tail (better quality in NimbleGen kit), show clear separation by capture kit in both cases and controls.