Literature DB >> 35966919

A Fast and Robust Strategy to Remove Variant-Level Artifacts in Alzheimer Disease Sequencing Project Data.

Michael E Belloy¹, Yann Le Guen¹, Sarah J Eger¹, Valerio Napolioni¹, Michael D Greicius¹, Zihuai He¹.

Abstract

Background and
Objectives: Exome sequencing (ES) and genome sequencing (GS) are expected to be critical to further elucidate the missing genetic heritability of Alzheimer disease (AD) risk by identifying rare coding and/or noncoding variants that contribute to AD pathogenesis. In the United States, the Alzheimer Disease Sequencing Project (ADSP) has taken a leading role in sequencing AD-related samples at scale, with the resultant data being made publicly available to researchers to generate new insights into the genetic etiology of AD. To achieve sufficient power, the ADSP has adapted a study design where subsets of larger AD cohorts are collected and sequenced across multiple centers, using a variety of sequencing platforms. This approach may lead to variable variant quality across sequencing centers and/or platforms. In this study, we sought to implement and evaluate filters that can be applied fast to robustly remove variant-level artifacts in the ADSP data.
Methods: We implemented a robust quality control procedure to handle ADSP data. We evaluated this procedure while performing exome-wide and genome-wide association analyses on AD risk using the latest ADSP whole ES (WES) and whole GS (WGS) data releases (NG00067.v5).
Results: We observed that many variants displayed large variation in allele frequencies across sequencing centers/platforms and contributed to spurious association signals with AD risk. We also observed that sequencing platform/center adjustment in association models could not fully account for these spurious signals. To address this issue, we designed and implemented variant filters that could capture and remove these center-specific/platform-specific artifactual variants. Discussion: We derived a fast and robust approach to filter variants that represent sequencing center-related or platform-related artifacts underlying spurious associations with AD risk in ADSP WES and WGS data. This approach will be important to support future robust genetic association studies on ADSP data, as well as other studies with similar designs.

Entities: Chemical

Year: 2022 PMID： 35966919 PMCID： PMC9372872 DOI： 10.1212/NXG.0000000000200012

Source DB: PubMed Journal: Neurol Genet ISSN： 2376-7839

Late-onset Alzheimer disease (AD) is marked by a strong genetic component, with heritability estimates ranging from 59% to 79%.[1,2] Largely supported by single-nucleotide polymorphism (SNP) genotyping arrays and variant imputation, large-scale meta-analyses of genome-wide association studies have so far implicated more than 50 loci relevant to AD in individuals of European ancestry.[2-6] Despite these important advances, most risk variants identified so far have common allele frequencies, and it is estimated that only approximately half of the genetic heritability of AD has been captured, such that much of the genetic component of AD remains to be identified.[2] In response to this observation, there has been a shift to start using exome sequencing (ES) or genome sequencing (GS) to help capture rare and/or coding variants that contribute to AD risk, which has led to several recent initial successes.[7-15] In the United States, the Alzheimer Disease Sequencing Project (ADSP) has taken a leading role in the sequencing of AD-related samples at scale, with resultant data being made publicly available to researchers to generate new insights into the genetic etiology of AD. The ADSP has pursued both “whole” ES (WES) and “whole” GS (WGS) approaches (although it should be noted that these for now do not actually provide whole coverage due to technical limitations), where most recently, the focus is increasingly on GS. To achieve sufficient power to support analyses of sequencing data and rare variants, the ADSP has adapted a study design where subsets of larger AD cohorts are collected and sequenced across multiple centers, using a variety of sequencing platforms.[16-18] This in turn can lead to “center” or “platform” effects that traditionally are accounted for by using center/platform covariate adjustment. However, a prior study using a prior version of the ADSP WES discovery phase observed that center/platform covariate adjustment could not account for variable variant qualities across centers and platforms, which in turn may lead to spurious associations or affect the identification of AD-associated risk variants.[19] Since then, the ADSP has further expanded its efforts and as of 2021, provides their WES and WGS data sets on 20.5 and 16.9 k individuals, respectively, across diverse ancestries.[18] In our exploratory analyses of these data, we observed many variants that displayed large variation in allele frequencies across centers/platforms and contributed to spurious association signals with AD risk, that is, associations that passed at least the common suggestive significance for genome-wide association studies (p < 1 × 10−5) but were of a (likely) artifactual nature. Similar to the prior study,[19] we also observed that platform/center adjustment could not fully account for these signals. Beyond center/platform adjustment, several strategies have been proposed to handle such artifacts in ES and GS data.[20-22] Notably, preprocessing of UK Biobank SNP array data has previously shown that filters that capture variants displaying large genotype variations across batches/arrays (assessed by the Fisher exact tests) can importantly help remove variants that represent batch or array effects.[23] Because the latter approach is reasonably fast to implement and robust, in this study, we designed and implemented similar filters that aimed to capture and remove center-specific/platform-specific artifactual variants in ASDP WES and WGS data. We additionally tested filters containing putatively artifactual variants identified in the Genome Aggregation Database (gnomAD) reference database.[24] All filters were designed such that they can be implemented post hoc to association analyses, leaving flexibility to researchers to either run full-sample analyses with robust variant quality control (QC) or to identify variants that require targeted analyses. This study summarized the effect of these filters on genome-wide and exome-wide AD association findings in ADSP and proposed they can be used as a fast approach to robustly remove artifactual variants, thereby supporting initial explorations of the ADSP data.

Methods

Ascertainment of Genotype and Phenotype Data

Genotype data for individuals with AD-related clinical outcome measures were available from the ADSP WES and WGS data. Notably, the ADSP performed targeted sequencing of samples in case-control (majority), family-based, population-based, and longitudinal cohorts, performing sequencing across multiple sequencing centers and using various sequencing platforms (eTable 1 and 2, links.lww.com/NXG/A536). Ascertainment of genotype/phenotype data for these samples is described in detail elsewhere.[18,25] In addition to the ADSP samples, we also had access to several publicly available SNP microarray and WGS data sets (eTable 1), largely comprising data from the Alzheimer Disease Genetics Consortium. The latter have a large degree of sample overlap with ADSP. To ensure the most up-to-date and parsimonious phenotypes, we performed a cross-sample genotype/phenotype harmonization, which is summarized in eMethods.

Standard Protocol Approvals, Registrations, and Patient Consents

Participants or their caregivers provided written informed consents in the original studies. This study protocol was granted an exemption by the Stanford Institutional Review Board because the analyses were conducted on “de-identified, off-the-shelf” data.

Genetic Data QC and Processing

The ADSP WES and WGS data (NG00067.v5) were joint called by the ADSP following the SNP/Indel Variant Calling Pipeline and data management tool used for analysis of GS and ES for the ADSP.[25] The WES data were currently released only for biallelic variants, which the ADSP has quality controlled. The WGS data were released for biallelic and multiallelic variants separately, which the ADSP had not yet quality controlled. The current analyses of ADSP WGS were restricted to biallelic variants, to which we applied the Variant Quality Score Recalibration QC filter (PASS variants; GATK v4.1).[26] The WES/WGS data were available in genome build hg38, which we annotated using dbSNP153 variant identifiers. Genetic data underwent standard QC. Detailed descriptions of all processing procedures and sequential sample filtering steps are listed in eMethods and eTables 3 and 4 (links.lww.com/NXG/A536). For the purpose of the presented genetic association analyses, only non-Hispanic individuals of European ancestry were considered to focus on the largest ancestry population (SNPweights v2.1; eFigure 1).[27] Principal component (PC) analysis of genotyped SNPs provided PCs capturing population substructure (PC-AiR, eFigure 2).[28] In both the WES and WGS data, variants with a genotyping rate less than 95%, deviating from the Hardy-Weinberg equilibrium in the full sample or in controls (p < 10−6), and a minor allele count less than 10 were excluded. After this standard QC, the total number of remaining variants was 224,270 for ADSP WES and 14,772,936 for ADSP WGS.

Primary Filters to Remove Sequencing Center-Related/Platform-Related Variant-Level Artifacts

We designed filters to assess whether there were significant deviations in genotype distributions for any given variant across sequencing centers and platforms. To avoid bias from frequency differences across cases and controls, we assessed only genotypes in control individuals. The primary filters made use of the fast Fisher exact test as implemented by Plink (v.1.9; command: fisher).[29] However, this test can currently be implemented by comparing only 2 groups at a time (e.g., 2 genotyping centers), while we observed variant issues across multiple groups. We therefore compared every individual sequencing center/platform with all others and combined the p values from the multiple tests through the Cauchy combination test[29] (code available at: github.com/yaowuliu/ACAT). Variants with a combined p value lower than the heuristic threshold of 1 × 10−5 were flagged to be filtered. We note that in this design, there is no need to adjust the p value threshold regarding the number of centers/platforms because the Cauchy combination test inherently accounts for this. We additionally tested 2 other types of sequencing center-based/platform-based variant filters. On one hand, we performed the χ2 tests (R v.3.6.0; command: chisq.test) that considered all sequencing centers or platforms at once. Variants with a p value lower than the heuristic threshold of 1 × 10−5 were flagged to be filtered. On the other hand, we performed the Fisher tests with Monte Carlo (MC) simulation of p values (R v.3.6.0; command: fisher.test(simulate.p.value = T)) that considered all sequencing centers or platforms. The MC approach was chosen to allow feasible run times. Variants with a p value lower than the heuristic threshold of 1 × 10−3 were flagged to be filtered (this threshold reflects that the p values from MC simulation are less small than those obtained for the other tests). The 3 filters were compared for speed by calculating the time needed to derive the respective variant filters on a 1 MB genetic region of chromosome 1 in ADSP WGS. Computing time was evaluated on a single central processing unit from an 80-core Xeon Gold 6138T processor @ 2.00 GHz.

Filters From the gnomAD

In addition to the filters proposed earlier, we used the gnomAD data base (v3.1.1) reference to identify potential variant artifacts.[24] Specifically, we created filters for variants that have the following: (1) a “non-PASS” flag in gnomAD, corresponding to those that did not pass gnomAD sample QC filters and may thus be more prone to sequencing issues; (2) an “LCR” flag in gnomAD, corresponding to those located in a low complexity region and may thus be more prone to low coverage, read misalignment, and subsequent genotype issues; (3) a differential frequency of more than 10% between our current samples and non-Finish European participants in gnomAD, which may indicate an issue with those variants in our samples. The 3 gnomAD filters were evaluated with the goal of supporting the primary ADSP WES/WGS center-based/platform-based variant filters.

Filters for Discordant Variants Across Duplicate Samples

A final set of filters was designed to flag variants that had a discordant genotype across any duplicate sample. Notably, the ADSP WES and WGS data contain a few hundred duplicate samples, generally covering multiple sequencing centers and/or platforms. Discordant variants across such duplicates therefore provide a reference of artifactual variants that should be removed and are largely reflecting center-related/platform-related genotyping issues. We evaluated these filters with the primary goal of comparing them with the primary ADSP WES/WGS center-based/platform-based variant filters, as well as the gnomAD-based variant filters. In a secondary goal, we also assessed to what extent these duplicate discordant variant filters themselves could handle center-related/platform-related variant issues that drove observations of spurious association signals.

Statistical Analyses, Variant Annotation, and Visualization

Exome-wide and genome-wide association studies on AD case-control status were conducted on ADSP WES and WGS, respectively, using LMM-BOLT (v.2.3.5). LMM-BOLT uses a Bayesian mixture model that allows the inclusion of related individuals by adjusting for the genetic relationship matrix,[30] thereby maximizing sample size and power. Given the current minor allele count thresholds, the approximate 50-50 ratio of cases to controls and sample sizes exceeding 5,000 participants for both ADSP WES and WGS, the resultant test statistics are expected to be well-calibrated.[30] After analyses, association statistics were transformed back to a logistic scale taking into account the case fraction.[30] Per convention, variants were considered at suggestive (p ≤ 1 × 10−5) or genome-wide (p ≤ 5 × 10−8) significance. Case-control association analyses considered 2 models. Model 1 included covariates for sex, APOE*4 dosage, APOE*2 dosage, and the first 5 genetic PCs. We did not adjust for age because we previously showed that this can lead to significant power loss when the age of cases is younger than that for controls,[15] which is true for ADSP, given their initial design to prioritize old controls and young cases (Table 1 and eTables 5 and 6, links.lww.com/NXG/A536). Model 2 was the same as model 1 but additionally included covariates for sequencing center and platform. Variant filters were then applied to summary statistics using data.table functions in R v.3.6.0.

Table 1

Sample Demographics

Sample Demographics The APOE locus (1 Mb region centered on APOE) was removed from all summary statistics. Independent loci were determined by sliding window when no variants with p ≤ 1 × 10−5 were observed within 200Kb from one another. The Manhattan plots provide RefSeq curated gene annotations for the gene closest (<500Kb) to the top significant variant per locus. Only variants with p ≤ 1 × 10−6 were annotated to improve visualization. Suggestive significance levels were indicated by gray dotted lines and green dots for variants. Genome-wide significance levels were indicated by black solid lines and red dots for variants. Variant densities were indicated at the bottom of the Manhattan plots (dark green = low density, yellow = medium density, and red = high density). Plots were generated using the R package CMplot.[31]

Data Availability

The specific data repository and identifier for each cohort is indicated in eTable 1 (links.lww.com/NXG/A536) of the supplement. Code for the Cauchy combination test is available at: github.com/yaowuliu/ACAT. Summary statistics and variant filters are available on application at: niagads.org/. All data used in the discovery analyses are available on application to the following: dbGaP (ncbi.nlm.nih.gov/gap/) NIAGADS (niagads.org/) LONI (ida.loni.usc.edu/) Synapse (synapse.org/) Rush (radc.rush.edu/) NACC (naccdata.org/).

Results

Sample demographics are summarized in Table 1, with per center/platform demographics in eTables 5 and 6 (links.lww.com/NXG/A536). In initial exome-wide and genome-wide analyses using model 1, we observed many spurious associations (p ≤ 1e−5). We identified that variants underlying these spurious signals displayed increased variation in allele frequency across sequencing centers/platforms for the full frequency range (Figure 1, A and B). We also observed that such variants could not consistently be accounted for by adjustment for sequencing center/platform in model 2; a specific example of such a variant is provided in Figure 1C.

Figure 1

Variant Artifacts Across Different Sequencing Centers/Platforms Drive Spurious Associations in ADSP WES and WGS data

In initial exome-wide and genome-wide association studies of ADSP WES and WGS, we observed many spurious associations (p ≤ 1e−5) using model 1 (i.e., not adjusting for sequencing center/platform; cf. Figures 2A and 3A). On inspection of these signals, it was notable that these variants displayed large variation in genotype counts across sequencing centers/platforms. The MAF variation in controls for all analyzed variants is visualized in (A.a-b) for ADSP WES and in (B.a-b) for ADSP WGS. (C.a-b) A specific example of a variant showing spurious association is provided. This variant, rs199707443, has an MAF of 0.003% in non-Finnish Europeans in Genome Aggregation Database v3.1.1, contrasting the 411 heterozygote counts in the Broad sequencing center. Notably, this particular variant still showed genome-wide significant association with Alzheimer disease risk even after sequencing center/platform adjustment (cf. Figure 2B). ADSP, Alzheimer Disease Sequencing Project; CN, cognitively normal; HET, heterozygote; HOM, homozygote; MAF, minor allele frequency; WT, wild type; WES, whole-exome sequencing; WGS, whole-genome sequencing.

Variant Artifacts Across Different Sequencing Centers/Platforms Drive Spurious Associations in ADSP WES and WGS data

Figure 2

The Proposed Center-Based/Platform-Based Variant Filters Remove Spurious Associations in Alzheimer Disease Sequencing Project Whole-Exome Sequencing

Figure shows the Manhattan (left) and quantile-quantile (right) plots. (A) Model 1 indicates many spurious hits. (B) Model 2 shows that adjustment for center/platform can reduce many, but not all, spurious hits. The variant described in Figure 1C is highlighted by the blue arrow. (C) Filters remove most spurious hits. (D) Further adjustment for center/platform removes few additional spurious hits.

Based on these observations, 3 versions of filters were designed and evaluated for their capacity to capture putative center-related/platform-related variant artifacts. In assessing computing time, the filter using the Fisher exact test implemented in Plink followed by the Cauchy combination of p values implemented in R proved to be the fastest, taking 32 seconds to be constructed using a single central processing unit for a 1 Mb region in ADSP WGS (5,402 variants). Comparatively, constructing the χ2 test filter implemented in R took 93 seconds, while the Fisher test with MC filter implemented in R took 128 seconds. Given the faster speed, as well as the expected higher robustness provided by an exact test, we present the filter using the Fisher exact test implemented in Plink as the primary filter, while the other 2 represent supporting analyses. Throughout the remainder of the article, we will use the term “filtered” to describe variants that were removed by filters and the term “non-filtered” to describe variants that were not removed by filters. The Fisher exact center-based/platform-based variant filters showed they strongly reduced the number of spurious associations observed with model 1 in ADSP WES (Figure 2, A and C) and WGS (Figure 3, A and C). When further adjusting for sequencing center/platform in model 2, spurious associations appeared essentially absent in ADSP WES (Figure 2D) and WGS (Figure 3D). Notably, the spurious associations were not detected by the genomic inflation, as for instance, the genomic control factor (λ) was consistent prior to and after applying variant filters in ADSP WGS for the respective models (Figure 3). The slightly larger λ for ADSP WES in model 1 prior to applying the variant filters (Figure 2A) indicated that the large number of spurious variants regarding the relatively small total set of variants was likely driving some modest inflation. Consistent observations were made for the other 2 center-based/platform-based variant filters (eFigures 3-6, links.lww.com/NXG/A536). When intersecting variants identified across these 3 sets of filters, the filter derived from the Fisher exact test implemented in Plink overlapped strongly (>96%) with the other 2 filters that in turn showed less overlap (eFigure 7). This was consistent with the Fisher exact test being the most conservative and robust.

Figure 3

Proposed Center-Based/Platform-Based Variant Filters Remove Spurious Associations in Alzheimer Disease Sequencing Project Whole-Genome Sequencing

The Proposed Center-Based/Platform-Based Variant Filters Remove Spurious Associations in Alzheimer Disease Sequencing Project Whole-Exome Sequencing

Figure shows the Manhattan (left) and quantile-quantile (right) plots. (A) Model 1 indicates many spurious hits. (B) Model 2 shows that adjustment for center/platform can reduce many, but not all, spurious hits. The variant described in Figure 1C is highlighted by the blue arrow. (C) Filters remove most spurious hits. (D) Further adjustment for center/platform removes few additional spurious hits.

Proposed Center-Based/Platform-Based Variant Filters Remove Spurious Associations in Alzheimer Disease Sequencing Project Whole-Genome Sequencing

Figure shows the Manhattan (left) and quantile-quantile (right) plots. (A) Model 1 indicates many spurious hits. (B) Model 2 shows that adjustment for center/platform can reduce many, but not all, spurious hits. (C) Filters remove most spurious hits. (D) Further adjustment for center/platform removes few additional spurious hits. A closer inspection of the center-based/platform-based variant filters showed that nonfiltered variants displayed fairly concordant p values across models 1 and 2, whereas filtered variants showed many discrepancies (Figure 4, A and C). This was consistent with the filtered variants driving spurious associations. In addition, it was apparent that filters removed variants across the full frequency range (Figure 4, B and D) consistent with the increased minor allele frequency variation across all frequency ranges for variants underlying spurious association signals (Figure 1, A and B).

Figure 4

Metrics of Variants Removed by the Proposed Center-Based/Platform-Based Variant Filters

Metrics of Variants Removed by the Proposed Center-Based/Platform-Based Variant Filters

(A.a, A.b, and B) ADSP WES. (C.a, C.b, and D) ADSP WGS. (A.a and C.a) Variants that passed filters showed largely consistent p values across model 1 and model 2 case-control association analyses, with only few variants remaining that reach suggestive significance in model 1 but lose suggestive significance on center/platform adjustment in model 2 (lower right quadrant). (A.b and C.b) Variants that were removed by filters showed many inconsistent p values across models 1 and 2, consistent with center-related/platform-related variant artifacts that could not fully be accounted for by model 2. (B and D) Frequency density plots, comparing variants that were filtered/removed with those that were not filtered. Note that variants were consistently filtered across the full frequency range, with increased density at frequencies <1% or >10% in ADSP WES. ADSP, Alzheimer Disease Sequencing Project; WES, whole-exome sequencing; WGS, whole-genome sequencing. We then assessed to what extent the gnomAD-based filters could remove the observed spurious associations. A visual assessment of the Manhattan plots showed that the gnomAD-based filters could only account for a part of the spurious associations (eFigures 8 and 9, links.lww.com/NXG/A536). Similarly, a closer inspection of the gnomAD-based filters showed that they mainly removed variants with frequencies <1% (eFigure 10). The p values across models 1 and 2 further showed many discrepancies for both nonfiltered and filtered variants (although fewer for nonfiltered variants). In sum, the gnomAD-based filters could remove some spurious signals but were less effective than the center-based/platform-based variant filters. We further assessed to what extent the duplicate discordant variants filters could remove the observed spurious associations. The Manhattan plots showed that the duplicate discordant variant filters could account for many of the spurious associations, but several remained when using model 1, while when using model 2, the Manhattan plots looked similar to those using the center-based/platform-based variant filters (eFigures 11 and 12, links.lww.com/NXG/A536). A closer inspection of the duplicate discordant variant filters similarly showed they mainly removed variants with frequencies >10% and did not remove a set of variants that lose suggestive significance when going from model 1 to model 2 (eFigure 13). An illustrative example of such a variant is listed in eTable 7, confirming these variants represent genotyping issues that more ideally should be removed from the data. In sum, the duplicate discordant filters could remove many spurious signals but were less effective than the center-based/platform-based variant filters, yet more effective than the gnomAD-based variant filters. We also sought to understand the overlap between the different proposed filters. The 3 gnomAD-based variant filters appeared to show little overlap with one another (eFigure 14, links.lww.com/NXG/A536) and overlapped with less than 20% of the variants in the center-based/platform-based variant filters (eFigure 15). Furthermore, in ADSP WES and WGS, 32% and 14% of duplicate discordant variants overlapped center-based/platform-based variant filters, respectively, while vice versa 31% and 15% of center-based/platform-based filtered variants overlapped duplicate discordant variants (eFigure 16). In the same comparison, 28% and 49% of duplicate discordant variants overlapped gnomAD-based variant filters, respectively, while vice versa 53% and 17% of gnomAD-based filtered variants overlapped duplicate discordant variants (eFigure 17). In sum, this confirmed that all 3 types of filters captured overlapping as well as unique variants. Notably, the center-based/platform-based and gnomAD-based variant filters could capture a subset of reference artifactual variants present in the duplicate discordant variant filters but identified many additional signals that represented likely artifactual variants and contributed to spurious association signals. An overview of the number of variants and spurious signals removed for all respective filters and models is summarized in eTable 8. Then, we sought to assess whether the use of these different types of variant filters could omit the need for adjusting for sequencing center/platform as implemented in model 2, which may be desirable for certain studies or research questions. We thus inspected all variants that passed suggestive significance in either model 1 or model 2 in ADSP WES (Table 2) and WGS (eTable 9, links.lww.com/NXG/A536) after applying the center-based/platform-based filters (which we showed removed the most spurious signals). We observed that many variants that lose suggestive significance after center/platform adjustment in model 2 have fairly small (above threshold) p values in the center-based/platform-based Fisher exact tests and/or are covered in the gnomAD-based and duplicate discordant variant filters. Similarly, assessing the Manhattan plots and variant metrics suggested that the gnomAD-based and/or duplicate discordant variant filters removed few additional variants underlying spurious signals (eFigures 18-23). Notably, we also observed in ADSP WGS that center/platform adjustment for some variants led to somewhat more significant p values, thereby increasing the number of suggestive hits (eTables 8 and 9). This could reflect improved model fits after center/platform adjustment by accounting for case/control imbalances or other factors. Overall, these observations suggest there may be added value in using model 2 and/or applying the gnomAD-based filters to reduce spurious signals. Obviously, adding the duplicate discordant variant filters will inherently remove artifactual signals and help reduce spurious signals.

Table 2

Alzheimer Disease Sequencing Project Whole-Exome Sequencing Variants Passing Suggestive Significance After Applying Center-Based/Platform-Based Filters

Alzheimer Disease Sequencing Project Whole-Exome Sequencing Variants Passing Suggestive Significance After Applying Center-Based/Platform-Based Filters Last, as a robustness check, we compared association statistics from the current ADSP WES analyses with variants that we identified in a prior study using a prior version of the ADSP WES data and observed highly concordant findings (eTable 10, links.lww.com/NXG/A536).[15]

Discussion

We present a fast and robust approach to filter variants that represent sequencing center-related or platform-related artifacts underlying spurious associations with AD risk in ADSP WES and WGS data, which cannot fully be accounted for by center/platform covariate adjustment. In addition, we showed that filters comprising variants that may be prone to artifacts, as identified by gnomAD, were less efficient in removing spurious signals but may still have added value on top of the center-based/platform-based filters. Similarly, filters containing variants that were discordant across duplicate samples could remove many, but not all, spurious signals and added onto the center-based/platform-based filters. In sum, the presented filters are important to support future robust studies on ADSP data. In addition, these filters allow flexibility, given that they can be applied in post hoc QC. Researchers may thus inspect filtered variants in targeted analyses in subsets of the ADSP data where no artifactual genotype enrichment is observed (e.g., excluding a single sequencing center/platform that showed an artifactual increase in genotype counts compared with the others, cf. Wickland et al.[19]). Certain study designs or research questions may benefit from not adjusting by sequencing center/platform (i.e., cohort adjustment). For example, a study that considers specific strata and/or low-frequency variants may observe some colinearity between variant genotype observations and sequencing centers/platforms. However, this does not necessarily indicate artifactual variants and may be driven by chance or variable cohort study designs across samples sequenced by different centers. We observed that the presented center-based/platform-based variant filters could handle nearly all spurious associations when not adjusting for sequencing center/platform in model 1. Inspecting the remaining signals passing suggestive significance, it was apparent that the gnomAD-based and duplicate discordant variant filters could remove a few additional spurious signals. Similarly, the p values from the Fisher exact tests across sequencing centers/platforms was fairly small for several variants that passed suggestive significance in their association with AD risk in model 1 but lost suggestive significance on center/platform adjustment in model 2. In sum, we suggest that model 2 with application of center-based/platform-based, gnomAD-based, and duplicate discordant variant filters is the most conservative approach, but model 1 using only center-based/platform-based and duplicate discordant variant filters may reasonably be implemented, contingent on post hoc assessment of the association signals' robustness. The center-based/platform-based filtering approach will further be valuable beyond the currently presented exome-wide and genome-wide univariate AD risk association analyses in European ancestry samples. Notably, the removal of artifactual variants may lead to improved association statistics in gene-based testing, which is particularly relevant for ES/GS data.[7] The filter approach can also be applied to non-European samples available in ADSP WES/WGS. Last, the approach to check for variant artifacts by comparing genotype distributions across sequencing centers/platforms may also be used in other studies with a similar design as the current ADSP data. Notably, our approach is similar to the one previously applied to the preprocessing of UK Biobank SNP array to remove variants that may represent batch or array effects.[23] In turn, the approach described here and applied to ES/GS data could also be applied to the large amount of SNP array data sets used in large-scale genetic studies of AD.[3] This study reports exome-wide and genome-wide AD risk association findings for the newly released ADSP 20.5k WES and 16.9k WGS data. After QC and filter implementation, we observed few signals passing the genome-wide significance threshold. In the ADSP WES data, TREM2 and ABCA7—well-established AD risk genes[2,6]—were observed with variants, respectively, at genome-wide and suggestive significance, consistent with observations for similar models in prior studies on the prior ADSP WES discovery phase data.[7,15] Despite observing only 4 variants in ADSP WES that passed suggestive significance in model 2, our findings were overall highly consistent with prior work.[15] We also observed that certain variants identified previously were not present in our current summary statistics (eTable 10, links.lww.com/NXG/A536), reflecting differences in joint calling, QC, and the fact that currently only biallelic data were available for the new ADSP WES data. Notably, the common protective variant on ABCA7 identified here has not been previously reported (and we confirm it appears to not have been successfully joint called in the prior ADSP WES data; dbGaP accession ID: phs000572). In the ADSP WGS data, in addition to several suggestive hits, BIN1—a well-established AD risk gene[2,6]—and CNTN4 were identified with variants at genome-wide significance. The common protective variant on CNTN4 appears novel and may be of relevance to AD pathogenesis given that Contactin 4 (CNTN4) is a binding partner of amyloid precursor protein (APP) and CNTN4/APP interaction may play a role in promoting target-specific axon arborization.[32,33] Overall, these initial findings appear promising but suggest that the current ADSP WES/WGS data may still face power limitations limiting the discovery of novel risk variants. As such, gene-based testing, analyses on available non-European ancestry samples, and novel methodological approaches to gain additional power[12,15] will all be crucial to support future advances into disentangling the missing heritability of AD using ADSP samples and other complimentary large-scale sequencing data. One limitation to the proposed center-based/platform-based and gnomAD-based filters is that, while they robustly remove many artifactual variants, they may potentially remove nonartifactual variants (i.e., false negatives) and thus reduce power or still miss other artifactual variants (i.e., true positives). Theoretically, filtered and nonfiltered variants could be verified for their association with AD in the summary statistics from other large-scale genome-wide association studies using imputed SNP data, but this inherently comes with concerns regarding imputation/genotype quality in those cohorts, as well as challenges to resolve signals below the suggestive significance threshold in ADSP (given its relatively limited sample sizes). As such, a clear assessment of sensitivity and specificity is not directly feasible at the current time. Some false positives may be expected in ADSP WES owning to the fact that the ADSP used a variety of exome capture kits, which were not considered here, because those metadata were not readily available at the current time. Additional false positives may also still be expected for any remaining variants with allele imbalance, which was not assessed in this study.[34] Furthermore, other factors such as imbalance of ancestry, case/control ratios, or age across centers may affect the variant filters and lead to false negatives. However, in data not shown in this study, consistent spurious associations were observed and removed by filters when considering a more homogenous population of North-western European or African ancestry individuals, suggesting ancestry imbalance did not specifically bias the center/platform effects. Similarly, by designing the center/platform filters on controls, there was little concern regarding case/control ratio and age imbalance. However, cohort study design differences may cause control individuals on certain centers/platforms to be enriched in protective variants (e.g., if a given study specifically recruited protected old age APOE*4 carriers), which could potentially contribute to false negatives. In addition, age in general may represent a confounding factor because clonal hematopoiesis of indeterminate potential (CHIP) contributes to an increased rate of somatic mutations with aging that can confound analyses (particularly in CHIP-associated genes).[35] This may be specifically relevant when the genetic association model does not account for age, as was the case in this study. Last, the gnomAD filters flag variants that were artifactual in gnomAD and are thus prone to technical issues, but not all these variants are necessarily artifactual in the current ADSP data. Future studies may further also consider adapting the gnomAD 10% differential frequency filter to instead make use of a Fisher test, similar as in the primary center-based/platform-based filters. In sum, while the current filters are clearly useful to increase the robustness of association finding in ADSP data, future studies may further implement and evaluate other approaches to handle artifactual variants while validating sensitivity and specificity. Future studies may also consider inspecting target variants or genes without applying the filters proposed here but instead using them as a reference or adapting them, as appropriate. We present a fast and robust approach to filter variants that represent sequencing center-related or platform-related artifacts underlying spurious associations with AD risk in ADSP WES and WGS data. This approach will be important to support future robust studies on ADSP data, as well as other studies with similar designs.

33 in total

1. The Alzheimer's Disease Sequencing Project: Study design and sample selection.

Authors: Gary W Beecham; J C Bis; E R Martin; S-H Choi; A L DeStefano; C M van Duijn; M Fornage; S B Gabriel; D C Koboldt; D E Larson; A C Naj; B M Psaty; W Salerno; W S Bush; T M Foroud; E Wijsman; L A Farrer; A Goate; J L Haines; Margaret A Pericak-Vance; E Boerwinkle; R Mayeux; S Seshadri; G Schellenberg
Journal: Neurol Genet Date: 2017-10-13

2. Glial fibrillary acidic protein-apolipoprotein E (apoE) transgenic mice: astrocyte-specific expression and differing biological effects of astrocyte-secreted apoE3 and apoE4 lipoproteins.

Authors: Y Sun; S Wu; G Bu; M K Onifade; S N Patel; M J LaDu; A M Fagan; D M Holtzman
Journal: J Neurosci Date: 1998-05-01 Impact factor: 6.167

3. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures.

Authors: Yaowu Liu; Jun Xie
Journal: J Am Stat Assoc Date: 2019-04-25 Impact factor: 5.033

4. Analysis of Whole-Exome Sequencing Data for Alzheimer Disease Stratified by APOE Genotype.

Authors: Yiyi Ma; Gyungah R Jun; Xiaoling Zhang; Jaeyoon Chung; Adam C Naj; Yuning Chen; Celine Bellenguez; Kara Hamilton-Nelson; Eden R Martin; Brian W Kunkle; Joshua C Bis; Stéphanie Debette; Anita L DeStefano; Myriam Fornage; Gaël Nicolas; Cornelia van Duijn; David A Bennett; Philip L De Jager; Richard Mayeux; Jonathan L Haines; Margaret A Pericak-Vance; Sudha Seshadri; Jean-Charles Lambert; Gerard D Schellenberg; Kathryn L Lunetta; Lindsay A Farrer
Journal: JAMA Neurol Date: 2019-09-01 Impact factor: 18.302

5. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer's disease risk.

Authors: Iris E Jansen; Jeanne E Savage; Stephan Ripke; Ole A Andreassen; Danielle Posthuma; Kyoko Watanabe; Julien Bryois; Dylan M Williams; Stacy Steinberg; Julia Sealock; Ida K Karlsson; Sara Hägg; Lavinia Athanasiu; Nicola Voyle; Petroula Proitsi; Aree Witoelar; Sven Stringer; Dag Aarsland; Ina S Almdahl; Fred Andersen; Sverre Bergh; Francesco Bettella; Sigurbjorn Bjornsson; Anne Brækhus; Geir Bråthen; Christiaan de Leeuw; Rahul S Desikan; Srdjan Djurovic; Logan Dumitrescu; Tormod Fladby; Timothy J Hohman; Palmi V Jonsson; Steven J Kiddle; Arvid Rongve; Ingvild Saltvedt; Sigrid B Sando; Geir Selbæk; Maryam Shoai; Nathan G Skene; Jon Snaedal; Eystein Stordal; Ingun D Ulstein; Yunpeng Wang; Linda R White; John Hardy; Jens Hjerling-Leffler; Patrick F Sullivan; Wiesje M van der Flier; Richard Dobson; Lea K Davis; Hreinn Stefansson; Kari Stefansson; Nancy L Pedersen
Journal: Nat Genet Date: 2019-01-07 Impact factor: 38.330

6. Identifying and mitigating batch effects in whole genome sequencing data.

Authors: Jennifer A Tom; Jens Reeder; William F Forrest; Robert R Graham; Julie Hunkapiller; Timothy W Behrens; Tushar R Bhangale
Journal: BMC Bioinformatics Date: 2017-07-24 Impact factor: 3.169

7. Exome-wide age-of-onset analysis reveals exonic variants in ERN1 and SPPL2C associated with Alzheimer's disease.

Authors: Liang He; Yury Loika; Yongjin Park; David A Bennett; Manolis Kellis; Alexander M Kulminski
Journal: Transl Psychiatry Date: 2021-02-26 Impact factor: 6.222

8. Impact of variant-level batch effects on identification of genetic risk factors in large sequencing studies.

Authors: Daniel P Wickland; Yingxue Ren; Jason P Sinnwell; Joseph S Reddy; Cyril Pottier; Vivekananda Sarangi; Minerva M Carrasquillo; Owen A Ross; Steven G Younkin; Nilüfer Ertekin-Taner; Rosa Rademakers; Matthew E Hudson; Liudmila Sergeevna Mainzer; Joanna M Biernacka; Yan W Asmann
Journal: PLoS One Date: 2021-04-16 Impact factor: 3.752

9. VCPA: genomic variant calling pipeline and data management tool for Alzheimer's Disease Sequencing Project.

Authors: Yuk Yee Leung; Otto Valladares; Yi-Fan Chou; Han-Jen Lin; Amanda B Kuzma; Laura Cantwell; Liming Qu; Prabhakaran Gangadharan; William J Salerno; Gerard D Schellenberg; Li-San Wang
Journal: Bioinformatics Date: 2019-05-15 Impact factor: 6.937

10. Whole-genome sequencing reveals new Alzheimer's disease-associated rare variants in loci related to synaptic function and neuronal development.

Authors: Dmitry Prokopenko; Sarah L Morgan; Kristina Mullin; Oliver Hofmann; Brad Chapman; Rory Kirchner; Sandeep Amberkar; Inken Wohlers; Christoph Lange; Winston Hide; Lars Bertram; Rudolph E Tanzi
Journal: Alzheimers Dement Date: 2021-04-02 Impact factor: 21.566