Literature DB >> 29590295

Functional annotation of genomic variants in studies of late-onset Alzheimer's disease.

Mariusz Butkiewicz¹, Elizabeth E Blue², Yuk Yee Leung³, Xueqiu Jian⁴, Edoardo Marcora⁵, Alan E Renton⁵, Amanda Kuzma³, Li-San Wang³, Daniel C Koboldt⁶, Jonathan L Haines¹, William S Bush¹.

Abstract

Motivation: Annotation of genomic variants is an increasingly important and complex part of the analysis of sequence-based genomic analyses. Computational predictions of variant function are routinely incorporated into gene-based analyses of rare-variants, though to date most studies use limited information for assessing variant function that is often agnostic of the disease being studied.
Results: In this work, we outline an annotation process motivated by the Alzheimer's Disease Sequencing Project, illustrate the impact of including tissue-specific transcript sets and sources of gene regulatory information and assess the potential impact of changing genomic builds on the annotation process. While these factors only impact a small proportion of total variant annotations (∼5%), they influence the potential analysis of a large fraction of genes (∼25%). Availability and implementation: Individual variant annotations are available via the NIAGADS GenomicsDB, at https://www.niagads.org/genomics/ tools-and-software/databases/genomics-database. Annotations are also available for bulk download at https://www.niagads.org/datasets. Annotation processing software is available at http://www.icompbio.net/resources/software-and-downloads/. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 29590295 PMCID： PMC6084586 DOI： 10.1093/bioinformatics/bty177

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Multiple large-scale whole-exome (WES) and whole-genome sequencing (WGS) campaigns are currently underway to identify genetic variants that associate to a variety of traits, including studies of schizophrenia (Genovese ), type 2 diabetes (Fuchsberger ), height (Marouli ), myocardial infarction (Do ) and others. While the designs and scope of sequencing for these studies vary, previous similar studies have routinely identified millions of genetic variants that were novel (at the time of publication). The targeted sequencing of 82 pharmacogenes using the PGRN-seq platform identified 48.5% novel variants (Bush ), and a prior sequencing study of known pharmacogenes reported 90% novel variants (Nelson ). The Exome Aggregation Consortium (ExAC), which compiled WES data from multiple studies reported 72% novel variants (Lek ), and the UK10K study used WES and reported 57% novel variants (UK10K Consortium ). As such, any sequencing-based study will likely report several million variants that are not previously described in the scientific literature. With no published information about these variants available, the genomics community has increasingly relied on computational approaches for predicting variant function (Cingolani ; De Baets ; Wang ; Yourshaw ). A major goal of sequencing studies is to statistically examine the impact of low-frequency genetic variation on a trait of interest. Methods for the assessment of this rare-variant hypothesis now increasingly rely on biological information to group or bin variants together, improving the statistical power to detect an effect. Gene-based burden, collapsing and kernel-based tests are the most commonly applied (Bansal ; Lee ; Li and Leal, 2008; Liu and Leal, 2010; Madsen and Browning, 2009; Moutsianas ; Neale ; Price ; Sun ; Wu ). In addition to grouping variants by gene, multiple modifications to this strategy have been proposed that incorporate the predicted impact or consequence of genetic variants on the molecular function of the gene being tested (Ng , 2010; Wu ). Because of these technological and analytical advancements, bioinformatic annotation of variant function plays an increasingly critical role in the analysis of whole-genome and whole-exome sequence data. Prior large-scale sequencing efforts have employed a variety of annotation strategies. The UK10K project (UK10K Consortium ) from the Wellcome Trust Sanger Institute is annotated with RSIDs from dbSNP138 (Smigielski, 2000) and functional annotations through the Ensembl Variant Effect Predictor (VEP; version 75) (Yourshaw ). The ExAC similarly used VEP (version 81), along with the Loss-of-Function Transcript Effect Estimator (LOFTEE) plugin to evaluate protein-truncating variants. They further annotated variant impact using Polymorphism Phenotyping version 2 (PolyPhen2), Scale-Invariant Feature Transform (SIFT) and Combined Annotation Dependent Depletion (CADD) scores (Adzhubei ; Kircher ; Kumar ). Notably, these studies employed exome sequencing in their design, shifting the annotation focus onto coding variation. In this work, we describe variant annotation efforts motivated by the Alzheimer’s Disease Sequencing Project (ADSP), a collaborative effort between the National Institutes on Aging and the National Human Genome Research Institute, along with members of the Alzheimer’s disease genetics research community, designed to study the genetics of late-onset Alzheimer’s disease (LOAD). The ADSP discovery-phase datasets consist of 578 individuals from 111 densely affected LOAD families selected for WGS as well as 5778 LOAD cases and 5136 controls selected for WES from the Alzheimer’s Disease Genetics Consortium (ADGC) and the neurology phenotype working group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) consortium (Beecham et al., 2017). Analytical efforts within the ADSP are divided into multiple working groups focused on data flow, quality control (Malamon ), structural variant calling, family-based analyses (Ahmad ; Beecham ; Jaworski ), case/control analyses and protective variant analyses. The annotation working group is tasked with providing broad, consistent variant annotation resources for all workgroups, across all datasets and genomic builds, and with specific emphasis on functional effects of variants within AD-related tissues. To date, most sequencing studies have generated ad hoc, minimal annotation sets, which provide insight into only basic biological function and largely ignores the biological complexity of tissue-specific effects variants may have. Furthermore, on a project-level, there has been no described attempt to generate and maintain a consistent set of comprehensive annotations as a resource for all analyses. In this work, we describe an in silico annotation approach, offering guidance and resources to other ADSP investigators, and providing a common base for testing functional hypotheses in their analyses. We provide a summary of annotated variant consequences, illustrate the impact of using a tissue-specific transcript reference set, examine the annotation impact of including regulatory elements and provide an overview of issues arising from an anticipated transition from build 37 to build 38.

2 Materials and methods

2.1 The Alzheimer’s Disease Sequencing Project and the annotation pipeline

Details of the sample selection criteria and AD phenotyping are described in detail elsewhere (Beecham , 2017). In brief, ∼1400 LOAD families were reviewed for potential inclusion in the project, excluding families with known Mendelian mutations for AD. Families were prioritized based on the number of affected individuals, the number of generations affected, age of onset and absence of apolipoprotein E (APOE) ε4 risk alleles. Cases met the National Institute of Neurological and Communicative Disorders and Stroke—the Alzheimer’s NINCDS-ADRDA (Alzheimer's Disease and Related Disorders Association) criteria for possible, probable or definite AD (McKhann ). Case/control samples were selected from a pool of over 30 000 samples under a balanced risk/protection design; cases were selected with low a priori risk (younger and lacking APOE ε4 alleles), and control samples selected based on low rates of expected conversion to AD. Using these two sample sets, the ADSP pursued two complementary sequencing efforts—WGS on multiplex families, and WES on a large dataset of unrelated cases and controls. Details of the sequencing, capture design and quality control procedures for the ADSP data are described elsewhere (Malamon ) and here described briefly. Extracted genomic DNA was sent to one of three Large-Scale Sequencing and Analysis Centers for sequencing using Illumina WGS technology, the Broad Institute Genomics Service (Broad), the Baylor College of Medicine Human Genome Sequencing Center (Baylor) or the McDonnell Genome Institute at the Washington University in St. Louis (WashU). Resulting sequence reads were aligned to GRCh37 1000 Genomes reference (human_g1k_v37.fasta.gz within GATK resources) and genotype calling of both bi-allelic single nucleotide variants (SNVs) and insertion–deletion variants (indels) was performed centrally on the entire sample set using GATK-HaplotypeCaller and Atlas V2. The ADSP QC working group then applied pipeline-specific variant-level and sample-level quality control criteria, and a consensus call set was generated by including high-certainty genotype calls that were identical across the two calling pipelines. Multiple algorithms are also being applied to call structural variants from WGS data. As the ADSP progresses to the replication phase, sequencing centers are shifting their sequencing and alignment protocols to use GRCh38. A visual overview of the ADSP annotation pipeline is shown in Figure 1. Variants (SNVs) are first annotated to all Ensembl transcripts, producing multiple possible functional consequence predictions per variant. Variant consequences are then ranked by a custom ranking process based on Ensembl’s severity ranking, but that up-weights consequences of protein-coding transcripts (to produce an overall most-damaging consequence per gene for each variant) relative to non-coding or nonsense mediated decay transcripts. We also produce rankings using tissue-specific transcripts based on RNA-sequencing isoform data (to produce a tissue-specific most-damaging consequence per gene for each variant). These variant annotations are released as flat-files allowing one variant to have multiple annotated consequences. Attributes that are specific to a single variant agnostic of transcript information, such as external allele frequency data, variant scores etc. are generated and stored as a single flat annotation (with one entry per variant). While the primary unit of analysis for ADSPs is currently SNVs, structural variants of various sizes are also being called for analysis and are processed by our annotation pipeline with some limitations [SPIDEX™ and Contextual Analysis of TF Occupancy (CATO) scores are not available for structural variants].

Fig. 1.

Overview of ADSP annotation pipeline. The process begins with VCF input (left). Solid square items represent pipeline workflow processes, cylinder items are external data sources, and open items are intermediate files or outputs (right)

2.2 Comparisons to other variant sets

For the purposes of both validating variant calling, and to annotate known variants with additional frequency information, we accessed multiple sources of allele frequency information. We report minor allele frequencies (MAF) for variants called within two major genotyping projects, the 1000 Genomes Project (Abecasis ) [including AFR (African), AMR (American), ASN (Asian) and EUR (European) populations] and the NHLBI-ESP [AA (African-American) and EA (European-American) populations]. Together, these two sources reflect exome variation and low coverage sequencing of the remainder of the genome. We also included variant information from the ExAC representing exome variants called from ∼60 000 samples. We accessed additional variant allele frequencies from the Known Variants (Kaviar) database (Glusman ), comprised of SNVs, small insertions and deletions, and complex variants discovered in humans. It contains 162 million SNV sites, of which 25 million were not reported in dbSNP, and data from 35 projects spanning 77 781 individuals (13.2 K whole genome, 64.6 K exome). The database also includes 50 million short indels and substitutions from a subset of the data sources. Kaviar excludes cancer genomes but includes some data from cell lines and individuals affected by disease. Given the importance of age on AD phenotypes, we also accessed WGS data from the Wellderly study (Erikson ). The Wellderly study investigates a cohort of elderly individuals with an age range of 80–105 years and who have no diagnosed chronic medical diseases. Using genomic sequencing, the whole genomes of 600 participants were analyzed and correlated with genetic data collected by Inova Translational Medicine Institute (ITMI) from 1507 adults, representing the general population.

2.3 Variant consequence prediction and tissue-specific consequences

The starting point of our annotation strategy, VEP (Yourshaw ), overlays variant positions with extensive resources from the Ensembl database, and algorithmically assigns a predicted consequence to a variant based on where it falls with genes, transcripts and protein sequence, and further assigns a Sequence Ontology (SO) term (Mungall ) for every variant consequence. Variant consequences are reported with respect to each transcript and protein within the Ensembl database, meaning that a single variant can have multiple consequences, relative to the multiple distinct transcripts it affects. Additional identifiers are reported from various data sources, including from dbSNP (Smigielski, 2000), consensus coding sequence (CCDS) (Pruitt ), Uniprot (Magrane and Consortium, 2011) and HGVS (den Dunnen ). Because multiple variant consequences are reported for each transcript/protein affected by a variant, Ensembl VEP provides an option (‘–per_gene’) to generate a ‘most damaging consequence’ for each gene–variant relationship. Using this option, when a variant overlaps multiple possible transcripts for a gene, the consequences are ranked according to multiple criteria within the VEP logic, including the canonical status of the transcript, an Ensembl estimate of transcript support level, the biological type of transcript (preferring protein-coding transcripts) and translated transcript length. For the specific purposes of ADSP analyses, we constructed a custom ranking table (Supplementary Table S1) that down-weights transcripts undergoing nonsense mediated decay and RNA-based transcripts. Using this approach, we generate both full variant annotation and a ‘most damaging consequence’ annotation. To provide a tissue-specific set of variant annotations, we accessed RNA-seq based transcript expression data from a study of 276 samples of temporal cortex and 275 samples of cerebellum (Allen ), generated as part of the Accelerating Medicines Partnership for Alzheimer’s Disease (AMP-AD). Normalized transcripts were accessed from the AMP-AD Portal (AMPAD Knowledge Portal, 2016; Hodes and Buckholtz, 2016). Based on prior assessments of RNA-seq data (Hebenstreit ), transcripts were considered ‘highly expressed’ within a tissue if the average normalized transcript count >1 (corresponding to log2 > 0).

2.4 Additional annotations for coding variants

VEP also provides SIFT and PolyPhen predictions for each variant. These scores assess the impact of amino acid substitution based on sequence homology and the physical properties of amino acids (Kumar ) and impact on structure and function of a human protein (Adzhubei ). We also employ external scoring metrics including CADD (Kircher ) and SPIDEX (Xiong ). These metrics gauge how damaging a variant might be, and thus whether the variant would be a potential candidate for subsequent investigation. The CADD score (Kircher ) evaluates the deleteriousness of SNVs as well as indels in the human genome. CADD is based on allelic diversity, pathogenicity of both coding and non-coding variants, experimentally measured regulatory effects, and quantitatively prioritizes variants across a wide range of functional categories, effect sizes and genetic architectures. While CADD was designed to provide uniform variant pathogenicity scoring across both coding and non-coding regions, some comparisons indicate that CADD may be best suited for scoring coding variants (Shihab ). CADD also relies on evolutionary conservation as a predictive feature, and it is unclear the extent to which risk alleles for a late-onset condition like AD will be conserved. The CADD score is presented on a PHRED-based scale, with scores from 0 to 20 generally considered as non-deleterious and scores >20 generally considered as deleterious, with an accuracy of above 99.0% (the variant is amongst the top 1% of deleterious variants). SPIDEX (Xiong ) assesses whether a SNV causes a dysregulation of a splicing event. The score covers all synonymous, missense and nonsense exonic SNVs, as well as intronic SNVs that are in proximity of splice junctions. It evaluates exons on RefSeq transcripts for predefined features and uses a prediction model to infer whether the exon was spliced into the transcript given the reference genome and the given variant with respect to 16 human tissues. The SPIDEX scores exhibit substantial predictive performance, especially for deleterious mutations, such as intronic mutations that are far from splice sites. SPIDEX scores are available as transformed z-scores, with values >3 indicating a high likelihood of splicing.

2.5 Additional annotations for non-coding variants

Unlike coding variants, VEP provides only minimal annotation of non-coding variants, indicating only intron-exon boundaries or overlap with a known regulatory feature, often a sequence motif or epigenetic state. We expanded this set of annotations to include expressed enhancers and predictions of transcription factor occupancy genome-wide. Using capped analysis of gene expression (CAGE) data, we identified expressed enhancer elements from the FANTOM5 project (Andersson ). Elements were used from all available cellular contexts, and were matched (by the most relevant tissue) to expression quantitative trait loci (eQTL) associations from the genotype tissue expression (GTEx) project (Mele ) (analysis by tissue v6) to identify enhancer–gene relationships. Because eQTL data were only used to establish a linkage between known enhancer elements and the genes they regulate, we used a nominal significance level (P < 0.05). After compiling data, we identified WGS-based variants that were in or within +/−500 bases from 20 220 enhancer elements and tied these elements to the genes they putatively impact. We also included CATO scores (Maurano ) to provide a recalibrated probability of affecting the binding of any transcription factor. CATO also provides a quantitatively ranked list of transcription factor families where binding might be altered. CATO scores can only be applied to variants in regulatory regions and therefore are only applicable to a specific subset of the ADSP data. CATO scores above 0.1 are generally considered indicative of variants that significantly alter transcription factor binding.

2.6 LiftOver-based annotation

Because of the multi-stage nature of the ADSP, the project data will migrate to build 38 between the discovery and replication phases. This migration will raise many issues, most imminently, how to compare statistical results from the discovery phase (build 37) to those of the replication phase (build 38). Variants mapped within build 37 were transitioned to build 38 using LiftOver, part of the UCSC Genome Browser toolset (Karolchik ). For known variants (those with RSIDs), the resulting LiftOver build 38 coordinates were compared to the build 38 coordinates from the Ensembl Variant database (version 80), which were mapped to build 38 using the Ensembl pipeline. Variants called by the ADSP QC working group were assigned unique variant identifiers consisting of chromosome, position, reference allele, alternate allele and genomic build. Additional variant identifiers (such as reference sequence RS numbers) are assigned as part of our annotation process from external data sources, including Ensembl (version 80) and dbSNP (build 144). To facilitate the comparison of statistical results and to support annotation-based replication analyses, we have also constructed a single unique variant index to identify variants in a build-agnostic manner.

3 Results

3.1 Allelic discovery in the ADSP

At the completion of the Discovery Phase of the ADSP, a total of 578 individuals from 111 families were whole-genome sequenced, and 10 913 unrelated cases and controls were whole-exome sequenced. An overview of variant annotations and frequencies is shown in Table 1. Following QC, WGS generated 27 896 774 called variants, with just over 5 million novel variants (18.16%) not previously reported by dbSNP (b144) or ExAC (v0.3). The majority of variants (84%) are annotated as intronic or intergenic. Based on Ensembl VEP, variants were annotated to an average of 3.5 features, including transcripts and regulatory motifs. Crude allele frequency estimates (not accounting for family structure) show the majority of identified variants are rare (MAF < 0.01) with ∼25% observed on only one or two chromosomes. Compared to other published studies of exome-sequence variants, we observe a larger number (22.4%) of common variants (MAF > 0.05).

Table 1.

Overview of ADSP variant annotations

	Whole exome (case/control)		Whole genome (family-based)
Variants called	1 586 703	—	27 896 774	—
Variants annotated	1 586 703	—	27 674 996	—
Variants unannotated	0	—	221 778	—
Variants in ExAC v0.3	933 318	58.82%	361 205	1.29%
Variants in dbSNP	936 417	59.02%	22 837 563	81.86%
Variants in ClinVar	17 860	1.13%	10 960	0.04%
Variants in Wellderly	163 733	10.31%	10 304 395	36.93%
Novel variants	608 092	38.32%	5 065 664	18.16%
Average transcripts per variant	7.75	—	3.494	—
AF>0.05	35 377	2.23%	6 247 716	22.40%
0.01<AF< 0.05	20 566	1.30%	4 795 467	17.19%
Two observations<AF<0.01	1 000 280	63.04%	9 187 863	32.94%
Two observations	152 770	9.63%	3 554 970	12.74%
One observation	377 555	23.79%	4 110 758	14.74%

Overview of ADSP variant annotations In contrast, WES revealed 1 586 703 called variants, of which 38.32% were previously unreported. Over all variants, an average of 7.75 transcripts were impacted per variant, illustrating the enrichment of coding variants. The categories of predicted variant consequence are shown relative to the allele frequency spectrum of WES variants in Figure 2. The frequency spectrum of captured variants is heavily shifted toward low-frequency alleles (MAF < 0.01), with only 3.5% of variants having a larger frequency. We observe an expected enrichment for missense and synonymous variants relative to other consequence types. CADD scores are highest on average for stop-gained variants, followed by splicing-associated variants. There is also a notable relative lack of overlap with variants identified in the Wellderly cohort in the WES versus WGS, suggesting a relative lack of coding alleles in this successfully aged cohort.

Fig. 2.

Allele frequency spectrum by variant annotation (whole-exome sequencing). Total variant counts from the ADSP WES case/control dataset are shown by VEP predicted consequence and dataset minor allele frequency (inset legend). CADD score averages (center point) ±1 SD are shown as embedded lines

3.2 Non-coding variant annotation

∼3% of WGS-identified variants are annotated to a regulatory region. Regulatory annotations are shown by variant frequency in Figure 3. VEP reported regulatory consequences for many variants, including 438 461 variants within a CTCF binding site, 295 603 variants within an open chromatin region, 707 302 within predicted enhancer regions, 437 642 within predicted promoter regions, 1 719 607 within predicted promoter flanking regions and 247 701 within transcription factor binding sites. Of these, variants within predicted promoter regions have the highest CADD scores (avg. 7.7, SD 4.98). In contrast, CATO score predictions of transcription factor occupancy are highest in transcription factor binding sites and CTCF binding sites, as expected (avg. 0.05, SD 0.06).

Fig. 3.

Allele frequency spectrum by regulatory annotation (whole-genome sequencing). Total variant counts from the ADSP WGS family-based dataset are shown by either VEP predicted regulatory consequence or FANTOM5 enhancer annotation, and crude minor allele frequency estimates (inset legend) from the dataset. CADD score averages (center point) ±1 SD are shown as embedded lines We also expanded our annotation set by including regulatory enhancers identified from FANTOM5 and matched to nearby genes using eQTL from the GTEx) project (Mele ). Mapping distant regulatory elements to the genes they regulate provides a biology-based means to include additional variants into a gene unit-based association test. Including FANTOM5-based enhancers, we annotate an additional 86 789 variants from the WGS dataset to 31 351 genes. Variants are an average of 380 KB (SD 298 KB) from the transcription start or end site when they fall outside gene bounds, and add an average of 35 variants to each gene-unit. Variants also have an average CADD score of 6.2 (SD 4.5), and an average CATO score of 0.042 (SD 0.048), showing similar properties to the VEP predicted promoter regions. Of these variants, 1576 have CADD scores >20 (indicating a predicted damaging impact), which likely influence 10 528 genes. In total, 1111 regulatory variants potentially impacting 8375 genes have CATO scores >0.2, indicating a high probability of transcription factor occupancy.

3.3 Tissue-specific transcript reference changes annotations

Given the relevance of brain tissues for AD risk, we examined the impact of selecting brain-specific transcript references on variant annotation. Using both the full transcript reference and a cerebellum-specific transcript reference, we generated a ‘most damaging consequence per gene’ for each variant. With the full Ensembl transcript set, 1 586 703 variants had 2 812 045 predicted consequences ultimately affecting 30 121 genes. In this context, the Ensembl definition of ‘gene’ includes immunoglobulin, RNA-based and pseudo genes. Restricting annotations to only genes highly expressed in the cerebellum based on RNA-sequencing, we annotate 1 230 598 variants with 1 574 165 predicted consequences affecting 14 312 genes. The RNA-seq dataset notably reduced annotations to mostly (90.8%) protein-coding transcripts. Of variants annotated using both transcript references, 95% of variant annotations were identical, with 63 010 annotations (4%) having a different predicted consequence relative to cerebellum-specific transcripts. Figure 4 illustrates the change in variant consequence relative to total variant annotations. These differences were largely due to alternative splicing, with 27% annotated as intron variants. Most other changes are due to alternate transcription start sites, annotated as upstream gene variants (10.8%) and 5′-UTR variants (4%), and alternate transcription end sites, annotated as downstream gene variants (4.46%) and 3′-UTR variants (8.58%).

Fig. 4.

Transitions between WES annotation consequences when using full versus cerebellum-expressed transcript references. The total number of variant consequences from the ADSP WES case/control dataset is shown in the inner ring. Transitions between variant consequences when shifting from the full transcript set to cerebellum-expressed transcripts are shown via internal lines, with proportions shown in the outer ring. For example, the most common transition was from missense to intron due to differential splicing in the cerebellum Because a likely application of variant annotations is to weight or prioritize variants for gene-based burden tests, we also examined the impact that transcript references have to the collection of consequences by gene. While only 4% of variant annotations are altered when using a cerebellum-based transcript reference, those annotations impact 4490 genes. Of the genes affected, 3973 (88.4%) contained at least one annotation with a different predicted impact on the gene function, indicating that for gene-based tests that use variant impact (such as loss-of-function analyses) may be significantly altered by selecting a tissue-specific transcript reference.

3.4 LiftOver accuracy assessments

We examined the impact of using LiftOver by first quantifying variants that are unmapped by the LiftOver process (only 148 from WES and 3831 from WGS), and by comparing the resulting GRCh38 coordinates for known variants (identified by RSIDs) to those reported in the Ensembl Build 38 database. From the WES data, 15 751 variants (1% of all mapped variants) showed discrepancy between the LiftOver coordinates and the Ensembl coordinates, with an average of 55 808 bases away from the Ensembl position. From the WGS data, this proportion slightly increased with 360 459 variants (1.29%) showing discrepancy with an average of 831 512 bases away from the Ensembl position. As expected from WES data, the majority of these coordinate mismatches affect missense (53.3%) and synonymous (33.58%) variants. Within the WGS data, most variants are intronic (46.63%) and intergenic (29.39%), with a small number of downstream (6.625) and upstream variants (8.72%).

4 Discussion

A growing list of annotation resources are now available for the prioritization and filtering genomic variants. Due to the density of available data, a key hurdle of sequencing projects is to precisely define a priori what annotation resources will be used to test various functional hypotheses within the data. In this paper, we describe a variant annotation pipeline developed for the ADSP, and the resulting annotations for over 30 million distinct variants, over 5 million of which are novel. As expected, the vast majority of variants identified are low-frequency events, with WES of AD cases and controls identifying mostly missense and synonymous variants, and WGS of AD-affected family members identifying intergenic, intronic and regulatory variants. While the general practice in genomic analysis is to assume a single, most-damaging biological consequence per variant, it is important to note that variants have a variety of effects across biological contexts—94% of all annotated WES variants impact two or more transcripts. Due to tissue-specific splicing and expression, this implies that no variant has a singular effect. As others have noted (Frankish ), the choice of transcript reference can have a significant impact on predicted variant consequences, however this work explores changes to the global transcript reference set (Ensembl versus RefSeq). Given the specific relevance of the central nervous system for AD pathogenesis, we explored how tissue-specific splicing changes variant annotation. Restricting annotations to transcripts within tissues of interest (cerebellum and temporal cortex) showed only a modest change to individual variant annotations; however, these changes can have a large impact on the downstream gene-based analyses that use them. With nearly one-fourth of protein-coding genes harboring variants that change their consequence and impact, statistical analyses using tissue-specific annotation could result in dramatically different association test results. Generating a definitive conclusion on what is ‘expressed’ in any given tissue however is difficult, and introduces new algorithm thresholds that may influence the resulting annotation. Applying regulatory annotations from the FANTOM5 project in conjunction with eQTL data from the GTEx project provides a tissue-specific set of long-range enhancers tied to the genes they potentially regulate. Including these annotations for variants identified in WGS of families extends the traditional concept of gene-based unit tests to include variants within their regulatory elements, adding on average 35 variants to each gene-unit. Average CADD scores for FANTOM5 enhancer elements are increased relative to other regulatory annotations, and incorporating these along with promoter elements should greatly improve the interpretation of regulatory rare-variant hypothesis tests. Ideally, all annotation resources would be regenerated for GRCh38, however this would require extensive effort to reconstitute the enormous training data resources for prediction algorithms such as CADD. Anticipating the need to use LiftOver for GRCh37-mapped annotation resources, we explored how well LiftOver mapped our own variants into GRCh38 by comparing the coordinates of known variants to the GRCh38 version of Ensembl. While only a fraction of a percentage of variants were unmapped by the LiftOver process, roughly 1% of known variants had discrepant positions between LiftOver and Ensembl. The majority of these were localized to chromosome 6, likely owing to the complexity of the major histocompatibility complex region, but also to chromosomes 15 and 17. These mapping errors were not insignificant, with LiftOver coordinates an average of 50 KB from the reported Ensembl position. This result indicates that while LiftOver is an important stop-gap approach for mapping annotation resources into GRCh38, post-hoc checks for annotation accuracy will be needed to ensure that the biological implications of GRCh38 variants are properly portrayed, and whenever possible, resources constructed based on GRCh38 should be used instead. All together, we show that annotation is more than ever a critical component of genetic data analysis. Regardless of the annotation strategy employed, it is critical to be aware that variant annotation injects biological assumptions into the statistical analysis of genomic data. Providing a uniform set of annotation resources for all ADSPs will ease comparisons between analyses and informs the interpretation of results.

Funding

A detailed acknowledgment of all funding sources supporting this work is available in the Supplementary Information. Conflict of Interest: none declared. Click here for additional data file.

46 in total

1. Pooled association tests for rare variants in exon-resequencing studies.

Authors: Alkes L Price; Gregory V Kryukov; Paul I W de Bakker; Shaun M Purcell; Jeff Staples; Lee-Jen Wei; Shamil R Sunyaev
Journal: Am J Hum Genet Date: 2010-05-13 Impact factor: 11.025

2. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.

Authors: Kim D Pruitt; Jennifer Harrow; Rachel A Harte; Craig Wallin; Mark Diekhans; Donna R Maglott; Steve Searle; Catherine M Farrell; Jane E Loveland; Barbara J Ruef; Elizabeth Hart; Marie-Marthe Suner; Melissa J Landrum; Bronwen Aken; Sarah Ayling; Robert Baertsch; Julio Fernandez-Banet; Joshua L Cherry; Val Curwen; Michael Dicuccio; Manolis Kellis; Jennifer Lee; Michael F Lin; Michael Schuster; Andrew Shkeda; Clara Amid; Garth Brown; Oksana Dukhanina; Adam Frankish; Jennifer Hart; Bonnie L Maidak; Jonathan Mudge; Michael R Murphy; Terence Murphy; Jeena Rajan; Bhanu Rajput; Lillian D Riddick; Catherine Snow; Charles Steward; David Webb; Janet A Weber; Laurens Wilming; Wenyu Wu; Ewan Birney; David Haussler; Tim Hubbard; James Ostell; Richard Durbin; David Lipman
Journal: Genome Res Date: 2009-06-04 Impact factor: 9.043

3. Accelerating Medicines Partnership: Alzheimer's Disease (AMP-AD) Knowledge Portal Aids Alzheimer's Drug Discovery through Open Data Sharing.

Authors: Richard J Hodes; Neil Buckholtz
Journal: Expert Opin Ther Targets Date: 2016-02-07 Impact factor: 6.902

Review 4. An integrated computational pipeline and database to support whole-genome sequence annotation.

Authors: C J Mungall; S Misra; B P Berman; J Carlson; E Frise; N Harris; B Marshall; S Shu; J S Kaminker; S E Prochnik; C D Smith; E Smith; J L Tupy; C Wiel; G M Rubin; S E Lewis
Journal: Genome Biol Date: 2002-12-23 Impact factor: 13.583

5. Human genomics. The human transcriptome across tissues and individuals.

Authors: Marta Melé; Pedro G Ferreira; Ferran Reverter; David S DeLuca; Jean Monlong; Michael Sammeth; Taylor R Young; Jakob M Goldmann; Dmitri D Pervouchine; Timothy J Sullivan; Rory Johnson; Ayellet V Segrè; Sarah Djebali; Anastasia Niarchou; Fred A Wright; Tuuli Lappalainen; Miquel Calvo; Gad Getz; Emmanouil T Dermitzakis; Kristin G Ardlie; Roderic Guigó
Journal: Science Date: 2015-05-08 Impact factor: 47.728

6. UniProt Knowledgebase: a hub of integrated protein data.

Authors: Michele Magrane
Journal: Database (Oxford) Date: 2011-03-29 Impact factor: 3.451

7. Human whole genome genotype and transcriptome data for Alzheimer's and other neurodegenerative diseases.

Authors: Mariet Allen; Minerva M Carrasquillo; Cory Funk; Benjamin D Heavner; Fanggeng Zou; Curtis S Younkin; Jeremy D Burgess; High-Seng Chai; Julia Crook; James A Eddy; Hongdong Li; Ben Logsdon; Mette A Peters; Kristen K Dang; Xue Wang; Daniel Serie; Chen Wang; Thuy Nguyen; Sarah Lincoln; Kimberly Malphrus; Gina Bisceglio; Ma Li; Todd E Golde; Lara M Mangravite; Yan Asmann; Nathan D Price; Ronald C Petersen; Neill R Graff-Radford; Dennis W Dickson; Steven G Younkin; Nilüfer Ertekin-Taner
Journal: Sci Data Date: 2016-10-11 Impact factor: 6.444

8. A groupwise association test for rare mutations using a weighted sum statistic.

Authors: Bo Eskerod Madsen; Sharon R Browning
Journal: PLoS Genet Date: 2009-02-13 Impact factor: 5.917

9. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

10. The UK10K project identifies rare variants in health and disease.

Authors: Klaudia Walter; Josine L Min; Jie Huang; Lucy Crooks; Yasin Memari; Shane McCarthy; John R B Perry; ChangJiang Xu; Marta Futema; Daniel Lawson; Valentina Iotchkova; Stephan Schiffels; Audrey E Hendricks; Petr Danecek; Rui Li; James Floyd; Louise V Wain; Inês Barroso; Steve E Humphries; Matthew E Hurles; Eleftheria Zeggini; Jeffrey C Barrett; Vincent Plagnol; J Brent Richards; Celia M T Greenwood; Nicholas J Timpson; Richard Durbin; Nicole Soranzo
Journal: Nature Date: 2015-09-14 Impact factor: 49.962

12 in total

1. Identification of risk genes for Alzheimer's disease by gene embedding.

Authors: Yashwanth Lagisetty; Thomas Bourquard; Ismael Al-Ramahi; Carl Grant Mangleburg; Samantha Mota; Shirin Soleimani; Joshua M Shulman; Juan Botas; Kwanghyuk Lee; Olivier Lichtarge
Journal: Cell Genom Date: 2022-07-26

2. Genetic Variation in Genes Underlying Diverse Dementias May Explain a Small Proportion of Cases in the Alzheimer's Disease Sequencing Project.

Authors: Elizabeth E Blue; Joshua C Bis; Michael O Dorschner; Debby W Tsuang; Sandra M Barral; Gary Beecham; Jennifer E Below; William S Bush; Mariusz Butkiewicz; Carlos Cruchaga; Anita DeStefano; Lindsay A Farrer; Alison Goate; Jonathan Haines; Jim Jaworski; Gyungah Jun; Brian Kunkle; Amanda Kuzma; Jenny J Lee; Kathryn L Lunetta; Yiyi Ma; Eden Martin; Adam Naj; Alejandro Q Nato; Patrick Navas; Hiep Nguyen; Christiane Reitz; Dolly Reyes; William Salerno; Gerard D Schellenberg; Sudha Seshadri; Harkirat Sohi; Timothy A Thornton; Otto Valadares; Cornelia van Duijn; Badri N Vardarajan; Li-San Wang; Eric Boerwinkle; Josée Dupuis; Margaret A Pericak-Vance; Richard Mayeux; Ellen M Wijsman
Journal: Dement Geriatr Cogn Disord Date: 2018-02-27 Impact factor: 2.959

Review 3. Harnessing endophenotypes and network medicine for Alzheimer's drug repurposing.

Authors: Jiansong Fang; Andrew A Pieper; Ruth Nussinov; Garam Lee; Lynn Bekris; James B Leverenz; Jeffrey Cummings; Feixiong Cheng
Journal: Med Res Rev Date: 2020-07-13 Impact factor: 12.944

4. Rare genetic variation implicated in non-Hispanic white families with Alzheimer disease.

Authors: Gary W Beecham; Badri Vardarajan; Elizabeth Blue; William Bush; James Jaworski; Sandra Barral; Anita DeStefano; Kara Hamilton-Nelson; Brian Kunkle; Eden R Martin; Adam Naj; Farid Rajabli; Christiane Reitz; Timothy Thornton; Cornelia van Duijn; Allison Goate; Sudha Seshadri; Lindsay A Farrer; Eric Boerwinkle; Gerard Schellenberg; Jonathan L Haines; Ellen Wijsman; Richard Mayeux; Margaret A Pericak-Vance
Journal: Neurol Genet Date: 2018-11-21

5. GWAS and Beyond: Using Omics Approaches to Interpret SNP Associations.

Authors: Hung-Hsin Chen; Lauren E Petty; William Bush; Adam C Naj; Jennifer E Below
Journal: Curr Genet Med Rep Date: 2019-02-14

Review 6. Bringing radiomics into a multi-omics framework for a comprehensive genotype-phenotype characterization of oncological diseases.

Authors: Mario Zanfardino; Monica Franzese; Katia Pane; Carlo Cavaliere; Serena Monti; Giuseppina Esposito; Marco Salvatore; Marco Aiello
Journal: J Transl Med Date: 2019-10-07 Impact factor: 5.531

7. M-DATA: A statistical approach to jointly analyzing de novo mutations for multiple traits.

Authors: Yuhan Xie; Mo Li; Weilai Dong; Wei Jiang; Hongyu Zhao
Journal: PLoS Genet Date: 2021-11-04 Impact factor: 5.917

8. PSCAN: Spatial scan tests guided by protein structures improve complex disease gene discovery and signal variant detection.

Authors: Zheng-Zheng Tang; Gregory R Sliwoski; Guanhua Chen; Bowen Jin; William S Bush; Bingshan Li; John A Capra
Journal: Genome Biol Date: 2020-08-26 Impact factor: 13.583

9. Hadoop and PySpark for reproducibility and scalability of genomic sequencing studies.

Authors: Nicholas R Wheeler; Penelope Benchek; Brian W Kunkle; Kara L Hamilton-Nelson; Mike Warfe; Jeremy R Fondran; Jonathan L Haines; William S Bush
Journal: Pac Symp Biocomput Date: 2020

10. Author Correction: Genome-wide human brain eQTLs: In-depth analysis and insights using the UKBEC dataset.

Authors: Letitia M F Sng; Peter C Thomson; Daniah Trabzuni
Journal: Sci Rep Date: 2020-10-01 Impact factor: 4.379