Literature DB >> 27899611

The ExAC browser: displaying reference data information from over 60 000 exomes.

Konrad J Karczewski^1,2, Ben Weisburd^3,2, Brett Thomas^3,2, Matthew Solomonson^3,2, Douglas M Ruderfer⁴, David Kavanagh⁴, Tymor Hamamsy⁴, Monkol Lek^3,2, Kaitlin E Samocha^3,2, Beryl B Cummings^3,2, Daniel Birnbaum^3,2, Mark J Daly^3,2, Daniel G MacArthur^3,2.

Abstract

Worldwide, hundreds of thousands of humans have had their genomes or exomes sequenced, and access to the resulting data sets can provide valuable information for variant interpretation and understanding gene function. Here, we present a lightweight, flexible browser framework to display large population datasets of genetic variation. We demonstrate its use for exome sequence data from 60 706 individuals in the Exome Aggregation Consortium (ExAC). The ExAC browser provides gene- and transcript-centric displays of variation, a critical view for clinical applications. Additionally, we provide a variant display, which includes population frequency and functional annotation data as well as short read support for the called variant. This browser is open-source, freely available at http://exac.broadinstitute.org, and has already been used extensively by clinical laboratories worldwide.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2016 PMID： 27899611 PMCID： PMC5210650 DOI： 10.1093/nar/gkw971

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Recently, large reference datasets, such as those from the 1000 Genomes Project Consortium (1), Exome Sequencing Project (ESP) (2) and Exome Aggregation Consortium (ExAC) (3), have become publicly available for the benefit of the biomedical community. These datasets are beneficial for many applications, including clinical as well as basic research. In particular, as a variant's frequency is among the best predictor of its deleteriousness, clinical geneticists use reference datasets to discern pathogenic mutations from benign polymorphisms. Additionally, genetics researchers rely on variant and allele frequency data to infer gene and variant function (e.g. whether a gene is essential) as well as for population genetics analyses. These large-scale projects release raw data in the form of variant call format (VCF) files, but these files require bioinformatics expertise to parse and synthesize. Genome browsers, such as the UCSC genome browser (4), have become a popular method for non-technical audiences to visualize large genome-scale datasets. Additionally, browsers of variation data, including the Exome Variant Server (EVS) from ESP and the 1000 Genomes Browser, have been developed to present population data, but these are limited in the data they display. For instance, deviations in coverage, which affect one's confidence of the absence of variation, are not natively shown: EVS contains a link to a coverage track on UCSC, but coverage is not visualized on the page itself. There are a number of practical considerations for the optimal display of reference data. Specifically, as one primary use case for genome browsers involves gene-level analyses, the display of gene summary information is a central view for a genome browser, including integration of summary statistics as well as data for individual single nucleotide variants (SNV), insertions and deletions (indel) and copy number variants (CNV). Of course, detailed information on each variant, including annotations and quality metrics, is of paramount importance. However, an equally important display is that of the absence of variation: whether a missing variant implies a lack of observed variation, low or no coverage in the genomic region, or variation filtered due to poor quality. The Exome Aggregation Consortium (ExAC) has collected, harmonized, and released exome sequence data from 60 706 individuals (3). Already, these data have proven useful in filtering variants for identifying causal variants for rare disease (5,6,7). Here, we present a visual browser of the ExAC dataset. The browser is intended for use by clinical geneticists researching variants of interest for patients as well as biologists exploring variation in specific genes.

ExAC BROWSER

We designed the ExAC browser as an intuitive interface to enable clinical geneticists and biologists to explore variants and genes of interest. We built a scalable browser framework to display qualitative and quantitative information for genes and variants in the ExAC dataset (see Methods), including both quality control information as well as summary statistics. The front page of the browser includes a search bar, which is seeded with autocomplete suggestions based on gene symbols and aliases, as well as sample queries. From here, there are two central units of the ExAC browser: the gene (or transcript) page and the variant page.

Gene/transcript page

The ExAC browser gene page is an overview page for gene-level information, including summary statistics, coverage, and variants. The page begins with gene metadata and external references, along with constraint information, which summarizes the gene's intolerance to variation for multiple functional classes (3,8) (Figure 1A). Next, we present single base-resolution coverage information for each exon for a number of metrics including mean, median, and proportion of individuals covered at a number of depth cutoffs (Figure 1B). Immediately below, an exon summary plot displays the position and frequency of each SNV and indel, as well as CNV count information broken down by population (Figure 1C). All individual CNV calls are provided in the form of UCSC tracks, which are linked at the top of the page and above the CNV display. Finally, the browser provides a comprehensive table for variant information, which includes the worst functional annotation across transcripts for each variant, as well as frequency information. The table is sortable and can be exported to a CSV format (Figure 1D).

Figure 1.

Gene page. (A) Gene information is summarized, including links to various external resources, as well as constraint information as described in (3). For all exons in the canonical transcript, we display (B) base-level coverage for a number of metrics (mean coverage by default), as well as (C) position and frequency information for all variants, including CNVs. (D) A table of all variants is provided with additional annotation information and links to variant pages. By default, the gene summary page presents a table of all variants in the gene, annotated with the worst consequence across all transcripts, as well as coverage information for the canonical transcript: we also present a transcript page that includes annotation and coverage information specific to that transcript.

Variant page

The variant page includes a diverse set of annotations for the given variant. First, a site overview (Figure 2A) and site- and genotype-level quality metrics are provided (Figure 2B). The user is notified whether any individual has another variant in the same codon (suggesting a multi-nucleotide variant, or MNV), whether the site is multi-allelic, or if a low number of individuals is covered at this locus. Functional annotations of the variant against each transcript including PolyPhen2 (9), SIFT (10), and LOFTEE annotations, as well as a sortable table of population frequencies are provided (Figure 2C). Finally, for users that wish to evaluate the validity of specific variants, raw short-read data from a subset of individuals is available for each variant. We provide an IGVweb visualization of the read pileup of a 125 bp window around the variant (Figure 2D) for a random sample of individuals with each variant, as well as a sampling of homozygous individuals, if available. For the first time among genome browsers, we provide users with a mechanism to efficiently visualize the raw read support for a variant and make assessments of its quality that may not have been detected by variant calling algorithms.

Figure 2.

Variant page. (A) Variant metadata is displayed, including links to dbSNP, UCSC and Clinvar. (B) Users can browse quality metrics based on genotypes (genotype quality and depth) as well as site-level quality metrics from GATK. (C) Annotations for each transcript are provided—if a variant overlaps multiple transcripts with the same functional annotation, a dropdown box provides additional details for the annotations. (D) Allele frequency information is displayed for each continental group. (E) Short read data is provided for more technical users to assess validity of the variant call.

Non-variant information

One important consideration for displaying genetic data includes the display of non-variant sites. In particular, if a variant or region is queried, we display metadata about the locus, whether or not variants are present in the dataset. When a user searches for variants or regions that are not covered in the ExAC dataset, the user is shown a page with coverage information for the general region for the variant.

Additional considerations

As current web browsers and connections benefit from smaller data transfers and footprints, we have developed a number of optimizations to the browser, including compression and caching of data for large genes (Methods). Finally, the browser is optimized for mobile browsing, where extraneous information is hidden when browsing from a mobile device.

DISCUSSION

Here, we have described a browser for reference variation data, whose use has become widespread in clinical genetics laboratories across the world. As of this writing (8/1/2016), the browser has had over 5 million pageviews and 250 000 users spanning over 188 countries. The top 10 genes and top three variants visited by users are shown in Table 1.

Table 1.

Top genes and variants viewed in the ExAC browser

Gene/Variant	Associated syndromes	Pageviews
PCSK9	(Linked on front page of browser) Hypercholesterolemia	13 540
BRCA1	Breast cancer susceptibility	8251
BRCA2	Breast cancer susceptibility	7408
CFTR	Cystic Fibrosis	5179
FBN1	Marfan Syndrome	4886
TP53	Cancer susceptibility	3712
TTN	Gene that encodes for the largest protein, cardiomyopathy	3528
MYH7	Cardiomyopathy	3497
MYBPC3	Cardiomyopathy	3398
SCN5A	Brugada syndrome, long QT, cardiomyopathy	3175
rs113993960	Cystic fibrosis (CFTR deltaF508)	203
rs1799966	Breast cancer (BRCA1 missense variant)	157
rs11571833	Breast cancer (BRCA2 stop-gained variant)	120

There are a number of limitations to the ExAC browser. First, the browser only displays data from the exome, or the coding regions of the genome. While these are typically of highest relevance for clinical genetics, there are many non-coding regions that are known to be important for human disease and studied by researchers. An updated version of the browser to allow whole-genome data will be useful for these applications, but additional scalability considerations will be required to display these expanded datasets. Accordingly, queries are limited to 100 kb to ensure a timely return of results. The addition of an optimized API would provide additional flexibility for the browser, as well as serve the needs of researchers needing to do bulk queries for larger-scale analyses. Finally, the quality of the database relies on the quality of the variant calls and annotations contained within ExAC: as new callsets become available, the browser will be updated with new versions that adopt new computational methods and sequencing technologies. The code is open-source and available at http://github.com/konradjk/exac_browser. The browser framework established can be privately cloned and used for internal sequencing projects, as well as extended to a number of applications, such as a browser for results from genome-wide association studies (GWAS).

METHODS

Data sources

As of this writing (8/1/2016), version 0.3.1 of ExAC dataset, as described in (3), was used for the ExAC Browser. Variants were annotated using the Variant Effect Predictor (VEP) version 81 (11,12) against the Gencode v19 transcript set. RSIDs were obtained from dbSNP version 142 and gene names and aliases were extracted from dbNSFP (13,14). Histograms for various genotype-specific quality metrics, such as per-sample genotype quality and depth, are pre-computed using a custom python script (https://github.com/macarthur-lab/exac_2015/blob/master/src/prepare_exac_sites_vcf.py). MNVs and constraint metrics are pre-calculated as described in (3). Reassembled read data was generated for each of the 9.8 million variants in ExAC v0.3.1 by running GATK HaplotypeCaller 3.1 (full version: v3.1-1-ga70dc6e) with the -bamout flag on each sample containing the particular variant (up to a limit of five homozygous and five heterozygous samples). Only samples with a read depth (DP) ≥10 and genotype quality (GQ) ≥20 were included. When a variant was present in more than five such samples, the five samples with the highest GQ were selected. Overall, HaplotypeCaller was run 22.3 million times to produce over 5 Tb of small BAM files—with each BAM file storing reassembled reads for a several-hundred base pair window around the variant. Batches of several thousand of these small BAM files were then combined into larger BAM files to improve compression ratios, while using read groups to keep track of the original source of each read. The final dataset comprised ∼23 000 BAMs and spanned 540 Gb. These BAM files were made directly available over the web and visualized in the ExAC browser using IGV.js. Besides the -bamout flag, these additional flags were passed to HaplotypeCaller to ensure that gVCF genotype calls matched the original ExAC gVCF genotypes, which are reproduced here to facilitate reproducibility: -ERC GVCF - -paddingAroundSNPs 300 - -paddingAroundIndels 300 - -max_alternate_alleles 3 -A DepthPerSampleHC -A StrandBiasBySample - -maxNumHaplotypesInPopulation 200 -stand_call_conf 30.0 -stand_emit_conf 30.0 - -disable_auto_index_creation_and_locking_when_reading_rods - -minPruning 3 - -variant_index_type LINEAR - -variant_index_parameter 128000 This data processing was managed by a python-based pipeline available here: https://github.com/macarthur-lab/exac_readviz_scripts. CNVs were generated using XHMM (15) and based on GENCODE v19 coding regions: all details of CNV calling and quality control have been published previously (16). Gene summary CNV counts and related constraint scores are presented based on likelihoods of the CNV occurring within the genomic range of the gene, as described (16). Exon CNV counts and CNVs presented in the UCSC browser are based on all confidently called CNVs (XHMM SQ > 60) across the genome. All overlapping CNVs, regardless of amount of overlap, are included in Exon CNV counts.

Website design

The ExAC browser was built primarily on open-source tools. On the server, a lightweight Flask framework serves content built on Python scripts available at http://github.com/konradjk/exac_browser. All variants and metadata are loaded into MongoDB (version 2.4.14). The major components loaded include the variant data (directly from the VCF format), coverage data (generated by a modified version of samtools, as described in (3)), MNV and constraint information, as well as gene models from Gencode and RSID information from dbSNP. The HTML backbone was created based on Bootstrap version 3.1.1 (https://github.com/twbs/bootstrap) and JQuery version 1.11.1 (http://jquery.org). Plotting was performed using d3 version 3 (17). Read visualization is powered by IGVweb version 0.9.3 (https://github.com/igvteam/igv.js/releases/tag/0.9.3). The entire system runs on a Linux virtual machine with eight cores, 32 GB RAM, and 2T of disk space using Apache 2.4.12. Page tracking is provided by Google Analytics (http://www.google.com/analytics/).

Optimizations

Bootstrap is a mobile-first web framework, which enables the ExAC browser's optimizations for mobile browsing: specifically, much extraneous information (such as the coverage information or additional variant annotations) is hidden when the browser is used on a smaller screen. Additionally, the pages for large genes are pre-computed, allowing for faster load times for these genes. Finally, user search is optimized using typeahead version 0.10.2, with most search terms, including gene names and all aliases, populating the search bar. The single search bar is used to search for variants (formatted as RSIDs or in a chromosome and position format), genes and transcripts (symbols, aliases or Ensembl identifiers) and regions.

17 in total

1. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth.

Authors: Menachem Fromer; Jennifer L Moran; Kimberly Chambert; Eric Banks; Sarah E Bergen; Douglas M Ruderfer; Robert E Handsaker; Steven A McCarroll; Michael C O'Donovan; Michael J Owen; George Kirov; Patrick F Sullivan; Christina M Hultman; Pamela Sklar; Shaun M Purcell
Journal: Am J Hum Genet Date: 2012-10-05 Impact factor: 11.025

2. A method and server for predicting damaging missense mutations.

Authors: Ivan A Adzhubei; Steffen Schmidt; Leonid Peshkin; Vasily E Ramensky; Anna Gerasimova; Peer Bork; Alexey S Kondrashov; Shamil R Sunyaev
Journal: Nat Methods Date: 2010-04 Impact factor: 28.547

3. dbNSFP v2.0: a database of human non-synonymous SNVs and their functional predictions and annotations.

Authors: Xiaoming Liu; Xueqiu Jian; Eric Boerwinkle
Journal: Hum Mutat Date: 2013-07-10 Impact factor: 4.878

4. Exploring the landscape of pathogenic genetic variation in the ExAC population database: insights of relevance to variant classification.

Authors: Wei Song; Sabrina A Gardner; Hayk Hovhannisyan; Amanda Natalizio; Katelyn S Weymouth; Wenjie Chen; Ildiko Thibodeau; Ekaterina Bogdanova; Stanley Letovsky; Alecia Willis; Narasimhan Nagan
Journal: Genet Med Date: 2015-12-17 Impact factor: 8.822

5. dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions.

Authors: Xiaoming Liu; Xueqiu Jian; Eric Boerwinkle
Journal: Hum Mutat Date: 2011-08 Impact factor: 4.878

6. Patterns of genic intolerance of rare copy number variation in 59,898 human exomes.

Authors: Douglas M Ruderfer; Tymor Hamamsy; Monkol Lek; Konrad J Karczewski; David Kavanagh; Kaitlin E Samocha; Mark J Daly; Daniel G MacArthur; Menachem Fromer; Shaun M Purcell
Journal: Nat Genet Date: 2016-08-17 Impact factor: 38.330

7. Analysis of protein-coding genetic variation in 60,706 humans.

Authors: Monkol Lek; Konrad J Karczewski; Eric V Minikel; Kaitlin E Samocha; Eric Banks; Timothy Fennell; Anne H O'Donnell-Luria; James S Ware; Andrew J Hill; Beryl B Cummings; Taru Tukiainen; Daniel P Birnbaum; Jack A Kosmicki; Laramie E Duncan; Karol Estrada; Fengmei Zhao; James Zou; Emma Pierce-Hoffman; Joanne Berghout; David N Cooper; Nicole Deflaux; Mark DePristo; Ron Do; Jason Flannick; Menachem Fromer; Laura Gauthier; Jackie Goldstein; Namrata Gupta; Daniel Howrigan; Adam Kiezun; Mitja I Kurki; Ami Levy Moonshine; Pradeep Natarajan; Lorena Orozco; Gina M Peloso; Ryan Poplin; Manuel A Rivas; Valentin Ruano-Rubio; Samuel A Rose; Douglas M Ruderfer; Khalid Shakir; Peter D Stenson; Christine Stevens; Brett P Thomas; Grace Tiao; Maria T Tusie-Luna; Ben Weisburd; Hong-Hee Won; Dongmei Yu; David M Altshuler; Diego Ardissino; Michael Boehnke; John Danesh; Stacey Donnelly; Roberto Elosua; Jose C Florez; Stacey B Gabriel; Gad Getz; Stephen J Glatt; Christina M Hultman; Sekar Kathiresan; Markku Laakso; Steven McCarroll; Mark I McCarthy; Dermot McGovern; Ruth McPherson; Benjamin M Neale; Aarno Palotie; Shaun M Purcell; Danish Saleheen; Jeremiah M Scharf; Pamela Sklar; Patrick F Sullivan; Jaakko Tuomilehto; Ming T Tsuang; Hugh C Watkins; James G Wilson; Mark J Daly; Daniel G MacArthur
Journal: Nature Date: 2016-08-18 Impact factor: 49.962

8. A framework for the interpretation of de novo mutation in human disease.

Authors: Kaitlin E Samocha; Elise B Robinson; Stephan J Sanders; Christine Stevens; Aniko Sabo; Lauren M McGrath; Jack A Kosmicki; Karola Rehnström; Swapan Mallick; Andrew Kirby; Dennis P Wall; Daniel G MacArthur; Stacey B Gabriel; Mark DePristo; Shaun M Purcell; Aarno Palotie; Eric Boerwinkle; Joseph D Buxbaum; Edwin H Cook; Richard A Gibbs; Gerard D Schellenberg; James S Sutcliffe; Bernie Devlin; Kathryn Roeder; Benjamin M Neale; Mark J Daly
Journal: Nat Genet Date: 2014-08-03 Impact factor: 38.330

9. The Ensembl Variant Effect Predictor.

Authors: William McLaren; Laurent Gil; Sarah E Hunt; Harpreet Singh Riat; Graham R S Ritchie; Anja Thormann; Paul Flicek; Fiona Cunningham
Journal: Genome Biol Date: 2016-06-06 Impact factor: 13.583

10. Genetic risk for autism spectrum disorders and neuropsychiatric variation in the general population.

Authors: Elise B Robinson; Beate St Pourcain; Verneri Anttila; Jack A Kosmicki; Brendan Bulik-Sullivan; Jakob Grove; Julian Maller; Kaitlin E Samocha; Stephan J Sanders; Stephan Ripke; Joanna Martin; Mads V Hollegaard; Thomas Werge; David M Hougaard; Benjamin M Neale; David M Evans; David Skuse; Preben Bo Mortensen; Anders D Børglum; Angelica Ronald; George Davey Smith; Mark J Daly
Journal: Nat Genet Date: 2016-03-21 Impact factor: 38.330

223 in total

1. The Missing LINC for Genetic Cardiovascular Disease?

Authors: Megan J Puckelwartz
Journal: Circ Cardiovasc Genet Date: 2017-06

2. SynGO: An Evidence-Based, Expert-Curated Knowledge Base for the Synapse.

Authors: Frank Koopmans; Pim van Nierop; Maria Andres-Alonso; Andrea Byrnes; Tony Cijsouw; Marcelo P Coba; L Niels Cornelisse; Ryan J Farrell; Hana L Goldschmidt; Daniel P Howrigan; Natasha K Hussain; Cordelia Imig; Arthur P H de Jong; Hwajin Jung; Mahdokht Kohansalnodehi; Barbara Kramarz; Noa Lipstein; Ruth C Lovering; Harold MacGillavry; Vittoria Mariano; Huaiyu Mi; Momchil Ninov; David Osumi-Sutherland; Rainer Pielot; Karl-Heinz Smalla; Haiming Tang; Katherine Tashman; Ruud F G Toonen; Chiara Verpelli; Rita Reig-Viader; Kyoko Watanabe; Jan van Weering; Tilmann Achsel; Ghazaleh Ashrafi; Nimra Asi; Tyler C Brown; Pietro De Camilli; Marc Feuermann; Rebecca E Foulger; Pascale Gaudet; Anoushka Joglekar; Alexandros Kanellopoulos; Robert Malenka; Roger A Nicoll; Camila Pulido; Jaime de Juan-Sanz; Morgan Sheng; Thomas C Südhof; Hagen U Tilgner; Claudia Bagni; Àlex Bayés; Thomas Biederer; Nils Brose; John Jia En Chua; Daniela C Dieterich; Eckart D Gundelfinger; Casper Hoogenraad; Richard L Huganir; Reinhard Jahn; Pascal S Kaeser; Eunjoon Kim; Michael R Kreutz; Peter S McPherson; Ben M Neale; Vincent O'Connor; Danielle Posthuma; Timothy A Ryan; Carlo Sala; Guoping Feng; Steven E Hyman; Paul D Thomas; August B Smit; Matthijs Verhage
Journal: Neuron Date: 2019-06-03 Impact factor: 17.173

3. Detecting the Presence of an Individual in Phenotypic Summary Data.

Authors: Yongtai Liu; Zhiyu Wan; Weiyi Xia; Murat Kantarcioglu; Yevgeniy Vorobeychik; Ellen Wright Clayton; Abel Kho; David Carrell; Bradley A Malin
Journal: AMIA Annu Symp Proc Date: 2018-12-05

4. The Design of the Valsartan for Attenuating Disease Evolution in Early Sarcomeric Hypertrophic Cardiomyopathy (VANISH) Trial.

Authors: Carolyn Y Ho; John J V McMurray; Allison L Cirino; Steven D Colan; Sharlene M Day; Akshay S Desai; Steven E Lipshultz; Calum A MacRae; Ling Shi; Scott D Solomon; E John Orav; Eugene Braunwald
Journal: Am Heart J Date: 2017-02-16 Impact factor: 4.749

Review 5. Clinical utility of genomic sequencing.

Authors: Matthew B Neu; Kevin M Bowling; Gregory M Cooper
Journal: Curr Opin Pediatr Date: 2019-12 Impact factor: 2.856

6. GENCODE Pseudogenes.

Authors: Cristina Sisu
Journal: Methods Mol Biol Date: 2021

7. Cardelino: computational integration of somatic clonal substructure and single-cell transcriptomes.

Authors: Davis J McCarthy; Raghd Rostom; Yuanhua Huang; Daniel J Kunz; Petr Danecek; Marc Jan Bonder; Tzachi Hagai; Ruqian Lyu; Wenyi Wang; Daniel J Gaffney; Benjamin D Simons; Oliver Stegle; Sarah A Teichmann
Journal: Nat Methods Date: 2020-03-16 Impact factor: 28.547

8. A novel homozygous KY variant causing a complex neurological disorder.

Authors: Beenish Arif; Arisha Rasheed; Kishore R Kumar; Amara Fatima; Ghazanfar Abbas; Elizabeth Wohler; Nara Sobriera; Katja Lohmann; Sadaf Naz
Journal: Eur J Med Genet Date: 2020-08-18 Impact factor: 2.708

Review 9. The domino effect triggered by the tethered ligand of the protease activated receptors.

Authors: Xu Han; Marvin T Nieman
Journal: Thromb Res Date: 2020-08-04 Impact factor: 3.944

10. Modeling mutant/wild-type interactions to ascertain pathogenicity of PROKR2 missense variants in patients with isolated GnRH deficiency.

Authors: Kimberly H Cox; Luciana M B Oliveira; Lacey Plummer; Braden Corbin; Thomas Gardella; Ravikumar Balasubramanian; William F Crowley
Journal: Hum Mol Genet Date: 2018-01-15 Impact factor: 6.150