Literature DB >> 26363178

DIVAS: a centralized genetic variant repository representing 150,000 individuals from multiple disease cohorts.

Wei-Yi Cheng¹, Jörg Hakenberg¹, Shuyu Dan Li¹, Rong Chen¹.

Abstract

MOTIVATION: A plethora of sequenced and genotyped disease cohorts is available to the biomedical research community, spread across many portals and represented in various formats.
RESULTS: We have gathered several large studies, including GERA and GRU, and computed population- and disease-specific genetic variant frequencies. In total, our portal provides fast access to genetic variants observed in 84,928 individuals from 39 disease populations. We also include 66,335 controls, such as the 1000 Genomes and Scripps Wellderly.
CONCLUSION: Combining multiple studies helps validate disease-associated variants in each underlying data set, detect potential false positives using frequencies of control populations, and identify novel candidate disease-causing alterations in known or suspected genes.
AVAILABILITY AND IMPLEMENTATION: https://rvs.u.hpc.mssm.edu/divas CONTACT: rong.chen@mssm.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2015 PMID： 26363178 PMCID： PMC4681987 DOI： 10.1093/bioinformatics/btv511

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

DNA sequencing and genotyping data from populations of various demographic backgrounds are becoming available to the biomedical research community at an ever increasing pace. Individually, targeted studies have provided insights into the genetic underpinnings of diseases or confirm previously identified causal alterations: for congenital heart disease (Turki ), coloboma (Rainger ), schizophrenia (Rees ) and numerous others. The power to discover novel, disease-causing or protective alleles potentially becomes even larger when combining data from different studies, thereby increasing the number of controls and cases for related phenotypes, and helping adjust for the mutational spectrum for individuals of diverse ethnic backgrounds. Projects such as the 1000 Genomes Project (1000 Genomes Project Consortium, 2012) or the NHLBI Exome Sequencing Project (www.evs.gs.washington.edu/EVS/) included large cohorts from various ethnicities, and have been frequently used as a control population for filtering out potential benign variants observed in a disease cohort. However, few efforts compile the information from multiple large cohorts into a centralized portal to obtain the distribution of variants observed across multiple studies. The first two large portals providing joint access to several sequencing studies went live earlier this year: the Exome Aggregation Consortium (ExAC, 2015) (www.exac.broadinstitute.org) and the European Variation Archive (EVA, 2015) (www.ebi.ac.uk/eva/). ExAC compiled summarized variant information from more than 63 000 exomes by normalizing data from individual studies through an identical pipeline. EVA provides a query interface for obtaining summarized variant information from several large studies. Although both portals provide variant frequencies observed in numerous control cohorts of multiple ethnic groups, they do not include disease-specific variant frequencies from pure disease cohorts, which can be a positive indication that a variant may be a candidate for causing the disease. We have compiled a centralized genetic variant repository (Disease Variant Store, DIVAS) that includes genotype data from eight large-scale studies, including 1000 Genomes, ExAC, dbGaP GRU, GERA (Hoffmann ) and UK10K ALSPAC/TWINS (see Supplementary Information), consisting of 150 000 individuals from seven ethnic groups. Among this population, 84 928 individuals were annotated with 39 disease phenotypes, and 66 335 as control cohorts. We have computed the disease-specific, ethnicity-specific and control variant frequencies as well as genotype frequencies for all observed variants, and then visualized these summarized information through a public web interface. The broad spectrum of disease phenotypes and ethnic groups in DIVAS make it a simple and comprehensive tool to validate known pathogenic variants, or facilitate in the discovery of novel disease-causing variants.

2 Usage

The DIVAS web interface provides several ways to query for variants, including genes and coordinates. Results are presented in a table including information on effects from snpEff, variant frequencies observed in selected DIVAS cohorts, predicted functional impact (such as SIFT, MutationAssessor) and known disease associations from ClinVar, OMIM, SwissVar and HGMD (the latter with restricted access). Once the results are shown in tabular format, users can select one of those four annotation categories from the dropdown in the upper left. The frequencies of each variant are generated dynamically through bar charts. The bar charts were implemented in D3.js and thus allow the user to filter the frequencies by population, or to sort the frequencies based on various criteria (conditioned on disease/control and population). One immediate use of DIVAS is to validate known disease variant associations in public databases such as ClinVar. For instance, the variant GATA4:p.Ala346Val (rs115372595) was reported in Rajagopal to be observed only in a proband with endocardial cushion defect (ECD); it is annotated as pathogenic in ClinVar. In DIVAS, we observe that this variant has a frequency more than 2-fold higher in a congenital heart defect population than in any other disease and control cohort (Fig. 1). This observation is consistent with the original study that this variant may contribute to the risk of ECD. We provided other examples in the Supplementary Material.

Fig. 1.

Frequency bar chart showing the variant frequencies across all DIVAS disease and control cohorts for GATA4:p.Ala346Val. Opacity of bars indicates the sample size of that cohort

Frequency bar chart showing the variant frequencies across all DIVAS disease and control cohorts for GATA4:p.Ala346Val. Opacity of bars indicates the sample size of that cohort We also provide RESTful API access to query DIVAS for allele frequencies, diseases, effects and predicted functional impacts by specifying genes, dbSNP IDs or variant keys. See Supplementary Methods and the DIVAS website for details.

3 Discussion

Combining multiple studies helps to validate disease-associated variants in each underlying dataset, detect potential false positives using frequencies of other populations and identify novel candidate disease-causing alterations in known or suspected genes. It has to be noted that several datasets represented in DIVAS do not provide detailed phenotypes. Disease annotations across different studies are not necessarily compatible; we plan to address this issue by structuring phenotypes using UMLS or HPO (Köhler ; Thorn ). Moreover, some studies are assumed to be healthy (1000 Genomes) or phenotypes are not published on an individual level (ExAC). Individuals of the former group may still develop a disease at a later time point due to a genetic component. Since DIVAS variants are derived from a variety of sequencing and genotyping platforms, some variants are not profiled and missing data require imputation. Different platforms also impose different quality criteria; in DIVAS, we are currently using the quality metrics and filter criteria provided by each individual study. For that reason, DIVAS provides allele frequencies before and after filtering out non-passing sites. A long-term goal of the service is to integrate the DIVAS infrastructure with several other large data access portals (for instance, dbGap and UK10K). This can help prospective data applicants get a clear idea whether their genes and variants of interest have a frequency distribution in any relevant study supporting their hypothesis.

4 Implementation

Integrating variants from heterogeneous sources requires left-aligned normalization on all genetic variants, which in our case was performed using an algorithm proposed by Tan . For variants containing multiple alternate alleles, we separated each alternate allele as one unique variant and normalized them independently. We have devised a unique, reversible, compressed variant key to represent all possible single nucleotide variations and deletions using a ≤15-byte string representation. This representation can also handle insertions of up to 2958 bases or multi-nucleotide variations using at most 1000 bytes (see Supplementary Methods). By indexing variants and disease-variant association databases in the same way, we are able to quickly retrieve detailed annotations from ClinVar (Landrum ), SwissVar, HGMD (Stenson ), OMIM and dbSNP. We compute gene- and protein-level annotations using snpEff (Cingolani ) and include pre-computed functional predictions from dbNSFP (Liu ). For each dataset, we calculated frequencies for the annotated diseases and ethnicities when possible. We then calculated variant frequencies in each subset. We also calculated frequencies exclusively using variants that have passed the quality control employed by the original data publisher. Frequency calculation is implemented using Apache Pig and results were exported to a MySQL database backing the web query interface. Data visualization is based on D3.js, enabling the web interface to dynamically render charts of variant frequencies based on individual filter/sort criteria chosen by the user.

13 in total

1. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.

Authors: Pablo Cingolani; Adrian Platts; Le Lily Wang; Melissa Coon; Tung Nguyen; Luan Wang; Susan J Land; Xiangyi Lu; Douglas M Ruden
Journal: Fly (Austin) Date: 2012 Apr-Jun Impact factor: 2.160

2. The UMLS Knowledge Source Server: an experience in Web 2.0 technologies.

Authors: Karen E Thorn; Anantha K Bangalore; Allen C Browne
Journal: AMIA Annu Symp Proc Date: 2007-10-11

3. Unified representation of genetic variants.

Authors: Adrian Tan; Gonçalo R Abecasis; Hyun Min Kang
Journal: Bioinformatics Date: 2015-02-19 Impact factor: 6.937

4. Next generation genome-wide association tool: design and coverage of a high-throughput European-optimized SNP array.

Authors: Thomas J Hoffmann; Mark N Kvale; Stephanie E Hesselson; Yiping Zhan; Christine Aquino; Yang Cao; Simon Cawley; Elaine Chung; Sheryl Connell; Jasmin Eshragh; Marcia Ewing; Jeremy Gollub; Mary Henderson; Earl Hubbell; Carlos Iribarren; Jay Kaufman; Richard Z Lao; Yontao Lu; Dana Ludwig; Gurpreet K Mathauda; William McGuire; Gangwu Mei; Sunita Miles; Matthew M Purdy; Charles Quesenberry; Dilrini Ranatunga; Sarah Rowell; Marianne Sadler; Michael H Shapero; Ling Shen; Tanushree R Shenoy; David Smethurst; Stephen K Van den Eeden; Larry Walter; Eunice Wan; Reid Wearley; Teresa Webster; Christopher C Wen; Li Weng; Rachel A Whitmer; Alan Williams; Simon C Wong; Chia Zau; Andrea Finn; Catherine Schaefer; Pui-Yan Kwok; Neil Risch
Journal: Genomics Date: 2011-04-30 Impact factor: 5.736

5. Spectrum of heart disease associated with murine and human GATA4 mutation.

Authors: Satish K Rajagopal; Qing Ma; Dita Obler; Jie Shen; Ani Manichaikul; Aoy Tomita-Mitchell; Kari Boardman; Christine Briggs; Vidu Garg; Deepak Srivastava; Elizabeth Goldmuntz; Karl W Broman; D Woodrow Benson; Leslie B Smoot; William T Pu
Journal: J Mol Cell Cardiol Date: 2007-06-21 Impact factor: 5.000

6. dbNSFP v2.0: a database of human non-synonymous SNVs and their functional predictions and annotations.

Authors: Xiaoming Liu; Xueqiu Jian; Eric Boerwinkle
Journal: Hum Mutat Date: 2013-07-10 Impact factor: 4.878

Review 7. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine.

Authors: Peter D Stenson; Matthew Mort; Edward V Ball; Katy Shaw; Andrew Phillips; David N Cooper
Journal: Hum Genet Date: 2014-01 Impact factor: 4.132

8. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

9. ClinVar: public archive of relationships among sequence variation and human phenotype.

Authors: Melissa J Landrum; Jennifer M Lee; George R Riley; Wonhee Jang; Wendy S Rubinstein; Deanna M Church; Donna R Maglott
Journal: Nucleic Acids Res Date: 2013-11-14 Impact factor: 16.971

10. CNV analysis in a large schizophrenia sample implicates deletions at 16p12.1 and SLC1A1 and duplications at 1p36.33 and CGNL1.

Authors: Elliott Rees; James T R Walters; Kimberly D Chambert; Colm O'Dushlaine; Jin Szatkiewicz; Alexander L Richards; Lyudmila Georgieva; Gerwyn Mahoney-Davies; Sophie E Legge; Jennifer L Moran; Giulio Genovese; Douglas Levinson; Derek W Morris; Paul Cormican; Kenneth S Kendler; Francis A O'Neill; Brien Riley; Michael Gill; Aiden Corvin; Pamela Sklar; Christina Hultman; Carlos Pato; Michele Pato; Patrick F Sullivan; Pablo V Gejman; Steven A McCarroll; Michael C O'Donovan; Michael J Owen; George Kirov
Journal: Hum Mol Genet Date: 2013-10-26 Impact factor: 6.150

5 in total

1. Acute Intermittent Porphyria: Predicted Pathogenicity of HMBS Variants Indicates Extremely Low Penetrance of the Autosomal Dominant Disease.

Authors: Brenden Chen; Constanza Solis-Villa; Jörg Hakenberg; Wanqiong Qiao; Ramakrishnan R Srinivasan; Makiko Yasuda; Manisha Balwani; Dana Doheny; Inga Peter; Rong Chen; Robert J Desnick
Journal: Hum Mutat Date: 2016-09-05 Impact factor: 4.878

2. Analysis of 589,306 genomes identifies individuals resilient to severe Mendelian childhood diseases.

Authors: Rong Chen; Lisong Shi; Jörg Hakenberg; Brian Naughton; Pamela Sklar; Jianguo Zhang; Hanlin Zhou; Lifeng Tian; Om Prakash; Mathieu Lemire; Patrick Sleiman; Wei-Yi Cheng; Wanting Chen; Hardik Shah; Yulan Shen; Menachem Fromer; Larsson Omberg; Matthew A Deardorff; Elaine Zackai; Jason R Bobe; Elissa Levin; Thomas J Hudson; Leif Groop; Jun Wang; Hakon Hakonarson; Anne Wojcicki; George A Diaz; Lisa Edelmann; Eric E Schadt; Stephen H Friend
Journal: Nat Biotechnol Date: 2016-04-11 Impact factor: 54.908