Literature DB >> 23193291

DbVar and DGVa: public archives for genomic structural variation.

Ilkka Lappalainen¹, John Lopez, Lisa Skipper, Timothy Hefferon, J Dylan Spalding, John Garner, Chao Chen, Michael Maguire, Matt Corbett, George Zhou, Justin Paschall, Victor Ananiev, Paul Flicek, Deanna M Church.

Abstract

Much has changed in the last two years at DGVa (http://www.ebi.ac.uk/dgva) and dbVar (http://www.ncbi.nlm.nih.gov/dbvar). We are now processing direct submissions rather than only curating data from the literature and our joint study catalog includes data from over 100 studies in 11 organisms. Studies from human dominate with data from control and case populations, tumor samples as well as three large curated studies derived from multiple sources. During the processing of these data, we have made improvements to our data model, submission process and data representation. Additionally, we have made significant improvements in providing access to these data via web and FTP interfaces.

Entities: Chemical Disease Species

Mesh：

Year: 2012 PMID： 23193291 PMCID： PMC3531204 DOI： 10.1093/nar/gks1213

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Genomic structural variation (GSV) comprises rearrangement events ranging in size from tens to millions of base pairs in size and includes insertions, deletions, inversions, translocations, locus copy number changes and is seen in a diverse class of taxa (2–4). The discovery and characterization of GSV is challenging for a number of reasons (5). A major difficulty in representing these types of variants is obtaining breakpoint resolution of these events. Studies based on microarray technology provide information about sequences involved in variation events, but only a rough estimate of the location of the breakpoints. Current sequencing technology can occasionally provide breakpoint resolution, but often there is a degree of uncertainty about the precise breakpoint location. The variability in the size and type of events that can be detected using a given technology and analysis method underscores the importance of robustly capturing as much experimental information as possible when recording GSV(6). The European Bioinformatics Institute (EBI) and the National Center for Biotechnology Information (NCBI) maintain permanent public repositories, DGVa (http://www.ebi.ac.uk/dgva) and dbVar (http://www.ncbi.nlm.nih.gov/dbvar), respectively. Both resources provide archival, data accessioning and distribution services for all types of GSV in all species. Together, these archives represent the most comprehensive source of GSV in the world and include data originating from the 1000 Genomes project (estd59 and estd199) (7), The Wellcome Trust Sanger Institute Mouse Genomes (estd118) (8), COSMIC project (estd192) (9) and from numerous clinical genetics studies (e.g. nstd37 and nstd54) (10,11) (Figure 1). Data are submitted to these archives using a standard format that captures the methodology used for calling and validating GSV in individual samples, for aggregating data and representing breakpoint ambiguity. The archives also use Sequence Ontology terms (12) to describe GSV types and associated phenotypic information. The archives exchange data with one another regularly and release them to the scientific community using standard data formats on a monthly basis.

Figure 1.

The data growth since DGVa and dbVar services was launched. The graph shows accumulation of variant calls, stratified by organism. Several large datasets such as the 1000 Genomes project pilot (estd59) and phase I (estd199), structural variation data from 17 in-bred mouse strains (estd118) and the first releases of somatic structural variation from the COSMIC database (estd192), case-control and case-only studies on developmental delay (nstd54) and the International Standard Cytogenetic Array (ISCA) consortium data (nstd37). In addition to human and mouse data the archives include data from dog, pig, fruit fly, macaque, cow, horse, zebrafish, sorghum and chimp.

SHARED DATA REPRESENTATION

The DGVa and dbVar share a data model that is designed to capture and describe the complexity of GSV discovery, validation and genotyping experiments and provides accession numbers for three types of object: the study, the variant region and the variant call. This model allows the representation of a variant region based on the evidence of variation observed in one or more individual samples (the variant calls). The association between calls and regions is made by an assertion method that describes the basis for defining the GSV region. For example, a region might be defined by the set of variant calls overlapping one another by 90% (Figure 2). Variant call and region types are described using Sequence Ontology terms (Table 1).

Figure 2.

Table 1.

Variant call types and variant region types

Variant call type	Associated variant region type
Copy number gain	CNV
Copy number loss	CNV
Deletion	CNV
Duplication	CNV
Insertion	Insertion
Mobile element insertion	Mobile element insertion
Novel sequence insertion	Novel sequence insertion
Tandem duplication	Tandem duplication
Translocation	Translocation
Interchromosomal breakpoint	Interchromosomal breakpoint
Intrachromosomal breakpoint	Intraschromosomal breakpoint
Complex	Complex
Unknown	Unknown

The complex region type can be used for any region where calls of different type (other than CNV) have been called and aggregated into a region by the user. CNV = Copy Number Variation.

Graphical representation of the archive data model. The three accessioned objects (studies, calls and regions) are prefixed by an ‘n’ if submitted to dbVar and an ‘e’ if submitted to DGVa. Variation in individual sample genomes is aggregated to a variant region, with respect to a reference genome. Genomic position (indicated by green arrows) does not necessarily overlap completely. Study authors describe the aggregation process in the Assertion method attribute. Discovery and validation methods for each call are stored in the Experiment attribute. This facilitates cross-study analysis of GSV identified using different techniques. Studies point to any external resources that provide access to the raw data used in the experiment or to the publication describing the data. Variant call types and variant region types The complex region type can be used for any region where calls of different type (other than CNV) have been called and aggregated into a region by the user. CNV = Copy Number Variation. Variant calls have a number of associated attributes including the details of the sample(s) or sample set(s) details in which the variation was observed as well as the experimental procedure involved in discovery and/or validation. Combinations of variant call, sample and experiment are unique. Thus, a GSV identified by two different methods, for example, might result in the creation of two separate variant call objects. The data model accommodates the breakpoint ambiguity associated with a range of experimental and analysis protocols. Three sets of coordinate identifiers are available: start-stop, inner start-stop and outer start-stop. Traditional start and stop coordinates can be used alone to describe variants in which base pair resolution has been achieved. When used in conjunction with the inner and outer coordinate system, the same coordinates allow users to represent an estimated start and stop along with a confidence interval, thus matching the common output of many techniques using next-generation sequencing (NGS) methods. Finally, only inner and/or outer coordinates alone may be used in cases where no start is estimated, as is often the case with array-based techniques, with the inner start and stop defining the region known to be contained within the GSV and the outer start and stop used to define the region likely to contain the breakpoints. All coordinates must be associated with a genome assembly that has been submitted to an International Nucleotide Sequence Database Collaboration (INSDC) database (13). In cases where novel sequence has been identified and genomic coordinates cannot be determined, these novel sequences should be submitted to an INDSC database where it will receive an accession; this identifier can then be referenced by the variant call. Phenotype information can be associated with samples or sample sets using any of a number of controlled vocabularies, including the Human Phenotype Ontology (14). Our data model also supports assertions of clinical significance to a variant calls to provide explicit links between causative alleles and phenotypes.

DATA SUBMISSION AND RELEASE FROM THE ARCHIVES

Both archives use a common set of well-defined tab delimited files that can be created using Excel to facilitate submission. The submission template collates all the information required to represent the submitter-asserted GSV within the study. The DGVa and dbVar do not store raw data from array-based assays or sequencing experiments; however, submitters are encouraged to pre-submit raw data to a dedicated EBI or NCBI database. Accession numbers from these deposits should be included with the DGVa/dbVar submission. More information about the submission template, including up-to-date guidelines and instructions for accessing the dedicated help-desks, are available on the DGVa and dbVar websites. Submitted data are processed by the archive that received the initial submission. Processing protocols are shared by both archives and enforce validation rules that aim to ensure data quality and integrity. Once data pass quality control the processing archive issues stable identifiers for the study, all variant calls and regions; these data are then exchanged between archives. Synchronized and timely public release from both databases is the goal and public release can be adjusted to fit with the manuscript publication timelines. The archives support both pre-publication data release, in accordance with the Toronto agreement (15), and data release delayed until publication when requested by the submitters. Data are made available to the public in Genome Variation Format (GVF) (16) from both archives. A GVF file for each taxonomic name and assembly in a given study can be downloaded; in addition, separate files for germline and somatic mutations, and also for cases where dbVar has remapped submitted data to a more recent version of the assembly are available. dbVar also provides data as tab delimited files and XML format.

ACCESS TO THE DATA THROUGH dbVar WEBSITE

Users can navigate to particular studies using our Study Browser (http://www.ncbi.nlm.nih.gov/dbvar/studies), or they can perform text-based searches using the standard NCBI Entrez search interface (17). Searching for gene symbols or phenotype terms will provide information on studies and variant regions associated with the search query. Users who search by location, either by providing a cytogenetic coordinate or a chromosome location (in the form chr1: start–stop), will be redirected to the dbVar Genome Browser (see below). Study records provide global information about the study type, variant calls and regions, the samples used, the experimental details as well as any validation experiments performed as part of the study. Publication information for the study is shown as are links to external resources such as OMIM®, dbGaP and submitter resources. Every submitted variant region is given a dedicated page providing a detailed view of the region. An overview of the variant region is shown at the top, while detailed information is provided below. The detailed information is segregated into labeled tabs. The ‘Genome View’ tab provides a graphical representation of the region in the context of other genome features such as genes. Breakpoint ambiguity, as denoted by endpoint triangles or by translucent color (Figure 3a), and variant call and region type information distinguished by shape and color (Figure 3b), are available in this view. Summary data about overlapping variant regions are available in this tab, with a link to the genome browser that will allow users to browse data from additional studies. Detailed placement information for both the variant calls and regions are shown in the ‘Variant Region Details and Evidence’ tab. Variant calls are also explicitly associated with samples and experimental data in this tab. If there are additional variant calls from a sample, a link is provided so that it is easy to see all calls from a given sample for this study. Additionally, NCBI maps features from submitted assemblies to the current reference assemblies when possible and provides access to all genomic contexts in this tab. Validation information for any calls in this region are available in the ‘Validations’ tab. Detailed information concerning any clinical assertions are in the ‘Clinical Assertions’ tab. While we have a tab reserved for Genotype Information, this is not yet populated. We anticipate adding these data this year, starting with genotype data from the 1000 Genomes project.

Figure 3.

Rendering of breakpoint ambiguity (A) is shown. Variants with breakpoint resolution are shown with fully saturated color. Breakpoints defining by a range (using inner/outer starts and stops) are shown as fully saturated for the high confidence intervals (the regions defined by the inner start-stop) while the region of breakpoint ambiguity is shown as transparent. In many cases, an undefined breakpoint is submitted, but no likelihood range is provided; in these cases triangles pointing towards each other (when only outer coordinates are provided) or pointing out (when inner coordinates are provided). Rendering call and region type (B) is usually designated by color. SV corresponds to variant region and SSV corresponds to variant calls. We recently introduced a genome browser to facilitate the graphical view of multiple studies side by side. This viewer also provides access to other genome information such as assembly information, NCBI gene annotation and SNP data, including access to clinically relevant SNPs (in the ‘Clinical Channel’ track) and SNPs that are associated with publications (in the ‘Cited Variants’ track). The top of the page contains information on chromosome location and provides functions for navigating around the genome. A graphical sequence viewer showing annotated features dominates the page. The left-hand column provides a genome overview and navigation widget, a menu for selecting available assemblies, a search function (users can perform term searches or location searches) and information on studies that have data available in the given region. Users can click on the ‘(+)’ or ‘(−)’ to add or remove particular study tracks to the graphical view.

INTEGRATION OF DGVa DATA TO OTHER PUBLIC RESOURCES

The DGVa provides human data to the Database of Genomic Variants (DGV), available from the University of Toronto (18). Utilizing the range of supplied variant properties, DGV merges data of differing qualities, derived using different methodologies to form a high-quality curated reference set of ‘normal’ GSV in humans. The DGV also shows human data from DGVa where samples carry a disease phenotype as separate tracks in the DGV genome browser. All DGVa archived data are provided to Ensembl, which has developed new ways to visualize GSV data in the genome browser (19). Ensembl uses the same Sequence Ontology terms for the variant classes as DGVa and breakpoint ambiguity is shown using a similar methodology to that applied by dbVar. The GSV can be viewed not only alongside the reference sequence but also against a wealth of other information that includes SNPs and somatic variation, genes and transcripts, mRNA and protein alignments, ncRNAs and regulatory features. The integration of GSV data into such a rich set of genomic annotation provides an extremely powerful tool for elucidating the biological consequences of GSV. All GSV data are integrated as part of the Variant Effect Predictor to provide the variant consequence types for each transcript (20). Ensembl also provides programmatic access to DGVa accessioned variants allowing data from multiple studies to be compared, integrated and analyzed together in novel ways. DGVa data are also made available through Ensembl BioMart to facilitate data mining and integration across all studies and species for researchers without programmatic access.

FUTURE DIRECTIONS

The wealth of GSV information continues to expand both in terms of sheer volume and the nature of associated attributes that are captured. Increasingly these data are accompanied by genotype, phenotype or clinical information, which provides foundation for understanding phenomena such as segregation and variation diversity within populations and in understanding the biological significance of GSV. The data model used by DGVa and dbVar allows for an effective representation of the richness and complexity of GSV information that will be crucial in providing a basis with which to move forward in future integration and analyses.

FUNDING

The Intramural Research Program of the National Institutes of Health, National Library of Medicine for the work on dbVar; the Wellcome Trust (grant number WT084107MA) and by the European Molecular Biology Laboratory for the DGVa. Funding for open access charge: NCBI. Conflict of interest statement. None declared.

20 in total

Review 1. Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome.

Authors: J Zhang; L Feuk; G E Duggan; R Khaja; S W Scherer
Journal: Cytogenet Genome Res Date: 2006 Impact factor: 1.636

Review 2. The human phenotype ontology.

Authors: P N Robinson; S Mundlos
Journal: Clin Genet Date: 2010-02-11 Impact factor: 4.438

3. Prepublication data sharing.

Authors: Ewan Birney; Thomas J Hudson; Eric D Green; Chris Gunter; Sean Eddy; Jane Rogers; Jennifer R Harris; S Dusko Ehrlich; Rolf Apweiler; Christopher P Austin; Lisa Berglund; Martin Bobrow; Chas Bountra; Anthony J Brookes; Anne Cambon-Thomsen; Nigel P Carter; Rex L Chisholm; Jorge L Contreras; Robert M Cooke; William L Crosby; Ken Dewar; Richard Durbin; Stephanie O M Dyke; Joseph R Ecker; Khaled El Emam; Lars Feuk; Stacey B Gabriel; John Gallacher; William M Gelbart; Antoni Granell; Francisco Guarner; Tim Hubbard; Scott A Jackson; Jennifer L Jennings; Yann Joly; Steven M Jones; Jane Kaye; Karen L Kennedy; Bartha Maria Knoppers; Nikos C Kyrpides; William W Lowrance; Jingchu Luo; John J MacKay; Luis Martín-Rivera; W Richard McCombie; John D McPherson; Linda Miller; Webb Miller; Don Moerman; Vincent Mooser; Cynthia C Morton; James M Ostell; B F Francis Ouellette; Julian Parkhill; Parminder S Raina; Christopher Rawlings; Steven E Scherer; Stephen W Scherer; Paul N Schofield; Christoph W Sensen; Victoria C Stodden; Michael R Sussman; Toshihiro Tanaka; Janet Thornton; Tatsuhiko Tsunoda; David Valle; Eero I Vuorio; Neil M Walker; Susan Wallace; George Weinstock; William B Whitman; Kim C Worley; Cathy Wu; Jiayan Wu; Jun Yu
Journal: Nature Date: 2009-09-10 Impact factor: 49.962

Review 4. Genome structural variation discovery and genotyping.

Authors: Can Alkan; Bradley P Coe; Evan E Eichler
Journal: Nat Rev Genet Date: 2011-03-01 Impact factor: 53.242

5. An evidence-based approach to establish the functional and clinical significance of copy number variants in intellectual and developmental disabilities.

Authors: Erin B Kaminsky; Vineith Kaul; Justin Paschall; Deanna M Church; Brian Bunke; Dawn Kunig; Daniel Moreno-De-Luca; Andres Moreno-De-Luca; Jennifer G Mulle; Stephen T Warren; Gabriele Richard; John G Compton; Amy E Fuller; Troy J Gliem; Shuwen Huang; Morag N Collinson; Sarah J Beal; Todd Ackley; Diane L Pickering; Denae M Golden; Emily Aston; Heidi Whitby; Shashirekha Shetty; Michael R Rossi; M Katharine Rudd; Sarah T South; Arthur R Brothman; Warren G Sanger; Ramaswamy K Iyer; John A Crolla; Erik C Thorland; Swaroop Aradhya; David H Ledbetter; Christa L Martin
Journal: Genet Med Date: 2011-09 Impact factor: 8.822

6. The fine-scale architecture of structural variants in 17 mouse genomes.

Authors: Binnaz Yalcin; Kim Wong; Amarjit Bhomra; Martin Goodson; Thomas M Keane; David J Adams; Jonathan Flint
Journal: Genome Biol Date: 2012 Impact factor: 13.583

7. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer.

Authors: Simon A Forbes; Nidhi Bindal; Sally Bamford; Charlotte Cole; Chai Yin Kok; David Beare; Mingming Jia; Rebecca Shepherd; Kenric Leung; Andrew Menzies; Jon W Teague; Peter J Campbell; Michael R Stratton; P Andrew Futreal
Journal: Nucleic Acids Res Date: 2010-10-15 Impact factor: 16.971

8. A copy number variation morbidity map of developmental delay.

Authors: Gregory M Cooper; Bradley P Coe; Santhosh Girirajan; Jill A Rosenfeld; Tiffany H Vu; Carl Baker; Charles Williams; Heather Stalker; Rizwan Hamid; Vickie Hannig; Hoda Abdel-Hamid; Patricia Bader; Elizabeth McCracken; Dmitriy Niyazov; Kathleen Leppig; Heidi Thiese; Marybeth Hummel; Nora Alexander; Jerome Gorski; Jennifer Kussmann; Vandana Shashi; Krys Johnson; Catherine Rehder; Blake C Ballif; Lisa G Shaffer; Evan E Eichler
Journal: Nat Genet Date: 2011-08-14 Impact factor: 38.330

9. The Sequence Ontology: a tool for the unification of genome annotations.

Authors: Karen Eilbeck; Suzanna E Lewis; Christopher J Mungall; Mark Yandell; Lincoln Stein; Richard Durbin; Michael Ashburner
Journal: Genome Biol Date: 2005-04-29 Impact factor: 13.583

10. Mouse segmental duplication and copy number variation.

Authors: Xinwei She; Ze Cheng; Sebastian Zöllner; Deanna M Church; Evan E Eichler
Journal: Nat Genet Date: 2008-05-22 Impact factor: 38.330

112 in total

1. Computational Prediction of Position Effects of Apparently Balanced Human Chromosomal Rearrangements.

Authors: Cinthya J Zepeda-Mendoza; Jonas Ibn-Salem; Tammy Kammin; David J Harris; Debra Rita; Karen W Gripp; Jennifer J MacKenzie; Andrea Gropman; Brett Graham; Ranad Shaheen; Fowzan S Alkuraya; Campbell K Brasington; Edward J Spence; Diane Masser-Frye; Lynne M Bird; Erica Spiegel; Rebecca L Sparkes; Zehra Ordulu; Michael E Talkowski; Miguel A Andrade-Navarro; Peter N Robinson; Cynthia C Morton
Journal: Am J Hum Genet Date: 2017-07-20 Impact factor: 11.025

Review 2. Detecting Causal Variants in Mendelian Disorders Using Whole-Genome Sequencing.

Authors: Abdul Rezzak Hamzeh; T Daniel Andrews; Matt A Field
Journal: Methods Mol Biol Date: 2021

Review 3. Settling the score: variant prioritization and Mendelian disease.

Authors: Karen Eilbeck; Aaron Quinlan; Mark Yandell
Journal: Nat Rev Genet Date: 2017-08-14 Impact factor: 53.242

Review 4. Drug repurposing from the perspective of pharmaceutical companies.

Authors: Y Cha; T Erez; I J Reynolds; D Kumar; J Ross; G Koytiger; R Kusko; B Zeskind; S Risso; E Kagan; S Papapetropoulos; I Grossman; D Laifenfeld
Journal: Br J Pharmacol Date: 2017-05-18 Impact factor: 8.739

5. Principles and Recommendations for Standardizing the Use of the Next-Generation Sequencing Variant File in Clinical Settings.

Authors: Ira M Lubin; Nazneen Aziz; Lawrence J Babb; Dennis Ballinger; Himani Bisht; Deanna M Church; Shaun Cordes; Karen Eilbeck; Fiona Hyland; Lisa Kalman; Melissa Landrum; Edward R Lockhart; Donna Maglott; Gabor Marth; John D Pfeifer; Heidi L Rehm; Somak Roy; Zivana Tezak; Rebecca Truty; Mollie Ullman-Cullere; Karl V Voelkerding; Elizabeth A Worthey; Alexander W Zaranek; Justin M Zook
Journal: J Mol Diagn Date: 2017-03-18 Impact factor: 5.568

6. Annotating Mutational Effects on Proteins and Protein Interactions: Designing Novel and Revisiting Existing Protocols.

Authors: Minghui Li; Alexander Goncearenco; Anna R Panchenko
Journal: Methods Mol Biol Date: 2017

7. The impact of DNA input amount and DNA source on the performance of whole-exome sequencing in cancer epidemiology.

Authors: Qianqian Zhu; Qiang Hu; Lori Shepherd; Jianmin Wang; Lei Wei; Carl D Morrison; Jeffrey M Conroy; Sean T Glenn; Warren Davis; Marilyn L Kwan; Isaac J Ergas; Janise M Roh; Lawrence H Kushi; Christine B Ambrosone; Song Liu; Song Yao
Journal: Cancer Epidemiol Biomarkers Prev Date: 2015-05-19 Impact factor: 4.254

8. A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease.

Authors: Damian Smedley; Max Schubach; Julius O B Jacobsen; Sebastian Köhler; Tomasz Zemojtel; Malte Spielmann; Marten Jäger; Harry Hochheiser; Nicole L Washington; Julie A McMurry; Melissa A Haendel; Christopher J Mungall; Suzanna E Lewis; Tudor Groza; Giorgio Valentini; Peter N Robinson
Journal: Am J Hum Genet Date: 2016-08-25 Impact factor: 11.025

Review 9. Implications of germline copy-number variations in psychiatric disorders: review of large-scale genetic studies.

Authors: Masahiro Nakatochi; Itaru Kushima; Norio Ozaki
Journal: J Hum Genet Date: 2020-09-21 Impact factor: 3.172

Review 10. Multigene families of immunoglobulin domain-containing innate immune receptors in zebrafish: deciphering the differences.

Authors: Iván Rodríguez-Nunez; Dustin J Wcisel; Gary W Litman; Jeffrey A Yoder
Journal: Dev Comp Immunol Date: 2014-02-15 Impact factor: 3.636