Literature DB >> 21624899

SVA: software for annotating and visualizing sequenced human genomes.

Dongliang Ge¹, Elizabeth K Ruzzo, Kevin V Shianna, Min He, Kimberly Pelak, Erin L Heinzen, Anna C Need, Elizabeth T Cirulli, Jessica M Maia, Samuel P Dickson, Mingfu Zhu, Abanish Singh, Andrew S Allen, David B Goldstein.

Abstract

SUMMARY: Here we present Sequence Variant Analyzer (SVA), a software tool that assigns a predicted biological function to variants identified in next-generation sequencing studies and provides a browser to visualize the variants in their genomic contexts. SVA also provides for flexible interaction with software implementing variant association tests allowing users to consider both the bioinformatic annotation of identified variants and the strength of their associations with studied traits. We illustrate the annotation features of SVA using two simple examples of sequenced genomes that harbor Mendelian mutations.
AVAILABILITY AND IMPLEMENTATION: Freely available on the web at http://www.svaproject.org.

Entities: Disease Gene Species

Mesh：

Year: 2011 PMID： 21624899 PMCID： PMC3129530 DOI： 10.1093/bioinformatics/btr317

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Recent advances in next-generation sequencing (NGS) technologies have made it possible to search through entire genomes for variants that influence traits of interest (Choi ; Lupski ; Ng ; Roach ). However, analyzing NGS data still requires addressing a number of considerable computational challenges. A variety of methods have been developed to firstly align the sequence reads (Langmead ; Li and Durbin, 2009), and secondly to identify the genetic variants (Bentley ; Chen ; Hormozdiari ; Li ). In this work, we focus on the downstream annotation necessary for the analysis of variants and provide simple means of visualizing variants in their genomic context using a built-in genome browser (Fig. 1). One fundamental difference from the previous approach of using genome-wide association studies (GWAS) (McCarthy ), is that many of the variants identified in NGS will be newly discovered. A second fundamental difference is that unlike GWAS data, sequence data, at least in principle, provides complete information about the variation present in a genome. For both of these reasons, it is essential to develop a computational environment that can assess the likely functions of the variants observed, and whether or not they are present in existing variant databases. In most contexts, such functional assignments will need to be considered along with statistical association evidence in order to prioritize variants in terms of their likelihoods of influencing traits of interest. We also emphasize that in the early days of interpreting sequence data for complex traits, the sequence data itself will often not be sufficient to provide statistical confidence of association. Instead, the bioinformatic prioritization helps to point towards variants of interest that can be evaluated in much larger sample sets by direct genotyping. We have designed Sequence Variant Analyzer (SVA) to meet these challenges, and in this report, we use two examples to illustrate some of the annotation features.

Fig. 1.

A screenshot of SVA, highlighting a frameshift indel that is located on exon 14 of the Factor VIII gene and is the cause of type A hemophilia.

2 METHODS

SVA was developed using the Java programming language, using the NetBeans IDE (Oracle, Redwood Shores, CA, USA). The graphical user interface was developed using Java Swing. A standard edition of SVA is available by DVD ROM, or a downloadable package (http://www.svaproject.org/). Additionally, we provide a much smaller evaluation edition. SVA uses a customized indexing, storing and optimizing mechanism. Therefore, to run SVA, installation and optimization of database systems is not required. SVA is also multiple-core aware, although Input/Output (I/O) consumes the majority of the computational time. SVA is an elaboration of WGAViewer (Ge ), which was designed for the interpretation of GWAS datasets. A number of publicly available genomic and biological databases are used by SVA for the annotation of genomic variants. We used the NCBI RefSeq (Pruitt ) as the reference genome in SVA. Both human reference genome builds 36 and 37 are currently supported. We used the Ensembl core and variation databases (Flicek ) as the main source of annotation features for genes and variations. We also provide options to include user-specified annotation tracks, for example, gene annotation dataset or transcription factor binding sites downloadable from the UCSC genome browser. We used the HUGO Gene Nomenclature Committee (HGNC) database as the primary source for naming canonical genes. Gene functions were annotated by integrating information from the Ensembl core, the gene ontology (GO) (Ashburner ), and the KEGG pathway databases (Kanehisa ). We used RefSNP, HapMap (Frazer ), the 1000 Genomes Project (Durbin ), and the DGV databases (Iafrate ) for checking the novelty of identified genetic variants. We will release compiled annotation databases when new reference genome builds become available. Within each reference genome build, users may update their gene annotation databases by directly downloading the annotation data files following our online instructions.

3 RESULTS

The SVA tool utilizes a knowledgebase of 8.9 GB, which is compiled and compressed into DVD ROM or can be downloaded from the SVA website. The main annotation functionality is performed locally and does not require an active internet connection. This tool has two distinct modules that allow analysis of the genomes included in a given project: the annotation module is used to determine genomic context and to predict the potential biological function of the identified variants (for example, synonymous or nonsynonymous, essential splice, etc.) in each genome. The results can then be visualized in SVA's built-in genome browser (visualization module). The annotated variants can also be exported to a separate statistical tool, for example ATAV (http://www.duke.edu/~minhe/atav/), to assess their statistical associations with traits. Importantly, SVA is designed to conveniently connect its bioinformatic annotations with the statistical association tests so that users are allowed to consider them both and even suggestive association evidence may catch the users' attention. There are also a number of integrated bioinformatic listing functions in SVA designed to help prioritize genetic variants. In broad overview, therefore, SVA inputs a set of identified variants for each genome, annotates them for possible functionality, and permits the user a variety of ways to visualize and filter the resulting data with relationship to phenotypic information. The basic principles of these core analyses are fairly simple, but the strength of these analyses is their ability to be seamlessly run across multiple genomes to prioritize all variants and genes at once. Two examples illustrate SVA's utility: To illustrate the simple annotation features of SVA we have analyzed the sequence data for subjects with two different Mendelian diseases. (1) Hemophilia A is an X-linked recessive disorder that is characterized by excessive bleeding due to improper clotting. Genetic mutations in the Factor VIII gene are known to cause Hemophilia A; therefore, it is possible to use this disease as a positive control. We annotated 10 case and 10 control genomes and found that the gene with the most cases harboring a rare mutation was Factor VIII (F8) (Pelak ). An SVA genome browser view of one of the identified indels is shown in Figure 1. (2) Metachondromatosis (MC) is an autosomal dominant condition affecting bone growth. Our center performed a whole-genome sequencing of one MC patient, following a linkage analysis that implicated six candidate regions spanning a total of 42 MB. We took a three-step strategy: we first applied SVA's genomic region filter; we next applied SVA's novelty filters to identify variants that were absent in dbSNP and absent in all eight control genomes; and finally, we applied SVA's functional filters to identify protein-truncating variants. These filtering steps quickly identify an 11-bp frameshifting deletion on the PTPN11 gene as the cause, which was validated in a separate family (Sobreira ). We note that for these simple cases no association tests are required (beyond rarity or absence in controls), whereas in the case of complex traits SVA would normally be used alongside a program like ATAV to facilitate the consideration of variants that are both annotated as functional and that show some degree of statistical association with trait values.

4 CONCLUSIONS AND DISCUSSION

The overriding philosophy of SVA is that the interpretation of whole-genome sequence data benefits from simultaneous consideration of multiple lines of evidence, in particular bioinformatic annotation of variant function and statistical genetic association with trait values. We note this philosophy contrasts with what emerged as best practice for GWAS in which all variants were implicitly considered equally likely a priori to show association with traits (McCarthy ). While there are several genome browser and annotation programs available today that are suitable for different needs (Fiume ; Robinson ; Wang ; SeattleSeq: http://gvs.gs.washington.edu/SeattleSeqAnnotation/), we are unaware of any that performs the integrated features of SVA including: variant annotation and filtering by function and/or calling QC, visualization in a built-in browser and convenient export of user selected variants for statistical association testing. These features allow users to interact directly with the full set of data that are relevant to making a judgment about which variants show the strongest combined evidence of influencing the trait of interest. We also note that the SVA framework is fully generalizable, and over time we expect a number of new annotation features to be incorporated which will take advantage of knowledge of the functional regions of the human genome emerging from the ENCODE project (Birney ) and related activities. Additionally, in principle, SVA could be modified to perform annotations in other species. Future development plans include developing versions of SVA in highly used model organisms. Funding: The Bill & Melinda Gates Foundation (grant 157412); National Institute of Allergy and Infectious Diseases Center for HIV/AIDS Vaccine Immunology (grant AI067854); National Institute of Neurological Disorders and Stroke (grant RC2NS070344); National Institute of Mental Health (grant RC2MH089915). Conflict of Interest: none declared.

25 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. Detection of large-scale variation in the human genome.

Authors: A John Iafrate; Lars Feuk; Miguel N Rivera; Marc L Listewnik; Patricia K Donahoe; Ying Qi; Stephen W Scherer; Charles Lee
Journal: Nat Genet Date: 2004-08-01 Impact factor: 38.330

3. Savant: genome browser for high-throughput sequencing data.

Authors: Marc Fiume; Vanessa Williams; Andrew Brook; Michael Brudno
Journal: Bioinformatics Date: 2010-06-20 Impact factor: 6.937

4. Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy.

Authors: James R Lupski; Jeffrey G Reid; Claudia Gonzaga-Jauregui; David Rio Deiros; David C Y Chen; Lynne Nazareth; Matthew Bainbridge; Huyen Dinh; Chyn Jing; David A Wheeler; Amy L McGuire; Feng Zhang; Pawel Stankiewicz; John J Halperin; Chengyong Yang; Curtis Gehman; Danwei Guo; Rola K Irikat; Warren Tom; Nick J Fantin; Donna M Muzny; Richard A Gibbs
Journal: N Engl J Med Date: 2010-03-10 Impact factor: 91.245

5. Analysis of genetic inheritance in a family quartet by whole-genome sequencing.

Authors: Jared C Roach; Gustavo Glusman; Arian F A Smit; Chad D Huff; Robert Hubley; Paul T Shannon; Lee Rowen; Krishna P Pant; Nathan Goodman; Michael Bamshad; Jay Shendure; Radoje Drmanac; Lynn B Jorde; Leroy Hood; David J Galas
Journal: Science Date: 2010-03-10 Impact factor: 47.728

6. A map of human genome variation from population-scale sequencing.

Authors: Gonçalo R Abecasis; David Altshuler; Adam Auton; Lisa D Brooks; Richard M Durbin; Richard A Gibbs; Matt E Hurles; Gil A McVean
Journal: Nature Date: 2010-10-28 Impact factor: 49.962

7. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.

Authors: Kai Wang; Mingyao Li; Hakon Hakonarson
Journal: Nucleic Acids Res Date: 2010-07-03 Impact factor: 16.971

8. Integrative genomics viewer.

Authors: James T Robinson; Helga Thorvaldsdóttir; Wendy Winckler; Mitchell Guttman; Eric S Lander; Gad Getz; Jill P Mesirov
Journal: Nat Biotechnol Date: 2011-01 Impact factor: 54.908

9. The characterization of twenty sequenced human genomes.

Authors: Kimberly Pelak; Kevin V Shianna; Dongliang Ge; Jessica M Maia; Mingfu Zhu; Jason P Smith; Elizabeth T Cirulli; Jacques Fellay; Samuel P Dickson; Curtis E Gumbs; Erin L Heinzen; Anna C Need; Elizabeth K Ruzzo; Abanish Singh; C Ryan Campbell; Linda K Hong; Katharina A Lornsen; Alexander M McKenzie; Nara L M Sobreira; Julie E Hoover-Fong; Joshua D Milner; Ruth Ottman; Barton F Haynes; James J Goedert; David B Goldstein
Journal: PLoS Genet Date: 2010-09-09 Impact factor: 5.917

10. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors: Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal: Nucleic Acids Res Date: 2006-11-27 Impact factor: 16.971

46 in total

Review 1. Bioinformatics for personal genome interpretation.

Authors: Emidio Capriotti; Nathan L Nehrt; Maricel G Kann; Yana Bromberg
Journal: Brief Bioinform Date: 2012-01-13 Impact factor: 11.622

2. VarSifter: visualizing and analyzing exome-scale sequence variation data on a desktop computer.

Authors: Jamie K Teer; Eric D Green; James C Mullikin; Leslie G Biesecker
Journal: Bioinformatics Date: 2011-12-30 Impact factor: 6.937

3. AnnTools: a comprehensive and versatile annotation toolkit for genomic variants.

Authors: Vladimir Makarov; Tina O'Grady; Guiqing Cai; Jayon Lihm; Joseph D Buxbaum; Seungtai Yoon
Journal: Bioinformatics Date: 2012-01-18 Impact factor: 6.937

4. TDP2 protects transcription from abortive topoisomerase activity and is required for normal neural function.

Authors: Fernando Gómez-Herreros; Janneke H M Schuurs-Hoeijmakers; Mark McCormack; Marie T Greally; Stuart Rulten; Rocío Romero-Granados; Timothy J Counihan; Elijah Chaila; Judith Conroy; Sean Ennis; Norman Delanty; Felipe Cortés-Ledesma; Arjan P M de Brouwer; Gianpiero L Cavalleri; Sherif F El-Khamisy; Bert B A de Vries; Keith W Caldecott
Journal: Nat Genet Date: 2014-03-23 Impact factor: 38.330

5. The genetic landscape of mutations in Burkitt lymphoma.

Authors: Cassandra Love; Zhen Sun; Dereje Jima; Guojie Li; Jenny Zhang; Rodney Miles; Kristy L Richards; Cherie H Dunphy; William W L Choi; Gopesh Srivastava; Patricia L Lugar; David A Rizzieri; Anand S Lagoo; Leon Bernal-Mizrachi; Karen P Mann; Christopher R Flowers; Kikkeri N Naresh; Andrew M Evens; Amy Chadburn; Leo I Gordon; Magdalena B Czader; Javed I Gill; Eric D Hsi; Adrienne Greenough; Andrea B Moffitt; Matthew McKinney; Anjishnu Banerjee; Vladimir Grubor; Shawn Levy; David B Dunson; Sandeep S Dave
Journal: Nat Genet Date: 2012-11-11 Impact factor: 38.330

Review 6. A beginners guide to SNP calling from high-throughput DNA-sequencing data.

Authors: André Altmann; Peter Weber; Daniel Bader; Michael Preuss; Elisabeth B Binder; Bertram Müller-Myhsok
Journal: Hum Genet Date: 2012-08-11 Impact factor: 4.132

7. Mutations in the gene that encodes the F-actin binding protein anillin cause FSGS.

Authors: Rasheed A Gbadegesin; Gentzon Hall; Adebowale Adeyemo; Nils Hanke; Irini Tossidou; James Burchette; Guanghong Wu; Alison Homstad; Matthew A Sparks; Jose Gomez; Ruiji Jiang; Andrea Alonso; Peter Lavin; Peter Conlon; Ron Korstanje; M Christine Stander; Ghaidan Shamsan; Moumita Barua; Robert Spurney; Pravin C Singhal; Jeffrey B Kopp; Hermann Haller; David Howell; Martin R Pollak; Andrey S Shaw; Mario Schiffer; Michelle P Winn
Journal: J Am Soc Nephrol Date: 2014-03-27 Impact factor: 10.121

8. TNXB mutations can cause vesicoureteral reflux.

Authors: Rasheed A Gbadegesin; Patrick D Brophy; Adebowale Adeyemo; Gentzon Hall; Indra R Gupta; David Hains; Bartlomeij Bartkowiak; C Egla Rabinovich; Settara Chandrasekharappa; Alison Homstad; Katherine Westreich; Guanghong Wu; Yutao Liu; Danniele Holanda; Jason Clarke; Peter Lavin; Angelica Selim; Sara Miller; John S Wiener; Sherry S Ross; John Foreman; Charles Rotimi; Michelle P Winn
Journal: J Am Soc Nephrol Date: 2013-04-25 Impact factor: 10.121

9. Novel mutation in VCP gene causes atypical amyotrophic lateral sclerosis.

Authors: Paloma González-Pérez; Elizabeth T Cirulli; Vivian E Drory; Ron Dabby; Puiu Nisipeanu; Ralph L Carasso; Menachem Sadeh; Andrew Fox; Barry W Festoff; Peter C Sapp; Diane McKenna-Yasek; David B Goldstein; Robert H Brown; Sergiu C Blumen
Journal: Neurology Date: 2012-11-14 Impact factor: 9.910

10. Mutations in TNK2 in severe autosomal recessive infantile onset epilepsy.

Authors: Yuki Hitomi; Erin L Heinzen; Simona Donatello; Hans-Henrik Dahl; John A Damiano; Jacinta M McMahon; Samuel F Berkovic; Ingrid E Scheffer; Benjamin Legros; Myriam Rai; Sarah Weckhuysen; Arvid Suls; Peter De Jonghe; Massimo Pandolfo; David B Goldstein; Patrick Van Bogaert; Chantal Depondt
Journal: Ann Neurol Date: 2013-09-04 Impact factor: 10.422