Literature DB >> 22693211

VARIANT: Command Line, Web service and Web interface for fast and accurate functional characterization of variants found by Next-Generation Sequencing.

Ignacio Medina1, Alejandro De Maria, Marta Bleda, Francisco Salavert, Roberto Alonso, Cristina Y Gonzalez, Joaquin Dopazo.   

Abstract

The massive use of Next-Generation Sequencing (NGS) technologies is uncovering an unexpected amount of variability. The functional characterization of such variability, particularly in the most common form of variation found, the Single Nucleotide Variants (SNVs), has become a priority that needs to be addressed in a systematic way. VARIANT (VARIant ANalyis Tool) reports information on the variants found that include consequence type and annotations taken from different databases and repositories (SNPs and variants from dbSNP and 1000 genomes, and disease-related variants from the Genome-Wide Association Study (GWAS) catalog, Online Mendelian Inheritance in Man (OMIM), Catalog of Somatic Mutations in Cancer (COSMIC) mutations, etc). VARIANT also produces a rich variety of annotations that include information on the regulatory (transcription factor or miRNA-binding sites, etc.) or structural roles, or on the selective pressures on the sites affected by the variation. This information allows extending the conventional reports beyond the coding regions and expands the knowledge on the contribution of non-coding or synonymous variants to the phenotype studied. Contrarily to other tools, VARIANT uses a remote database and operates through efficient RESTful Web Services that optimize search and transaction operations. In this way, local problems of installation, update or disk size limitations are overcome without the need of sacrifice speed (thousands of variants are processed per minute). VARIANT is available at: http://variant.bioinfo.cipf.es.

Entities:  

Mesh:

Year:  2012        PMID: 22693211      PMCID: PMC3394276          DOI: 10.1093/nar/gks572

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Exome or genome sequencing constitutes a promising instrument for finding novel mutations of human disorders (1,2). However, massive sequencing experiments reveal an enormous amount of genomic variations (3), mostly of unknown functional consequences. Moreover, it has been noted that most sequence variants found in patients are likely to be neutral and do not cause any severe disorders (2). Within this scenario, the identification of casual mutations still represents a big challenge. Typically, several filters are applied to reduce the number of candidate variants, which include a crucial step of functional assessment. Software analysis pipelines currently used in the analysis of Next-Generation Sequencing (NGS) data are highly modular, heterogeneous and rapidly evolving. Apart from several commercial packages, there are also open source packages available such as SAMTOOLS (4), GATK (5), GAMES (6) and others. Some of them provide information on the consequence type of the variants found. However, the limited nature of this information has fostered the recent development of tools specifically designed for the annotation and functional assessment of Single Nucleotide Variants (SNVs) such as ANNOVAR (7), snpEff (http://snpeff.sourceforge.net/) or Variant Effect Predictor (8). VARIANT can be considered as a new generation tool for the functional assessment of variants because: (i) it extends its information contents from coding to non-coding regions too, and (ii) it implements technological novelties such as RESTful Web Services that optimize search and transaction operations and allow using a remote database in an extremely efficient way. VARIANT has three clients: Command Line Interface (CLI), Web Application and Google Chrome Extension.

SUMMARY OF THE FEATURES OF VARIANT

VARIANT can easily be incorporated into a NGS-resequencing pipeline either as a CLI or invoked as a Web service. In addition, it can be invoked through a web application as a conventional web server. VARIANT inputs data directly in Variant Call Format (VCF) (9), which is the output of the most widely used programs for variant calling. VARIANT can report the functional properties of any variant in all the human, mouse, rat, zebrafish or fruitfly genes (and soon new model organisms will be added). VARIANT not only reports the obvious functional effects in the coding regions but also analyses SNVs in non-coding regions situated both within the gene and in the neighborhood that could affect different regulatory motifs, splicing signals and other structural elements or evolutionarily highly conserved elements. In addition, known phenotypic or disease-related variants from the GWAS catalog, OMIM, COSMIC mutations, etc., are reported.

Biological features

VARIANT reports the conventional consequence type of the variant that can be: Non-synonymous, Synonymous, Intronic, 5′ UTR/3′ UTR, Upstream/Downstream, Essential Splice Site, Splice site, Stop gained, Stop lost and Intergenic and Non-coding and Nonsense-mediated decay transcripts. We use as a reference consequence type the Sequence Ontology (10) from Open Biological and Biomedical Ontologies (OBO) (11), together with Ensembl consequence types and National Center for Biotechnology Information (NCBI) terms. In addition, VARIANT reports information on variants that affect different regulatory or structural sites, such as CCCTF-binding factor (CTCF) transcriptional repressor sites, polymerase and histone sites or open chromatin regions, taken from Ensembl (12). Also variants disrupting transcription factor-binding sites (TFBSs) from the Jaspar database (13) or miRNA targets (14) are reported. Highly conserved regions between human and mouse genomes, taken from the Ensembl (12), are also reported because of their putative functional relevance. Another useful information to decide on the possible functional effect of an already described variant is the HapMap (15) allele frequency, which is also reported by the program. VARIANT also provides calculations of selective pressure values, related to the functional impact that a change can have in the variant site (16). VARIANT also reports the information available for already described variants such as Single Nucleotide Polymorphisms (SNPs). This information is collected from different databases: dbSNP (17), Ensembl (12) and 1000 genomes (18). Annotated SNPs are taken from Ensembl’s Application Programming Interface (API) which integrates distinct databases such as: HGMD-PUBLIC (19), NHGRI GWAS catalog (20), OMIM (http://omim.org/) and Open Access GWAS Database (21). We also include pathologic mutations collected from COSMIC (http://www.sanger.ac.uk/genetics/CGP/cosmic/) and UniProt (22).

The dilemma on where to place the heavy data

In the process of annotation of variants, there are two heavy data: the own VCF data and the already big and fast growing databases used for the annotation. The use of Web interfaces or Web services has been almost discarded given the difficulties of transferring variation files through the internet in favour of the local run. Unfortunately, local run requires of the local installation of the database, which present an amazing growing rate. Following a philosophy similar to Google, here we choose an innovative solution: heavy data are not moved through the web. The VCF is processed locally with a CLI client that, via RESTful Web Services send batches of queries in parallel to the remote Web service (alternatively the Web service can be used by another local script using the available API). This client has been implemented to send only the information needed to obtain the consequence type of variants, by doing this, data transfer is minimized. Then an optimized Java server program queries the database on the server side and returns the response to the client. This process mimics a local process minimizing data transfer. The resulting query process is efficient and very fast and returns the annotation of >10 000 variants per minute. Figure 1 shows a schema of the client-server architecture. A main collateral advantage of this scenario is that no installation or update of databases are needed. The database with the required information is always up-to-date in the remote server. Our group has maintained updated databases of functional effect of variants for different SNP-related projects (23–25).
Figure 1.

Schema of the client-server architecture of VARIANT. Biological information is stored in a remote MySQL cluster which is accessed through a Java RESTful WEB Services API. To connect to database Hibernate library is used. Data can be retrieved by clients in both text and JSON formats.

Schema of the client-server architecture of VARIANT. Biological information is stored in a remote MySQL cluster which is accessed through a Java RESTful WEB Services API. To connect to database Hibernate library is used. Data can be retrieved by clients in both text and JSON formats.

Other technical features

VARIANT is a Java application that can run in any platform. The database is queried in an optimized way by RESTful Web Services, either directly or invoked through a CLI program. There is also the possibility of using VARIANT as a Google Chrome Extension, having information of any variant with a mouse click. A RESTful Web Service API to calculate consequence types has been implemented in Java to make accessible the information contained in the database with simple calls. For example: returns the consequence type of a substitution of the reference base by a T in the position 32906982 of the chromosome 13. http://ws.bioinfo.cipf.es/cellbase/rest/latest/hsa/genomic/variant/13:32906982:T/consequence_type VARIANT comes with a VCF Genome browser developed in HTML5 with Scalable Vector Graphics (SVG) that allows representing the variants found in their genomic context and offers the possibility of visualizing information on the genes, their properties and any other genomic feature around. Figure 2 shows several views that can be displayed by the Genome Browser. In the upper part a pie chart summarizes the type of variants found and two bar charts represent the distribution of variant across chromosomes and in terms of quality. The central and lower parts show two filters to represent the variants in the genome browser. Using the different filters provided by the tool, different representations of the variants in their genomic context, along with the associated genomic and functional information, can be obtained.
Figure 2.

Different representative views that can be displayed by the Genome Browser. In the upper part a pie chart summarizes the type of variants found and two bar charts represent the distribution of variant across chromosomes and in terms of quality. The central part displays the gene filter and the lower part shows the variant-type filter. Using the different filters, different representations of the variants in the genomic context can be obtained.

Different representative views that can be displayed by the Genome Browser. In the upper part a pie chart summarizes the type of variants found and two bar charts represent the distribution of variant across chromosomes and in terms of quality. The central part displays the gene filter and the lower part shows the variant-type filter. Using the different filters, different representations of the variants in the genomic context can be obtained. Another interesting feature, first implemented in our program, is that registered users can consult their launched jobs at any time and place, given that sessions and data are stored in our servers.

Other tools

Prediction of the putative functional effect of a mutation is a classic problem already addressed in the context of studies of associations using SNPs (26), and several popular tools have been used for this purpose such as PolyPhen (27) and SIFT(28) or PupaSuite (24). However, the use of exome sequencing has changed the nature of the problem in two respects: (i) contrarily to the case of SNPs, the variants found in NGS studies are, in many cases, unknown and (ii) the number of variants to be annotated is much higher than in the case of a conventional SNP association study. In the last two years, a few tools have been specifically designed to cope with this new challenge in the functional annotation of variants. The first and probably the most popular tools are Annovar (7) and snpEff (http://snpeff.sourceforge.net/). Both are local tools that require the installation of a database, are reasonably fast and provide a succinct, but useful, annotation of the variants (essentially consequence type and some additional information). Recently, other applications more oriented to human genomes like SVA (29), which also offer a convenient Java-based interface, or TREAT (30) have been published. Other applications covering more genomes are also recently available (31). According to the information given in the respective publications, runtimes are quite similar (approximately one exome, assuming 30–40,000 variants per exome in half hour), except for the snpEff, which claims to make over a million predictions per minute (according to http://snpeff.sourceforge.net/). VARIANT would be in an intermediate place between these runtimes, with, approximately, an exome in less than five minutes (or several hundreds of predictions per second), taking into account that the program also searches regulatory and variation information. The growing sizes of the databases will be a limiting factor for future local usage. For example, the TREAT (30) database requires 175 GBs, and this data size will increase as more biological information is added and more genomes are sequenced. The extended use of scripting programming languages, such as Perl or Phyton, used in programs like ANNOVAR, Variant Effect Predictor, NGS-SNP (31) and TREAT can make the programming step easier but at the exchange of immense increases in both runtimes and difficulties for the scalability of future releases.

DISCUSSION

Current SNV annotation tools have different limitations. Most of them only report information on SNPs already present in dbSNP or 1000 genomes, or a few functional features, such as the consequence type or information on diseases or phenotypes. Generally speaking any variant labelled as non-coding or synonymous was filtered out. VARIANT increases the information scope outside the coding regions by including all the available information on regulation, DNA structure, conservation, evolutionary pressures, etc. Regulatory variants constitute a recognized, but still unexplored, cause of pathologies (32). The determination of variants with potential regulatory effect can explain many phenotypes or susceptibilities. As an example of the importance of the regulatory variants, the potential role of CDKN2B in the development of sporadic medullary thyroid carcinoma was confirmed by our group using a functional assay that showed that a variant (the SNP rs7044859) in the promoter region of the gene altered the binding of the transcription factor HNF1 (33). Another innovative aspect of VARIANT is the client-server architecture that separates physically the database, which is remote, from the local execution. This original solution minimizes the data traffic through internet and frees the user from disk space constraints and the need of cumbersome database updating processes. This client-server process almost mimics a local process minimizing data transfer. The resulting query is extremely fast and returns the annotation of ∼10 000 variants per minute. All these features make VARIANT a comprehensive, fast and innovative tool for the annotation of variants found in exome or genome sequencing experiments.

FUNDING

Funding for open access charge: The Spanish Ministry of Science and Innovation (MICINN) [BIO2011-27069]; the Conselleria de Educacio of the Valencian Community [PROMETEO/2010/001]; National Institute of Bioinformatics (www.inab.org) and the CIBER de Enfermedades Raras (CIBERER), both initiatives of the ISCIII, MICINN; Red Tematica de Investigacion Cooperativa en Cancer (RTICC), ISCIII, MICINN [RD06/0020/1019]; ‘Programa Nacional de Proyectos de investigación Aplicada’ [I+D+i 2008]; ‘Subprograma de actuaciones Científicas y Tecnológicas en Parques Científicos y Tecnológicos’ [ACTEPARQ 2009]; FEDER. Conflict of interest statement. None declared.
  33 in total

1.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors:  Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal:  Genome Res       Date:  2010-07-19       Impact factor: 9.043

2.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits.

Authors:  Lucia A Hindorff; Praveen Sethupathy; Heather A Junkins; Erin M Ramos; Jayashri P Mehta; Francis S Collins; Teri A Manolio
Journal:  Proc Natl Acad Sci U S A       Date:  2009-05-27       Impact factor: 11.205

3.  Next generation tools for the annotation of human SNPs.

Authors:  Rachel Karchin
Journal:  Brief Bioinform       Date:  2009-01       Impact factor: 11.622

4.  Integrating common and rare genetic variation in diverse human populations.

Authors:  David M Altshuler; Richard A Gibbs; Leena Peltonen; David M Altshuler; Richard A Gibbs; Leena Peltonen; Emmanouil Dermitzakis; Stephen F Schaffner; Fuli Yu; Leena Peltonen; Emmanouil Dermitzakis; Penelope E Bonnen; David M Altshuler; Richard A Gibbs; Paul I W de Bakker; Panos Deloukas; Stacey B Gabriel; Rhian Gwilliam; Sarah Hunt; Michael Inouye; Xiaoming Jia; Aarno Palotie; Melissa Parkin; Pamela Whittaker; Fuli Yu; Kyle Chang; Alicia Hawes; Lora R Lewis; Yanru Ren; David Wheeler; Richard A Gibbs; Donna Marie Muzny; Chris Barnes; Katayoon Darvishi; Matthew Hurles; Joshua M Korn; Kati Kristiansson; Charles Lee; Steven A McCarrol; James Nemesh; Emmanouil Dermitzakis; Alon Keinan; Stephen B Montgomery; Samuela Pollack; Alkes L Price; Nicole Soranzo; Penelope E Bonnen; Richard A Gibbs; Claudia Gonzaga-Jauregui; Alon Keinan; Alkes L Price; Fuli Yu; Verneri Anttila; Wendy Brodeur; Mark J Daly; Stephen Leslie; Gil McVean; Loukas Moutsianas; Huy Nguyen; Stephen F Schaffner; Qingrun Zhang; Mohammed J R Ghori; Ralph McGinnis; William McLaren; Samuela Pollack; Alkes L Price; Stephen F Schaffner; Fumihiko Takeuchi; Sharon R Grossman; Ilya Shlyakhter; Elizabeth B Hostetter; Pardis C Sabeti; Clement A Adebamowo; Morris W Foster; Deborah R Gordon; Julio Licinio; Maria Cristina Manca; Patricia A Marshall; Ichiro Matsuda; Duncan Ngare; Vivian Ota Wang; Deepa Reddy; Charles N Rotimi; Charmaine D Royal; Richard R Sharp; Changqing Zeng; Lisa D Brooks; Jean E McEwen
Journal:  Nature       Date:  2010-09-02       Impact factor: 49.962

5.  A map of human genome variation from population-scale sequencing.

Authors:  Gonçalo R Abecasis; David Altshuler; Adam Auton; Lisa D Brooks; Richard M Durbin; Richard A Gibbs; Matt E Hurles; Gil A McVean
Journal:  Nature       Date:  2010-10-28       Impact factor: 49.962

6.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

7.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.

Authors:  Kai Wang; Mingyao Li; Hakon Hakonarson
Journal:  Nucleic Acids Res       Date:  2010-07-03       Impact factor: 16.971

8.  Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor.

Authors:  William McLaren; Bethan Pritchard; Daniel Rios; Yuan Chen; Paul Flicek; Fiona Cunningham
Journal:  Bioinformatics       Date:  2010-06-18       Impact factor: 6.937

9.  In-depth annotation of SNPs arising from resequencing projects using NGS-SNP.

Authors:  Jason R Grant; Adriano S Arantes; Xiaoping Liao; Paul Stothard
Journal:  Bioinformatics       Date:  2011-06-22       Impact factor: 6.937

10.  SVA: software for annotating and visualizing sequenced human genomes.

Authors:  Dongliang Ge; Elizabeth K Ruzzo; Kevin V Shianna; Min He; Kimberly Pelak; Erin L Heinzen; Anna C Need; Elizabeth T Cirulli; Jessica M Maia; Samuel P Dickson; Mingfu Zhu; Abanish Singh; Andrew S Allen; David B Goldstein
Journal:  Bioinformatics       Date:  2011-05-29       Impact factor: 6.937

View more
  22 in total

1.  Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR.

Authors:  Hui Yang; Kai Wang
Journal:  Nat Protoc       Date:  2015-09-17       Impact factor: 13.491

2.  AVIA: an interactive web-server for annotation, visualization and impact analysis of genomic variations.

Authors:  Hue Vuong; Robert M Stephens; Natalia Volfovsky
Journal:  Bioinformatics       Date:  2013-11-09       Impact factor: 6.937

3.  Computational approaches to identify functional genetic variants in cancer genomes.

Authors:  Abel Gonzalez-Perez; Ville Mustonen; Boris Reva; Graham R S Ritchie; Pau Creixell; Rachel Karchin; Miguel Vazquez; J Lynn Fink; Karin S Kassahn; John V Pearson; Gary D Bader; Paul C Boutros; Lakshmi Muthuswamy; B F Francis Ouellette; Jüri Reimand; Rune Linding; Tatsuhiro Shibata; Alfonso Valencia; Adam Butler; Serge Dronov; Paul Flicek; Nick B Shannon; Hannah Carter; Li Ding; Chris Sander; Josh M Stuart; Lincoln D Stein; Nuria Lopez-Bigas
Journal:  Nat Methods       Date:  2013-08       Impact factor: 28.547

4.  Gene Variant Databases and Sharing: Creating a Global Genomic Variant Database for Personalized Medicine.

Authors:  Lora J H Bean; Madhuri R Hegde
Journal:  Hum Mutat       Date:  2016-03-18       Impact factor: 4.878

5.  Dissecting ancestry genomic background in substance dependence genome-wide association studies.

Authors:  Renato Polimanti; Can Yang; Hongyu Zhao; Joel Gelernter
Journal:  Pharmacogenomics       Date:  2015-08-12       Impact factor: 2.533

6.  Genome Maps, a new generation genome browser.

Authors:  Ignacio Medina; Francisco Salavert; Rubén Sanchez; Alejandro de Maria; Roberto Alonso; Pablo Escobar; Marta Bleda; Joaquín Dopazo
Journal:  Nucleic Acids Res       Date:  2013-06-08       Impact factor: 16.971

7.  Assessing the impact of mutations found in next generation sequencing data over human signaling pathways.

Authors:  Rosa D Hernansaiz-Ballesteros; Francisco Salavert; Patricia Sebastián-León; Alejandro Alemán; Ignacio Medina; Joaquín Dopazo
Journal:  Nucleic Acids Res       Date:  2015-04-16       Impact factor: 16.971

8.  Scripps Genome ADVISER: Annotation and Distributed Variant Interpretation SERver.

Authors:  Phillip H Pham; William J Shipman; Galina A Erikson; Nicholas J Schork; Ali Torkamani
Journal:  PLoS One       Date:  2015-02-23       Impact factor: 3.240

9.  Next-generation-based targeted sequencing as an efficient tool for the study of the genetic background in Hirschsprung patients.

Authors:  Berta Luzón-Toro; Laura Espino-Paisán; Raquel Ma Fernández; Marta Martín-Sánchez; Guillermo Antiñolo; Salud Borrego
Journal:  BMC Med Genet       Date:  2015-10-05       Impact factor: 2.103

10.  A survey of tools for variant analysis of next-generation genome sequencing data.

Authors:  Stephan Pabinger; Andreas Dander; Maria Fischer; Rene Snajder; Michael Sperk; Mirjana Efremova; Birgit Krabichler; Michael R Speicher; Johannes Zschocke; Zlatko Trajanoski
Journal:  Brief Bioinform       Date:  2013-01-21       Impact factor: 11.622

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.