Literature DB >> 31161195

VCF/Plotein: visualization and prioritization of genomic variants from human exome sequencing projects.

Raul Ossio¹, O Isaac Garcia-Salinas¹, Diego Said Anaya-Mancilla¹, Jair S Garcia-Sotelo¹, Luis A Aguilar¹, David J Adams², Carla Daniela Robles-Espinoza^1,2.

Abstract

MOTIVATION: Identifying disease-causing variants from exome sequencing projects remains a challenging task that often requires bioinformatics expertise. Here we describe a user-friendly graphical application that allows medical professionals and bench biologists to prioritize and visualize genetic variants from human exome sequencing data.
RESULTS: We have implemented VCF/Plotein, a graphical, fully interactive web application able to display exome sequencing data in VCF format. Gene and variant information is extracted from Ensembl. Cross-referencing with external databases and application-based gene and variant filtering have also been implemented. All data processing is done locally by the user's CPU to ensure the security of patient data.
AVAILABILITY AND IMPLEMENTATION: Freely available on the web at https://vcfplotein.liigh.unam.mx. Website implemented in JavaScript using the Vue.js framework, with all major browsers supported. Source code freely available for download at https://github.com/raulossio/VCF-plotein. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2019 PMID： 31161195 PMCID： PMC6853650 DOI： 10.1093/bioinformatics/btz458

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Exome sequencing (ES) has been highly successful at identifying genetic variation contributing to a large number of human phenotypes and diseases (Do ; Gilissen ). However, the actual process of identifying disease-causing variants and mutations remains a challenging task, and often one that requires at least some bioinformatics knowledge. This is due mainly to the sheer number of variants routinely identified in ES projects, the diversity of biological mechanisms by which variants may act, and the need to integrate large amounts of information from both pathogenicity scoring algorithms and clinical and population databases. In this context, several software tools have been developed that are able to filter, display and contextualize exome sequencing data in order to accelerate the discovery of disease-causing variants. However, these platforms either require a good understanding of the command line (Paila ), have an interactive web interface but do not leverage external gene annotations that enrich biological interpretation (Hart ; Salatino and Ramraj, 2017), or do not support variant visualization at the protein level (Alemán ; Salatino and Ramraj, 2017). Here, we introduce VCF/Plotein, a user-friendly graphical web application to both visualize and prioritize variants from exome sequencing studies that requires minimal bioinformatics knowledge. As such, this application can be used equally by bioinformaticians, by biologists whose projects involve exome sequencing, or by medical professionals studying a particular disease or gene.

2 Materials and methods

VCF/Plotein has been implemented entirely as a single-page application hosted on a server with a 2-core Intel Xeon E5-4627 v4 2.60 Ghz processor running a VMware 6.5.0 virtual machine over a Linux Centos 7.5 operating system. The server also has 4 GB of RAM and a solid-state hard disk drive with 1 TB of storage space. The application has been written mainly in JavaScript and uses the Vue.js-based Nuxt.js framework to control the storage, flow and presentation of information in the browser. A purpose-made API has been developed to obtain information from locally-installed external databases [gnomAD (version: 2.1 size: 59.23 GB) (Lek ), dbSNP (build: 151, size: 14.6 GB) (Sherry ), COSMIC (version: 86, size: 421.8 MB) (Forbes ), ClinVar (version: 86, size: 170.7 MB) (Landrum ), phenotype relationships from the Human Phenotype Ontology database (version: February 2019, size: 5.9 mb) (Kohler ) and GO term information (version: September 2018, size: 7mb) (Ashburner ) for each annotated gene]. VCF/Plotein works with files in the variant call format (VCF) (Danecek ). Upon loading, a VCF file is validated and, after identifying the assembly version from the appropriate line, genes with variants are quickly found by matching an interval tree algorithm to the internal coordinate indexes containing each gene’s genomic positions. This generates a list with all the genes represented in the VCF, which can be filtered in different ways. Once a gene is selected, information about protein-coding transcripts and functional domains is extracted from Ensembl via the REST API (Zerbino ). Consequences from all variants falling within the selected gene, as well as their pathogenicity scores by SIFT (Ng and Henikoff, 2003) and PolyPhen (Adzhubei , 2013), are obtained via the Ensembl Variant Effect Predictor (McLaren ). Cross-referencing with supported external databases is then performed by querying our internal database using the Elasticsearch search engine (Supplementary Fig. S1). All collected information is stored as a collection of objects in JSON format, returned to the web browser and depicted over a customizable plot of the primary structure of the canonical transcript made using the D3.js library (Supplementary Fig. S2). All operations, except for the search of naked genomic positions in supported external databases, are performed locally by the user’s CPU.

3 Results

3.1 Overview

The only requirements to run VCF/Plotein are a computer with an internet connection and a VCF file. Once the user loads the VCF file, the genome assembly is identified, genes with variants are found, and a list of criteria is displayed to aid with gene prioritization (Supplementary Fig. S3). Once a gene is selected, a new page is shown with the primary protein structure of its canonical transcript with its domains and other features along with all its recorded variants. Variants are shown with an indication of their frequency among samples in the VCF file, their transcript consequences, and their presence or absence in the gnomAD, dbSNP, ClinVar and COSMIC databases (Supplementary Fig. S2). The user can click on any variant to access further information about it, such as its genomic coordinates, a prediction of its pathogenicity according to SIFT and PolyPhen, and a list of carrier samples. The left-hand menu allows the user to load a new VCF file, to select a different gene, to select a different transcript, to select which protein domains and features to show, to filter variants, to analyze sample IDs, and finally to bookmark the selected features. Using the top menu, variant information can also be displayed and downloaded in table format, which includes zygosity information for each carrier sample, as well as printed in the SVG vector image file format or the PNG raster graphics format.

3.2 Data security

The API and the internal databases have been installed behind a Fortinet firewall, and run over an HTTPS port with a SSL certificate for secure data transfer. No sensitive sample information is uploaded to the server. Sensitive data comprise the name or ID of the samples, sample genotype information, any annotation previously added to the VCF file by the user, or information in the VCF headers. The only information sent to the servers is naked genomic positions (chromosome, position and base change), in order to retrieve any relevant information present in public databases. Therefore, the server does not hold or save any sample information, an important feature given the data security policy that many patient-focused sequencing projects are bound by. All data processing, including construction of the JSON object and graphing of primary protein structures, is done locally on the user’s computer.

3.3 Variant filtering and visualization

Variants falling in any selected protein-coding transcript from any gene can be filtered and plotted. Users can filter variants by protein consequence, by clinical prediction, by pathogenicity score or by their allelic frequency in the gnomAD database, or can select a custom subset to display. Users can also select which protein domains and features to plot. The customized protein plot can then be exported as an SVG or PNG file.

3.4 Performance

VCF/Plotein is able to process VCF files from exome sequencing studies in a reduced time frame. One of the key aspects regarding performance has to do with the opening and loading of the VCF file, which requires as much RAM as the size of the file. Therefore, there is no hard limit in this step: Computers with more RAM will perform better at this task and will be able to open bigger files. A similar relationship exists between processor type and processing time: Processors with faster clock speeds will read the VCF file information faster. Since the application is run in the browser, the operating system does not play an important role in the performance of the application. Other time-consuming steps are those that require sending and receiving data over the internet, which are affected by connection data transfer speeds and the number of variants sent to the servers for querying the databases. To illustrate the performance of VCF/Plotein under different system architectures, processor types and memory characteristics, we have tested our application with different file sizes on a number of different machines (Supplementary Table S1). Although VCF/Plotein should run without issues on the majority of web browsers, it has been tested on the Chrome browser in the MacOS and Linux operating systems, as well as on the Edge browser in the Windows 10 operating system.

3.5 Bookmarks

Bookmarks allow users to easily save any selected features from any number of gene transcripts in a text file (in JSON format) which can subsequently be loaded into VCF/Plotein.

3.6 Comparison with other similar tools

Other available software tools perform some of the functions of VCF/Plotein, but either require at least some bioinformatics expertise, do not leverage information from external databases, do not allow users to visualize their own exome data, or are not freely available (Supplementary Table S2).

4 Use case: finding pathogenic variants in the BAP1 gene

To illustrate how to use VCF/Plotein, we have provided a use case based on a real VCF file from O’Shea ), who performed functional studies to identify those variants in the BAP1 gene likely to confer a higher risk of melanoma. We have supplemented this VCF file with simulated mutation data to add information from more genes. In the accompanying Supplementary Text, available in the Online Materials, we go through the typical filtering steps a researcher may follow to prioritize variants within this gene, which yields 4 variants, three of which were found to be functional in the original publication.

5 Discussion

We anticipate that VCF/Plotein will allow researchers, especially in small labs, to focus on biologically relevant questions instead of having to learn to install software dependencies, learn to use variant-annotation and cross-referencing tools, and become familiar with the UNIX and/or the MySQL command line. The main advantages that this tool provides over other similar software are its ease of use, the ability to display information from a custom VCF file, that it is freely available, and that it can process files locally. We have illustrated with a use case that, by applying a number of filters, a researcher can identify a small subset of variants within a gene that contains those found to be deleterious to gene function. By combining variant filtering and annotation in a single graphical and interactive tool, we have shown that variant prioritization and visualization become easier, faster and more intuitive. Click here for additional data file.

19 in total

1. dbSNP: the NCBI database of genetic variation.

Authors: S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

2. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

3. SIFT: Predicting amino acid changes that affect protein function.

Authors: Pauline C Ng; Steven Henikoff
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

Review 4. Unlocking Mendelian disease using exome sequencing.

Authors: Christian Gilissen; Alexander Hoischen; Han G Brunner; Joris A Veltman
Journal: Genome Biol Date: 2011-09-14 Impact factor: 13.583

5. A population-based analysis of germline BAP1 mutations in melanoma.

Authors: Sally J O'Shea; Carla Daniela Robles-Espinoza; Lauren McLellan; Jeanine Harrigan; Xavier Jacq; James Hewinson; Vivek Iyer; Will Merchant; Faye Elliott; Mark Harland; D Timothy Bishop; Julia A Newton-Bishop; David J Adams
Journal: Hum Mol Genet Date: 2017-02-15 Impact factor: 6.150

6. Analysis of protein-coding genetic variation in 60,706 humans.

Authors: Monkol Lek; Konrad J Karczewski; Eric V Minikel; Kaitlin E Samocha; Eric Banks; Timothy Fennell; Anne H O'Donnell-Luria; James S Ware; Andrew J Hill; Beryl B Cummings; Taru Tukiainen; Daniel P Birnbaum; Jack A Kosmicki; Laramie E Duncan; Karol Estrada; Fengmei Zhao; James Zou; Emma Pierce-Hoffman; Joanne Berghout; David N Cooper; Nicole Deflaux; Mark DePristo; Ron Do; Jason Flannick; Menachem Fromer; Laura Gauthier; Jackie Goldstein; Namrata Gupta; Daniel Howrigan; Adam Kiezun; Mitja I Kurki; Ami Levy Moonshine; Pradeep Natarajan; Lorena Orozco; Gina M Peloso; Ryan Poplin; Manuel A Rivas; Valentin Ruano-Rubio; Samuel A Rose; Douglas M Ruderfer; Khalid Shakir; Peter D Stenson; Christine Stevens; Brett P Thomas; Grace Tiao; Maria T Tusie-Luna; Ben Weisburd; Hong-Hee Won; Dongmei Yu; David M Altshuler; Diego Ardissino; Michael Boehnke; John Danesh; Stacey Donnelly; Roberto Elosua; Jose C Florez; Stacey B Gabriel; Gad Getz; Stephen J Glatt; Christina M Hultman; Sekar Kathiresan; Markku Laakso; Steven McCarroll; Mark I McCarthy; Dermot McGovern; Ruth McPherson; Benjamin M Neale; Aarno Palotie; Shaun M Purcell; Danish Saleheen; Jeremiah M Scharf; Pamela Sklar; Patrick F Sullivan; Jaakko Tuomilehto; Ming T Tsuang; Hugh C Watkins; James G Wilson; Mark J Daly; Daniel G MacArthur
Journal: Nature Date: 2016-08-18 Impact factor: 49.962

7. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources.

Authors: Sebastian Köhler; Leigh Carmody; Nicole Vasilevsky; Julius O B Jacobsen; Daniel Danis; Jean-Philippe Gourdine; Michael Gargano; Nomi L Harris; Nicolas Matentzoglu; Julie A McMurry; David Osumi-Sutherland; Valentina Cipriani; James P Balhoff; Tom Conlin; Hannah Blau; Gareth Baynam; Richard Palmer; Dylan Gratian; Hugh Dawkins; Michael Segal; Anna C Jansen; Ahmed Muaz; Willie H Chang; Jenna Bergerson; Stanley J F Laulederkind; Zafer Yüksel; Sergi Beltran; Alexandra F Freeman; Panagiotis I Sergouniotis; Daniel Durkin; Andrea L Storm; Marc Hanauer; Michael Brudno; Susan M Bello; Murat Sincan; Kayli Rageth; Matthew T Wheeler; Renske Oegema; Halima Lourghi; Maria G Della Rocca; Rachel Thompson; Francisco Castellanos; James Priest; Charlotte Cunningham-Rundles; Ayushi Hegde; Ruth C Lovering; Catherine Hajek; Annie Olry; Luigi Notarangelo; Morgan Similuk; Xingmin A Zhang; David Gómez-Andrés; Hanns Lochmüller; Hélène Dollfus; Sergio Rosenzweig; Shruti Marwaha; Ana Rath; Kathleen Sullivan; Cynthia Smith; Joshua D Milner; Dorothée Leroux; Cornelius F Boerkoel; Amy Klion; Melody C Carter; Tudor Groza; Damian Smedley; Melissa A Haendel; Chris Mungall; Peter N Robinson
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971