Literature DB >> 36103257

VIVID: A Web Application for Variant Interpretation and Visualization in Multi-dimensional Analyses.

Swapnil Tichkule^1,2, Yoochan Myung^3,4, Myo T Naung^1,2, Brendan R E Ansell¹, Andrew J Guy⁵, Namrata Srivastava⁶, Somya Mehra⁷, Simone M Cacciò⁸, Ivo Mueller¹, Alyssa E Barry^7,9, Cock van Oosterhout¹⁰, Bernard Pope^11,12,13,14, David B Ascher^3,4, Aaron R Jex^1,15.

Abstract

Large-scale comparative genomics- and population genetic studies generate enormous amounts of polymorphism data in the form of DNA variants. Ultimately, the goal of many of these studies is to associate genetic variants to phenotypes or fitness. We introduce VIVID, an interactive, user-friendly web application that integrates a wide range of approaches for encoding genotypic to phenotypic information in any organism or disease, from an individual or population, in three-dimensional (3D) space. It allows mutation mapping and annotation, calculation of interactions and conservation scores, prediction of harmful effects, analysis of diversity and selection, and 3D visualization of genotypic information encoded in Variant Call Format on AlphaFold2 protein models. VIVID enables the rapid assessment of genes of interest in the study of adaptive evolution and the genetic load, and it helps prioritizing targets for experimental validation. We demonstrate the utility of VIVID by exploring the evolutionary genetics of the parasitic protist Plasmodium falciparum, revealing geographic variation in the signature of balancing selection in potential targets of functional antibodies.

Entities: Chemical

Keywords: data visualization; evolution; multi-dimensional analysis; population genetics; protein structure; variant interpretation

Mesh：

Year: 2022 PMID： 36103257 PMCID： PMC9514033 DOI： 10.1093/molbev/msac196

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 8.800

Introduction

The modern explosion of genomics, population genetics, and genome-wide association studies (GWAS) is producing enormous amounts of polymorphism data linking nucleotide variation to phenotypic outcomes (Luo et al. 2011; Uffelmann et al. 2021) or disease (Duncavage and Tandon 2015; Wu et al. 2016; Giannopoulou et al. 2019). Countless informatic and statistical methods have been developed to compile and quantify patterns in genotypic data that can be distilled into biologically meaningful results (Luo et al. 2011; Uffelmann et al. 2021). However, visualization is one of the most powerful methods for pattern recognition because it is closely aligned to our brain's processing of complex information (Bülthoff and Edelman 1992). Many tools can display (Variation Viewer; ncbi.nlm.nih.gov/variation/view) and predict the impact of individual single nucleotide polymorphisms (SNPs) on protein sequence and structure for large-scale population comparative studies and GWAS (Glusman et al. 2017). However, few can display the sum of these SNP locations in protein structures in an automated way. Moreover, current 3D mapping and visualization tools are either manually driven (e.g., Chimera; Pettersen et al. 2004) or restricted to specialist applications. cBioPortal (Cerami et al. 2012), COSMIC-3D (cancer.sanger.ac.uk/cosmic3d/), CRAVAT (Douville et al. 2013), MoKCa database (Richardson et al. 2009), and cancer3D (Porta-Pardo et al. 2015) are limited to cancer-related variants. LS-SNP/PDB (Ryan et al. 2009), MuPIT (Niknafs et al. 2013), PopViz (Zhang et al. 2018), and VarMap (Stephenson et al. 2019) are limited to humans. SNP2Structure (Wang et al. 2015) exclusively accesses genetic variation from dbSNP (Sherry et al. 2001). Finally, to predict the mutational effects of noncancer and nonhuman genetic variants on protein structure, users necessarily depend on third-party software and databases. In turn, this requires multiple webservers and input file formats, which is challenging and time-consuming, hampering comparison and accessibility. Furthermore, although existing tools allow annotation of individual SNPs, they do not scale to support the rendering of complex population-level variant data from any given organism or disease. We present VIVID, a novel interactive and user-friendly platform that automates mapping of genotypic information and population genetic analysis from Variant Call Format files in 2D and 3D protein-structural space. This platform provides an easy user experience and an integrated analysis environment to yield both individual- and population-level insights, while generalizing to any organism or disease.

New Approach

VIVID is a unique ensemble user interface that enables users to explore and interpret the impact of genotypic variation on the phenotypes of secondary and tertiary protein structures. Using data from a standard VCF, VIVID integrates published algorithms, programs, and databases to map, annotate, analyze, and visualize the effects of mutations on primary protein sequence, 2D protein residue interactions, and a variety of 3D protein-structural renderings at individual and population scales. Figure 1 provides a schematic workflow for VIVID, which consists of three key components: input, data preprocessing, and visualization. Overall, VIVID allows the integration of multiple analyses within one interactive visualization interface, adding new dimensions to variant annotation and functional effect prediction for any organism in any system.

Schematic workflow of VIVID. The back-end of the VIVID webserver consists of a series of tools for population genetics and mutational analysis. The information of allele frequency in the population and consensus sequence of each sample in the population are extracted from user provided VCF file by using Bcftools v1.8 and are further used by BioStructmap v.0.4.1 to map population genetic indices onto the 3D protein structure. The 3D structure of nonsynonymous mutations is modeled by MODELLER 10.1 and their effects on protein stability and interatomic interactions are evaluated by Dynamut2 and Arpeggio, respectively. All analysis outcomes (variant frequency, stabilizing/destabilizing effects, interatomic interactions, conservation score, diversity, selection, mutations, etc.) are visualized on 3D protein structure using NGLviewer. The front-end of the VIVID webserver was designed with Materialize v1.0.0 and the back-end was based on Python 2.7 via Flask Framework version v1.0.2 on a Linux server running Apache.

Input Source

VIVID requires four main inputs in the submission page of the webserver. (1) The complete nucleotide coding sequence of a gene in FASTA format. (2) A general feature format (GFF) file to extract genomic coordinates of the queried coding sequence and to map SNPs from (3) a VCF file on (4) a protein structure (PDB file), which can either be retrieved on request through VIVID from the AlphaFold2 predicted protein structure database (Jumper et al. 2021) or provided by users from any source, whether in silico predicted (Roy et al. 2010; Kelley et al. 2015) or accessed via the RCSB Protein Data Bank API (https://www.rcsb.org) (Berman et al. 2000). VIVID also allows users to select codon usage preference by selecting appropriate “genetic code” from the drop-down menu (default: standard code).

Data Preprocessing: web-server Mapping Mutations in 1D–3D

SNPs can be identified and explored within the VIVID interface based on their location in 1D linear display, 2D residue contact maps, and 3D protein-structural renderings. First, the 1D display is generated by mapping SNPs from the VCF file to the primary protein sequence based on the input GFF file and highlighted in the protein sequence viewer. Next, VIVID uses atomic coordinates from the PDB file to calculate the Euclidian distance between each pair of residues and display each mutated amino acid in the context of long-range residue contacts. Finally, VIVID renders each mutated amino acid in the 3D protein structure using NglViewer (Rose et al. 2018).

Data Preprocessing: web-server Annotation

Each mutated amino acid is classified as nonsynonymous or synonymous using an appropriate genetic code selected by the user. Protein sequences and structural renderings can also link to functional and domain-based annotations using data retrieved from UniProt (The UniProt Consortium 2021) and Pfam (Mistry et al. 2021), if available.

Data Preprocessing: Mutational Analyses

VIVID predicts the likely effects of substituted amino acids on protein structure using previously published algorithms, tools, and APIs. Firstly, MODELLER is used to model nonsynonymous mutations (Fiser and Šali 2003). Dynamut2 (Rodrigues et al. 2020) is then used to predict the effects of missense mutations on protein structure stability and flexibility. Arpeggio (Jubb et al. 2017) is further used to calculate interatomic interactions of mutated residues in 3D space. Amino acid conservation score of mutated residues is computed using the Position-Specific Scoring Matrix (PSSM) of PSI-BLAST (Altschul et al. 1997). BCFtools (http://samtools.github.io/bcftools/howtos/install.html) is used to calculate allele frequencies of each alternate allele from the VCF file. It generates consensus sequences for each sample within VCF file by using a user-supplied coding sequence as a reference, then mapping population genetic observations on protein structures. Finally, a 3D sliding window-based application implemented in BioStructMap (Guy et al. 2018) is used to map nucleotide diversity (π) and Tajima's D onto the protein models. This enables users to identify possible links between the effects of natural selection and the structural or biochemical changes in the protein.

Data Preprocessing: web-server Visualization

Nglviewer is used to visualize protein structures in an interactive mode. It also allows the user to map information onto rendered protein structures by selecting options including nonsynonymous and synonymous mutations, PSSM conservation scores, nucleotide diversity, Tajima's D, SNP frequency, etc. In addition, user-selected residues associated with functional and structural domains (retrieved from Uniprot and Pfam databases) in the primary sequence viewer can also be rendered on the protein structure.

Results

The outcome of VIVID in figure 1 is represented in seven panels on the results page of the webserver (fig. 2).

Seven panels showing the VIVID output on the results page. These panels show the results of integrative analyses of SNPs (data derived from the MalariaGen Pf3k version5 dataset) on the EBA175 RII protein.

External Information

The first panel provides an external link for UniProt and Pfam to allow users to render structural and functional domain information (e.g., conserved structural domains, active sites, etc.) on protein sequence and structure to identify mutational hotspots.

Protein Sequence Viewer

The second panel displays the translated protein sequence of the residues present in the protein structure where synonymous (purple) and nonsynonymous (yellow) mutations are highlighted by default. Users can also click and select protein residues to view them in 3D space in the visualization panel.

Contact Map

The third panel displays an interactive protein contact map where all pairs of residues mapping within a user-defined Euclidean distance threshold (default: 10 Ångstrom) in 3D-space and found more than six (default) amino acids apart in the primary sequence are displayed. This highlights long-range contacts that are important for stabilizing protein structure (Vendruscolo et al. 1997; Toth-Petroczy et al. 2016), with pair-wise contacts involving (and potentially disrupted by) mutated residues highlighted in pink.

3D Visualization

In the fourth panel, users can visualize the protein structure with different representations and coloring options. Users can map and visualize mutations, annotations, and analyses on the 3D model to detect hotspots of genetic variation by selecting various coloring options. Color coding options include hydrophobicity, non/synonymous mutations, PSSM conservation score, changes of folding free energy (ΔΔG), nucleotide diversity, Tajima's D, etc.

Mutational Analysis

The fifth panel displays results from mutational analyses performed using Dynamut2 (Rodrigues et al. 2020). It provides a bar-chart of predicted changes of folding free energy (ΔΔG) of substituted amino acids on protein structure stability and flexibility. The ΔΔG values are also reported in tabular format.

Arpeggio Results

The sixth panel shows interatomic interactions between the substituted residue and nearby residues in protein structures. It reports differences in 20 interatomic interactions between wild-type and mutated residues in tabular format. It is useful to get information about the lost and gained interaction after substitution in a protein structure.

Download

The seventh panel allows users to download Dynamut2 and Arpeggio results.

Case Study

Here, we demonstrate the utility of VIVID and its features by exploring the evolution of vaccine candidates within the parasitic protist Plasmodium falciparum, one of the world's primary causes of malaria (Prugnolle et al. 2011). Strain-specific immunity hinders vaccine development, and vaccine escape is likely to occur at sites under balancing selection. Drawing on an ensemble of published tools and algorithms, VIVID provides an intuitive and user-friendly interface to analyze functional divergence in more depth. We used SNP data from the published genomes (MalariaGEN Pf3K v5.1) of naturally occurring P. falciparum infections from Guinea (n = 100) and Thailand (n = 148). We filtered these data for SNPs mapping to region II (AA residues: N152–V745) of the erythrocyte binding antigen protein (EBA175 RII) using the P. falciparum 3D7 genome (PlasmoDB release v43). EBA175 mediates binding to human receptor glycophorin A during merozoite invasion of the RBC (Tolia et al. 2005) (fig 3). Region II (RII) of EBA175 is the functional binding domain consisting of two cysteine-rich Duffy binding-like domains called F1 and F2 (Sim et al. 1994). These F1 and F2 domains mediate dimerization of two EBA-175 proteins through disulphide bond formation (Tolia et al. 2005), allowing P. falciparum merozoites to bind to the erythrocyte surface (fig 3) during blood-stage invasion (Tham et al. 2012). Antibodies that block dimerization of EBA175 may inhibit glycophorin A binding, which is associated with protection from clinical malaria (Chen et al. 2013; Irani et al. 2015) (see residues highlighted in magenta in fig 3). Most (>90%) of the SNPs mapping to this region in the Guinean (Africa) and Thailand (Asia) populations were nonsynonymous (fig. 3). Spatially derived Tajima's D analysis in VIVID identified a large region of the F1 domain (AA residues E226–L294, I312–K324, and W377–I400) of EBA175 RII as being under balancing selection (i.e., positive Tajima's D values) in both populations (fig. 3). Some of the F1 domain mutations, such as E274K, L482V (also under balancing selection), were predicted to stabilize the parasite protein (i.e., positive ΔΔG) (fig. 3). Such polymorphisms can be maintained at intermediate frequencies to help the parasite escape host immune responses while maintaining a functional protein. A second region within the F2 domain (AA residues C551–C591 and Y731–F743) was also identified under balancing selection, but this was specific to the Thailand population (circles in fig. 3). Some of these residues with the F2 domain are targets of functional antibodies (Ambroggio et al. 2013; Chen et al. 2013) (fig 3). In contrast, the F1 domain has high nucleotide diversity, less amino acid (low PSSM) conservation, and is under balancing selection in both the populations (fig 3–). Therefore, the results suggest geographically variable effectiveness of the EBA175 RII vaccine if the formulation is based on a single reference strain.

Analyses of EBA175 RII polymorphisms from two different populations—Thailand (Asia) and Guinea (Africa) using VIVID. A part of the F2 domain (circle) illustrates distinct balancing selection between Thailand and Guinea populations. Unless otherwise stated, only the results from the Thailand population are shown. (A) Schematic diagram of the dimerized EBA175 RII ligand binding to glycophorin A human receptor. (B) The highlighted residues (circle) indicated the target of known inhibitory antibodies from previous studies (Ambroggio et al. 2013; Chen et al. 2013). (C) Map of nonsynonymous and synonymous mutations on EBA175 RII. (D) Map of ΔΔG values on EBA175 RII. (E) Map of spatial Tajima’s D on EBA175 RII. (F) Map of PSSM scores on EBA175 RII. (G) Map of nucleotide diversity on EBA175 RII.

Conclusion

We developed a web-based application, VIVID, to support improved visualization, analysis, and understanding of how mutations impact protein structure and function in any organism, using a selection of standardized input format files. This method allows users to examine mutations from individuals or populations in multiple dimensions for solved or in silico predicted protein structures, which can either be imported directly from RCSB or AlphaFold2, or uploaded by the user. VIVID can assess SNPs for their impact on protein structure, inferred function, and surface biochemistry (e.g., hydrophobicity) using individual-based data. For population studies, in addition to these features, VIVID can use the SNP data to color any protein model based on localized differences in population genetic metrics. This enables users to visualize protein evolution in 3D and identifies genomic regions under selection. The architecture of this browser-based tool is designed to benefit the broader scientific community, provide an easy user experience, and offer an integrated analysis environment. Our web portal integrates features from multiple programs, algorithms, and databases, providing users with a single platform to explore and interpret variants in multiple dimensions. In addition, the user-friendly and interactive panels allow downloading multiple result outputs that can be used downstream in other programs. In this paper, we have demonstrated the utility of VIVID by using a case study of the evolution of vaccine candidates against malaria. This will be a valuable resource to the research community working in structural and functional proteomics and genomics and will have numerous valuable applications.

Future Developments

We are in the process of expanding the usability of this application by integrating additional features from the available wealth of bioinformatics resources. These include widely used variant repositories for model-organisms, annotation databases in addition to UniProt, and additional protein functional properties to map on 3D protein structures. Moreover, we plan to integrate other developed algorithms and tools from our Biosig server (http://biosig.unimelb.edu.au/biosig). Furthermore, we are in the process of expanding VIVID features for population and evolutionary genomics analyses and 3D visualization. Finally, we will make available a standalone package for free download. In addition to the current analyses, this will allow users with sufficient computational resources to scale VIVID to perform proteome-wide structural studies.

37 in total

1. dbSNP: the NCBI database of genetic variation.

Authors: S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

2. Modeller: generation and refinement of homology-based protein structure models.

Authors: András Fiser; Andrej Sali
Journal: Methods Enzymol Date: 2003 Impact factor: 1.600

3. UCSF Chimera--a visualization system for exploratory research and analysis.

Authors: Eric F Pettersen; Thomas D Goddard; Conrad C Huang; Gregory S Couch; Daniel M Greenblatt; Elaine C Meng; Thomas E Ferrin
Journal: J Comput Chem Date: 2004-10 Impact factor: 3.376

4. LS-SNP/PDB: annotated non-synonymous SNPs mapped to Protein Data Bank structures.

Authors: Michael Ryan; Mark Diekhans; Stephanie Lien; Yun Liu; Rachel Karchin
Journal: Bioinformatics Date: 2009-04-15 Impact factor: 6.937

5. Recovery of protein structure from contact maps.

Authors: M Vendruscolo; E Kussell; E Domany
Journal: Fold Des Date: 1997

Review 6. Erythrocyte and reticulocyte binding-like proteins of Plasmodium falciparum.

Authors: Wai-Hong Tham; Julie Healer; Alan F Cowman
Journal: Trends Parasitol Date: 2011-12-16

7. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data.

Authors: Ethan Cerami; Jianjiong Gao; Ugur Dogrusoz; Benjamin E Gross; Selcuk Onur Sumer; Bülent Arman Aksoy; Anders Jacobsen; Caitlin J Byrne; Michael L Heuer; Erik Larsson; Yevgeniy Antipin; Boris Reva; Arthur P Goldberg; Chris Sander; Nikolaus Schultz
Journal: Cancer Discov Date: 2012-05 Impact factor: 39.397

Review 8. Mapping genetic variations to three-dimensional protein structures to enhance variant interpretation: a proposed framework.

Authors: Gustavo Glusman; Peter W Rose; Andreas Prlić; Jennifer Dougherty; José M Duarte; Andrew S Hoffman; Geoffrey J Barton; Emøke Bendixen; Timothy Bergquist; Christian Bock; Elizabeth Brunk; Marija Buljan; Stephen K Burley; Binghuang Cai; Hannah Carter; JianJiong Gao; Adam Godzik; Michael Heuer; Michael Hicks; Thomas Hrabe; Rachel Karchin; Julia Koehler Leman; Lydie Lane; David L Masica; Sean D Mooney; John Moult; Gilbert S Omenn; Frances Pearl; Vikas Pejaver; Sheila M Reynolds; Ariel Rokem; Torsten Schwede; Sicheng Song; Hagen Tilgner; Yana Valasatava; Yang Zhang; Eric W Deutsch
Journal: Genome Med Date: 2017-12-18 Impact factor: 11.117

9. Structural and functional basis for inhibition of erythrocyte invasion by antibodies that target Plasmodium falciparum EBA-175.

Authors: Edwin Chen; May M Paing; Nichole Salinas; B Kim Lee Sim; Niraj H Tolia
Journal: PLoS Pathog Date: 2013-05-23 Impact factor: 6.823

10. Pfam: The protein families database in 2021.

Authors: Jaina Mistry; Sara Chuguransky; Lowri Williams; Matloob Qureshi; Gustavo A Salazar; Erik L L Sonnhammer; Silvio C E Tosatto; Lisanna Paladin; Shriya Raj; Lorna J Richardson; Robert D Finn; Alex Bateman
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971