Literature DB >> 24906884

VarMod: modelling the functional effects of non-synonymous variants.

Morena Pappalardo1, Mark N Wass2.   

Abstract

Unravelling the genotype-phenotype relationship in humans remains a challenging task in genomics studies. Recent advances in sequencing technologies mean there are now thousands of sequenced human genomes, revealing millions of single nucleotide variants (SNVs). For non-synonymous SNVs present in proteins the difficulties of the problem lie in first identifying those nsSNVs that result in a functional change in the protein among the many non-functional variants and in turn linking this functional change to phenotype. Here we present VarMod (Variant Modeller) a method that utilises both protein sequence and structural features to predict nsSNVs that alter protein function. VarMod develops recent observations that functional nsSNVs are enriched at protein-protein interfaces and protein-ligand binding sites and uses these characteristics to make predictions. In benchmarking on a set of nearly 3000 nsSNVs VarMod performance is comparable to an existing state of the art method. The VarMod web server provides extensive resources to investigate the sequence and structural features associated with the predictions including visualisation of protein models and complexes via an interactive JSmol molecular viewer. VarMod is available for use at http://www.wasslab.org/varmod.
© The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Substances:

Year:  2014        PMID: 24906884      PMCID: PMC4086131          DOI: 10.1093/nar/gku483

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

The ability to sequence genomes has resulted in the identification of millions of genetic variants, particularly single nucleotide variants (SNVs), within the human population as highlighted by the 1000 genomes project (1,2). Additionally, other studies have demonstrated that individuals have many rare SNVs (3,4). The data generated by such studies provide a unique resource for investigating the genotype to phenotype relationship. However, this is a complex problem as demonstrated by Genome Wide Association Studies (GWAS), which have identified many variants associated with disease risk but have only explained a limited amount of heritability (5). Additionally, in these studies, it is difficult to identify causal variants from a selection of candidate SNVs in the regions of the genome associated with the particular disease. There is therefore a need to develop methods to identify SNVs, in our case non-synonymous SNVs (nsSNVs), that are likely to affect the function of the protein in which they are present and are more likely to be associated with a change in phenotype. A number of methods have been developed previously (reviewed in (6)), with the Sorting Intolerant From Tolerant algorithm (SIFT, 7) and PolyPhen (8) being among the most well known. SIFT uses residue conservation in multiple sequence alignments to identify function altering nsSNVs, while PolyPhen uses machine learning to combine features from both sequence and structure. Here we have developed VarMod a new method for identifying functional nsSNVs. VarMod develops our recent research in which we demonstrated that disease associated nsSNVs are enriched at protein–protein interfaces (9). Additionally, in GWAS, we have previously used structural modelling of ligand binding sites to identify likely candidates for association with disease (10–12). For example, in a kidney disease genome wide association study (10), we demonstrated that the variant rs13538 results in a phenylalanine to serine change located in the acetyl Co-enzymeA binding site of the protein NAT8 and proposed that the variant may have an effect on the activity of the enzyme (10). VarMod builds upon these observations and uses structural modelling of ligand binding and protein–protein interface sites to generate features that are combined with other features such as residue to conservation to identify functional nsSNVs. The VarMod web server provides an overall prediction made using a machine learning approach (a support vector machine) to combine the data from the different individual analyses. Additionally the server provides users with extensive resources to investigate the results from the separate analyses.

METHODS

The VarMod algorithm

VarMod obtains features from multiple analyses, which are combined using a support vector machine (SVM) (13) to make an overall prediction. The data sources used are described below. Sequence conservation is calculated using Jensen–Shannon divergence (14). Homologues of the query sequence are identified by PSI-BLAST (15) using an approach shown to optimise results (16), where the query sequence is initially searched against UniRef50 to generate a sequence profile that is used to search against the full UniProt sequence database (17). The query sequence and homologues are aligned using MUSCLE (18) and the resulting multiple sequence alignment used to calculate the Jensen–Shannon divergence. To perform the structural analysis, a structural model of the query protein is generated. To do this, template structures in the protein databank (PDB) (19) are identified using hhblits (20) by searching a PDB sequence database representative at 70% sequence identity. Templates are selected with an hhblits probability (probability that the template and query sequence are homologous) score >80% and such that as much of the sequence is covered without redundantly modelling the same region of the protein multiple times. Initial structural models are generated using an approach based on the one used by Phyre2 (21,22). Side chains are added and optimised using pulchra (23). Small molecule binding sites are modelled using 3DLigandSite (with default parameters) (24) with the structural model used as the input. Protein–protein interface sites are modelled using an approach based on Interactome3D (25). The Interactome3D high confidence set of protein–protein interactions with template complexes in the PDB was used to generate models of the complexes. For each sequence–template pair the sequence is modelled using the template by applying the structural modelling approach described above. The features used in the SVM fall into two areas of sequence and structural features (a full list is available in Supplementary Table S1). The sequence features include residue conservation (the Jensen–Shannon convergence) and three features that represent the change of amino acid properties of size/mass, charge and functional group. The size/mass change of the amino acid is represented by the ratio of the mass of the two amino acids. To consider the change in charge between the two amino acids, the 20 amino acids are grouped according to charge (Supplementary Table S2). The feature representing the change in the charge of the amino acid considers changes between these charge groups, with values set in Supplementary Table S3. A further feature represents the change of chemical functional group present in the amino acid side chain. The amino acids are grouped as described by Innis et al. (26) (Supplementary Table S4) and the feature captures changes between these functional groups. The structural features use the ligand binding site, interface site and general structural features of the model. Where ligand-binding sites have been identified the distance of the variant to the binding site is calculated and used as a feature. When a variant is in a binding site, two further features capture results from the 3DLigandSite analysis. Where interface sites have been predicted, a further feature represents the distance of the variant to an interface site. Two features represent the type of secondary structure that the variant is located in. The first uses the secondary structure types classified by DSSP (27,28), while a second feature reduces these to the three main categories of helix, sheet and coil. A final feature represents the solvent accessibility (calculated using DSSP). The features generated are input into each of the five optimised SVM models generated during cross-validation (details below) to predict whether each variant is functional or non-functional. The outputs from each of the SVM models are converted to probabilities as described in Platt (29). An ensemble approach is taken with the probability from each SVM model weighted according to its accuracy in cross validation. The weighted probabilities are summed and normalised to generate a final probability for the VarMod prediction.

Generating a test set

Dataset 5 from VariBench (30) was used to train and test VarMod. This dataset contains human pathogenic and neutral variants, excludes cancer mutations and is clustered so that protein sequences share no >30% sequence identity. This set was initially split with 1401 pathogenic and 1527 neutral variants retained for final testing. The remaining 11 336 pathogenic and 12 737 neutral variants were split into five groups by protein sequence to perform 5-fold cross-validation to ensure that variants from each individual sequence appear in only 1-fold.

SVM training

The SVMs were generated by SVMlight (31) using a linear kernel. For each of the 5-folds, three were used for training, a further fold was used for validation and the SVM tested on the remaining fold. The SVMs were optimised for the trade off between training error and margin and also the cost factor to identify how training errors on positive examples should outweigh those on negative examples.

Comparison with PolyPhen-2

To compare VarMod performance with PolyPhen-2, the final test set of nsSNVs was run on the PolyPhen-2 web server (on 1 March 2014). Predictions were made using the two different classifiers available (HumDiv and HumVar) with default settings. The ROC and Precision–Recall analyses of PolyPhen-2 were performed by varying the ‘pph2_prob’ score.

EVALUATING VARMOD PERFORMANCE

The performance of VarMod was assessed using the set of sequences from VariBench that were not used in cross-validation. The performance of VarMod on the test set of sequences was assessed using the measures of specificity, sensitivity (recall), precision and a Receiver Operator Characteristic (ROC) analysis. The ROC curve and Precision–Recall graph in Figure 1 show the performance of VarMod and the comparison with PolyPhen-2. It shows that VarMod performance is comparable to PolyPhen-2. Interestingly, in the ROC analysis, neither of the PolyPhen-2 classifiers reaches the point 0,0 which is due to a small number of high confidence false positive predictions (i.e. neutral variants predicted to be pathogenic). This may reflect that PolyPhen-2 has been trained using different sets of pathogenic and neutral variants. It has also been previously observed that there is limited overlap between the predictions of different methods (32).
Figure 1.

Benchmarking VarMod. Analysis of the VarMod and PolyPhen-2 predictions on the non-cross validation test set. (A) ROC analysis, (B) precision–recall graph.

Benchmarking VarMod. Analysis of the VarMod and PolyPhen-2 predictions on the non-cross validation test set. (A) ROC analysis, (B) precision–recall graph.

The VarMod WEB SERVER

The VarMod web server is available at http://www.wasslab.org/varmod. Users are required to submit a protein sequence (raw sequence or FASTA formatted) or a UniProt accession, and a list of variant positions (e.g. A45C, where the single letter code is used to define the amino acids). A UniProt accession is required to perform the protein–protein interface analysis (optional). Processing time for each submission varies from 5 min to a few hours. Structural data has been pre-computed for all of the UniProt human principal protein isoforms, so submissions using these sequences are processed in a few minutes. Where other sequences are submitted, the structural models and binding sites need to be modelled thereby increasing the running time to a few hours.

Results output

The display of VarMod results is split into multiple sections (Figures 2 and 3). The first section provides a summary table of the analyses performed and the overall prediction made for each of the submitted nsSNVs. This table is colour coded to highlight the results from the individual analyses/features to indicate if they suggest the variant could affect protein function. For example, the binding site column is coloured red if the variant is in the binding site and the colour changes to blue the more distant the variant is from a known ligand-binding site. The summary table enables the user to see the overall result and to identify analyses that may be of interest for further inspection.
Figure 2.

Display of VarMod results. The results for variants in Phosphorylase b kinase gamma catalytic chain (UniProt accession P15735). The variants shown are known to have a role in Glycogen storage disease 9C. (A) The prediction summary table, showing the overall VarMod prediction and summarising the output from the different analyses. Results are colour coded to indicate the likely relevance of the changes, with features that suggest the variant is likely to be functional coloured red with the colour scale ranging to blue for features that are least likely to lead to functional changes. (B) The VarMod sequence display, residues are coloured to indicate conservation and the presence of ligand binding and interface sites. (C) The VarMod structural view.

Figure 3.

The VarMod interactions view for investigating variants located at protein–protein interfaces.

Display of VarMod results. The results for variants in Phosphorylase b kinase gamma catalytic chain (UniProt accession P15735). The variants shown are known to have a role in Glycogen storage disease 9C. (A) The prediction summary table, showing the overall VarMod prediction and summarising the output from the different analyses. Results are colour coded to indicate the likely relevance of the changes, with features that suggest the variant is likely to be functional coloured red with the colour scale ranging to blue for features that are least likely to lead to functional changes. (B) The VarMod sequence display, residues are coloured to indicate conservation and the presence of ligand binding and interface sites. (C) The VarMod structural view. The VarMod interactions view for investigating variants located at protein–protein interfaces. The sequence and structure sections display the main analyses. The sequence section displays the protein sequence, colour coded to highlight multiple features including residue conservation, ligand binding sites and protein–protein interfaces. The summary results and sequence view can be downloaded as a PDF file. The structural section first displays the details of the structural templates and models of the protein that have been generated (one for each region/domain for which a template was identified). A JSmol (www.jmol.org) molecular viewer forms the main part of the structural section and initially displays the model with the highest confidence (probability from hhblits alignment). The JSmol viewer enables visualisation of the modelled protein and by default is coloured to highlight the functional regions of the protein (ligand-binding and protein–protein interface sites) and the nsSNVs (red). A control panel to the right of the display enables the user to investigate the nsSNVs by displaying a different model, or modifying the display style (cartoon/spacefill or sticks representations) and colour of the whole protein, nsSNVs or functional sites. The user is able to generate high quality images of the displayed model by clicking on the ‘generate image’ button, enabling the analysis to be used for reports or publications. The location of the nsSNVs in relation to the protein–protein interface sites can be explored further via the modelled complexes. The complex models are listed in a table, which also indicates the nsSNVs that are present in the model and if they occur within an interface. The complexes can be viewed in a separate JSmol viewer accessed from a link for each of the entries in the list.

CONCLUDING REMARKS

VarMod was developed to use recent observations that disease associated nsSNVs are frequently located at ligand-binding and protein–protein interface sites and to automate manual approaches that we have previously used to analyse GWAS candidate nsSNVs. We have demonstrated that VarMod performance on a large and established benchmark set is comparable to an existing state of the art method (PolyPhen-2). The VarMod server provides a resource for users to identify functional nvSNVs and to investigate the individual features associated with these variants. Plans for future improvements to the server include increasing the number of interface and binding site features such as considering how variants may alter binding energies and options to submit variants in alternative formats such as Variant Call Files (VCF), which will facilitate high throughput analysis of nsSNVs identified from sequencing studies.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.
  30 in total

1.  Protein-protein interaction sites are hot spots for disease-associated nonsynonymous SNPs.

Authors:  Alessia David; Rozami Razali; Mark N Wass; Michael J E Sternberg
Journal:  Hum Mutat       Date:  2011-12-27       Impact factor: 4.878

2.  Exploring the extremes of sequence/structure space with ensemble fold recognition in the program Phyre.

Authors:  Riccardo M Bennett-Lovsey; Alex D Herbert; Michael J E Sternberg; Lawrence A Kelley
Journal:  Proteins       Date:  2008-02-15

3.  Protein structure prediction on the Web: a case study using the Phyre server.

Authors:  Lawrence A Kelley; Michael J E Sternberg
Journal:  Nat Protoc       Date:  2009       Impact factor: 13.491

4.  Identification of deleterious mutations within three human genomes.

Authors:  Sung Chun; Justin C Fay
Journal:  Genome Res       Date:  2009-07-14       Impact factor: 9.043

5.  VariBench: a benchmark database for variations.

Authors:  Preethy Sasidharan Nair; Mauno Vihinen
Journal:  Hum Mutat       Date:  2012-10-11       Impact factor: 4.878

6.  Evolution and functional impact of rare coding variation from deep sequencing of human exomes.

Authors:  Jacob A Tennessen; Abigail W Bigham; Timothy D O'Connor; Wenqing Fu; Eimear E Kenny; Simon Gravel; Sean McGee; Ron Do; Xiaoming Liu; Goo Jun; Hyun Min Kang; Daniel Jordan; Suzanne M Leal; Stacey Gabriel; Mark J Rieder; Goncalo Abecasis; David Altshuler; Deborah A Nickerson; Eric Boerwinkle; Shamil Sunyaev; Carlos D Bustamante; Michael J Bamshad; Joshua M Akey
Journal:  Science       Date:  2012-05-17       Impact factor: 47.728

7.  Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.

Authors:  W Kabsch; C Sander
Journal:  Biopolymers       Date:  1983-12       Impact factor: 2.505

8.  3DLigandSite: predicting ligand-binding sites using similar structures.

Authors:  Mark N Wass; Lawrence A Kelley; Michael J E Sternberg
Journal:  Nucleic Acids Res       Date:  2010-05-31       Impact factor: 16.971

9.  Genome-wide association study identifies loci influencing concentrations of liver enzymes in plasma.

Authors:  John C Chambers; Weihua Zhang; Joban Sehmi; Xinzhong Li; Mark N Wass; Pim Van der Harst; Hilma Holm; Serena Sanna; Maryam Kavousi; Sebastian E Baumeister; Lachlan J Coin; Guohong Deng; Christian Gieger; Nancy L Heard-Costa; Jouke-Jan Hottenga; Brigitte Kühnel; Vinod Kumar; Vasiliki Lagou; Liming Liang; Jian'an Luan; Pedro Marques Vidal; Irene Mateo Leach; Paul F O'Reilly; John F Peden; Nilufer Rahmioglu; Pasi Soininen; Elizabeth K Speliotes; Xin Yuan; Gudmar Thorleifsson; Behrooz Z Alizadeh; Larry D Atwood; Ingrid B Borecki; Morris J Brown; Pimphen Charoen; Francesco Cucca; Debashish Das; Eco J C de Geus; Anna L Dixon; Angela Döring; Georg Ehret; Gudmundur I Eyjolfsson; Martin Farrall; Nita G Forouhi; Nele Friedrich; Wolfram Goessling; Daniel F Gudbjartsson; Tamara B Harris; Anna-Liisa Hartikainen; Simon Heath; Gideon M Hirschfield; Albert Hofman; Georg Homuth; Elina Hyppönen; Harry L A Janssen; Toby Johnson; Antti J Kangas; Ido P Kema; Jens P Kühn; Sandra Lai; Mark Lathrop; Markus M Lerch; Yun Li; T Jake Liang; Jing-Ping Lin; Ruth J F Loos; Nicholas G Martin; Miriam F Moffatt; Grant W Montgomery; Patricia B Munroe; Kiran Musunuru; Yusuke Nakamura; Christopher J O'Donnell; Isleifur Olafsson; Brenda W Penninx; Anneli Pouta; Bram P Prins; Inga Prokopenko; Ralf Puls; Aimo Ruokonen; Markku J Savolainen; David Schlessinger; Jeoffrey N L Schouten; Udo Seedorf; Srijita Sen-Chowdhry; Katherine A Siminovitch; Johannes H Smit; Timothy D Spector; Wenting Tan; Tanya M Teslovich; Taru Tukiainen; Andre G Uitterlinden; Melanie M Van der Klauw; Ramachandran S Vasan; Chris Wallace; Henri Wallaschofski; H-Erich Wichmann; Gonneke Willemsen; Peter Würtz; Chun Xu; Laura M Yerges-Armstrong; Goncalo R Abecasis; Kourosh R Ahmadi; Dorret I Boomsma; Mark Caulfield; William O Cookson; Cornelia M van Duijn; Philippe Froguel; Koichi Matsuda; Mark I McCarthy; Christa Meisinger; Vincent Mooser; Kirsi H Pietiläinen; Gunter Schumann; Harold Snieder; Michael J E Sternberg; Ronald P Stolk; Howard C Thomas; Unnur Thorsteinsdottir; Manuela Uda; Gérard Waeber; Nicholas J Wareham; Dawn M Waterworth; Hugh Watkins; John B Whitfield; Jacqueline C M Witteman; Bruce H R Wolffenbuttel; Caroline S Fox; Mika Ala-Korpela; Kari Stefansson; Peter Vollenweider; Henry Völzke; Eric E Schadt; James Scott; Marjo-Riitta Järvelin; Paul Elliott; Jaspal S Kooner
Journal:  Nat Genet       Date:  2011-10-16       Impact factor: 38.330

10.  An integrated map of genetic variation from 1,092 human genomes.

Authors:  Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal:  Nature       Date:  2012-11-01       Impact factor: 49.962

View more
  10 in total

1.  VIPdb, a genetic Variant Impact Predictor Database.

Authors:  Zhiqiang Hu; Changhua Yu; Mabel Furutsuki; Gaia Andreoletti; Melissa Ly; Roger Hoskins; Aashish N Adhikari; Steven E Brenner
Journal:  Hum Mutat       Date:  2019-08-17       Impact factor: 4.878

2.  PTENpred: A Designer Protein Impact Predictor for PTEN-related Disorders.

Authors:  Sean B Johnston; Ronald T Raines
Journal:  J Comput Biol       Date:  2016-06-16       Impact factor: 1.479

3.  3DLigandSite: structure-based prediction of protein-ligand binding sites.

Authors:  Jake E McGreig; Hannah Uri; Magdalena Antczak; Michael J E Sternberg; Martin Michaelis; Mark N Wass
Journal:  Nucleic Acids Res       Date:  2022-04-12       Impact factor: 19.160

Review 4.  Single nucleotide variations: biological impact and theoretical interpretation.

Authors:  Panagiotis Katsonis; Amanda Koire; Stephen Joseph Wilson; Teng-Kuei Hsu; Rhonald C Lua; Angela Dawn Wilkins; Olivier Lichtarge
Journal:  Protein Sci       Date:  2014-10-20       Impact factor: 6.725

5.  PinSnps: structural and functional analysis of SNPs in the context of protein interaction networks.

Authors:  Hui-Chun Lu; Julián Herrera Braga; Franca Fraternali
Journal:  Bioinformatics       Date:  2016-03-24       Impact factor: 6.937

6.  Prediction on the risk population of idiosyncratic adverse reactions based on molecular docking with mutant proteins.

Authors:  Hongbo Xie; Diheng Zeng; Xiujie Chen; Diwei Huo; Lei Liu; Denan Zhang; Qing Jin; Kehui Ke; Ming Hu
Journal:  Oncotarget       Date:  2017-10-05

7.  Associating mutations causing cystinuria with disease severity with the aim of providing precision medicine.

Authors:  Henry J Martell; Kathie A Wong; Juan F Martin; Ziyan Kassam; Kay Thomas; Mark N Wass
Journal:  BMC Genomics       Date:  2017-08-11       Impact factor: 3.969

8.  In silico analysis of the V66M variant of human BDNF in psychiatric disorders: An approach to precision medicine.

Authors:  Clara Carolina Silva De Oliveira; Gabriel Rodrigues Coutinho Pereira; Jamile Yvis Santos De Alcantara; Deborah Antunes; Ernesto Raul Caffarena; Joelma Freire De Mesquita
Journal:  PLoS One       Date:  2019-04-18       Impact factor: 3.240

Review 9.  Analysis of genetic variation and potential applications in genome-scale metabolic modeling.

Authors:  João G R Cardoso; Mikael Rørdam Andersen; Markus J Herrgård; Nikolaus Sonnenschein
Journal:  Front Bioeng Biotechnol       Date:  2015-02-16

10.  In silico analysis of the tryptophan hydroxylase 2 (TPH2) protein variants related to psychiatric disorders.

Authors:  Gabriel Rodrigues Coutinho Pereira; Gustavo Duarte Bocayuva Tavares; Marta Costa de Freitas; Joelma Freire De Mesquita
Journal:  PLoS One       Date:  2020-03-02       Impact factor: 3.240

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.