Literature DB >> 30649215

A web application and service for imputing and visualizing missense variant effect maps.

Yingzhou Wu^1,2,3,4, Jochen Weile^1,2,3,4, Atina G Cote^1,4, Song Sun^1,4, Jennifer Knapp^1,4, Marta Verby^1,4, Frederick P Roth^1,2,3,4,5,6.

Abstract

SUMMARY: The promise of personalized genomic medicine depends on our ability to assess the functional impact of rare sequence variation. Multiplexed assays can experimentally measure the functional impact of missense variants on a massive scale. However, even after such assays, many missense variants remain poorly measured. Here we describe a software pipeline and application to impute missing information in experimentally determined variant effect maps.
AVAILABILITY AND IMPLEMENTATION: http://impute.varianteffect.org source code: https://github.com/joewuca/imputation. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Gene Species

Mesh：

Year: 2019 PMID： 30649215 PMCID： PMC6735881 DOI： 10.1093/bioinformatics/btz012

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Interpreting genome sequences for personalized diagnostics and therapy is becoming increasingly common (Starita ). However, our limited ability to interpret which genetic variants are functional has hindered progress. Indeed, among variants in ClinVar that have been subjected to clinical interpretation, the majority has been deemed a ‘variant of uncertain significance’ (Cooper, 2015). Many purely computational methods exist for identifying functional variants, e.g. PolyPhen-2 (Adzhubei ), SIFT (Ng and Henikoff, 2001) and PROVEAN (Choi ); however, computational methods can detect far fewer disease-associated variants with high confidence than experimental functional assays (Sun ). Experimental assays have typically been ‘reactive’, i.e. carried out only after a variant has been observed in a patient. More recently, it has become possible to measure the functional impact of many variants in a single protein using multiplexed assays of variant effect (MAVEs), in which next-generation sequencing is used to measure the effects of functional selection on a mutagenized pool of clones via changes in allele frequency during the selection (Fowler and Fields, 2014, Starita , 2017, Weile , Weile and Roth, 2018). However, some missense variants in MAVE experiments are poorly represented in the mutagenized library so that functional impact cannot be confidently assessed (Supplementary Fig. S1). Previously, we described methods to fill in the missing information in the resulting variant effect (VE) maps, and to refine entries that were poorly measured (Weile ). Strong agreement has been found between imputed function scores and individual experimental assays (Weile ). Others have used MAVE data to train models for predicting functional impact (Gray ), but these models were not optimized for the imputation problem. Here we modify previous computational methods and make them more accessible via a web application. Specifically, we provide: (i) a front-end web application that allows users to upload their own MAVE data and visualize or download a complete VE map (Fig. 1A); and (ii) a back-end data processing service that performs imputation and refinement (Fig. 1B). We note that human protein VE maps (imputed or otherwise) are research tools and should be appropriately validated before clinical use.

Fig. 1.

(A) The front-end web application. (B) The back-end application for data processing and machine learning workflow

2 The front-end web interface

The imputation pipeline front end was developed using the web application development tool Google Web Toolkit. The user interface is composed of three sections: (i) Upload and Impute, in which users upload their MAVE data in the appropriate format (See User Guide Sections 5.2 and 5.3), provide the ID from Uniprot (Bateman ) that corresponds to their target protein, and select analysis parameters (e.g. quality score threshold). After the back-end application has completed uploading, processing and imputing missing MAVE data, the front end visualizes the complete VE map. (ii) View Landscapes, allowing users to access previously imputed VE maps and revisit user-imputed maps associated with a known session ID. Users may view the landscape of experimentally measured function scores, imputed and refined function scores, or the landscape of scores from computational methods like Polyphen-2 and PROVEAN. (iii) Downloads, where the user can download input data format templates and user guide. Where no VE map is yet available, the application also allows entry of a UniprotID to retrieve contextual information (e.g. secondary structure annotations) and computational VE predictions from PolyPhen-2, SIFT, PROVEAN. The application is currently limited to the ∼3000 disease-implicated proteins with at least one pathogenic missense variant in ClinVar (Landrum ).

3 The back-end application

The back-end application was developed using Python and associated ‘scikit-learn’ machine learning package. Upon receiving user-uploaded MAVE data, the back-end application executes a series of jobs (Fig. 1B) to return the complete imputed and refined VE map to the front-end interface so that users can visualize and download the results.

3.1 Rescaling function scores

Unless the user indicates that the function scores they uploaded were pre-normalized, the pipeline rescales the score of each variant such that the median of stop codon variants () is defined to have function score 0 and the median of synonymous variants () is defined to have function score 1.

3.2 Correcting apparently-adaptive variants

Some variants may appear beneficial, i.e. have greater-than-wild-type function. However, variants that exhibit higher-than-wild-type function in yeast complementation assays are likely to be deleterious in humans (Weile ). Therefore, as in Weile et al., function scores that exceed the wild-type score of 1 are transformed to .

3.3 Modeling MAVE

The application generates a predictive statistical model for the MAVE data. Input features for this model included pre-computed PolyPhen-2 and PROVEAN scores, chemical and physical properties of the wild-type and substituted amino acids, protein structure-related information and the average function score at each position (Supplementary Table S1). (As the automated retrieval of these features may be more generally useful, this is enabled even where MAVE data is not available.) The models are trained using the Gradient Boosted Tree (GBT) method which outperformed other methods (e.g. random forest, SVM and linear regression) for all available VE maps (Supplementary Fig. S2) in 10-fold cross-validation, as measured by root-mean-squared error. Like the previously-described random forest approach (Weile ), a GBT model must be retrained for every new protein entry. Relative to our previously described random forest method, the GBT implementation used in our web application is faster and can also handle missing features in the data. Feature importance for each tree was the number of times a feature was used for splitting, weighted by squared improvement to root-mean-squared error owing to that feature. Average feature importance over all trees was reported (Supplementary Fig. S3). For previous imputation models, the most important feature has been the average function score at each position, with PolyPhen-2, SIFT, PROVEAN and BLOSUM (Henikoff and Henikoff, 1992) scores also being helpful. To include only high-quality measurements in model training, users can either provide a quality cutoff parameter, or let the analysis pipeline select the cutoff that optimizes performance in terms of predicting the test dataset which consists of the top 20% of variants ranked by quality score (Supplementary Fig. S4). The trained GBT model is then applied to each unmeasured missense variant to impute the function score.

3.4 Estimating error in function scores

To interpret the function score estimated for a given variant, it is important to understand the uncertainty in that estimate. Given a sufficient number () of replicates, the standard error for each variant can be accurately calculated from the set of replicates. When fewer than replicates are available, a regularized estimate of is calculated as in Weile , updating the measured with a prior estimate of that is based on an overall regression of values against function scores.

3.5 Refining measured function scores

The model for imputing missing scores can help refine experimental scores that were imperfectly measured. Refined scores were calculated as a weighted average of imputed and measured scores (weighting by the inverse-square of estimated standard error in each input score). Click here for additional data file.

13 in total

1. Predicting deleterious amino acid substitutions.

Authors: P C Ng; S Henikoff
Journal: Genome Res Date: 2001-05 Impact factor: 9.043

2. Amino acid substitution matrices from protein blocks.

Authors: S Henikoff; J G Henikoff
Journal: Proc Natl Acad Sci U S A Date: 1992-11-15 Impact factor: 11.205

3. Massively Parallel Functional Analysis of BRCA1 RING Domain Variants.

Authors: Lea M Starita; David L Young; Muhtadi Islam; Jacob O Kitzman; Justin Gullingsrud; Ronald J Hause; Douglas M Fowler; Jeffrey D Parvin; Jay Shendure; Stanley Fields
Journal: Genetics Date: 2015-03-30 Impact factor: 4.562

4. Variant Interpretation: Functional Assays to the Rescue.

Authors: Lea M Starita; Nadav Ahituv; Maitreya J Dunham; Jacob O Kitzman; Frederick P Roth; Georg Seelig; Jay Shendure; Douglas M Fowler
Journal: Am J Hum Genet Date: 2017-09-07 Impact factor: 11.025

5. Deep mutational scanning: a new style of protein science.

Authors: Douglas M Fowler; Stanley Fields
Journal: Nat Methods Date: 2014-08 Impact factor: 28.547

6. Predicting the functional effect of amino acid substitutions and indels.

Authors: Yongwook Choi; Gregory E Sims; Sean Murphy; Jason R Miller; Agnes P Chan
Journal: PLoS One Date: 2012-10-08 Impact factor: 3.240

7. Parlez-vous VUS?

Authors: Gregory M Cooper
Journal: Genome Res Date: 2015-10 Impact factor: 9.043

8. An extended set of yeast-based functional assays accurately identifies human disease mutations.

Authors: Song Sun; Fan Yang; Guihong Tan; Michael Costanzo; Rose Oughtred; Jodi Hirschman; Chandra L Theesfeld; Pritpal Bansal; Nidhi Sahni; Song Yi; Analyn Yu; Tanya Tyagi; Cathy Tie; David E Hill; Marc Vidal; Brenda J Andrews; Charles Boone; Kara Dolinski; Frederick P Roth
Journal: Genome Res Date: 2016-03-14 Impact factor: 9.043

9. ClinVar: improving access to variant interpretations and supporting evidence.

Authors: Melissa J Landrum; Jennifer M Lee; Mark Benson; Garth R Brown; Chen Chao; Shanmuga Chitipiralla; Baoshan Gu; Jennifer Hart; Douglas Hoffman; Wonhee Jang; Karen Karapetyan; Kenneth Katz; Chunlei Liu; Zenith Maddipatla; Adriana Malheiro; Kurt McDaniel; Michael Ovetsky; George Riley; George Zhou; J Bradley Holmes; Brandi L Kattman; Donna R Maglott
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

10. UniProt: the universal protein knowledgebase.

Authors:
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

4 in total

1. Deep2Full: Evaluating strategies for selecting the minimal mutational experiments for optimal computational predictions of deep mutational scan outcomes.

Authors: C K Sruthi; Meher Prakash
Journal: PLoS One Date: 2020-01-10 Impact factor: 3.240

2. A proactive genotype-to-patient-phenotype map for cystathionine beta-synthase.

Authors: Song Sun; Jochen Weile; Marta Verby; Yingzhou Wu; Yang Wang; Atina G Cote; Iosifina Fotiadou; Julia Kitaygorodsky; Marc Vidal; Jasper Rine; Pavel Ješina; Viktor Kožich; Frederick P Roth
Journal: Genome Med Date: 2020-01-30 Impact factor: 11.117

3. Improved pathogenicity prediction for rare human missense variants.

Authors: Yingzhou Wu; Roujia Li; Song Sun; Jochen Weile; Frederick P Roth
Journal: Am J Hum Genet Date: 2021-09-21 Impact factor: 11.025

4. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect.

Authors: Daniel Esposito; Jochen Weile; Jay Shendure; Lea M Starita; Anthony T Papenfuss; Frederick P Roth; Douglas M Fowler; Alan F Rubin
Journal: Genome Biol Date: 2019-11-04 Impact factor: 13.583

4 in total