| Literature DB >> 29269382 |
Jochen Weile1,2,3,4, Song Sun1,2,3,4,5, Atina G Cote1,2,3, Jennifer Knapp1,2,3, Marta Verby1,2,3, Joseph C Mellor2,6, Yingzhou Wu1,2,3,4, Carles Pons7, Cassandra Wong1,2, Natascha van Lieshout1, Fan Yang1,2,3,4, Murat Tasan1,2,3,4, Guihong Tan2,3, Shan Yang8, Douglas M Fowler9, Robert Nussbaum8, Jesse D Bloom10, Marc Vidal11,12, David E Hill11, Patrick Aloy7,13, Frederick P Roth14,2,3,4,15.
Abstract
Although we now routinely sequence human genomes, we can confidently identify only a fraction of the sequence variants that have a functional impact. Here, we developed a deep mutational scanning framework that produces exhaustive maps for human missense variants by combining random codon mutagenesis and multiplexed functional variation assays with computational imputation and refinement. We applied this framework to four proteins corresponding to six human genes: UBE2I (encoding SUMO E2 conjugase), SUMO1 (small ubiquitin-like modifier), TPK1 (thiamin pyrophosphokinase), and CALM1/2/3 (three genes encoding the protein calmodulin). The resulting maps recapitulate known protein features and confidently identify pathogenic variation. Assays potentially amenable to deep mutational scanning are already available for 57% of human disease genes, suggesting that DMS could ultimately map functional variation for all human disease genes.Entities:
Keywords: complementation; deep mutational scanning; genotype–phenotype; variants of uncertain significance
Mesh:
Substances:
Year: 2017 PMID: 29269382 PMCID: PMC5740498 DOI: 10.15252/msb.20177908
Source DB: PubMed Journal: Mol Syst Biol ISSN: 1744-4292 Impact factor: 11.429
Figure 1UBE2I screening and validation
Modular structure of the screening framework.
Raw DMS‐BarSeq fitness scores in technical replicates (separately plated assays of the same pool) and biological replicates (separate sub‐strains in the pool carrying the same variants).
Manual spotting assay validation of a representative set of variants. Each row represents a consecutive fivefold dilution. Marked in red: maximal dilution visible in empty vector control. Marked in green: maximal dilution with visible human wt control. Marked in yellow: dilution steps exceeding visible human wt control. Bar heights represent summary screen scores. Error bars show Bayesian regularized standard error based on three technical replicates and a prior based on pre‐selection counts and final score (see Materials and Methods for details).
Variants grouped by evolutionary conservation (AMAS score) of their respective sites (top) and grouped by structural context within the protein core, within protein–protein interaction interfaces or on remaining protein surface (bottom). Boxes range across the second and third quartiles with the middle bar representing the median. Whiskers show the most extreme values within 1.5×IQR. As normality cannot be assumed for the distributions of fitness scores, one‐sided two‐sample Wilcoxon–Mann–Whitney tests were used. Low conservation (n = 60 clones) vs. medium conservation (n = 105 clones) W = 3789, *P = 0.015; medium conservation (n = 105 clones) vs. high conservation (n = 404 clones) W = 28043, *P = 1.8 × 10−7; Core (n = 208 clones) vs. surface (n = 42 clones) W = 1649, *P = 1.01 × 10−10; interface (n = 215 clones) vs. surface (n = 42 clones) W = 2461, *P = 1.58 × 10−6.
Figure 2Validation of machine‐learning imputation for UBE2I
Cross‐validation evaluation: Joint scores from DMS‐BarSeq and DMS‐TileSeq compared to machine‐learning prediction in 10× cross‐validation. The agreement is comparable to that between biological replicates in the screen itself (compare to Fig 1B).
Error map, showing cross‐validation results for each data point sorted by amino acid position and mutant residue.
Comparison of imputation predictions with individual spotting assays. Each row represents a consecutive fivefold dilution. Marked in red: maximal dilution visible in empty vector control. Marked in green: maximal dilution with visible human wt control. Marked in yellow: dilution steps exceeding visible human wt control.
Most informative features in the Random Forest imputation, as measured in % increase in mean squared deviation upon randomization of a given feature.
Figure 3A complete functional map of UBE2I
A complete functional map of UBE2I as resulting from the combination of the complementation screen and machine‐learning imputation and refinement. An impact score of 0 (blue) corresponds to a fitness equivalent to the empty vector control. A score of 1 (white) corresponds to a fitness equivalent to the wild‐type control. A score > 1 (red) corresponds to fitness above wild‐type levels. Shown above, for comparison are sequence conservation, secondary structure, solvent accessibility, and burial of the respective amino acid in protein–protein interaction interfaces with covalently and non‐covalently bound SUMO, the E1 UBA2, the sumoylation target RanGAP1, the E3 RanBP2 and UBE2I itself. Hydrogen bonds or salt bridges between residues and the respective interaction partner are marked with red asterisks. Residues buried in both the covalent SUMO and client interfaces are framed with dotted lines, marking the core members of the active site.
UBE2I crystal structure with residues colored according to the median mutant fitness. Colors as in (A). The interacting substrate's ΨKxE motif is shown in green stick model; Covalently bound SUMO is shown as a red cartoon model; and non‐covalently bound SUMO is shown in brown cartoon model. The structures shown were obtained by alignment of PDB entries 3UIP and 2PE6.
UBE2I crystal structure as in (B), with residues colored according to maximum mutant fitness.
Map quality comparison
| Gene | Possible AA changes | Achieved AA changes | Imputation RMSD | Experimental max(s.e.m.) | Refined max(s.e.m.) | Refinement > 0.05 |
|---|---|---|---|---|---|---|
| UBE2I | 3021 | 2563 (85%) | 0.24 | 0.36 | 0.25 | 2.46% |
| SUMO1 | 1919 | 1700 (89%) | 0.25 | 0.19 | 0.17 | 1.06% |
| TPK1 | 4617 | 3181 (69%) | 0.34 | 0.49 | 0.37 | 5.51% |
| CALM1 | 2831 | 1813 (64%) | 0.29 | 0.28 | 0.22 | 6.84% |
Experimental max(s.e.m.): the largest standard error associated with any experimentally measured score in the given dataset; refined max(s.e.m.): the largest standard error associated with any refined score in the given dataset. Refinement > 0.05: the percentage of variants whose scores were changed by more than 0.05 as a result of refinement.
Figure 4Functional maps of SUMO1, TPK1, and calmodulin (CALM1/2/3)
Layout and colors as in Fig 3.
Figure 5DMS functional maps reflect clinical phenotypes
Comparison of (refined) functional scores between rare polymorphisms (GnomAD) and somatic tumor mutations (COSMIC) in UBE2I and SUMO1. Bars show median and quartiles. As normality cannot be assumed for the distributions of fitness scores, a one‐sided two‐sample Wilcoxon–Mann–Whitney test was used: n = {26,31} variants, W = 570.5, P = 3.73 × 10−3.
Impact score distributions in calmodulin overlayed with previously observed alleles in CALM1, CALM2, and CALM3: Rare alleles from GnomAD are shown in green; ClinVar alleles classified as pathogenic are shown in red.
Precision‐recall curves for our DMS atlas, PROVEAN, and PolyPhen‐2 with respect to distinguishing Gnomad variants from pathogenic alleles from ClinVar.
Invitae VUS classification
| Variant | MAF | sd/rmsd | Imp/ref | Unrefined | DMS | DMS call | Indication |
|---|---|---|---|---|---|---|---|
| D94A | NA | 0.26 | Imputed | NA | 0.46 | Likely damaging | Cardio |
| D96H | NA | 0.26 | Imputed | NA | 0.72 | Likely damaging | Cardio |
| I28V | 10−5 | 0.05 | Mild ref. | 0.88 | 0.88 | Uncertain | Cardio |
| N98S | NA | 0.05 | Mild ref. | 0.89 | 0.89 | Uncertain | Cardio |
| T35I | 4 × 10−6 | 0.04 | Mild ref. | 0.93 | 0.93 | Likely benign | Non‐Cardio |
| E48G | NA | 0.05 | Mild ref. | 0.93 | 0.93 | Likely benign | Cardio |
| G26D | NA | 0.06 | Mild ref. | 0.94 | 0.94 | Likely benign | Non‐Cardio |
| T27S | 3 × 10−5 | 0.05 | Mild ref. | 0.96 | 0.96 | Likely benign | Non‐Cardio |
| V122A | NA | 0.05 | Mild ref. | 0.98 | 0.98 | Likely benign | Non‐Cardio |
| A104G | NA | 0.08 | Mild ref. | 1.00 | 1.00 | Likely benign | Non‐Cardio |
sd/rmsd, standard error (for measured values)/root‐mean‐squared deviation (for imputed values); imp/ref, imputation/refinement; mild ref., mild refinement.