Literature DB >> 29785012

Multiplex assessment of protein variant abundance by massively parallel sequencing.

Kenneth A Matreyek¹, Lea M Starita¹, Jason J Stephany¹, Beth Martin¹, Melissa A Chiasson¹, Vanessa E Gray¹, Martin Kircher¹, Arineh Khechaduri¹, Jennifer N Dines², Ronald J Hause¹, Smita Bhatia³, William E Evans⁴, Mary V Relling⁴, Wenjian Yang⁴, Jay Shendure^5,6, Douglas M Fowler^7,8,9.

Abstract

Determining the pathogenicity of genetic variants is a critical challenge, and functional assessment is often the only option. Experimentally characterizing millions of possible missense variants in thousands of clinically important genes requires generalizable, scalable assays. We describe variant abundance by massively parallel sequencing (VAMP-seq), which measures the effects of thousands of missense variants of a protein on intracellular abundance simultaneously. We apply VAMP-seq to quantify the abundance of 7,801 single-amino-acid variants of PTEN and TPMT, proteins in which functional variants are clinically actionable. We identify 1,138 PTEN and 777 TPMT variants that result in low protein abundance, and may be pathogenic or alter drug metabolism, respectively. We observe selection for low-abundance PTEN variants in cancer, and show that p.Pro38Ser, which accounts for ~10% of PTEN missense variants in melanoma, functions via a dominant-negative mechanism. Finally, we demonstrate that VAMP-seq is applicable to other genes, highlighting its generalizability.

Entities: CellLine Chemical Disease Gene Mutation Species

Mesh：

Substances：

Year: 2018 PMID： 29785012 PMCID： PMC5980760 DOI： 10.1038/s41588-018-0122-z

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 38.330

INTRODUCTION

Every possible nucleotide change that is compatible with life is likely present in the germline of a living human[1]. Some of these variants alter protein activity or abundance, and, consequently, may impact disease risk. However, only ~2% of all presently reported germline missense variants have clinical interpretations[2,3]. Most of the remaining variants, as well as nearly all missense variants not yet observed, are rare and cannot be interpreted using traditional genetic approaches. Computational approaches are insufficiently accurate, and somatic mutations further complicate the picture. These limitations create a major challenge for the clinical use of genomic information. Deep mutational scans, which enable the simultaneous functional characterization of thousands of missense variants of a protein, offer one potential solution to the variant interpretation problem[4-6]. For example, the effects of nearly all possible single amino acid variants of the RING domain of BRCA1 on E3 ligase and BARD1 binding activity were quantified in a single study[7]. In another example, the effects of all possible single amino acid variants of PPARγ on the expression of CD36 in response to different agonists were measured[8]. In both cases, the functional data enabled accurate identification of most known pathogenic variants, suggesting that it could be useful in interpreting newly observed variants. So far, deep mutational scans, including of BRCA1 and PPARγ, have relied on assays specific for each protein’s molecular function. However, developing specific assays for each of the thousands of disease-related proteins is impractical. To overcome this challenge, we sought to devise a functional assay that was both informative of variant effect and generalizable to many proteins. We based our assay on the fact that most proteins, despite their diversity, must be abundant enough to perform their molecular function. Variants can interfere with steady-state protein abundance in cells via a variety of mechanisms, including by diminishing thermodynamic stability, altering post-transcriptional regulation or interrupting trafficking. In fact, as much as 75% of the pathogenic variation in monogenic disease is thought to disrupt thermodynamic stability and, consequently, alter abundance[9,10]. Furthermore, low-abundance variants of tumor suppressors can lead to cancer[11,12], while low-abundance variants of drug-metabolizing enzymes can alter drug response[13]. Here, we describe Variant Abundance by Massively Parallel Sequencing (VAMP-seq), which measures the steady-state abundance of protein variants in cultured human cells. We applied VAMP-seq to assess 4,112 single amino acid variants of the tumor suppressor PTEN and 3,689 variants of the enzyme TPMT. Our results show how changes in protein biophysical properties and interactions within and between proteins alter protein abundance in cells. We identify 1,138 previously uncharacterized, low-abundance single amino acid variants of PTEN that are likely to be pathogenic, and 777 TPMT variants that are likely unable to adequately methylate and thereby inactivate thiopurine drugs. We observe selection for low-abundance PTEN variants in cancer and reveal that LRG_311p1:p.Pro38Ser, which accounts for ~10% of PTEN missense variants observed in melanoma, functions via a dominant negative mechanism. Finally, we demonstrate that VAMP-seq can be applied to other clinically important proteins including VKOR, CYP2C9, CYP2C19, MLH1, and PMS2.

RESULTS

Multiplex assessment of PTEN and TPMT variant abundance

Inspired by earlier methods to assess the stability of protein variants in yeast[14] and bacteria[15], and by a microarray-based assay that globally profiled mammalian protein stability[16], we developed VAMP-seq. VAMP-seq is a multiplex assay that uses fluorescent reporters to measure the steady-state abundance of protein variants in cultured human cells (Fig. 1). Each cell expresses a single variant directly fused to EGFP. The stability of the variant dictates the abundance of the EGFP fusion and, accordingly, the green fluorescence signal of the cell. To control for expression, mCherry is either co-transcriptionally or co-translationally expressed.

Figure 1

Overview of Variant Abundance by Massively Parallel Sequencing (VAMP-seq)

A mixed population of cells each expressing one protein variant fused to EGFP is created. The variant dictates the abundance of the variant-EGFP fusion protein, resulting in a range of cellular EGFP fluorescence levels. Cells are then sorted into bins based on their level of fluorescence, and high throughput sequencing is used to quantify every variant in each bin. VAMP-seq scores are calculated from the scaled, weighted average of variants across bins. The resulting sequence-function maps describe the relative intracellular abundance of thousands of protein variants.

We first evaluated VAMP-seq’s ability to quantify abundance of the tumor suppressor protein PTEN and the enzyme TPMT. Each wild type open reading frame was N-terminally tagged with EGFP and recombined into a single genomic locus of an engineered HEK 293T cell line[17]. We also constructed cell lines expressing known low-abundance variants of each protein. We assessed the EGFP:mCherry ratio by flow cytometry, and found that cells expressing wild type PTEN or TPMT had ~5-fold higher EGFP:mCherry ratios than the known low-abundance variants (Fig. 2a; Supplementary Fig. 1b, c).

Figure 2

VAMP-seq abundance scores for PTEN and TPMT

a, Flow cytometry profiles for PTEN (left) and TPMT (right), with WT (red), known low-abundance variant controls (blue), and the variant libraries (gray) overlaid. Bin thresholds used to sort the library are shown above the plots. Each smoothed histogram was generated from at least 1,500 recombined cells from control constructs, and at least 6,000 recombined cells from the library. b, VAMP-seq abundance score density plots for PTEN (left) and TPMT (right) nonsense variants (blue dotted line), synonymous variants (red dotted line), and missense variants (filled, solid line). The missense variant densities are colored as gradients between the lowest 10% of abundance scores (blue), the WT abundance score (white), and abundance scores above WT (red). c, d, Heatmap of PTEN (c) and TPMT (d) abundance scores, colored according to the scale in b. Variants that were not scored are colored gray. e, f, Number of amino acid substitutions scored at each position for PTEN and TPMT. g, h, Positional median PTEN and TPMT abundance scores, computed for positions with a minimum of 5 variants, are shown as dots. The gray line represents the mean abundance score in a three-residue sliding window. i, j, PTEN and TPMT position-specific PSIC conservation scores are shown as dots, and the gray line represents the mean PSIC score within a three-residue sliding window. k, l, PTEN and TPMT domain architecture is shown, with positions in alpha helices and beta sheets colored cyan and pink, respectively.

We next applied VAMP-seq to measure the steady state abundance of thousands of PTEN and TPMT single amino acid variants in parallel. Barcoded, site saturation mutagenesis libraries of each protein were separately recombined into our engineered HEK 293T cell line[17,18]. Cells harboring each library had EGFP:mCherry ratios that spanned the range of our wild type (WT) and known low-abundance variants controls (Fig. 2a). Cells were flow sorted into bins according to their EGFP:mCherry ratio, and high-throughput DNA sequencing was used to quantify each variant’s frequency in each bin. Finally, an abundance score was calculated for each variant based on its distribution across the bins (Fig. 1; Supplementary Table 1). Abundance scores ranged from about zero, indicating total loss of abundance, to about one, indicating WT-like abundance (Fig. 2b). Abundance scores correlated modestly well between replicates (mean Pearson’s r = 0.63, mean Spearman’s ρ = 0.62 for PTEN; and mean r = 0.73, mean ρ = 0.67 for TPMT; Supplementary Fig. 2). To improve accuracy, final abundance scores and confidence intervals were computed from eight replicate experiments. The resulting data set describes the effects of 4,112 of the 7,638 possible single amino acid PTEN variants and 3,689 of the 4,655 possible TPMT variants (Fig. 2c, d; Supplementary Data 1, 2; Supplementary Table 2). VAMP-seq-derived abundance scores were highly correlated with the abundances of protein variants assessed in individual experiments (n = 25, r = 0.96, ρ = 0.96 for PTEN; n = 19, r = 0.75, ρ = 0.61 for TPMT; Supplementary Fig. 3a, b). Furthermore, PTEN variant abundance measured using full-length EGFP or a fifteen amino acid split-GFP tag[19] were in agreement (n = 6, r = 0.98, ρ = 0.94; Supplementary Fig. 1d). Finally, our abundance scores were consistent with 41 PTEN and 20 TPMT variant abundance effects assessed by western blotting (Supplementary Fig. 3c, d). Thus, VAMP-seq accurately quantifies steady-state protein variant abundance. For both proteins, the distribution of abundance scores was bimodal, with peaks that overlapped WT synonyms and nonsense variants (Fig. 2b). Nonsense variants exhibited consistently low scores, except for those at the extreme N- or C-termini of each protein (Supplementary Fig. 3e). A larger fraction of PTEN variants had low abundance scores than TPMT variants, possibly reflecting the lower thermostability of PTEN (Tm = 40.3 °C) relative to TPMT (Tm = ~60 °C) (Supplementary Fig 3f)[20,21]. This inverse relationship between low-abundance and thermostability is consistent with a deep mutational scan of GFP (Tm = ~78 °C) which found relatively few variants with a large effect on fluorescence[22,23]. Median variant abundance scores at each position illustrated tolerance to amino acid substitution (Fig. 2g, h; Supplementary Data 3, 4; Supplementary Table 2), which was inversely related to conservation (ρ = −0.26 and −0.59 for PTEN and TPMT, respectively; Fig. 2i, j; Supplementary Fig. 3g, h). In PTEN, alpha helices and beta sheets were less tolerant to substitution, while flexible loops were highly tolerant (Fig. 2k, l; Supplementary Fig. 3i). In TPMT, beta sheets, which comprise the core of protein, were less tolerant of substitution (Supplementary Fig. 3j). The abundance data can be explored using an interactive web-interface (see URLs).

Thermodynamic stability partly explains variant abundance

Variants can potentially alter protein abundance inside cells via a variety of mechanisms, including by changing thermodynamic stability. We compared our abundance scores to various biochemical and biophysical features and found that hydrophobic packing, which affects thermodynamic stability in vitro[24-26], was a key correlate of abundance. Mutation of WT hydrophobic aromatic, methionine, or long nonpolar aliphatic amino acids produced the largest decreases in abundance for both proteins (Fig. 3a). In fact, WT amino acid hydrophobicity was negatively correlated with abundance (Fig. 3b, WT hydroΦ), whereas mutant amino acid hydrophobicity was positively correlated with abundance (MT hydroΦ). Conversely, mutations of WT amino acids with high relative solvent accessibility (RSA), polarity (WT Polarity), and crystal-structure temperature factor (B-factor), all features associated with polar residues present on the protein surface, were associated with high abundance (Fig. 3b). Consistent with the importance of hydrophobic packing, positions with the lowest average abundance scores were largely in the solvent inaccessible interiors of each protein (Fig. 3c, d). Finally, PTEN abundance scores correlated strongly with in vitro melting temperatures[20] (n = 5, r = 0.97, ρ = 0.90; Supplementary Fig. 4a). These observations, consistent between PTEN and TPMT, suggest that variant thermodynamic stability is a major driver of variant abundance in vivo.

Figure 3

Biochemical features influencing intracellular protein abundance

a, Scatterplots of variant abundance scores averaged over all twenty WT residues (left) or mutant residues (right) for PTEN (x-axis) and TPMT (y-axis). b, A scatterplot of Spearman’s rho values for PTEN (x-axis) or TPMT (y-axis) abundance score correlations with various evolutionary (red), structural (blue), or primary protein sequence (cyan) features (n = 3,411 for PTEN, n = 3,230 for TPMT). See legend of Supplementary Table 2 for information regarding these features. c, d, PTEN (c, PDB: 1d5r) and TPMT (d, PDB: 2h11) crystal structures are shown. Chains are colored according to positional median abundance scores using a gradient between the lowest 10% of positional median abundance scores (blue), the WT abundance score (white), and abundance scores above WT (red). The 20% of positions with the lowest scores are shown as a semi-transparent surface. The substrate mimicking compounds tartrate and S-adenosyl-L-homocysteine are displayed as magenta spheres. e, Low-abundance PTEN residues with predicted hydrogen bonds or salt bridges are shown as sticks with a semi-transparent surface representation. Residues within 11 Å of each other are clustered and colored as discrete groups. The residues in each group are identified by number, followed, in parentheses, by the number of times any variant at the residue is found in the COSMIC database. f, Residues with high abundance scores are shown as semi-transparent red spheres, and known membrane-interacting side-chains shown as opaque cyan spheres. Residues that are both membrane-interacting and have high abundance scores are shown in gray.

Next, we explored the role of polar contacts, using the PTEN structure to identify all side-chains predicted to form hydrogen bonds and ion pairs. Of the 76 positions potentially participating in these interactions, only 26 were mutationally intolerant (Supplementary Fig. 4b). These 26 intolerant positions largely clustered into discrete groups in three-dimensional space (Fig. 3e; Supplementary Fig. 4c). The groups highlighted regions of PTEN particularly important for abundance, and often included positions distant in primary sequence. For example, group 5 positions, along with p.Ser170, mediate inter-domain contacts between the PTEN phosphatase and C2 domains[27], and we found that mutations at these positions resulted in loss of abundance (Fig. 3e). Mutations at these positions also frequently occur in cancer[27]; our data suggests they may compromise function by virtue of their low abundance. Similarly, loss of abundance from abrogation of intra-domain polar contacts may account for the high frequency of mutations at p.Lys66, p.Tyr68, or p.Asp107 (group 2) in cancer (Fig. 3e; Supplementary Fig. 4d). TPMT lacked clusters of intolerant, polar-contact positions, possibly because it is a smaller, single domain protein with a higher melting temperature.

Cell membrane interactions modulate PTEN variant abundance

Though VAMP-seq does not explicitly query post-translational modification, trafficking or partner binding, each of these can impact abundance. Therefore, we searched for signatures of these properties in our abundance data. PTEN mediates the removal of the 3′ phosphate from phosphatidylinositol 3,4,5-triphosphate (PIP3) to produce phosphatidylinositol 4,5-diphosphate (PIP2) at the membrane[28]. Membrane interaction is aided by phospholipid-binding positions present in both PTEN domains (Fig. 3f)[29,30]. Furthermore, PTEN membrane binding and activity is negatively regulated by phosphorylation of its unstructured C-terminal tail[28,31]. Active site or C-terminal regulatory phosphosite variants have been found to decrease activity, reduce membrane binding and increase abundance, hinting at the existence of a negative feedback mechanism that degrades membrane-bound, active PTEN[31,32]. We therefore asked whether any PTEN variants increased abundance, perhaps by altering membrane interaction. We identified 41 positions in PTEN that had mean abundance scores higher than WT. 19 of these enhanced-abundance positions were in structurally resolved regions, and 58% of them were within 7 Å of known phospholipid-binding positions. In comparison, only 13% of all structurally resolved PTEN positions were within 7 Å of phospholipid-binding positions (Supplementary Fig. 4e). Thus, positions with abundance-enhancing variants tended to be near the membrane-proximal face of PTEN, and included those important for binding PIP3, PIP2 or PI(3)P[30,33,34] (Fig. 3f). Furthermore, phosphomimetic substitutions at the p.Ser385 PTEN C-terminal regulatory phosphosite exhibited the highest abundance scores, whereas positively charged substitutions had low scores, supporting the impact of phosphorylation at this site on abundance (Supplementary Fig. 4f). Thus, many of the enhanced-abundance variants we identified likely disrupt PTEN membrane localization or PIP3 phosphatase function.

New, potentially pathogenic low-abundance PTEN variants

VAMP-seq scores can also be used to identify potentially pathogenic variants. To simplify comparisons to clinical variant effects, we classified PTEN missense single nucleotide variants (SNVs) as either low abundance, possibly low abundance, possibly WT-like abundance, or WT-like abundance based on how each variant’s abundance score and confidence interval compared to the distribution of WT synonym scores (Fig. 4a, Supplementary Fig. 5a). Then, we analyzed variants present in public databases of either germline or somatic variation in the light of these abundance classifications.

Figure 4

PTEN variant abundance classes across PHTS and cancer

a, A histogram of PTEN abundance scores for all missense variants observed in the experiment, with bars colored according to abundance classification. Abundance scores for three possibly benign variants present in the GnomAD database are shown as dots colored by classification. b, c, d, Abundance score histograms, colored by abundance classification, for PTEN germline variants listed in ClinVar as known pathogenic (b), likely pathogenic (c), or variants of uncertain significance (d). e, PTEN missense and nonsense variants in TCGA and the AACR GENIE project databases are arranged by cancer type. The top bar in each cancer type panel shows the observed frequency of variants in each abundance class as determined using VAMP-seq data. The bottom bar in each cancer type panel shows the expected abundance class frequencies based on cancer type-specific nucleotide substitution rates. Abundance classes are colored blue (low-abundance), light blue (possibly low-abundance), pink (possibly WT-like), or red (WT-like). The p.Pro38Ser variant is additionally colored with yellow stripes. The four known PTEN dominant negative variants are colored yellow. Variants not scored in the experiment are colored grey. n is the number of instances of PTEN variants observed in the indicated cancer type and also scored in our experiments. f, A western blot analysis of cells stably expressing WT or missense variants of N-terminally HA-tagged PTEN. This experiment was independently performed twice with similar results (See Supplementary Figure 5e).

Heterozygous germline loss of PTEN activity can cause a spectrum of symptoms including multiple hamartomas, carcinoma, and macrocephaly, collectively known as PTEN Hamartoma Tumor Syndrome (PHTS)[35], which includes Cowden Syndrome. 216 PTEN germline missense SNVs are in ClinVar, a submission-driven database of variants identified primarily through clinical testing[3]. 41 of the 216 PTEN missense variants are annotated as pathogenic, 25 of which had abundance scores. Of these 25, 16 (64%) were classified as low abundance (Fig. 4b), a significantly higher proportion than the 24% of scored missense variants that are low abundance (Resampling test, n = 25, P < 0.0001; Fig. 4a; Supplementary Fig. 5b; Supplementary Table 3). Of the remaining nine variants, three were possibly low abundance. Four were active site variants (p.His93Arg, p.Gly129Glu, p.Arg130Leu, and p.Thr131Ile) known to be inactive without loss of abundance. The remaining two variants (p.Asp24Gly and p.Arg234Gln) were distal to the active site and likely alter PTEN function by an unknown mechanism[36,37]. Thus, VAMP-seq-derived abundance scores, where available and combined with structural knowledge of the PTEN active-site, reveal >90% of known PTEN pathogenic variants. We could not formally assess the VAMP-seq false positive rate because no PTEN variants are currently classified as benign. However, as has been done before[8], we were able to identify likely non-damaging variants based on their population frequency. Germline PTEN variants cause Cowden Syndrome, a high-penetrance, dominantly-inherited Mendelian disease, at a rate of at least ~1 per 200,000 individuals[35,38]. We identified PTEN variants occurring at frequencies higher than expected given the prevalence of Cowden’s Syndrome, strongly suggesting that they are non-damaging[8,39]. Seven variants passed this threshold, and six were in our dataset (Supplementary Fig. 5c). None were low abundance. One was possibly low abundance and two were possibly WT-like abundance. The remaining three, p.Ala79Thr, p.Pro354Gln, and p.Ser294Arg, were WT-like in abundance and had frequencies higher than 5 × 10−5, strongly suggesting that they are likely to be benign[2] (Fig. 4a). This analysis suggests that the PTEN abundance score data have a low false positive rate. An additional 41 PTEN variants are annotated as likely pathogenic in ClinVar. Of these, 23 had abundance scores, 10 (43%) of which were classified as low abundance (Fig. 4c; Supplementary Fig. 5b). Thus, the likely pathogenic category also had more low-abundance variants than expected (Resampling test, n = 23, P = 0.0188; Supplementary Table 3). The 134 remaining ClinVar variants are of uncertain significance. 83 of these variants had abundance scores, and 22 (27%) were low abundance (Fig. 4d). By providing additional evidence that supports pathogenicity, our abundance data could be used to alter variant clinical interpretations[40] (Supplementary Note, Supplementary Fig. 6). For example, 22 variants of uncertain significance along with 275 possible but not-yet-observed missense variants are low-abundance and could potentially be moved to the likely pathogenic category once observed in the appropriate clinical setting (Supplementary Table 4).

Abundance data reveals mechanisms of PTEN dysregulation

Somatic inactivation of PTEN by missense variation is an important contributor to multiple types of cancer[41]. We asked whether VAMP-seq derived abundance data could reveal the contribution of previously reported somatic PTEN variants to tumorigenesis. We collected PTEN missense or nonsense variants found in The Cancer Genome Atlas[42] and the AACR Project GENIE[43], and compared the observed frequencies of PTEN variants of each abundance class to the expected frequencies based on cancer type-specific nucleotide mutation spectra[42]. We observed significantly more low-abundance PTEN variants than expected for every cancer type analyzed (Resampling test, all P values ≤ 0.0032; Fig. 4e; see Supplementary Table 5 for p-values). This pattern suggests that selection for low abundance PTEN variants is a common oncogenic mechanism. Some PTEN variants (e.g. p.Cys124Ser, p.Gly129Glu, p.Arg130Gly, p.Arg130Gln) are inactive but have WT-like abundance. These inactive variants exert a dominant negative affect on PTEN activity, leading to enhanced Akt phosphorylation and enhanced tumorigenesis in mouse models[44-46]. As expected, known dominant negative variants had WT-like or higher abundance scores (p.Cys124Ser = 1.14, p.Arg130Gly = 1.09, p.Gly129Glu = 0.76) . Known dominant negative variants were also significantly enriched in cancer, largely driven by the high frequencies of p.Arg130Gly and p.Arg130Gln[44,47] (Fig. 4e; Supplementary Fig. 5c; Supplementary Table 5 for p-values). Unlike for every other cancer type we examined, melanoma lacked an enrichment of known dominant negative variants. However, p.Pro38Ser was significantly enriched, accounting for 10.4% of PTEN missense variants (Resampling test, n = 77, P < 0.0001; Fig. 4e; Supplementary Fig. 5d; see Supplementary Table 5 for p-values). p.Pro38Ser had been previously observed in melanoma cancer cell lines, yet had never been functionally characterized[48]. p.Pro38Ser had a slightly higher abundance score than WT (1.14) in our assay. Based on its prevalence in melanoma and its WT-like abundance, we hypothesized that it might exert a dominant negative effect. Indeed, we found that p.Pro38Ser, like known dominant-negative variants, drove increased Akt phosphorylation in the presence of endogenous wild type PTEN (Fig. 4f; Supplementary Fig. 5e). In contrast, computational predictors suggested that p.Pro38Ser is thermodynamically unstable, highlighting the utility of VAMP-seq (Supplementary Fig. 5f). Overall, our results show that low abundance PTEN variants are important cancer drivers and that p.Pro38Ser, over-represented in melanoma, likely acts as a dominant negative.

Implications of TPMT abundance for drug treatment

TPMT is one of 17 pharmacogenes whose genotype can be used to guide drug dosing[49]. Functional TPMT is required to metabolize thiopurine drugs such as 6-mercaptopurine (6-MP) and its prodrug, azathioprine. Thiopurine drugs are used to treat individuals with leukemia, rheumatic disease, inflammatory bowel disease, or rejection in solid organ transplant. Increased exposure to thiopurines causes treatment interruption or even life-threatening myelosuppression and hepatotoxicity. Three known nonfunctional variants of TPMT, NP_000358.1:p.Ala80Pro, p.Ala154Thr and p.Tyr240Cys, are found at high allele frequencies (combined MAF = 0.066) and are responsible for 95% of decreased-function alleles in the population[50]. The drug toxicity to carriers of these variants can be explained, at least in part, by the fact that they result in lower abundance of TPMT relative to wild type[13,21] (Fig. 5a). Accordingly, both abundance scores (Fig 5a) and individually assessed EGFP:mCherry values (Fig. 2a; Supplementary Fig. 1c) were lower for these nonfunctional variants compared to the WT allele. Since our abundance scores identified known decreased-function alleles, we analyzed the abundance of rare TPMT variants of unknown function.

Figure 5

TPMT variant abundance classes across pharmacogenomics phenotypes

a, A histogram of TPMT abundance scores for all missense variants observed in the experiment, with bars colored according to abundance classification (top; n = 1,529 data points). Abundance scores for variants previously identified and characterized in patients are shown as dots colored by classification. Variants found in gnomAD at frequencies higher than 4×10−6 are also shown (bottom; n = 118 data points). b, A scatterplot of abundance score and mean 6-MP dose tolerated by individuals heterozygous for each variant. Dose intensity is the dose at which 6-MP becomes toxic to the patient before the 100% protocol dose of 75 mg/m2. r and ρ denote Pearson’s and Spearman’s correlation coefficients, respectively.

In a clinical study of patients with acute lymphoblastic leukemia (ALL), 884 patients were analyzed by exome array. 278 of these patients also had exome sequencing data available. Red blood cell (RBC) TPMT activity and 6-MP dose intensity, the dose at which each individual became sensitive to 6-MP, were also measured[51]. The three known, high-frequency drug sensitivity variants were identified, along with four rare variants: p.Ser125Leu, p.Gln179His, p.Arg215His and p.Arg226Gln (combined MAF < 0.0053). The mean RBC activity of individuals heterozygous for p.Gln179His, p.Arg215His, and p.Arg226Gln was lower than the mean activity of individuals without TPMT variants, but higher than the activity of individuals heterozygous for the high-frequency drug sensitivity variants (Supplementary Fig. 7a, b). In contrast, RBC activity for p.Ser125Leu was higher than WT. Thiopurine dose intensity, which is affected by TPMT activity, is highly correlated with variant abundance (r = 0.99, ρ = 1, n = 6; Fig. 5b; Supplementary Fig. 7c). Though their RBC activity varied over a wide range, the individuals heterozygous for these rare variants tolerated a higher mean dose of 6-MP than individuals heterozygous for the known sensitivity variants. Additionally, the four rare variants are classified as WT-like based on VAMP-seq abundance data. Individual assessment confirmed that these rare alleles do not affect abundance (Supplementary Fig. 7d). Thus, p.Ser125Leu, p.Gln179His, p.Arg215His and p.Arg226Gln may not be decreased-function variants. Sequencing of the human population[2] and individuals intolerant to thiopurine drugs[52] has revealed an additional 118 rare TPMT variants. These variants (MAF range = 0.000004 – 0.00066) are carried, in aggregate, by 0.2% of the population[2], but the impact of most of these variants on TPMT activity and abundance are unknown[53]. We measured abundance scores for 96 of these variants, classifying fourteen (15%) as low abundance and seventeen (18%) as possibly low abundance. When these or any of the other 389 missense variants we classified as low or possibly low abundance are identified in the clinic, the risk for thiopurine toxicity may be elevated. Dose reduction or closer monitoring could minimize toxicity and improve outcomes[50].

General utility of VAMP-seq for assessing variant abundance

To demonstrate that VAMP-seq is applicable to diverse proteins, we evaluated wild type and known or predicted low-abundance variants for seven additional pharmacogenes or “clinically actionable” genes[54,55] (Supplementary Table 6). For CYP2C9, CYP2C19, and VKORC1, we found large differences in the EGFP:mCherry ratios of the wild type and known or predicted low-abundance missense variants (Fig. 6), whereas MLH1 and PMS2 yielded smaller differences. Thus, VAMP-seq could be applied to these five proteins. Furthermore, ~52% of human proteins yielded at least as much fluorescence as MLH1 when expressed as EGFP fusions[16], suggesting that many human proteins are compatible with VAMP-seq (Supplementary Fig. 8). However, BRCA1 and LMNA resulted in low EGFP signal or no difference in the EGFP:mCherry ratio between wild type and known low-abundance variants (Fig. 6 and data not shown). Thus, VAMP-seq will not be applicable in all cases. In particular, proteins that are marginally stable like BRCA1, make large complexes like LMNA, or are secreted and therefore break the link between variant genotype and phenotype, are not amenable to VAMP-seq.

Figure 6

Additional drug- and disease-related genes are compatible with VAMP-seq

Representative flow cytometry EGFP:mCherry smoothed histogram plots for WT (red) and known or predicted destabilized variants (blue) for VKOR, CYP2C9, CYP2C19, MLH1, PMS2, and LMNA. Each smoothed histogram was generated from at least 1,000 recombined cells. This experiment was independently performed three times with similar results.

DISCUSSION

VAMP-seq is a generalizable method for multiplex measurement of steady-state protein variant abundance. Since alterations in abundance may be a general mechanism of pathogenic variation[9,10], an important application of VAMP-seq may be to aid clinical geneticists in understanding the effects of newly discovered missense variants. Indeed, the American College of Medical Genetics suggests that well-established functional assays can provide strong evidence of pathogenicity[40]. Thus, in the context of monogenic diseases where protein inactivation is pathogenic, VAMP-seq-derived abundance data can help to identify pathogenic variants. The utility of VAMP-seq for this purpose is highlighted by the fact that 64% of known PTEN pathogenic missense variants were of low abundance. Furthermore, VAMP-seq revealed 1,138 low-abundance PTEN variants that would likely confer an increased risk of PTEN Hamartoma Tumor Syndrome and 777 low-abundance TPMT variants that would likely require altered drug dosing. If other proteins yielded similar results, VAMP-seq could provide evidence of pathogenicity for greater than half of the pathogenic missense variants we will eventually find as more human genomes are sequenced. Interpretation of somatic variation is more difficult, but functional data can reveal driver variants and, therefore, potential treatments. For example, variation in PTEN, presumably resulting in PTEN loss-of-function, is associated with increased sensitivity to PI3K, AKT, and mTOR inhibitors, and decreased sensitivity to receptor tyrosine kinase inhibitors[56]. Our PTEN abundance data reveal many loss-of-function variants, which could help to clarify the link between PTEN inactivation and altered drug sensitivity, and thus might inform cancer treatment. Furthermore, aided by our abundance data, we identified p.Pro38Ser as a candidate PTEN dominant negative variant in melanoma. Since the known dominant negative variants p.Gly129Glu and p.Cys124Ser result in exacerbated oncogenic phenotypes in mice[44,46], p.Pro38Ser status might help to predict tumor aggressiveness. Despite its utility, VAMP-seq has limitations. Bottlenecks in our library generation method were largely responsible for the ~50% of possible PTEN variants missing from the final data set. In the future, early library validation using deep sequencing along with other well-validated library generation methods[8] could improve coverage. Additionally, like any assay, VAMP-seq abundance data is subject to uncertainty. To address this concern, we quantified the uncertainty associated with each abundance score. We suggest that abundance score uncertainty should be taken into consideration, as we did when classifying variant abundance. VAMP-seq relies on fusion of the protein of interest to EGFP. We showed a high concordance between VAMP-seq abundance data and abundance as measured by other methods, but this might not always be the case. Furthermore, VAMP-seq cannot yield insight into variants that are pathogenic because of reduced enzymatic activity, altered localization, or effects on splicing. Thus, while VAMP-seq abundance data is useful for identifying pathogenic variants, it should not be used to conclude that a variant is benign. Generalizable assays like VAMP-seq are a promising way to understand the functional effects of missense variation at scale. In addition to demonstrating its effectiveness for PTEN and TPMT, we provide preliminary evidence that VAMP-seq could be applied to other clinically relevant proteins. Furthermore, repeating VAMP-seq assays in different cell lines could reveal cell-type specific regulation of variant abundance. Comparing variant abundance data in wild type and chaperone knockout cells could reveal what makes a protein a chaperone client. Combining VAMP-seq with small molecule modulators of chaperone or protein degradation machinery may even reveal variant-specific treatments that could rescue low-abundance variants. Thus, VAMP-seq greatly expands our ability to measure the impact of missense variants on abundance, a fundamental property that underlies protein function.

ONLINE METHODS

General reagents, DNA oligonucleotides and plasmids

Unless otherwise noted, all chemicals were obtained from Sigma and all enzymes were obtained from New England Biolabs. E. coli were cultured at 37°C in Luria Broth. All cell culture reagents were purchased from ThermoFisher Scientific unless otherwise noted. HEK 293T cells and derivatives thereof were cultured in Dulbecco’s modified Eagle’s medium (DMEM) supplemented with 10% fetal bovine serum (FBS), 100 U/mL penicillin, and 0.1 mg/mL streptomycin. Induction medium was furthermore supplemented with 2 μg/mL doxycycline (Sigma-Aldrich). Cells were passaged by detachment with trypsin–EDTA 0.25%. All synthetic oligonucleotides were obtained from IDT and can be found in Supplementary Table 7. All non-library related plasmid modifications were performed with Gibson assembly[57]. See Supplementary Note for construction of the VAMP-seq expression vectors.

Construction of barcoded, site-saturation mutagenesis libraries for TPMT and PTEN

Site-saturation mutagenesis libraries of TPMT and PTEN were constructed using inverse PCR[18]. See Supplementary Note for a detailed description of construction of the barcoded, site-saturation mutagenesis libraries.

Single Molecule Real Time (SMRT) sequencing to link each TPMT and PTEN variants to its barcode

For both PTEN and TPMT, the relationship between variants and barcodes was established using SMRT sequencing (Pacific Biosciences). See Supplementary Note for a detailed description of variant linking steps using SMRT sequencing.

Integration of single variant clones or barcoded libraries into the HEK293-landing pad cell line

Barcoded variant libraries or single variant clones were recombined into the Tet-on landing pad in engineered HEK 293T TetBxb1BFP Clone4 cells that we generated previously[17]. See Supplementary Note for a detailed description of how variant libraries were integrated into cells.

FACS to bin cells by mCherry:EGFP ratio

Cells harboring variant libraries, prepared as described above, were sorted using a FACSAria III (BD Biosciences) into bins according to the abundance of their expressed, EGFP tagged variant. First, live, single, recombinant cells were selected using forward and side scatter, mCherry and mTagBFP2 signals. Then, a FITC:PE-Texas Red ratiometric parameter in the BD FACSDIVA software was created. A histogram of the FITC:PE-Texas Red ratio was created and gates dividing the library into four equally populated bins based on the ratio were established. The details of replicate sorts can be found in Supplementary Table 1.

Sorted library genomic DNA preparation, barcode amplification and sequencing

For the TPMT experiments, sorted cells were collected by centrifugation and the FACS sheath buffer was aspirated. Cells were transferred into a microfuge tube, pelleted and stored at −20°C. Genomic DNA was prepared using the GentraPrep kit (Qiagen). For each bin, all the purified DNA was spread over eight 25 uL PCR reactions containing Kapa Robust, primers GPS-landing-f (in the genome) and BC-GPS-P7-i#-UMI (3′ of the barcode) to tag the barcodes with a unique molecular index (UMI) and add a sample index. UMI-tagging PCR were performed using the following conditions: initial denaturation 95 °C 2 minutes, followed by three cycles of (95 °C 15 seconds, 60 °C 20 seconds, 72 °C 3 minutes). The eight PCR reactions were pooled and the PCR amplicon was purified using 1× Ampure XP (Beckman Coulter). To shorten the amplicon and add the p5 and p7 Illumina cluster-generating sequences, the UMI-tagged barcodes were then amplified with primers BC-TPMT-P5-v2 and Illumina p7. This PCR was performed with Kapa Robust and SYBR green II on a Bio-Rad mini-opticon qPCR machine, reactions were monitored and removed before saturation of the SYBR green II signal, at around 25 cycles. The amplicons were pooled and gel purified. Barcodes were read twice by paired-end sequencing primers TPMT_Read1 and TPMT_Read2. The UMI and index were sequenced by the index read and primer TPMT_Index using a NextSeq 500 (Illumina). After converting from the BCL to FASTQ format using Illumina’s bcl2fastq version 2.18, the forward, reverse and index reads were concatenated and demultiplexed into a BAM file. Consensus barcodes were called from the forward and reverse reads. To collapse the barcode copies associated with unique UMIs, the UMI (bases 1-10 of the index read) were pasted onto the consensus barcode and unique combinations were identified (sort | uniq -c). The barcode from each unique barcode-UMI pair was used to populate a FASTQ file that could be used by the Enrich 2 software package to count variants. For the PTEN experiments, sorted cells were replated onto 10 cm plates and allowed to grow for approximately five days. Cells were then collected, pelleted by centrifugation, and stored at −20°C. Genomic DNA was prepared using a DNEasy kit, according to the manufacturer’s instructions (Qiagen) with the addition of a 30 minute incubation at 37°C with RNAse in the re-suspension step. Eight 50 μL first-round PCR reactions were each prepared with a final concentration of ~50 ng/μL input genomic DNA, 1× Kapa HiFi ReadyMix, and 0.25 μM of the KAM499/JJS_501a primers. The reaction conditions were 95 °C for 5 minutes, 98 °C for 20 seconds, 60 °C for 15 seconds, 72 °C for 90 seconds, repeat 7 times, 72 °C for 2 minutes, 4 °C hold. Eight 50 μL reactions were combined, bound to AMPure XP (Beckman Coulter), cleaned, and eluted with 40 μL water. 40% of the eluted volume was mixed with 2× Kapa Robust ReadyMix; JJS_seq_F and one of the indexed reverse primers, JJS_seq_R1a through JJS_seq_R12a were added at 0.25 μM each. Reaction conditions for the second round PCR were 95 °C for 3 minutes, 95 °C for 15 seconds, 60 °C for 15 seconds, 72 °C for 30 seconds, repeat 14 times, 72 °C for 1 minutes, 4 °C hold. Amplicons were extracted after separation on a 1.5% TBE/agarose gel using a Quantum Prep Freeze ‘N Squeeze DNA Gel Extraction Kit (Bio-Rad). Extracted amplicons were quantified using a KAPA Library Quantification Kit (Kapa Biosystems) and sequenced on a NextSeq 500 using a NextSeq 500/550 High Output v2 75 cycle kit (Illumina), using primers JJS_read_1, JJS_index_1, and JJS_read_2. Sequencing reads were converted to FASTQ format and de-multiplexed with bcl2fastq. Barcode paired sequencing reads for PTEN experiments 1 through 4 were joined using the fastq-join tool within the ea-utils package using the default parameters, whereas only one barcode read was collected for PTEN experiments 5 through 8. Technical amplification and sequencing replicates were conducted for every sample, and compared to assess variability in quantitation stemming from amplification and sequencing. Experiments with poor technical replication across multiple bins were reamplified and resequenced in their entirety, leaving eight replicate experiments with technical replicates shown here (Supplementary Fig. 9). FASTQ files from these technical replicate amplification and sequencing runs were concatenated for analysis with Enrich2[58].

Barcode counting and variant calling

Enrich2 was used to count the barcodes, associate each barcode with a nucleotide variant, and then translate and count both the unique-nucleotide and unique-amino acid variants[58]. FASTQ files containing either UMI-collapsed barcodes (TPMT) or total barcodes (PTEN) and the barcode-map for each protein were used as input for Enrich2. Enrich2 configuration files for each experiment are available on the GitHub repository (see URLs). Barcodes assigned to variants containing insertions, deletions or multiple amino acid mutations were removed from the analysis.

Calculating VAMP-seq scores and classifications

RStudio v1.0.136 was used for all subsequent analysis of the Enrich2 output. The count for each variant in a bin was divided by the sum of counts recorded in that bin to obtain the frequency of each variant (Fv) within that bin. This calculation was repeated for every bin in each replicate experiment. For each experiment, the total count of each variant across the bins was divided by the total count of all variants across the bins to obtain a total frequency value (Fv,total) for each variant for each experiment. This total frequency value was used for filtering low-frequency variants, which we reasoned would be subject to high levels of counting noise, out of the subsequent calculations. We set the Fv,total filtering threshold based on the assumption that accurately scored synonymous variants should create a clear, unimodal distribution around WT. We examined how different minimum Fv,total filtering threshold values affected the spread and central tendency of the synonymous distribution (Supplementary Fig. 10). We empirically selected 1 × 10−4.75 as the Fv,total filtering threshold value as it minimized the skew and coefficient of variation of the synonymous variant abundance score distribution while retaining the majority of missense variants. Next, for each experiment, a weighted average was calculated for each variant (Wv) passing the Fv,total filtering threshold value using the following equation: Thus, all weighted average values ranged from a value of 0.25 to 1. Finally, for each experiment, an abundance score for each variant (Sv) was obtained by subjecting the weighted average of each variant to min-max normalization, using the weighted average value of WT (Wwt), which was given a score of 1, and the median weighted average value for non-terminal nonsense variants (Wnonsense) at positions 51 through 349 for PTEN, or positions 51 through 219 for TPMT, which was given an abundance score of 0, using the following equation: The final abundance score for each variant was calculated by taking the mean of the min-max normalized abundance scores across the eight replicate experiments in which it could have been observed. Only variants which were scored in two or more replicate experiments were retained in the analysis. We implemented this filter because many sources of noise are not captured in count-based estimates of variance and because having replicate-level variance estimates was critical to our abundance classification scheme. A standard error for each abundance score was calculated by dividing the standard deviation of the min-max normalized values for each variant by the square root of the number of replicate experiments in which it was observed. Lastly, the lower bound of the 95% confidence interval was calculated by multiplying the standard error by the 97.5 percentile value of a normal distribution and subtracting this product from the abundance score. The upper bound of the 95% confidence interval was calculated by instead adding the product to the abundance score. Positional VAMP-seq scores were calculated by taking the median of all single amino acid VAMP-seq scores at each position. For both TPMT and PTEN, the distribution of wild type synonyms was used to create VAMP-seq classifications for every variant (see Supplementary Fig. 5a for scheme). First, we established a synonymous score threshold by determining the abundance score that separated the 95% most abundant synonymous variants from the 5% lowest abundance synonymous variants (0.71 for PTEN, and 0.72 for TPMT). Variants whose abundance score and upper confidence interval were both below this synonymous threshold value were classified as “low abundance” variants, whereas those with abundance scores below this threshold but upper confidence interval over this this were classified “possibly low abundance”. Variants with scores above this threshold but lower confidence intervals below the threshold were considered “possibly wt-like abundance”. Variants with scores and lower confidence interval above the threshold were classified as “WT-like abundance.” For both TPMT and PTEN, substitution-intolerant positions were determined based on the proportion of variants at the position with scores below the synonymous threshold, determined as described above. Positions where 5 or more variants were scored and greater than 90% of the scores were below the synonymous variant threshold value were considered substitution intolerant. Enhanced abundance positions were determined based on the proportion of variants at the position with scores above the median of the synonymous distribution. Positions where 5 or more variants were scored and more than 5 variants had scores above the median of the synonymous distribution were considered enhanced-abundance positions.

Assessment of the PTEN library composition

To better understand the sources bottlenecking in the PTEN experiments, the composition of the PTEN plasmid library preparation used to generate recombinant cells was assessed by determining barcode frequencies using high throughput Illumina sequencing. See Supplementary Note for a description of the steps taken to characterize the PTEN variant library. Metrics regarding the processing of sequencing data for the barcode-variant assignments can be found in Supplementary Table 8.

Variant annotation from online databases

Published western blotting results for PTEN and TPMT variants are listed, along with references, in Supplementary Table 9 and Supplementary Table 10. See Supplementary Note for a description of the online databases that were accessed to obtain PTEN and TPMT variant annotations.

PTEN ClinVar and cancer genomics analyses

Nine PTEN variants were listed in ClinVar as both likely pathogenic and pathogenic. We examined the evidence for these variants – p.His61Arg, p.Tyr68His, p.Leu108Pro, p.Gly127ARG, p.Arg130Leu, p.Arg130Gln, p.Gly132Val, p.Arg173Cys, and p.Arg173His – and following the ACMG-AMP guidelines[40], all nine were deemed to belong in the likely pathogenic category. An additional two variants – p.Arg15Lys and p.Pro96Ser – had an interpretation of uncertain significance along with another interpretation of likely pathogenic or pathogenic, and thus the clinical significance of the variant was listed as “Conflicting interpretations of pathogenicity”. As recommended by the ACMG/AMP guidelines[40], variants with conflicting interpretations were considered variants of unknown significance. Likely non-damaging PTEN variants were identified from the variants observed in gnomAD at allele frequencies rendering them highly unlikely to be causal for Cowden’s Syndrome, under an autosomal dominant model of inheritance with an estimated prevalence in the population of 1:200,000[35,38]. For each PTEN variant observed in gnomAD, a binomial distribution of the total number of alleles successfully sequenced at the site was calculated, using a collective pathogenic allele estimate of 1:400,000, genetic and allelic heterogeneity of 1, and a penetrance of 95%, which are all conservative assumptions[8,39]. Each observed PTEN variant was assessed using the following line of code in RStudio: qbinom(0.99, size = [total alleles genotyped at the site], prob = (1/400000)/0.95). PTEN variants in gnomAD with an observed allele count a full integer above this 99% confidence level of the calculated binomial distribution were considered variants highly unlikely to be causal for Cowden’s Syndrome.

Statistics and Reproducibility

For all figures, r denotes the Pearson’s correlation coefficient, whereas ρ denotes Spearman’s rho rank correlation coefficient. For our statistical analysis of the enrichments of low-abundance variants in the pathogenic, likely pathogenic, and uncertain significance ClinVar categories we used a resampling approach. We drew 10,000 random samples, with replacement corresponding to the number of variants scored from each category in ClinVar (pathogenic = 25; likely pathogenic = 23; uncertain significance = 83) from the 1,366 PTEN missense variants (e.g. single nucleotide variants that change an amino acid) with abundance scores. We recorded the frequency of low abundance variants in each round of resampling. Then, we computed the P-value for each category by dividing the number of times the observed frequency of PTEN low-abundance variants fell below the frequencies of low-abundance variants in the resampled sets by 10,000. For our statistical analysis of enrichments of low-abundance, dominant negative, or p.Pro38Ser variants in different cancer types, we first used the rates of single nucleotide transitions and transversions observed in TCGA[42,59] to create mutational probabilities for every possible PTEN missense or nonsense variant. Based on these probabilities we drew 10,000 random samples of PTEN variants of size to equal the number of PTEN variants found in each cancer type (n = 337, 192, 153, 186, 77, 113, and 327 for brain, breast, colorectal, endometrial, melanoma, NSCLC, and uterine cancers, respectively). For each cancer type, this created the null distribution of PTEN variant frequencies based on the mutation spectrum alone. Then, for each cancer type, we computed the P-value by dividing the number of times the observed frequency of low-abundance, dominant negative or p.Pro38Ser variants fell below the frequency of the appropriate type of variants in the resampled sets by 10,000.

Rosetta ΔΔG predictions

Computational predictions of PTEN variant losses in folding energy (e.g. ΔΔGs) were performed using the 2017.08 release of Rosetta. The PTEN protein data bank (PDB) file 1d5r was renumbered to accommodate missing residues, and the TLA ligand was removed. Preminimization of the ensuing file was performed using Rosetta minimize_with_cst, followed by the convert_to_cst_file shell script. Fine grain estimations of folding energy changes upon PTEN mutation were created with Rosetta ddg_monomer[60] using the talaris2014 scoring function, and the following flags: -ddg:weight_file soft_rep_design, -fa_max_dis 9.0, ddg∷iterations 50, -ddg∷dump_pdbs true, -ignore_unrecognized_res, -ddg∷local_opt_only false, -ddg∷min_cst true, -constraints∷cst_file input.cst, -ddg∷suppress_checkpointing true, -in∷file∷fullatom, -ddg∷mean false, -ddg∷min true, -ddg∷sc_min_only false, -ddg∷ramp_repulsive true, -ddg∷output_silent true.

Comparison of TPMT red blood cell activity or dose intensity to abundance scores

Genotypes, TPMT red blood cell activity that was normalized by cohort and dose intensity data for 884 ALL patients was provided from the study described in Liu et al.[51]. The mean TPMT red blood cell activity and dose intensity from individuals heterozygous for each unique TPMT variant was calculated. These values were directly compared to abundance scores for that variant from the VAMP-seq assay or the wild-type normalized GFP:mCherry ratio from individual flow cytometry experiments (Figure 5; Supplementary Fig. 7).

Western blotting

See Supplementary Note for details of the western blotting procedures.

Life Sciences Reporting Summary

Further information on experimental design is available in the Life Sciences Reporting Summary.

Data and code availability

All raw sequence data and function scores are freely available for all academic users, by nonexclusive license under reasonable terms to commercial entities that have committed to open sharing of PTEN and TPMT sequence variants, and under a free non-exclusive license to non-profits entities. The Illumina and PacBio raw sequencing files and barcode-variant maps can be accessed at the NCBI Gene Expression Omnibus (GEO) repository under accession number GSE108727. The data presented in the manuscript are available as Supplementary Data files. Code used for the analyses performed in this work is included as Supplementary Data 5, and also available at http://github.com/FowlerLab/VAMPseq. VAMP-seq scores are available at http://abundance.gs.washington.edu. Code used for subassembly by PacBio is available at http://github.com/shendurelab/AssemblyByPacBio.

58 in total

1. Protein tagging and detection with engineered self-assembling fragments of green fluorescent protein.

Authors: Stéphanie Cabantous; Thomas C Terwilliger; Geoffrey S Waldo
Journal: Nat Biotechnol Date: 2004-12-05 Impact factor: 54.908

2. Very important pharmacogene summary: ABCB1 (MDR1, P-glycoprotein).

Authors: Laura M Hodges; Svetlana M Markova; Leslie W Chinn; Jason M Gow; Deanna L Kroetz; Teri E Klein; Russ B Altman
Journal: Pharmacogenet Genomics Date: 2011-03 Impact factor: 2.089

3. Allele-specific tumor spectrum in pten knockin mice.

Authors: Hui Wang; Matt Karikomi; Shan Naidu; Ravi Rajmohan; Enrico Caserta; Hui-Zi Chen; Maysoon Rawahneh; Julie Moffitt; Julie A Stephens; Soledad A Fernandez; Michael Weinstein; Danxin Wang; Wolfgang Sadee; Krista La Perle; Paul Stromberg; Thomas J Rosol; Charis Eng; Michael C Ostrowski; Gustavo Leone
Journal: Proc Natl Acad Sci U S A Date: 2010-03-01 Impact factor: 11.205

Review 4. PTEN loss in the continuum of common cancers, rare syndromes and mouse models.

Authors: M Christine Hollander; Gideon M Blumenthal; Phillip A Dennis
Journal: Nat Rev Cancer Date: 2011-04 Impact factor: 60.716

5. Conformational stability and catalytic activity of PTEN variants linked to cancers and autism spectrum disorders.

Authors: Sean B Johnston; Ronald T Raines
Journal: Biochemistry Date: 2015-02-13 Impact factor: 3.162

6. The tumour-suppressor function of PTEN requires an N-terminal lipid-binding motif.

Authors: Steven M Walker; Nick R Leslie; Nevin M Perera; Ian H Batty; C Peter Downes
Journal: Biochem J Date: 2004-04-15 Impact factor: 3.857

7. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.

Authors: Sue Richards; Nazneen Aziz; Sherri Bale; David Bick; Soma Das; Julie Gastier-Foster; Wayne W Grody; Madhuri Hegde; Elaine Lyon; Elaine Spector; Karl Voelkerding; Heidi L Rehm
Journal: Genet Med Date: 2015-03-05 Impact factor: 8.822

8. A platform for functional assessment of large variant libraries in mammalian cells.

Authors: Kenneth A Matreyek; Jason J Stephany; Douglas M Fowler
Journal: Nucleic Acids Res Date: 2017-06-20 Impact factor: 16.971

9. A novel germline mutation of PTEN associated with brain tumours of multiple lineages.

Authors: F J T Staal; R B van der Luijt; M R M Baert; J van Drunen; H van Bakel; E Peters; I de Valk; H K P van Amstel; M J B Taphoorn; G H Jansen; C W M van Veelen; B Burgering; G E J Staal
Journal: Br J Cancer Date: 2002-05-20 Impact factor: 7.640

10. Using high-resolution variant frequencies to empower clinical genome interpretation.

Authors: Nicola Whiffin; Eric Minikel; Roddy Walsh; Anne H O'Donnell-Luria; Konrad Karczewski; Alexander Y Ing; Paul J R Barton; Birgit Funke; Stuart A Cook; Daniel MacArthur; James S Ware
Journal: Genet Med Date: 2017-05-18 Impact factor: 8.822

107 in total

1. Gene-specific features enhance interpretation of mutational impact on acid α-glucosidase enzyme activity.

Authors: Aashish N Adhikari
Journal: Hum Mutat Date: 2019-08-07 Impact factor: 4.878

2. A Saturation Mutagenesis Approach to Understanding PTEN Lipid Phosphatase Activity and Genotype-Phenotype Relationships.

Authors: Taylor L Mighell; Sara Evans-Dutson; Brian J O'Roak
Journal: Am J Hum Genet Date: 2018-04-26 Impact factor: 11.025

3. A Multiplex Homology-Directed DNA Repair Assay Reveals the Impact of More Than 1,000 BRCA1 Missense Substitution Variants on Protein Function.

Authors: Lea M Starita; Muhtadi M Islam; Tapahsama Banerjee; Aleksandra I Adamovich; Justin Gullingsrud; Stanley Fields; Jay Shendure; Jeffrey D Parvin
Journal: Am J Hum Genet Date: 2018-09-12 Impact factor: 11.025

4. A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation.

Authors: Nicholas Bogard; Johannes Linder; Alexander B Rosenberg; Georg Seelig
Journal: Cell Date: 2019-06-06 Impact factor: 41.582

5. Quantifying the Mutational Robustness of Protein-Coding Genes.

Authors: Evandro Ferrada
Journal: J Mol Evol Date: 2021-05-02 Impact factor: 2.395

Review 6. Emerging strategies to bridge the gap between pharmacogenomic research and its clinical implementation.

Authors: Volker M Lauschke; Magnus Ingelman-Sundberg
Journal: NPJ Genom Med Date: 2020-03-05 Impact factor: 8.617

7. Deep Mutational Scan of an SCN5A Voltage Sensor.

Authors: Andrew M Glazer; Brett M Kroncke; Kenneth A Matreyek; Tao Yang; Yuko Wada; Tiffany Shields; Joe-Elie Salem; Douglas M Fowler; Dan M Roden
Journal: Circ Genom Precis Med Date: 2020-01-12

Review 8. Biophysical and Mechanistic Models for Disease-Causing Protein Variants.

Authors: Amelie Stein; Douglas M Fowler; Rasmus Hartmann-Petersen; Kresten Lindorff-Larsen
Journal: Trends Biochem Sci Date: 2019-01-31 Impact factor: 13.807

9. Folliculin variants linked to Birt-Hogg-Dubé syndrome are targeted for proteasomal degradation.

Authors: Lene Clausen; Amelie Stein; Martin Grønbæk-Thygesen; Lasse Nygaard; Cecilie L Søltoft; Sofie V Nielsen; Michael Lisby; Tommer Ravid; Kresten Lindorff-Larsen; Rasmus Hartmann-Petersen
Journal: PLoS Genet Date: 2020-11-02 Impact factor: 5.917

10. Massively parallel variant characterization identifies NUDT15 alleles associated with thiopurine toxicity.

Authors: Chase C Suiter; Takaya Moriyama; Kenneth A Matreyek; Wentao Yang; Emma Rose Scaletti; Rina Nishii; Wenjian Yang; Keito Hoshitsuki; Minu Singh; Amita Trehan; Chris Parish; Colton Smith; Lie Li; Deepa Bhojwani; Liz Y P Yuen; Chi-Kong Li; Chak-Ho Li; Yung-Li Yang; Gareth J Walker; James R Goodhand; Nicholas A Kennedy; Federico Antillon Klussmann; Smita Bhatia; Mary V Relling; Motohiro Kato; Hiroki Hori; Prateek Bhatia; Tariq Ahmad; Allen E J Yeoh; Pål Stenmark; Douglas M Fowler; Jun J Yang
Journal: Proc Natl Acad Sci U S A Date: 2020-02-24 Impact factor: 11.205