Literature DB >> 22006916

Enrich: software for analysis of protein function by enrichment and depletion of variants.

Douglas M Fowler¹, Carlos L Araya, Wayne Gerard, Stanley Fields.

Abstract

SUMMARY: Measuring the consequences of mutation in proteins is critical to understanding their function. These measurements are essential in such applications as protein engineering, drug development, protein design and genome sequence analysis. Recently, high-throughput sequencing has been coupled to assays of protein activity, enabling the analysis of large numbers of mutations in parallel. We present Enrich, a tool for analyzing such deep mutational scanning data. Enrich identifies all unique variants (mutants) of a protein in high-throughput sequencing datasets and can correct for sequencing errors using overlapping paired-end reads. Enrich uses the frequency of each variant before and after selection to calculate an enrichment ratio, which is used to estimate fitness. Enrich provides an interactive interface to guide users. It generates user-accessible output for downstream analyses as well as several visualizations of the effects of mutation on function, thereby allowing the user to rapidly quantify and comprehend sequence-function relationships.
AVAILABILITY AND IMPLEMENTATION: Enrich is implemented in Python and is available under a FreeBSD license at http://depts.washington.edu/sfields/software/enrich/. Enrich includes detailed documentation as well as a small example dataset. CONTACT: dfowler@uw.edu; fields@uw.edu SUPPLEMENTARY INFORMATION: Supplementary data is available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2011 PMID： 22006916 PMCID： PMC3232369 DOI： 10.1093/bioinformatics/btr577

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Understanding how variations in protein sequence relate to function is of fundamental importance. Measurement of protein activity is critical to engineer protein function, to understand how mutations relate to disease and to gain insight into catalytic mechanisms (Alper ; Kato ; Weiss ). Efforts to parallelize measurement of protein activity rely on selection for a desired function present within a library of variants of a protein of interest using a display-based system that directly links a protein's activity to its encoding DNA sequence (Levin and Weiss, 2006; Pal ; Sidhu and Koide, 2007). Selection for function (e.g. ligand binding, catalytic activity or stability) alters the population of displayed proteins, and thus their associated DNA molecules. DNA sequences encoding highly functional variants are enriched, whereas DNA sequences encoding poorly functional variants are depleted. Sanger sequencing of library members after selection can reveal a few hundred highly functional variants. Recently, high-throughput sequencing has been used to significantly increase the number of variants assessed (Di Niro ; Dias-Neto ; Ernst ; Fowler ; Hietpas ; Hinkley ; Ravn ). Such ‘deep mutational scanning’ (Araya and Fowler, 2011) experiments engender significant analysis challenges. Here, we present Enrich, a tool to address these challenges. Enrich identifies and enumerates unique protein sequences within high-throughput sequencing data. It calculates an enrichment ratio between unselected and selected libraries for each unique variant, and it creates a number of visualizations. Enrich is open-source, freely available and modular, creating easy-to-manipulate output files. Thus, users can customize Enrich and perform project-specific analysis.

2 APPROACH

Enrich is implemented in Python. Enrich requires ∼2 h to run on a typical dataset on a desktop computer. To facilitate the analysis of multiple datasets in parallel, Enrich can function in a high-performance computing environment managed by the Oracle Grid Engine. Enrich uses the DRMAA distributed resource management API to facilitate extension to other environments (http://drmaa.org/). Enrich supports command line execution and an interactive mode that guides users through the configuration and execution of Enrich runs. Enrich takes as input FASTQ-formatted high-throughput sequencing data files acquired from an unselected and a selected library (Cock ). Enrich can use reads from any sequencing platform, provided they are FASTQ-formatted. If overlapping paired-end reads have been acquired, Enrich corrects each read pair for sequencing error by examining agreement between the reads. At positions where the reads disagree, the nucleotide with the higher quality score is used. If both reads have identical quality scores at the position in question, the read pair is removed. More robust error models could improve error correction, particularly when overlapping paired-end reads are not available (e.g. ShoRAH) (Zagordi ). Variant sequences are identified and enumerated within the unselected and selected libraries. Variants containing insertions and deletions are removed. An enrichment ratio (selected/unselected) is calculated for each variant. Enrichment ratios are evaluated using a two-sided Poisson exact test to calculate a P-value for the significance of enrichment or depletion for each variant. Multiple testing correction is performed using false discovery rates (Storey and Tibshirani, 2003). The resulting q-values enable the user to identify subsets of variants whose frequency is significantly altered by selection. To accomplish these tasks, the Enrich workflow is divided into seven modules that can run independently or all together (for a more detailed description, see the Supplementary Material). Enrich uses matplotlib to produce any of three visualizations as a starting point for further analyses (Fig. 1). The visualizations show an estimation of library diversity, the position-averaged mutation enrichment and an all-residue enrichment ratio scan. In addition to providing these visualization options, Enrich produces easy to use output files that can be carried forward into project-specific analyses. Enrich can take advantage of high-performance computing to conduct many analyses in parallel. Enrich's Python-based modular, extensible design enables users to customize the software. Enrich facilitates deep mutational scanning, which can be widely applied to the breadth of disciplines that depend on understanding protein sequence–function relationships.

Fig. 1.

Enrich visualizations. Enrich produces three visualizations; examples from the dataset included with Enrich are shown here. (a) The diversity within a library is illustrated by a heatmap of the frequency of each position–mutation combination. (b) The position-averaged change in mutational frequency between two libraries is shown. (c) The log2-scaled enrichment ratio for each position–mutation combination is plotted, individually organized both by position and by amino acid (a single amino acid, serine, is shown). Blue dots indicate the enrichment or depletion of substitutions. Red squares correspond to wild-type residues. Grey squares correspond to unobserved mutations.

17 in total

1. Rapid mapping of protein functional epitopes by combinatorial alanine scanning.

Authors: G A Weiss; C K Watanabe; A Zhong; A Goddard; S S Sidhu
Journal: Proc Natl Acad Sci U S A Date: 2000-08-01 Impact factor: 11.205

2. Statistical significance for genomewide studies.

Authors: John D Storey; Robert Tibshirani
Journal: Proc Natl Acad Sci U S A Date: 2003-07-25 Impact factor: 11.205

3. Comprehensive and quantitative mapping of energy landscapes for protein-protein interactions by rapid combinatorial scanning.

Authors: Gábor Pál; Jean-Louis K Kouadio; Dean R Artis; Anthony A Kossiakoff; Sachdev S Sidhu
Journal: J Biol Chem Date: 2006-06-08 Impact factor: 5.157

Review 4. Optimizing the affinity and specificity of proteins with molecular display.

Authors: A M Levin; G A Weiss
Journal: Mol Biosyst Date: 2005-11-08

Review 5. Phage display for engineering and analyzing protein interaction interfaces.

Authors: Sachdev S Sidhu; Shohei Koide
Journal: Curr Opin Struct Biol Date: 2007-09-17 Impact factor: 6.809

6. Next-generation phage display: integrating and comparing available molecular tools to enable cost-effective high-throughput analysis.

Authors: Emmanuel Dias-Neto; Diana N Nunes; Ricardo J Giordano; Jessica Sun; Gregory H Botz; Kuan Yang; João C Setubal; Renata Pasqualini; Wadih Arap
Journal: PLoS One Date: 2009-12-17 Impact factor: 3.240

Review 7. Deep mutational scanning: assessing protein function on a massive scale.

Authors: Carlos L Araya; Douglas M Fowler
Journal: Trends Biotechnol Date: 2011-05-10 Impact factor: 19.536

8. Engineering yeast transcription machinery for improved ethanol tolerance and production.

Authors: Hal Alper; Joel Moxley; Elke Nevoigt; Gerald R Fink; Gregory Stephanopoulos
Journal: Science Date: 2006-12-08 Impact factor: 47.728

Review 9. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.

Authors: Peter J A Cock; Christopher J Fields; Naohisa Goto; Michael L Heuer; Peter M Rice
Journal: Nucleic Acids Res Date: 2009-12-16 Impact factor: 16.971

10. Rapid interactome profiling by massive sequencing.

Authors: Roberto Di Niro; Ana-Marija Sulic; Flavio Mignone; Sara D'Angelo; Roberta Bordoni; Michele Iacono; Roberto Marzari; Tiziano Gaiotto; Miha Lavric; Andrew R M Bradbury; Luigi Biancone; Dina Zevin-Sonkin; Gianluca De Bellis; Claudio Santoro; Daniele Sblattero
Journal: Nucleic Acids Res Date: 2010-02-09 Impact factor: 16.971

66 in total

1. Fitness analyses of all possible point mutations for regions of genes in yeast.

Authors: Ryan Hietpas; Benjamin Roscoe; Li Jiang; Daniel N A Bolon
Journal: Nat Protoc Date: 2012-06-21 Impact factor: 13.491

2. Comparison of the peptide binding preferences of three closely related TRAF paralogs: TRAF2, TRAF3, and TRAF5.

Authors: Glenna Wink Foight; Amy E Keating
Journal: Protein Sci Date: 2016-02-03 Impact factor: 6.725

3. High-throughput analysis of in vivo protein stability.

Authors: Ikjin Kim; Christina R Miller; David L Young; Stanley Fields
Journal: Mol Cell Proteomics Date: 2013-07-29 Impact factor: 5.911

4. Massively Parallel Functional Analysis of BRCA1 RING Domain Variants.

Authors: Lea M Starita; David L Young; Muhtadi Islam; Jacob O Kitzman; Justin Gullingsrud; Ronald J Hause; Douglas M Fowler; Jeffrey D Parvin; Jay Shendure; Stanley Fields
Journal: Genetics Date: 2015-03-30 Impact factor: 4.562

5. Conformational Engineering of HIV-1 Env Based on Mutational Tolerance in the CD4 and PG16 Bound States.

Authors: Jeremiah D Heredia; Jihye Park; Hannah Choi; Kevin S Gill; Erik Procko
Journal: J Virol Date: 2019-05-15 Impact factor: 5.103

6. Structural architecture of a dimeric class C GPCR based on co-trafficking of sweet taste receptor subunits.

Authors: Jihye Park; Balaji Selvam; Keisuke Sanematsu; Noriatsu Shigemura; Diwakar Shukla; Erik Procko
Journal: J Biol Chem Date: 2019-02-05 Impact factor: 5.157

7. Protein tolerance to random circular permutation correlates with thermostability and local energetics of residue-residue contacts.

Authors: Joshua T Atkinson; Alicia M Jones; Vikas Nanda; Jonathan J Silberg
Journal: Protein Eng Des Sel Date: 2019-12-31 Impact factor: 1.650

8. Improved mutant function prediction via PACT: Protein Analysis and Classifier Toolkit.

Authors: Justin R Klesmith; Benjamin J Hackel
Journal: Bioinformatics Date: 2019-08-15 Impact factor: 6.937

9. A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function.

Authors: Carlos L Araya; Douglas M Fowler; Wentao Chen; Ike Muniez; Jeffery W Kelly; Stanley Fields
Journal: Proc Natl Acad Sci U S A Date: 2012-10-03 Impact factor: 11.205

10. First critical repressive H3K27me3 marks in embryonic stem cells identified using designed protein inhibitor.

Authors: James D Moody; Shiri Levy; Julie Mathieu; Yalan Xing; Woojin Kim; Cheng Dong; Wolfram Tempel; Aaron M Robitaille; Luke T Dang; Amy Ferreccio; Damien Detraux; Sonia Sidhu; Licheng Zhu; Lauren Carter; Chao Xu; Cristina Valensisi; Yuliang Wang; R David Hawkins; Jinrong Min; Randall T Moon; Stuart H Orkin; David Baker; Hannele Ruohola-Baker
Journal: Proc Natl Acad Sci U S A Date: 2017-09-01 Impact factor: 11.205