Literature DB >> 28387819

SLiMSearch: a framework for proteome-wide discovery and annotation of functional modules in intrinsically disordered regions.

Izabella Krystkowiak^1,2, Norman E Davey^1,2.

Abstract

The extensive intrinsically disordered regions of higher eukaryotic proteomes contain vast numbers of functional interaction modules known as short linear motifs (SLiMs). Here, we present SLiMSearch, a motif discovery tool that scans a motif consensus, representing the specificity determinants of a motif-binding domain, against a proteome to discover putative novel motif instances. SLiMSearch applies several distinct and complementary approaches exploiting the common properties of SLiMs to predict novel motifs. Consensus matches are annotated with overlapping sequence annotation, including feature information describing protein modular architecture, post-translational modification, structure, sequence variation and experimental characterisation of functional regions. Discriminatory motif attributes such as conservation and accessibility are also calculated. In addition, SLiMSearch provides functional enrichment and evolutionary analysis tools. The enrichment tool analyses GO terms, keywords and interacting partner enrichment to indicate possible motif function. The evolutionary tool evaluates motif taxonomic range and the conservation of motif sequence context. Consensus matches can be filtered based on motif attributes such as accessibility and taxonomic range; or by the localisation, interacting partners or ontology annotation of the peptide-containing protein. SLiMSearch supports a range of species of experimental and therapeutic relevance and is available online at http://slim.ucd.ie/slimsearch/.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2017 PMID： 28387819 PMCID： PMC5570202 DOI： 10.1093/nar/gkx238

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The higher eukaryotic proteomes contain extensive intrinsically disordered regions and these regions mediate many of the interactions of a protein (1–4). Furthermore, they integrate information encoded in their environment to make regulatory decisions in reaction to cell state changes (5,6). Various estimates have suggested that there may be upwards of one hundred thousand interaction interfaces in the intrinsically disordered regions of the human proteome (3). Yet, only a small portion of the functional elements predicted to reside within these regions have been characterized (3,7). The majority of experimentally characterized modules in disordered regions belong to a class of compact, degenerate and ex nihilo evolvable interaction interfaces known as short, linear motifs (SLiMs) (8–10). SLiMs perform many of the regulatory functions associated with intrinsically disordered regions. They are particularly important for the formation of transient protein complexes, modulating protein modification state, controlling protein stability and directing protein subcellular trafficking (4). The vast majority of protein motifs remain undiscovered due to experimental and computationally difficulties in characterizing novel motifs (7). Most SLiMs are encoded in a linear region of less than ten amino acids with only three or four core residues determining the majority of the binding affinity and specificity. These core residues extensively contact the motif-binding pocket, and therefore need to be physicochemically compatible with the binding-pocket. This constrains the peptide to a limited set of residues at these positions, resulting in a common motif or consensus in the binding partners of the motif-binding pocket. Consensus searches can be used to discover novel functional SLiMs but their length and the limited number of defined residues makes SLiMs difficult to identify; peptides matching the consensus are very likely to occur by chance, thus, the results are dominated by stochastically occurring non-functional consensus matches (9). The key steps in motif discovery are removing matches that are unlikely to be functional, and annotating the remaining matches with discriminatory data that can be used to prioritize these matches for further experimental validation. In recent work, we leveraged in silico sequence analysis to discover and annotate peptides matching the known specificity determinants of two motif-binding proteins, the APC/C substrate recruitment subunit Cdc20 and the protein phosphatase PP2AB56 holoenzyme (11–13). Discriminatory attributes indicative of motif functionality were used to guide the experimental characterization of several novel motifs, thereby, advocating the use of sequence analysis tools to augment experimental motif discovery. Several web-based tools are currently available for the discovery of novel instances of SLiM classes with characterized specificity determinants (14,15) (See Supplementary Table S1 for a detailed list). SLiM instance discovery webservers can be split into methods that scan a single protein with a set of predefined functionally characterized motif consensuses such as ELM (16), QuasiMotiFinder (17) and MiniMotifMiner (18); and those that scan a large set of proteins with a single motif consensus such as SLiMSearch (19), ScanProsite (20), SIRW (21), iELM (22) and DoReMi (23). These tools utilize a range of discriminatory attributes to prioritize consensus matches including sequence context, match conservation, structural context, ontology and interaction data to optimize motif discovery through filtering and ranking. Here, we introduce a major update to SLiMSearch (19), a web-based tool for the discovery of novel SLiM instances in a proteome. For this release, SLiMSearch has been completely rewritten from top to bottom. A new data management framework allowing automated dataset construction built on a relational database has replaced the previous flat file data storage framework. Novel conservation, functional analysis and filtering functionality have been added allowing complex querying and filtering options. In addition, the current version has been expanded from a single human dataset to 70 species of experimental and therapeutic relevance. SLiMSearch 4.0 is a single web-based framework that consolidates a decade of research into the discriminatory attributes pertinent to motif discovery. The resulting tool produces an intuitive, informative and interactive output that can be used to identify putative functional modules in the disordered regions of a proteome.

MATERIALS AND METHODS

The SLiMSearch framework is a suite of sequence and data analysis tools for motif consensus search, annotation and filtering. The framework takes as input a motif consensus describing the specificity determinants of a motif-binding pocket in regular expression syntax (see Supplementary Material), and a species of interest. This motif consensus can be derived from experimental data such as peptide mutagenesis, peptide arrays, phage display, motif evolutionary analyses or motif structural characterization (7,24) (see Supplementary Tables S2 and S3). The consensus is scanned against the proteome of the chosen species and a sortable list of annotated consensus matches is returned. These consensus matches can be analysed for evolutionary attributes or functional enrichment. Finally, matches can be filtered based on a wide range of discriminatory attributes. The framework was designed to facilitate easy expansion and updating of the underlying data. This allows novel protein sets, source of discriminatory data and sequence attributes to be effortlessly added to the resource. The current implementation covers 69 species including most major model organisms and relevant pathogens; and has a single protein set covering viral proteins (see Supplementary Table S4). An extensive help page including the required input and a detailed description of the output is available on the website. Jobs are stored on the server for 14 days after which they are deleted.

Feature and attribute annotation

The framework accesses a large pre-computed database of protein-centric information to annotate consensus matches with attributes that are strong discriminators for or against motif functionality (see Supplementary Table S5). Furthermore, consensus matches are annotated with information to understand the pre- and post-translational mechanisms controlling their function (5,6,25). For example, SLiMSearch annotates the motif instances overlapping features describing the protein modular architecture such as short linear motifs and domains (8,26), sites of post-translational modifications (27,28) and protein topology information (29); experimentally characterized regions such as solved structures (30), secondary structure assignment (31), mutagenesis, regions of interest and binding sites (32); and sequence variation such as alternative transcription, alternative splicing and SNPs (33,34). Peptide attributes are also quantified: scored as peptide disorder propensity (35), solvent accessibility (31) (if an overlapping structure is available), and conservation (36) (see Supplementary Material for details). Furthermore, proteins and regions of proteins that are inaccessible to intracellular proteins (e.g. secreted proteins, extracellular protein regions or transmembrane regions) are also annotated. All features and attribute annotation can be used to sort the consensus matches. Each annotation is linked to the source data to obtain more details about a feature or attribute of interest. Finally, the consensus matches are linked, by clicking on the peptide or conservation score, to the ProViz protein visualization tool (http://proviz.ucd.ie) allowing the overlapping feature and attribute annotations to be visualized (37).

Evolutionary annotation

There are two major evolutionary discriminators for motif functionality (Figure 1A): conservation over large evolutionary distances (Figure 1B) and high levels of conservation relative to the flanking regions (Figure 1C and D) (36,38–40). SLiMSearch provides conservation metrics to describe these discriminatory attributes. The taxonomic range section provides information about the conservation of the consensus across a set of species. For each species, a consensus match is annotated regarding its presence or absence at the same position in an ortholog alignment (Figure 1A). Conservation of a motif consensus over a large taxonomic range is a pointer towards a region that is constrained and therefore functional. Hence, experimentally characterized functional motifs are conserved over larger taxonomic ranges than uncharacterized consensus matches (where the majority of instances are non-functional) (Figure 1B). The flank conservation annotates the conservation of the consensus match relative to the conservation of the directly flanking regions and the relative conservation scores quantifies the likelihood of this relative conservation statistically. Similar to taxonomic range, relative conservation is a strong discriminator of functional motifs. The defined residues of functional motifs that make a direct contact with the binding partner are generally more conserved than solvent facing positions and flanking residues (Figure 1C and D). Consequently, in rapidly evolving intrinsically disordered regions, functional motifs are often observed as islands of conservation in a sea of mutations, insertions and deletions. Relative conservation quantifies this property. The flank conservation section also graphically represents the conservation of the sequence context of the consensus match where the level of conservation for each residue within and flanking the match correlates with the colour intensity. All conservation metrics are built on pre-computed ortholog alignments allowing complex evolutionary information to be rapidly computed and accessed (see Supplementary Materials). The alignment of the region used for the conservation calculation can be directly visualized using the ProViz protein visualization tool by clicking on the peptide (37).

Figure 1.

Benchmarking of evolutionary annotation used in SLiMSearch. (A) Example alignment of a [KR]xTQT Dynein Light Chain binding motif across different species showing the attributes of motif conservation measured by the relative local conservation and taxonomic range. (B) Motif consensus conservation of human motif instances across different species. The motif consensus taxonomic range of the validated human instances in the ELM resource compared to the non-validated instances (Instances in the human proteome which match a motif consensus from the ELM database, but are not annotated as a ‘true positive’ in the ELM database). (C) Relative local conservation (see Supplementary Material) for each residue in the defined, wildcard and flanking regions of a motif for validated instances from the ELM resource. (D) Relative local conservation for each residue in the defined, wildcard and flanking regions of a motif for consensus matches not annotated as validated instances from the ELM resource.

Functional enrichment analysis

A set of motif instances recognized by a given motif-binding partner will by definition share a common interactor, however, they often also share a common function, pathway or localisation (8). SLiMSearch analyses the enrichment of GO terms, keywords and interactors for the set of motif-containing proteins to link a function, localisation or binding partner with a motif consensus. Analysis of the ontologies can allow functions to be established for newly discovered motif classes or allow novel aspects of the biology of a previously characterized motif to be uncovered. Functional analysis of motif consensus search data has two biases that render the result of classical hypergeometric-based enrichment analyses unreliable. Firstly, the probability of seeing a consensus match in a protein is correlated to the length of the amino acid sequence. Therefore, functional annotations that are associated with longer proteins are more likely to be significantly enriched. This can be clearly seen when the median enrichment score (See Supplementary Material) of a GO term is plotted against the average number of disordered amino acids per protein annotated with that GO term for random motif consensus test sets (randomized motif consensuses would not be expected to have enriched functional annotation) (Figure 2A). This bias results in strong over-estimates of the significance of certain GO terms and, as a result, even random motif consensuses will regularly have numerous significantly enriched GO terms (Figure 2B). Secondly, related proteins, due to their sequence similarity, are more likely to share a randomly occurring consensus match. However, related proteins are also more likely to have overlapping functional annotations. Consequently, functional annotations associated with large protein families are also often significantly enriched. SLiMSearch includes two functional enrichment tools designed to remove these biases from functional analyses of consensus search results (see Supplementary Materials). The first is a corrected hypergeometric test that accounts for motif search space and evolutionary relationships between proteins, and applies Benjamini–Hochberg correction for multiple testing. The second is a Mann–Whitney rank test analysis using relative conservation scores as the ranking criteria. As functional matches of a motif consensus are generally more conserved than stochastically occurring non-functional instances (Figure 1C and D), biologically relevant functional annotations related to the motif consensus will be enriched for highly conserved motif instances, the non-random nature of this distribution can be captured by the rank test (see Supplementary Figure S3). These novel functional enrichment tools are a clear improvement on the commonly used hypergeometric statistic for functional enrichment analysis and conform closely to the expected distribution for random motif consensus test sets (Figure 2B). Furthermore, when analyzing consensus matches of an experimentally characterized motif family the functional enrichment tools can correctly return functional annotation associated with the motif family (Figure 2C).

Figure 2.

Benchmarking of the functional enrichment analysis approaches used by SLiMSearch. (A) Plot of the median GO term enrichment scores against the average number of disordered amino acids per protein for GO terms returned from the enrichment analysis of the random benchmarking set (see Supplementary Material). (B) Plot of the average p-value for a GO term against the percentage of GO terms with that p-value or less for the random benchmarking set. In this dataset, which should have no functional motif consensuses and therefore no enriched GO-terms, the data points should fall along the diagonal. The classical hypergeometric test clearly diverges from the diagonal and is under the line, as such it strongly over predicts the significance of each GO term. P-values are calculated using classical hypergeometric test with Benjamini–Hochberg correction (classical hypergeometric); hypergeometric test with Benjamini–Hochberg correction with motif search space correction (corrected hypergeometric); and Mann–Whitney U rank test for enrichment analysis based on conservation (QFO) (conservation rank test). (C) The distribution of corrected hypergeometric and conservation rank test P-values of GO terms for consensus searches of ELM class regular expressions (split into extended GO terms annotated in the ELM resource as functionally related to an ELM class, and extended GO terms not annotated for the ELM class), reversed ELM classes regular expressions and shuffled ELM classes regular expressions. Enrichment analysis performed with motif search space correction (corrected hypergeometric) and based on QFO conservation (conservation rank test). Both analyses used UniRef50 clustering of related proteins. The stars denote the mean value and red plus values denote outliers.

Filtering

Accessibility is a key discriminatory attribute for motif functionality (9,41,42). Consequently, by default the protein search space is restricted to the intrinsically disordered regions of the proteome (as defined by IUPred with a cut-off of 0.4) though this can be modified on the input page. Further accessibility filtering options include surface accessibility (when a structure is available) and, overlap with Pfam domains, topology and localization. Consensus matches falling within these regions are not filtered automatically as the filters are quite coarse and can remove many functional motif instances. For example, surface accessibility filtering can remove motifs solved while bound to the motif-binding pocket; many Pfam domains contain motifs in accessible loops or are family descriptors for conserved disordered regions; and topology and localisation requirements vary depending on the motif class searched. Consequently, by default, consensus matches that are found within these regions are retained, however, they are flagged in the output and can be removed using the quick filtering options in the top right corner of the instances table. SLiMSearch also allows consensus matches to be filtered based on general motif attributes such as motif taxonomic range; based on specific information about the motif-binding partner such as interactors, function or co-localisation; or simply using a list of GO term or protein accessions. The filtering options allow the user to create a biologically relevant subset of the consensus matches. For example, SLiMSearch can find Type I WW domain consensus matches ([LP]PxY): in intrinsically disordered and intracellular portions of a protein; in the Caenorhabditis elegans proteome that are also conserved in Drosophila melanogaster; in proteins annotated as ‘transcription’ or ‘hippo signaling’; in a protein that is known to interact with a protein containing a WW domain; or in a protein that shares at least one GO term with the WW domain-containing Yes-associated protein 1 (YAP1). Filtering options can also be chained to allow complex queries to be performed. For example, SLiMSearch can return all human PIP box motif consensus matches (Qxx[IL]xx[FHY][FHY]) found in a nuclear protein, annotated with the keyword ‘DNA repair’, conserved outside mammals, that occur in previously characterized PCNA interactors.

DISCUSSION

SLiMSearch is an interactive and information-rich yet intuitive motif discovery tool, accessible through a simple motif search interface. The framework searches characterized motif specificity determinants to identify putative novel motif instances. Instances are annotated with accessibility information, evolutionary attributes and experimental information to simplify the process of selecting instances for further validation. As such, SLiMSearch is a powerful tool to aid biologists in building hypotheses and designing experiments by simplifying the analysis of the functional and evolutionary features of motifs (7).

AVAILABILITY

SLiMSearch is available at http://slim.ucd.ie/slimsearch/. Click here for additional data file.

42 in total

1. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

Authors: A Krogh; B Larsson; G von Heijne; E L Sonnhammer
Journal: J Mol Biol Date: 2001-01-19 Impact factor: 5.469

2. Local structural disorder imparts plasticity on linear motifs.

Authors: Monika Fuxreiter; Peter Tompa; István Simon
Journal: Bioinformatics Date: 2007-03-25 Impact factor: 6.937

3. The switches.ELM resource: a compendium of conditional regulatory interaction interfaces.

Authors: Kim Van Roey; Holger Dinkel; Robert J Weatheritt; Toby J Gibson; Norman E Davey
Journal: Sci Signal Date: 2013-04-02 Impact factor: 8.192

Review 4. A million peptide motifs for the molecular biologist.

Authors: Peter Tompa; Norman E Davey; Toby J Gibson; M Madan Babu
Journal: Mol Cell Date: 2014-07-17 Impact factor: 17.970

Review 5. Intrinsically disordered proteins in cellular signalling and regulation.

Authors: Peter E Wright; H Jane Dyson
Journal: Nat Rev Mol Cell Biol Date: 2015-01 Impact factor: 94.444

6. The ABBA motif binds APC/C activators and is shared by APC/C substrates and regulators.

Authors: Barbara Di Fiore; Norman E Davey; Anja Hagting; Daisuke Izawa; Jörg Mansfeld; Toby J Gibson; Jonathon Pines
Journal: Dev Cell Date: 2015-02-09 Impact factor: 12.270

7. Linear motifs confer functional diversity onto splice variants.

Authors: Robert J Weatheritt; Norman E Davey; Toby J Gibson
Journal: Nucleic Acids Res Date: 2012-05-25 Impact factor: 16.971

8. ProViz-a web-based visualization tool to investigate the functional and evolutionary features of protein sequences.

Authors: Peter Jehl; Jean Manguy; Denis C Shields; Desmond G Higgins; Norman E Davey
Journal: Nucleic Acids Res Date: 2016-04-16 Impact factor: 16.971

9. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

10. The Pfam protein families database: towards a more sustainable future.

Authors: Robert D Finn; Penelope Coggill; Ruth Y Eberhardt; Sean R Eddy; Jaina Mistry; Alex L Mitchell; Simon C Potter; Marco Punta; Matloob Qureshi; Amaia Sangrador-Vegas; Gustavo A Salazar; John Tate; Alex Bateman
Journal: Nucleic Acids Res Date: 2015-12-15 Impact factor: 16.971

38 in total

1. Piggybacking on Classical Import and Other Non-Classical Mechanisms of Nuclear Import Appear Highly Prevalent within the Human Proteome.

Authors: Tanner M Tessier; Katelyn M MacNeil; Joe S Mymryk
Journal: Biology (Basel) Date: 2020-07-23

Review 2. IDPs in macromolecular complexes: the roles of multivalent interactions in diverse assemblies.

Authors: Ho Yee Joyce Fung; Melissa Birol; Elizabeth Rhoades
Journal: Curr Opin Struct Biol Date: 2018-01-04 Impact factor: 6.809

3. Systematic Discovery of Short Linear Motifs Decodes Calcineurin Phosphatase Signaling.

Authors: Callie P Wigington; Jagoree Roy; Nikhil P Damle; Vikash K Yadav; Cecilia Blikstad; Eduard Resch; Cassandra J Wong; Douglas R Mackay; Jennifer T Wang; Izabella Krystkowiak; Devin A Bradburn; Eirini Tsekitsidou; Su Hyun Hong; Malika Amyn Kaderali; Shou-Ling Xu; Tim Stearns; Anne-Claude Gingras; Katharine S Ullman; Ylva Ivarsson; Norman E Davey; Martha S Cyert
Journal: Mol Cell Date: 2020-07-08 Impact factor: 17.970

Review 4. Gain-of-Function Mutations: An Emerging Advantage for Cancer Biology.

Authors: Yongsheng Li; Yunpeng Zhang; Xia Li; Song Yi; Juan Xu
Journal: Trends Biochem Sci Date: 2019-04-29 Impact factor: 13.807

5. PSSMSearch: a server for modeling, visualization, proteome-wide discovery and annotation of protein motif specificity determinants.

Authors: Izabella Krystkowiak; Jean Manguy; Norman E Davey
Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971

Review 6. Affinity-based profiling of endogenous phosphoprotein phosphatases by mass spectrometry.

Authors: Brooke L Brauer; Kwame Wiredu; Sierra Mitchell; Greg B Moorhead; Scott A Gerber; Arminja N Kettenbach
Journal: Nat Protoc Date: 2021-09-13 Impact factor: 13.491

7. Prediction of protein disorder based on IUPred.

Authors: Zsuzsanna Dosztányi
Journal: Protein Sci Date: 2017-11-16 Impact factor: 6.725

8. A Consensus Binding Motif for the PP4 Protein Phosphatase.

Authors: Yumi Ueki; Thomas Kruse; Melanie Bianca Weisser; Gustav N Sundell; Marie Sofie Yoo Larsen; Blanca Lopez Mendez; Nicole P Jenkins; Dimitriya H Garvanska; Lauren Cressey; Gang Zhang; Norman Davey; Guillermo Montoya; Ylva Ivarsson; Arminja N Kettenbach; Jakob Nilsson
Journal: Mol Cell Date: 2019-10-01 Impact factor: 17.970

9. A Quantitative Chemical Proteomic Strategy for Profiling Phosphoprotein Phosphatases from Yeast to Humans.

Authors: Scott P Lyons; Nicole P Jenkins; Isha Nasa; Meng S Choy; Mark E Adamo; Rebecca Page; Wolfgang Peti; Greg B Moorhead; Arminja N Kettenbach
Journal: Mol Cell Proteomics Date: 2018-09-18 Impact factor: 5.911

10. Systematic prediction of FFAT motifs across eukaryote proteomes identifies nucleolar and eisosome proteins with the predicted capacity to form bridges to the endoplasmic reticulum.

Authors: John A Slee; Timothy P Levine
Journal: Contact (Thousand Oaks) Date: 2019-10-30