Literature DB >> 29140456

Minimotif Miner 4: a million peptide minimotifs and counting.

Kenneth F Lyon1, Xingyu Cai2, Richard J Young1, Abdullah-Al Mamun2, Sanguthevar Rajasekaran2, Martin R Schiller1.   

Abstract

Minimotif Miner (MnM) is a database and web system for analyzing short functional peptide motifs, termed minimotifs. We present an update to MnM growing the database from ∼300 000 to >1 000 000 minimotif consensus sequences and instances. This growth comes largely from updating data from existing databases and annotation of articles with high-throughput approaches analyzing different types of post-translational modifications. Another update is mapping human proteins and their minimotifs to know human variants from the dbSNP, build 150. Now MnM 4 can be used to generate mechanistic hypotheses about how human genetic variation affect minimotifs and outcomes. One example of the utility of the combined minimotif/SNP tool identifies a loss of function missense SNP in a ubiquitylation minimotif encoded in the excision repair cross-complementing 2 (ERCC2) nucleotide excision repair gene. This SNP reaches genome wide significance for many types of cancer and the variant identified with MnM 4 reveals a more detailed mechanistic hypothesis concerning the role of ERCC2 in cancer. Other updates to the web system include a new architecture with migration of the web system and database to Docker containers for better performance and management. Weblinks:minimotifminer.org and mnm.engr.uconn.edu.
© The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Substances:

Year:  2018        PMID: 29140456      PMCID: PMC5753208          DOI: 10.1093/nar/gkx1085

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Minimotifs are short peptide sequences that are important in evolution and human disease (1–6). Minimotifs play a critical role in interaction of proteins with other proteins and molecules. We refer to an occurrence of minimotif peptide sequences in a protein or peptide as an instance, but sometimes through analysis of multiple instances a consensus sequence pattern is identified that accounts for observed variation and defines observed degeneracy. In 2006, we released the original Minimotif Miner (MnM), which had mostly consensus sequences (n = ∼300) annotated from the literature (7). In the 2008, release of MnM 2, the database grew to >5000 minimotifs and we started including instances (8). We had been using a minimotif syntax put forth by the Seefeld Convention (9), but recognized several deficiencies which led us to propose a new model for minimotifs with a revised syntax that addressed these shortcomings (10). We later revised this model proposing inclusion of structure in the minimotif definition and other minor modifications for MnM 3 in 2011 (11,12). The MnM 3 database utilizing this model grew to ∼300 000 minimotifs. At the beginning of the MnM project we recognized that predictions from minimotif consensus sequences were excellent at identifying true positives, but lacked specificity, and, only a small percentage of new minimotif predictions were accurate (7). This led us to start the process of generating filters that improved accuracy. We first used proteome frequency, protein surface prediction and evolutionary conservation filters with modest effect (7). Next, we sequentially tested several filters for protein–protein interactions, related molecular functions, genetic interactions and secondary structure, each yielding modest increases in accuracy (11,13–15). However, when all these filters were combined an overall prediction accuracy of 90% was achieved (16). However, this accuracy is limited by the need of additional knowledge about the protein in other databases. Other groups have investigated other methods for improved accuracy as well (17–20). In 2007, investigators from the eukaryotic linear motif resource group suggested there may be over a million minimotif instances in the cell (21). At the time we agreed. However, we now suggest that this may be a vast underestimate. This hypothesis is supported by this report that the MnM database has grown to >1 million instances. Furthermore, if we consider that the number of minimotifs in proteins like Tat and Nef from human immunodeficiency virus (22–24), which seem to completely decorate the surface of the protein, there is likely at least another 1–2 of magnitude of minimotif instances in the cell. Given this prevalence in the cell, their functions are likely pervasive in most, if not all cell processes. Herein, we provide more details updating the MnM database and web system to version 4.

MATERIALS AND METHODS

We used separate parsers to extract data from each of four sources of minimotifs: UniProtKB, PhosphoSitePlus, MEROPS and manually annotated entries (25–27). We also used the proteomics standards initiative–modification (PSI-MOD) database from the protein information resource to accurately describe post-translational modifications (PTMs) with this ontology (28). More than 1.3 million missense single nucleotide polymorphisms (SNPs) from single nucleotide polymorphism database (dbSNP) 150 (2) were added to update MnM 4 (29). The integration of the SNP data is used to identify minimotifs that are variable in the human population since minimotifs play a role in evolution and disease (1,2). To identify molecular functions and cell processes that are enriched with minimotifs a Gene Ontology (GO) analysis was performed on 12 405 human RefSeq proteins using the GORILLA gene ontology enrichment analysis tool (30,31). GORILLA produces P- and q-values for enrichment, where the q-value is a P-value with a correction for the false discovery rate due to multiple testing. Binplots were created using ggplot2 in the R programming environment (http://link.springer.com/10.1007/978-0-387-98141-3; https://www.R-project.org). External databases and new annotations were mined and queried with a combination of Java (Sun Microsystems), MySQL (my structured query language) and Python (Python Software Foundation) custom programs. MnM 4 was deployed on a mirrored server with a Linux container architecture (Supplementary Methods and Supplementary Figure S1).

RESULTS

Two strategies were used to update the MnM database. Parsers were modified or built to extract information into the MnM database model. Source databases were universal protein resource (UniProt), PhosphoSitePlus and MEROPS (25–27). A total of 765 503 minimotifs were added to the database. Since the last MnM release many additional minimotifs have been published. PubMed was searched with relevant search terms to identify high-throughput affinity mass spectrometry papers that contained 100 or 1000 s of minimotifs. This produced 27 017 minimotifs annotated from 15 papers. Collectively, MnM has had continued growth to now over 1 million minimotifs in >180 000 unique proteins (Table 1). The data comes from 15 152 research articles. As expected, most of the minimotifs are instances (n = 1 059 542), with 894 consensus sequences. However, the rich source of instances presents an opportunity for a more standardized approach for generating consensus sequences and position specific-scoring matrices.
Table 1.

A decade of growth of the MnM database

CategoryMnMMnM 2MnM 3MnM 4
Total minimotifs4625089294 9331 060 436
Consensus312858880894
Instance444229294 0531 059 542
Activity classes
Post-translational modifications116663210 949912 735
Binding16246894922147 654
Trafficking34195228229
Required for cell process4747
Unique minimotif
Sequences3122224185 833590 589
Proteins<312121149 671182 868
Targets<31268726204586
C-Terminal minimotifsNDNDND12 808

ND = not determined.

ND = not determined. Enrichment analysis was performed for GO functions, processes and components in the 12 405 human reference sequence (RefSeq) proteins that contained minimotifs (30–32). The 10 GO categories with the highest P-values are listed in Table 2 and top 50 are in Supplementary Table S1. These tables also contain p- and q-values for term enrichment. Not surprisingly, the most enriched GO terms were related to receptor activity and signal transduction (Table 2).
Table 2.

Enrichment of Gene Ontology terms for human proteins with minimotifs1

Description P-value q-valueType
Olfactory receptor activity2.89E-2651.14E-261function
Detection of chemical stimulus involved in perception of smell2.89E-2653.95E-261process
Detection of stimulus involved in sensory perception8.09E-2322.76E-228process
G-protein coupled receptor activity2.84E-2135.6E-210function
Detection of stimulus1.82E-2074.98E-204process
Transmembrane signaling receptor activity1.17E-1901.53E-187function
Transmembrane receptor activity2.07E-1892.05E-186function
G-protein coupled receptor signaling pathway8.6E-1871.96E-183process
Signaling receptor activity6.33E-1825E-179function
Molecular transducer activity4.6E-1612.6E-158function

1 Supplementary Table S1 is a list of the top 50 enriched GO functions.

1 Supplementary Table S1 is a list of the top 50 enriched GO functions. Since the size of MnM database grew several fold in MnM 4, the nature of the minimotif activities was assessed and quantified through database querying. Most of the minimotifs in MnM 4 (86%) are for PTMs with most of the remaining minimotifs involved in intermolecular binding interactions (Figure 1A and B). Most of the minimotifs were instance with <1000 consensus sequences. The bias toward PTMs and instances likely reflects the increased application of high-throughput experimental approaches, primary affinity mass spectrometry. The new MnM 4 database contains ∼12 000 new C-terminal minimotifs on the last 10 amino acids of a protein, thus is a source of new information for the C-terminome minimotif database (33).
Figure 1.

Binplots representing counts of minimotif activity classes. Instances of major minimotif modification activities (A) 14 most common modification activity subclasses in MnM 4 (B), and the 12 new manually annotated modification activity subclasses for MnM 4 (C).

Binplots representing counts of minimotif activity classes. Instances of major minimotif modification activities (A) 14 most common modification activity subclasses in MnM 4 (B), and the 12 new manually annotated modification activity subclasses for MnM 4 (C). In MnM 4, approximately half of the PTMs subactivities were for phosphorylation sites with acetylation, glycosylation, methylation, ubiquitylation and proteolysis subactivities each approximately equally contributing to ∼35% of the database (Figure 1A). Other activities were not as commonly observed. Some of the growth in MnM 4 came from annotation of published papers using new custom built parsers. This approach and new database imports reduced the bias of MnM 4 toward phosphorylation, with growth in other modification subactivities such as acetylation, methylation and crotonylation (Figure 1C). There have been several minor modifications in MnM 4. The website architecture was migrated to Docker containers to improve performance and management (Supplementary Methods). MnM 4 maintains flexible navigation among pages. The MnM homepage and results page was updated with a new look, while unnecessary details were removed (Supplementary Figure S2). PSI-MOD and RefSeq were updated. A search progress indicator was added to track progress after submission. For the SNP functions, identifiers for dbSNP (rs number) have been added to the output (Figure 2C).
Figure 2.

Outputs of SNP functional analysis for excision repair cross-complementing 2 (ERCC2) with MnM. (A) MnM 3 output of ERCC2 with minimotifs highlighted magenta and SNPs highlighted dark blue or green. The table shows a list of the eight SNPs that are indicated in the sequence window. (B) Example rows of MnM 3 output showing minimotifs introduced (red font) or eliminated (green font) by an SNP that is selected in the sequence window. (C) MnM 4 output of ERCC2 with minimotifs highlighted magenta and SNPs highlighted dark blue or green. The table shows a list of the 29 SNPs that are indicated in the sequence window.

Outputs of SNP functional analysis for excision repair cross-complementing 2 (ERCC2) with MnM. (A) MnM 3 output of ERCC2 with minimotifs highlighted magenta and SNPs highlighted dark blue or green. The table shows a list of the eight SNPs that are indicated in the sequence window. (B) Example rows of MnM 3 output showing minimotifs introduced (red font) or eliminated (green font) by an SNP that is selected in the sequence window. (C) MnM 4 output of ERCC2 with minimotifs highlighted magenta and SNPs highlighted dark blue or green. The table shows a list of the 29 SNPs that are indicated in the sequence window. The ‘Show SNPs that change minimotifs’ function introduced in a previous version of MnM has become much more interesting given the explosive growth of human whole exome and whole genome sequencing. MnM has a menu selection to show SNPs in the protein sequence window on the query results page. In this window a user can mouse click one or more SNPs to reveal the amino acid encoded by the missense variant. Upon selection of the ‘View new minimotifs from SNPs’ menu item, a new table displays a list of minimotifs introduced and eliminated by the SNP. Since the original release of this function, dbSNP has grown by several orders of magnitude and the number of missense minimotifs has grown from ∼182 099 in MnM 3 to 1 291 434 in MnM 4 after update with dbSNP, build 150. To explore the utility of the updated SNP tool, we analyzed ERCC2, a gene strongly associated several types of cancer. The ERCC2 gene encodes a helicase in the TFIIH complex essential for DNA repair through the nucleotide excision repair pathway. The ERCC2 protein (NP000391) was analyzed with MnM 4 producing 100 predicted minimotifs. A total of 37 of these had high combined filter scores suggesting that these are accurate predictions. We next searched for those motifs that had a SNP in a critical position. In MnM 3, ERCC2, typical of other protein queries had <10 SNPs (n = 8) that changed 38 minimotifs (Figure 2A). However, the SNP update of ∼1.3 million missense mutations in MnM 4 produces ∼3.5-fold more SNPs in ERCC2 minimotifs (Figure 2B, n = 29) with 211 predicted minimotif changes, reflecting the large growth of dbSNP (29). As in MnM 3, the new MnM version codifies minimotifs introduced by a SNP with the font colored green and minimotifs eliminated by a minimotif with the font colored red (Figure 2B). One of these SNPs of interest was a variant (rs13181) encoding the missense substitution K751Q. This minimotif was a ubiquitylation site (PSI-MOD: MOD:01148) at position K751. MnM4 reveals that the ubiquitylated lysine was changed to glutamine (K751Q) with this SNP. Most certainly this variant creates a loss of function for ubiquitylation at this site. Most often C-terminal ubiquitylation is involved in degradation of the protein, thus it would be expected to increase protein expression. This SNP is of high interest because it reached genome wide significance in several genome wide association studies and meta analyzes for lung, melanoma, breast cancer, glioma, pancreatic, esophageal and ovarian cancers (34–42). Moreover, the variant is observed at relatively high allele frequency (33%) in ∼60 000 sequences from the Exome Aggregation Consortium (EXaC) browser (43). From these published works and our minimotif analysis we are able to generate a new hypothesis that this critical variant eliminates a ubiquitylation site, effecting the degradation of ERCC2. Since this gene is crucial for nucleotide excision repair, this minimotif may provide a clue as to the mechanism of loss of excision repair functions in several cancers.

DISCUSSION

In this paper we report the growth of the MnM database to more than a million minimotifs. However, we think that we are still just in the early stages of minimotif discovery. There are examples of proteins, where the entire protein surface is covered with minimotifs; see examples for tp53, RNAP II, and Histone 3 (44). Furthermore, it is clear that minimotif sites overlap with protein-protein interaction sites, as well as other minimotifs. Their prevalence implies that proteins are much more than simplistic switches for turning on or off an enzymatic function or cell process, but rather a much more complicated functional unit, like an integrated circuit computer chip with many minimotif and protein interaction inputs and outputs. In this role, minimotifs amplify interconnections within the cellular network. As the number of minimotifs discovered continues to grow, one important question that arises is why are minimotifs not frequently turning up as a vulnerability in disease? There are examples where minimotifs are important in rare disorders or infectious disease, although this is not commonly observed (2,20,45). Minimotifs are a source of functional genetic variation and important targets of selection and evolution (1). Despite their apparent importance, their general minimal association with disease can be reconciled by the explanation that minimotifs provide a functional redundancy and network robustness, such that loss of function of a single motif only impacts cellular function when it is at a point of network vulnerability. And thus, why minimotifs are only mutated in a few rare disorders. Proteins with multiple minimotifs engaging the same target protein (e.g. many SH3 domain/PxxP interactions) are examples of this encoded robustness, where mutation of one of many PxxP minimotifs in a protein is not likely to significantly influences its interaction with and SH3 protein target (46). This scenario would also explain the minimal influence of minimotif mimetic drugs as disease therapeutics. One exciting possibility is that minimotifs may contribute to genetic risk in common disorders, however, this will require future study. To help facilitate the study of minimotifs in these types of roles we have enhanced the functionality of MnM 4 to identify minimotifs that are variable in the human population. By identifying those minimotifs residues that are both covalently modified and changed to an amino acid with a chemistry not consistent with that PTM, loss of function minimotifs in the human population can be confidentially inferred as in the ERCC2 ubiquitylation example we highlight herein. We hope the tool to investigate SNPs in minimotifs in MnM 4 will help facilitate study of the roles of minimotifs in selection, evolution, network function, disease, and potentially identify targets for novel therapeutics. Click here for additional data file.
  46 in total

1.  Loops govern SH2 domain specificity by controlling access to binding pockets.

Authors:  Tomonori Kaneko; Haiming Huang; Bing Zhao; Lei Li; Huadong Liu; Courtney K Voss; Chenggang Wu; Martin R Schiller; Shawn Shun-Cheng Li
Journal:  Sci Signal       Date:  2010-05-04       Impact factor: 8.192

2.  Natural variability of minimotifs in 1092 people indicates that minimotifs are targets of evolution.

Authors:  Kenneth F Lyon; Christy L Strong; Steve G Schooler; Richard J Young; Nervik Roy; Brittany Ozar; Mark Bachmeier; Sanguthevar Rajasekaran; Martin R Schiller
Journal:  Nucleic Acids Res       Date:  2015-06-11       Impact factor: 16.971

Review 3.  A million peptide motifs for the molecular biologist.

Authors:  Peter Tompa; Norman E Davey; Toby J Gibson; M Madan Babu
Journal:  Mol Cell       Date:  2014-07-17       Impact factor: 17.970

4.  A computational tool for identifying minimotifs in protein-protein interactions and improving the accuracy of minimotif predictions.

Authors:  Sanguthevar Rajasekaran; Jerlin Camilus Merlin; Vamsi Kundeti; Tian Mi; Aaron Oommen; Jay Vyas; Izua Alaniz; Keith Chung; Farah Chowdhury; Sandeep Deverasatty; Tenisha M Irvey; David Lacambacal; Darlene Lara; Subhasree Panchangam; Viraj Rathnayake; Paula Watts; Martin R Schiller
Journal:  Proteins       Date:  2010-10-11

5.  Single nucleotide polymorphisms (SNPs) of ERCC2, hOGG1, and XRCC1 DNA repair genes and the risk of triple-negative breast cancer in Polish women.

Authors:  Beata Smolarz; Marianna Makowska; Dariusz Samulak; Magdalena M Michalska; Ewa Mojs; Maciej Wilczak; Hanna Romanowicz
Journal:  Tumour Biol       Date:  2014-01-09

Review 6.  A Review of Functional Motifs Utilized by Viruses.

Authors:  Haitham Sobhy
Journal:  Proteomes       Date:  2016-01-21

7.  UniProt: the universal protein knowledgebase.

Authors: 
Journal:  Nucleic Acids Res       Date:  2016-11-29       Impact factor: 16.971

8.  GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists.

Authors:  Eran Eden; Roy Navon; Israel Steinfeld; Doron Lipson; Zohar Yakhini
Journal:  BMC Bioinformatics       Date:  2009-02-03       Impact factor: 3.169

9.  Proteome-wide analysis of human disease mutations in short linear motifs: neglected players in cancer?

Authors:  Bora Uyar; Robert J Weatheritt; Holger Dinkel; Norman E Davey; Toby J Gibson
Journal:  Mol Biosyst       Date:  2014-10

10.  Twenty years of the MEROPS database of proteolytic enzymes, their substrates and inhibitors.

Authors:  Neil D Rawlings; Alan J Barrett; Robert Finn
Journal:  Nucleic Acids Res       Date:  2015-11-02       Impact factor: 16.971

View more
  6 in total

1.  PSSMSearch: a server for modeling, visualization, proteome-wide discovery and annotation of protein motif specificity determinants.

Authors:  Izabella Krystkowiak; Jean Manguy; Norman E Davey
Journal:  Nucleic Acids Res       Date:  2018-07-02       Impact factor: 16.971

2.  SLiM-Enrich: computational assessment of protein-protein interaction data as a source of domain-motif interactions.

Authors:  Sobia Idrees; Åsa Pérez-Bercoff; Richard J Edwards
Journal:  PeerJ       Date:  2018-10-31       Impact factor: 2.984

3.  On the evolution of protein-adenine binding.

Authors:  Aya Narunsky; Amit Kessel; Ron Solan; Vikram Alva; Rachel Kolodny; Nir Ben-Tal
Journal:  Proc Natl Acad Sci U S A       Date:  2020-02-20       Impact factor: 11.205

4.  From complete cross-docking to partners identification and binding sites predictions.

Authors:  Chloé Dequeker; Yasser Mohseni Behbahani; Laurent David; Elodie Laine; Alessandra Carbone
Journal:  PLoS Comput Biol       Date:  2022-01-28       Impact factor: 4.475

5.  Minimotifs dysfunction is pervasive in neurodegenerative disorders.

Authors:  Surbhi Sharma; Richard J Young; Jingchun Chen; Xiangning Chen; Edwin C Oh; Martin R Schiller
Journal:  Alzheimers Dement (N Y)       Date:  2018-07-25

6.  The HGR motif is the antiangiogenic determinant of vasoinhibin: implications for a therapeutic orally active oligopeptide.

Authors:  Juan Pablo Robles; Magdalena Zamora; Lourdes Siqueiros-Marquez; Elva Adan-Castro; Gabriela Ramirez-Hernandez; Francisco Freinet Nuñez; Fernando Lopez-Casillas; Robert P Millar; Thomas Bertsch; Gonzalo Martínez de la Escalera; Jakob Triebel; Carmen Clapp
Journal:  Angiogenesis       Date:  2021-06-07       Impact factor: 9.596

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.