Florian Gnad1, Jeremy Gunawardena, Matthias Mann. 1. Department of Proteomics and Signal Transduction, Max-Planck-Institute for Biochemistry, Am Klopferspitz 18, D-82152 Martinsried, Germany.
Abstract
The primary purpose of PHOSIDA (http://www.phosida.com) is to manage posttranslational modification sites of various species ranging from bacteria to human. Since its last report, PHOSIDA has grown significantly in size and evolved in scope. It comprises more than 80,000 phosphorylated, N-glycosylated or acetylated sites from nine different species. All sites are obtained from high-resolution mass spectrometric data using the same stringent quality criteria. One of the main distinguishing features of PHOSIDA is the provision of a wide range of analysis tools. PHOSIDA is comprised of three main components: the database environment, the prediction platform and the toolkit section. The database environment integrates and combines high-resolution proteomic data with multiple annotations. High-accuracy species-specific phosphorylation and acetylation site predictors, trained on the modification sites contained in PHOSIDA, allow the in silico determination of modified sites on any protein on the basis of the primary sequence. The toolkit section contains methods that search for sequence motif matches or identify de novo consensus, sequences from large scale data sets.
The primary purpose of PHOSIDA (http://www.phosida.com) is to manage posttranslational modification sites of various species ranging from bacteria to human. Since its last report, PHOSIDA has grown significantly in size and evolved in scope. It comprises more than 80,000 phosphorylated, N-glycosylated or acetylated sites from nine different species. All sites are obtained from high-resolution mass spectrometric data using the same stringent quality criteria. One of the main distinguishing features of PHOSIDA is the provision of a wide range of analysis tools. PHOSIDA is comprised of three main components: the database environment, the prediction platform and the toolkit section. The database environment integrates and combines high-resolution proteomic data with multiple annotations. High-accuracy species-specific phosphorylation and acetylation site predictors, trained on the modification sites contained in PHOSIDA, allow the in silico determination of modified sites on any protein on the basis of the primary sequence. The toolkit section contains methods that search for sequence motif matches or identify de novo consensus, sequences from large scale data sets.
Many cellular events are controlled by the posttranslational modification (PTM) of specific proteins in the proteome. For example, almost all signaling pathways are controlled by reversible phosphorylation, ubiquitination and other PTMs (1,2). In recent years, mass spectrometry (MS)-based proteomics has proven a powerful and generic tool to study these events on a global scale (3). PHOSIDA provides a repository for such modification sites and a systematic approach to protein and site annotation that requires integrating and standardizing data from various sources. It started in 2006, when the Mann laboratory described a generic, quantitative and high-resolution MS technology for the identification and quantitation of phosphorylation sites as a function of stimulus and time (4). Human cells were stimulated with EGF and site specific phosphorylation dynamics were determined and the resulting MS data were recorded in the PHOSIDA database. This study provided a blueprint for many subsequent large scale phosphoproteomic studies in the Mann group (5–7) and the impetus to develop PHOSIDA into a comprehensive and integrative environment. Initially, the main purpose of PHOSIDA was to make the identified phosphorylation data publicly and easily accessible. This explains its original name ‘PHOSIDA—the PHOsphorylation SIte DAtabase’. However, with the rapidly increasing number of identified PTM sites from different species, the systematic integration of various annotation data and the provision of multiple analysis tools, PHOSIDA became much more than a mere database. The first extensions including the evolutionary analysis and prediction of human phosphorylation sites were reported in 2007 (8). Since then many additional data sets and features have been integrated into PHOSIDA. The current version manages more than 70 000 phosphorylation sites, and the largest acetylome (9) and N-glycoproteome (10) determined so far. Due to these extensions we now rename PHOSIDA to the ‘Posttranslational Modification Site Database’. It contains modification sites of human, mouse, fly, worm and yeast proteins, and is also the most comprehensive repository of prokaryotic phosphoproteomes. To our knowledge, PHOSIDA and Uniprot (11) are the only resources that manage differently modified proteins from such a variety of species. However, Uniprot does not provide the same level of detail on MS and site specific data. Phospho.ELM (12), PhosphoSite (13), HPRD (14) and dbPTM (15) are further comprehensive databases that contain phosphorylation sites from different projects. Each of these websites has unique features regarding data integration, data representation and annotation. Importantly the identification of all integrated PTM sites in PHOSIDA has been based on high-accuracy mass spectrometry measurements using very strict detection criteria (16). This ensures very small false positive rates, which are not inflated by diverse data sets analyzed by different criteria. Furthermore, the inclusion of quantitative PTM dynamics is a unique feature of PHOSIDA.The presentation of different annotation data and the provision of analysis tools in PHOSIDA as an integrative platform has recently allowed the large scale evolutionary analysis of phosphorylation in all domains of life (17). Based on integrated phylogenetic relationships, global sequence alignments and structure information, we found that most of the identified eukaryotic phosphoproteins were already present in the earliest forms of life. However, their regulation via phosphorylation evolved after the divergence between single- and multi-cellular species. Even the worm phosphoproteome was found to be very distinct from the phosphoproteomes of higher eukaryotes, which is in concordance with the evolution of the corresponding kinase families. As another example, the high-accuracy predictors and sequence analysis tools have proven helpful in diverse studies (18–20). Here we describe the current version of PHOSIDA, which contains three main components: the integrative database management, the assembly of species-specific modification predictors and the analysis toolkit.
DATABASE ENVIRONMENT
Posttranslational modification site data sets
Initially PHOSIDA contained more than 6000 phosphorylation sites from HeLa cells exposed to growth factor stimulation. This data set presented the largest identified phosphoproteome at the time but this number has been exceeded in many subsequent large-scale studies, with no sign of saturation (Table 1). The current version of PHOSIDA contains an additional 20 000 human phosphorylation sites from kinase enriched samples (21) or of quantified dynamics during the cell cycle (5,7). Furthermore about 25 000 mouse phosphorylation sites are managed by PHOSIDA. These sites were derived from liver cells (6), melanoma tissue (22), brain (23) or macrophages (24). The global phosphoproteomes of fly (25), worm (26) and yeast (27) have also been added, so that PHOSIDA covers several representative eukaryotic species. Interestingly, the overlaps between different phosphoproteomes are relatively low on the site level, as demonstrated in the associated studies. This underlines the vast extent of phosphorylation in the cell. It also indicates that the identification of the complete eukaryotic phosphoproteome is far from being achieved. While thousands of serine/threonine and tyrosine phosphorylation sites can be detected in eukaryotic cells, measured prokaryotic phosphoproteomes generally do not comprise more than 100 sites. The relatively low extent of phosphorylation can be observed for both gram-negative and gram-positive bacteria such as Escherichia coli (28) and Bacillus subtilis (29), respectively. In the third domain of life, the archaeans, the detected serine/threonine and tyrosine phosphoproteome is likewise limited to 75 sites (30). Notably, mitochondria—the eukaryotic organelles with prokaryotic origin—also comprise a relatively sparse phosphoproteome, which classifies them with the prokaryotes rather than with other mammalian organelles (17).
Table 1.
Number of identified posttranslationally modified proteins, peptides and sites
Proteins
Peptides
Sites
Phosphorylation
Homo sapiens
8283
23 130
24 262
Mus musculus
9234
24 604
25 085
Drosophila melanogaster
2379
8777
10 043
Caenorhabditis elegans
2373
6926
6780
Saccharomyces cerevisiae
1118
4457
3620
Lactococcus lactis
63
99
73
Bacillus subtilis
78
102
76
Escherichia coli
79
104
81
Halobacterium salinarium
62
100
75
Acetylation
Homo sapiens
1750
4219
3600
N-Glycosylation
Mus musculus
2352
8681
6367
Number of identified posttranslationally modified proteins, peptides and sitesIn addition to phosphorylation data, 3600 acetylated lysines (9) and 6367 N-glycosylated asparagines (26) have been uploaded to PHOSIDA. To allow the species-specific retrieval of PTM sites from studies which employed different databases for identification, detected modified peptides and proteins are regularly reassigned to up-to-date database versions.Another unique feature of PHOSIDA is the uniform quality of the data. Acceptance of all PTM sites was based on high-accuracy mass spectrometry with stringent criteria yielding a very low false positive rate in the whole repository. Additionally, the online application additionally enables retrieval of modified sites from other sources including the Swiss-Prot database, Phospho.ELM and PhosphoSite. The collaboration with Phospho.ELM, another specialized modification site database, in particular, has proven very fruitful, ensuring up-to-date linkage. This could provide a model of PTM exchange akin to the exchanges of mainstream protein and gene databases. The following syntax provides a link directly to the annotation information of any eukaryotic protein of interest: http://141.61.102.18/phosida/index.aspx?query = [Uniprotaccession number].
Searching and browsing
For each species users can search for any protein of interest via accession number, gene name, description or sequence. As one of the major improvements, one can now browse for all posttranslationally modified proteins that were identified in a particular experiment or cell type (Figure 1). In addition, a gene ontology filter allows the retrieval of modified proteins that are localized in a certain cellular compartment or have a specified molecular function. The gene ontology data were derived from the AMIGO website (31,32). For example, users can browse for all protein kinases that are both phosphorylated and N-glycosylated in the mouse brain and localized at the plasma membrane.
Figure 1.
The new browsing function allows the searching for posttranslationally modified proteins that were identified in a particular experiment or cell type. Furthermore, the gene ontology filter enables users to search for modified proteins with specific cellular localization and molecular function (left panel). Selecting one of the resulting protein entries (middle panel) yields the protein annotation web page (right panel). In addition to general protein information, identified posttranslational modification sites are listed. Clicking on one of the site buttons results in the provision of site-specific information. In the illustrated example, searching for protein kinases that are both phosphorylated and N-glycosylated in the mouse brain and localized in the plasma membrane yields a list of proteins that match the specified search criteria. One of these proteins is the insulin receptor.
The new browsing function allows the searching for posttranslationally modified proteins that were identified in a particular experiment or cell type. Furthermore, the gene ontology filter enables users to search for modified proteins with specific cellular localization and molecular function (left panel). Selecting one of the resulting protein entries (middle panel) yields the protein annotation web page (right panel). In addition to general protein information, identified posttranslational modification sites are listed. Clicking on one of the site buttons results in the provision of site-specific information. In the illustrated example, searching for protein kinases that are both phosphorylated and N-glycosylated in the mouse brain and localized in the plasma membrane yields a list of proteins that match the specified search criteria. One of these proteins is the insulin receptor.
Posttranslational modification site information and integrated annotation data
For each modified protein, the user is presented with features such as description, gene symbol, sequence, accession numbers from various databases and gene ontology annotation. For the latter category the full terms (e.g. ‘ATPase activity’) with links to the corresponding entry of the gene ontology website are given. In the case of eukaryotic Swiss-Prot annotated proteins motifs, domains, modified sites from other sources and associated literature references with links to the PubMed site are provided. The integration of annotation data has proven very informative—for example, it allows immediate visualization of PTMs that occur in a certain domain. Since the localization of sites within the detected posttranslationally modified peptide is sometimes ambiguous, we had developed a probability based localization score (4). It reflects the chance of each site within the peptide to be posttranslationally modified given its fragmentation spectra. ‘Class I sites’ are defined by a minimum localization probability of 0.75. If this score is lower than 0.75, the site is enclosed in brackets in PHOSIDA. For each PTM site, the corresponding localization scores, the surrounding sequence, matching sequence motifs and the predicted secondary structure and accessibility are provided (Figure 2 left panel). As a major difference to previous releases, it is shown whether the specified site was detected in a certain experiment or cell type, if applicable (Figure 2 right panel). Clicking on one of the corresponding buttons yields the listing of the identified corresponding peptides and quantitative data, if available. For the most recent studies the related spectra are shown for additional validation. The indication of the occurrence of PTM sites in cancer cell lines or normal tissues is a striking feature of PHOSIDA. As previously discussed (33) modification states observed in cell lines might not occur in normal cancer tissues and their interpretation might therefore be misleading. The ‘help’ section lists all sample conditions used in the associated experiments.
Figure 2.
On the site level PHOSIDA provides the surrounding sequence, matching motifs, the predicted secondary structure, the predicted accessibility, the corresponding identified peptides and the identification state in certain cell types (left panel: N-glycosylated asparagine on position 426 of the mouse glutamate receptor 2 subunit) or experiments (right panel: serine on position 1039 of the human EGF receptor).
On the site level PHOSIDA provides the surrounding sequence, matching motifs, the predicted secondary structure, the predicted accessibility, the corresponding identified peptides and the identification state in certain cell types (left panel: N-glycosylatedasparagine on position 426 of the mouseglutamate receptor 2 subunit) or experiments (right panel: serine on position 1039 of the human EGF receptor).The evolutionary section provides information about the phylogenetic relationships between modified proteins and homologs in other species as described (8). Based on global alignments the amino acid conservation of all identified sites in orthologous proteins is displayed. The aligned surrounding sequences demonstrate if matching motifs are conserved. In contrast to the previous version of PHOSIDA, we now use the phylogentic relationships of 36 eukaryotes provided by the Ensembl Compara database (34).
POSTTRANSLATIONAL MODIFICATION SITE PREDICTION
Target serine/threonine/tyrosinesites are generally recognized by kinases and phosphatases through linear sequence patterns (motifs). Analogously, lysine acetylation sites are recognized by acetyltransferases and deacetylases. Various machine learning approaches try to predict phosphorylation sites. For example, Scansite (35) uses a profile method, whereas the prediction system NetPhos (36) is based on neural networks to predict phosphorylation events. NetPhosK (37) aims to predict phosphorylation sites along with their corresponding kinase. Each prediction method is unique regarding the underlying machine learning method, input sets used for training and usability. The main advantages of the PHOSIDA predictors are the high quality of the PTM sites used as input sets for training, the species specificity, and the particular effort invested in user-friendliness.We use our large-scale studies to construct a PTM site predictor based on a support vector machine. Using the integrated high-resolution MS data sets we developed support vector machines to predict phosphorylation and acetylation sites on the basis of the primary sequence. PHOSIDA contains phosphorylation site predictors for yeast, worm, fly, mouse and human. As previously shown (38,39), phosphorylation prediction accuracy can increase by the addition of further features such as structure and conservation. However, in our studies the prediction accuracies were already high on the basis of the surrounding sequences. The addition of further information including structural constraints and conservation yielded only a slight increase in prediction accuracy (8). The species specificity of the predictors proved to be crucial, as the accuracy decreases with input sets from distantly related species (25). For example, the accuracy of identifying yeastphosphosites using the human phosphorylation site predictor is comparatively low. Novel identified sites are continuously used to generate larger species-specific training sets. However, the addition of further sites results only in a slight increase of the accuracy. Moreover, PHOSIDA provides a mouse acetylation site predictor with 78% precision at 78% recall (40). A recent study has shown that PHOSIDA outperforms other acetylation site predictors including LysAcet (41) and PredMod (42). However, their accuracies might also increase upon training with current large scale high-quality data sets. As more lysine acetylomes of other species are mapped by high-resolution MS, it will be interesting to see if the PHOSIDA predictor is also capable to identify acetylation sites from distantly related organisms.To predict the occurrence of phosphorylated or acetylated sites on a single protein of interest, one can either insert its protein sequence without further annotation or a sequence entry in FASTA format. Addressing web users’ feedback, PHOSIDA now allows the prediction of PTM sites on multiple proteins in FASTA format (Figure 3). Users can set a desired cutoff directly on the precision-recall-curve. Restrictions on the format of input sequences are described in the corresponding help section accessible via the ‘question mark’ button.
Figure 3.
Species-specific phosphorylation or acetylation site predictors allow the in silico identification of proteins based on the primary sequence. Users can insert the sequence of a single protein or the sequences of multiple proteins in FASTA format (left panel). Using specified precision recall values, the predicted posttranslational modification sites are listed (right panel).
Species-specific phosphorylation or acetylation site predictors allow the in silico identification of proteins based on the primary sequence. Users can insert the sequence of a single protein or the sequences of multiple proteins in FASTA format (left panel). Using specified precision recall values, the predicted posttranslational modification sites are listed (right panel).
TOOLKIT SECTION
The recently established toolkit section contains various sequence analysis methods. The ‘motif matcher’ searches for matching motifs in any sequence of interest (Figure 4). The underlying repository of annotated sequence motifs contain recognition patterns related to phosphorylation, N-glycosylation and SUMOylation. Alternatively, users can define their own motif and determine matching sites.
Figure 4.
The Motif Matcher searches for sequence matches with annotated motifs including kinase recognition patterns. Users can insert a single sequence or multiple sequences (left panel) to find motif matches (right panel).
The Motif Matcher searches for sequence matches with annotated motifs including kinase recognition patterns. Users can insert a single sequence or multiple sequences (left panel) to find motif matches (right panel).Furthermore, we created a ‘motif finder’ for the de novo identification of protein phosphorylation sequence motifs from large scale data sets on the basis of bootstrap statistics. Briefly, sequences surrounding non-phosphorylated serines, threonines and tyrosines are randomly selected from species-specific protein databases in iterative steps. The resulting bootstrap distributions reflect the frequencies of amino acids at certain positions relative to the site. Significantly overrepresented phosphorylation motifs are identified by comparing the position specific amino acid frequencies in the sequences surrounding phosphorylation sites (positive set) with the corresponding calculated bootstrapping distributions (background set). Identified protein sequence motifs are then scanned for matches with annotated kinase motifs. For each derived motif, the corresponding score reflects the difference between the frequency of the given amino acid on the specified position and the mean of the corresponding background bootstrap distribution measured in number of standard deviations of the bootstrap distribution.Resulting sequence logos visualize the significance of position specific amino acid frequencies from given phosphorylation data sets. Amino acids are displayed in the sequence logo, if their frequency on a given position is higher than the mean of the corresponding background distribution. The height of the amino acid letter is relative to the highest motif identification score.PHOSIDA currently provides background sets consisting of non-phosphorylated sites of 47 eukaryotes and we intend to add further precalculated background sets to our database in the near future (Figure 5). The background sets are limited to eukaryotic species, as the application to the sparse prokaryotic phosphoproteomes does not yield any significant sequence patterns. The required input sets are phosphorylation sites with their six surrounding residues (to both termini). Phosphosite entries have to be separated via new lines. The input set can be a mixture of phosphoserines, phosphothreonines and phosphotyrosines, and has to contain a minimum of 100 instances for at least one phosphorylated amino acid type. Our online method applies the algorithm to each phosphorylated amino acid (S/T/Y) separately. Consequently, the web user can specify different score and occurrence cutoffs for each phosphorylated amino acid. However, the specified cutoffs have to satisfy minimum values (minimum proportional occurrence of 5% and minimum score of 15). The application to eukaryotic phosphoproteomes shows that many annotated kinase motifs are covered by our de novo method. More detailed descriptions of the algorithm and the usage of the motif finder are available via the help section of PHOSIDA.
Figure 5.
The Motif Finder identifies significantly overrepresented consensus sequences in given large-scale phospho data sets. It compares the position specific amino acid frequencies in the input set (left panel) with the ones of the background set. Based on bootstrap statistics the motif finder extracts de novo sequence motifs and matches them with annotated kinase motifs (right panel).
The Motif Finder identifies significantly overrepresented consensus sequences in given large-scale phospho data sets. It compares the position specific amino acid frequencies in the input set (left panel) with the ones of the background set. Based on bootstrap statistics the motif finder extracts de novo sequence motifs and matches them with annotated kinase motifs (right panel).
FUTURE PLANS AND CONCLUSIONS
As a dynamic database, PHOSIDA is continuously extended and upgraded. The integration of multiple PTM data sets from various species and cell types along with various annotation data makes PHOSIDA a unique environment. In particular the provision of analysis tools and predictors proved to be very helpful. The feedback of the scientific community has increased the usefulness of PHOSIDA consistently over the last four years. While further large-scale data sets will be integrated in the future, we intend to expand the toolkit section. We aim to implement additional analysis tools and extend the motif finder to identify de novo motifs for any PTM. The ‘news’ section provides information about changes and updates on a regular basis. Moreover, users can subscribe to a newsletter.
FUNDING
National Institutes of Health Grant R01 GM081578-02 on “Complex dynamics in multisite phosphorylation”. PROSPECT, a 7th framework program of the European Union (grant agreement HEALTH-F4-2008-201648/PROSPECTS). Funding for open access charge: Max-Planck Society.Conflict of interest statement. None declared.
Authors: Felix S Oppermann; Florian Gnad; Jesper V Olsen; Renate Hornberger; Zoltán Greff; György Kéri; Matthias Mann; Henrik Daub Journal: Mol Cell Proteomics Date: 2009-04-15 Impact factor: 5.911
Authors: Sara Zanivan; Florian Gnad; Sara A Wickström; Tami Geiger; Boris Macek; Jürgen Cox; Reinhard Fässler; Matthias Mann Journal: J Proteome Res Date: 2008-12 Impact factor: 4.466
Authors: Justin M Drake; Nicholas A Graham; Tanya Stoyanova; Amir Sedghi; Andrew S Goldstein; Houjian Cai; Daniel A Smith; Hong Zhang; Evangelia Komisopoulou; Jiaoti Huang; Thomas G Graeber; Owen N Witte Journal: Proc Natl Acad Sci U S A Date: 2012-01-17 Impact factor: 11.205
Authors: Ralph A Bradshaw; Jay Pundavela; Jordane Biarc; Robert J Chalkley; A L Burlingame; Hubert Hondermarck Journal: Adv Biol Regul Date: 2014-11-20
Authors: Gergely Róna; Máté Borsos; Jonathan J Ellis; Ahmed M Mehdi; Mary Christie; Zsuzsanna Környei; Máté Neubrandt; Judit Tóth; Zoltán Bozóky; László Buday; Emília Madarász; Mikael Bodén; Bostjan Kobe; Beáta G Vértessy Journal: Cell Cycle Date: 2014 Impact factor: 4.534