Literature DB >> 18477637

The Predikin webserver: improved prediction of protein kinase peptide specificity using structural information.

Abstract

The Predikin webserver allows users to predict substrates of protein kinases. The Predikin system is built from three components: a database of protein kinase substrates that links phosphorylation sites with specific protein kinase sequences; a perl module to analyse query protein kinases and a web interface through which users can submit protein kinases for analysis. The Predikin perl module provides methods to (i) locate protein kinase catalytic domains in a sequence, (ii) classify them by type or family, (iii) identify substrate-determining residues, (iv) generate weighted scoring matrices using three different methods, (v) extract putative phosphorylation sites in query substrate sequences and (vi) score phosphorylation sites for a given kinase, using optional filters. The web interface provides user-friendly access to each of these functions and allows users to obtain rapidly a set of predictions that they can export for further analysis. The server is available at http://predikin.biosci.uq.edu.au.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Peptides
Protein Kinases

Year: 2008 PMID： 18477637 PMCID： PMC2447752 DOI： 10.1093/nar/gkn279

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Enzymes of the eukaryotic protein kinase superfamily phosphorylate serine, threonine or tyrosine residues in proteins. Protein kinases and their substrates form complex networks that regulate essentially every eukaryotic cellular process (1). Defects in phosphorylation networks result in numerous disease states, making protein kinases important pharmacological targets. To ensure signalling fidelity, protein kinases act on discrete sets of substrates. Two major factors are responsible for substrate recognition (2): substrate recruitment, encompassing any process that promotes kinase-substrate encounters; and peptide specificity, the preference for particular residues surrounding the phosphorylation site. We have previously developed a method, named Predikin, to predict the peptide specificity of protein serine–threonine kinases (3). Predikin identifies key conserved substrate-determining residues (SDRs) in the protein kinase substrate-binding pocket. The region of the substrate contacted by these residues corresponds to the heptapeptide sequence comprised of positions −3 to +3 relative to the phosphorylated residue, so the physicochemical properties of SDRs can be used to predict which heptapeptides are the best substrates for a particular protein kinase. We have recently completely revised and expanded the Predikin codebase and provided access to Predikin via a new webserver. The Predikin webserver is built from three components: (i) Predikin.pm, a Perl module that provides data and methods for the analysis of protein kinase and substrate sequences; (ii) PredikinDB, a database of protein kinases and their substrates and (iii) the website user interface. In this article, we provide a brief description of the methods, capabilities and usage of the Predikin webserver.

METHODS

Protein kinase sequence analysis

Users begin a Predikin analysis by submitting a protein kinase sequence in fasta format. A Perl module, Predikin.pm, provides data and methods for the analysis of both protein kinase and substrate sequences. The protein kinase is analysed using the following methods: (i) assignment of protein kinase type (serine–threonine, CMGC or tyrosine kinase) using a regular expression match based on Prosite patterns (4); (ii) classification by Kinase Sequence Database (KSD) family (5); (iii) classification by PANTHER database family (6) and (iv) identification of the key SDRs. The module makes extensive use of the Bioperl library (7), HMM libraries and the HMMER package (8). SDRs in the query kinase are located using an alignment of the kinase sequence with a HMM profile model of the kinase catalytic domain (S_TKc, accession SM00220) from the SMART database (9). KSD family is assigned using the HMMER tool hmmpfam to compare the kinase sequence with a set of HMMs built from KSD family alignments. PANTHER family is assigned using HMM families and the pantherScore program, both obtained from the PANTHER database website. Having characterized the catalytic domain of the query kinase, Predikin moves to the next step in the procedure: calculation of kinase-specific weight matrices for substrate prediction.

Kinase-specific weight matrices

The unique feature of Predikin compared with existing methods is that it permits substrate prediction based solely on kinase sequence, as opposed to providing predictions only for a kinase family. This is achieved by querying PredikinDB, the MySQL database backend to the webserver. PredikinDB contains three linked tables that describe protein kinases, their substrates and phosphorylation sites. The data in PredikinDB are derived from the UniProt database using custom parsers written in Perl. PredikinDB is updated automatically at regular intervals using a pipeline of scripts that download UniProt files, parse and generate the database tables. The key feature of PredikinDB is that where possible, phosphorylation sites are linked with the sequence of the kinase acting at the site. This is achieved by parsing the UniProt MOD_RES line for a kinase name (e.g. ‘by PKA’) and comparing it with a list of gene names for kinases from the same organism as the substrate sequence. PredikinDB currently contains 2335 serine, threonine and tyrosine residues that are annotated as phosphoresidues (with UniProt evidence level ‘experimental’, ‘by similarity’, ‘probable’ or ‘potential’) in 1116 proteins and that are linked to a specific protein kinase sequence. 11 999 sites are also annotated as experimental by the phospho.ELM database (10), of which 690 are linked to a kinase. Linking phosphorylation sites with kinase sequences allows the retrieval of phosphorylation sites from the database, where (i) the kinase is known; (ii) the phosphorylation site is annotated with high confidence and (iii) the kinase has similarity in the catalytic domain (as measured by SDRs, KSD or PANTHER family) to those of the query kinase. Users can specify a minimum confidence value for the phosphorylation sites used in scoring matrices (phospho.ELM experimental; UniProt experimental, by similarity, probable or potential) and can also specify that only non-redundant sites be retrieved (homology reduction). The sites are then aligned and used to construct position weight matrices (Figure 1) by comparing the frequency of an amino acid at each position in the alignment with the frequency in all substrate sequences for the type (serine–threonine, CMGC or tyrosine kinase) of the query kinase. The matrices can then be used to score potential phosphorylation sites in putative substrates of the kinase.

Figure 1.

Frequency (upper) and weight (lower) matrices generated by Predikin to score potential substrates of protein kinase Cla4p from Saccharomyces cerevisiae, using the method of classification by KSD family.

Substrate prediction

The Predikin.pm Perl module provides methods to score potential phosphorylation sites using the weight matrices generated for the query kinase sequence. The user uploads putative substrates in fasta format, from which all peptides with the sequence XXX[ST]XXX (serine–threonine and CMGC kinases) or XXXYXXX (tyrosine kinases) are extracted. These sites can then be scored using one of the SDR, KSD or PANTHER matrices for the query kinase. A cutoff score below which results are not reported can be specified. In addition, the DisEMBL (11) and TMHMM (12) packages can be employed as filters to discriminate against putative phosphorylation sites on the basis of low intrinsic disorder and location within a transmembrane helix, respectively. Analysis of experimentally validated phosphorylation sites in PredikinDB shows that over 90% are found in a disordered region as predicted by at least one of DisEMBL's; three algorithms and <0.1% are located in a TMHMM-predicted helix. The analysis is available as Supplementary data at the website. The output from Predikin is a table (Figure 2) containing identifiers for the kinase, catalytic domain and substrate, the location of the potential phosphorylated residue, the heptapeptide XXX[STY]XXX and a relative score between 0 and 100 indicating the likelihood of phosphorylation by the kinase. We have performed an evaluation of Predikin scores using kinase-substrate pairs from PredikinDB to determine how well Predikin discriminates known phosphorylation sites of a kinase from unknown sites. The evaluation procedure is provided as Supplementary data at the website. Briefly, sites linked to a kinase were retrieved from PredikinDB and randomly divided into test (10%) and training (90%) sets. All XXX[STY]XXX sites in each test set substrate were scored by generating a scoring matrix for the corresponding kinase, omitting those sites in the training set linked to the same kinase. Known/unknown sites were labelled 1/0, respectively and redundant sites (same peptide, same kinase and so same score) were discarded. The procedure was repeated 100 times to obtain 100 samples of scores and labels for each scoring method/kinase type combination.

Figure 2.

A sample prediction generated by the Predikin webserver. Predikin SDR scores for the protein kinase Cla4p from Saccharomyces cerevisiae are shown for potential phosphorylation sites in Cla4p and yeast protein YOL113W. Area under receiver operating curve (AROC) values, obtained by plotting true positive (annotated sites) versus false positive (unannotated sites) rates as the score threshold is successively lowered (13) ranged from 0.71 ± 0.98 SD (tyrosine kinases, KSD scores) to 0.93 ± 0.02 SD (CMGC kinases, SDR scores), depending on the Predikin scoring method used and the kinase type. These values indicate that Predikin is effective at distinguishing true sites. Detailed comparison with other methods (GPS, KinasePhos, NetPhosK, PPSP and Scansite) is beyond the scope of this article; a preliminary AROC analysis indicates that Predikin performs as well or better than other phosphorylation site predictors. However, we emphasize that such comparisons are of limited value, particularly as the other methods can only assign a kinase family to a query substrate, whereas Predikin predicts substrates based on solely on query kinase sequence.

Web server implementation

The Predikin webserver user interface is built using the open-source Joomla content management system (CMS; http://www.joomla.org). Forms are designed using the Joomla Facile Forms component (http://www.facileforms.biz). The Joomla CMS provides a convenient modular approach to website design, making it easy to add features such as user management, custom forms, documentation and discussion forums. Joomla is written in PHP and so the PECL PHP embedded Perl extension (http://pecl.php.net/perl) is employed to allow communication between the webserver and the Predikin perl module. The webserver is primarily designed for users interested in a small set of kinases and potential substrates, identified in an experimental screen. However, users can acquire a large set of substrate predictions for a kinase quite rapidly, since all predictions for a session are stored in a temporary database table and can be exported as comma-separated text for easy import to other applications and further analysis. Users with more complex requirements (such as genome-scale prediction of substrates for kinases) may wish to use the standalone Predikin perl module and are encouraged to contact us for more information.

DISCUSSION

The Predikin webserver provides user-friendly access to the improved and enhanced Predikin prediction system. A number of existing tools such as Scansite (14), KinasePhos (15), NetPhosK (16), NetworKIN (17), GPS/GPS2 (18) and PPSP (19) are available to predict protein kinase substrates. The fundamental difference between these tools and Predikin is that analysis using the other tools begins with a substrate sequence, which has to be assigned to a limited number of pre-assigned kinase families. Predikin, on the other hand, uses the kinase sequence to build scoring matrices based on key residues in the kinase catalytic domain that are known from structural analysis to interact with the substrate phosphorylation site. It can therefore make substrate predictions for any protein kinase based on sequence alone, provided that phosphorylation sites of similar kinases are present in the PredikinDB database. New features and enhancements provided by the revised Predikin code include (i) more reliable determination of SDRs through the use of profile HMM alignments; (ii) filters to prescreen potential phosphorylation sites based on accessibility and disorder; (iii) three methods to generate kinase-specific scoring matrices based on SDRs, KSD or PANTHER family and (iv) use of the PredikinDB database, which is updated continually with new annotated phosphorylation sites and links sites with specific kinase sequences rather than kinase families, so forming the basis for substrate prediction using kinase features. Predikin provides a range of applications, such as predicting candidate substrates for a protein kinase, candidate protein kinases for a substrate and the assignment of protein kinases to their substrates in large datasets.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

19 in total

1. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

Authors: A Krogh; B Larsson; G von Heijne; E L Sonnhammer
Journal: J Mol Biol Date: 2001-01-19 Impact factor: 5.469

2. Structural basis and prediction of substrate specificity in protein serine/threonine kinases.

Authors: Ross I Brinkworth; Robert A Breinl; Bostjan Kobe
Journal: Proc Natl Acad Sci U S A Date: 2002-12-26 Impact factor: 11.205

3. Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs.

Authors: John C Obenauer; Lewis C Cantley; Michael B Yaffe
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

4. The Bioperl toolkit: Perl modules for the life sciences.

Authors: Jason E Stajich; David Block; Kris Boulez; Steven E Brenner; Stephen A Chervitz; Chris Dagdigian; Georg Fuellen; James G R Gilbert; Ian Korf; Hilmar Lapp; Heikki Lehväslaiho; Chad Matsalla; Chris J Mungall; Brian I Osborne; Matthew R Pocock; Peter Schattner; Martin Senger; Lincoln D Stein; Elia Stupka; Mark D Wilkinson; Ewan Birney
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

5. A kinase sequence database: sequence alignments and family assignment.

Authors: Oleksandr Buzko; Kevan M Shokat
Journal: Bioinformatics Date: 2002-09 Impact factor: 6.937

6. Systematic discovery of in vivo phosphorylation networks.

Authors: Rune Linding; Lars Juhl Jensen; Gerard J Ostheimer; Marcel A T M van Vugt; Claus Jørgensen; Ioana M Miron; Francesca Diella; Karen Colwill; Lorne Taylor; Kelly Elder; Pavel Metalnikov; Vivian Nguyen; Adrian Pasculescu; Jing Jin; Jin Gyoon Park; Leona D Samson; James R Woodgett; Robert B Russell; Peer Bork; Michael B Yaffe; Tony Pawson
Journal: Cell Date: 2007-06-14 Impact factor: 41.582

Review 7. Profile hidden Markov models.

Authors: S R Eddy
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

8. Protein disorder prediction: implications for structural proteomics.

Authors: Rune Linding; Lars Juhl Jensen; Francesca Diella; Peer Bork; Toby J Gibson; Robert B Russell
Journal: Structure Date: 2003-11 Impact factor: 5.006

9. Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins.

Authors: Francesca Diella; Scott Cameron; Christine Gemünd; Rune Linding; Allegra Via; Bernhard Kuster; Thomas Sicheritz-Pontén; Nikolaj Blom; Toby J Gibson
Journal: BMC Bioinformatics Date: 2004-06-22 Impact factor: 3.169

10. PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory.

Authors: Yu Xue; Ao Li; Lirong Wang; Huanqing Feng; Xuebiao Yao
Journal: BMC Bioinformatics Date: 2006-03-20 Impact factor: 3.169

12 in total

1. The importance of conserved features of yeast actin-binding protein 1 (Abp1p): the conditional nature of essentiality.

Authors: Bianca Garcia; Elliott J Stollar; Alan R Davidson
Journal: Genetics Date: 2012-06-01 Impact factor: 4.562

2. Uncovering Phosphorylation-Based Specificities through Functional Interaction Networks.

Authors: Omar Wagih; Naoyuki Sugiyama; Yasushi Ishihama; Pedro Beltrao
Journal: Mol Cell Proteomics Date: 2015-11-16 Impact factor: 5.911

3. Dynamics of re-constitution of the human nuclear proteome after cell division is regulated by NLS-adjacent phosphorylation.

Authors: Gergely Róna; Máté Borsos; Jonathan J Ellis; Ahmed M Mehdi; Mary Christie; Zsuzsanna Környei; Máté Neubrandt; Judit Tóth; Zoltán Bozóky; László Buday; Emília Madarász; Mikael Bodén; Bostjan Kobe; Beáta G Vértessy
Journal: Cell Cycle Date: 2014 Impact factor: 4.534

4. Simultaneous genome-wide inference of physical, genetic, regulatory, and functional pathway components.

Authors: Christopher Y Park; David C Hess; Curtis Huttenhower; Olga G Troyanskaya
Journal: PLoS Comput Biol Date: 2010-11-24 Impact factor: 4.475

5. Transcriptional regulation via TF-modifying enzymes: an integrative model-based analysis.

Authors: Logan J Everett; Shane T Jensen; Sridhar Hannenhalli
Journal: Nucleic Acids Res Date: 2011-04-05 Impact factor: 16.971

6. Predicting protein kinase specificity: Predikin update and performance in the DREAM4 challenge.

Authors: Jonathan J Ellis; Boštjan Kobe
Journal: PLoS One Date: 2011-07-28 Impact factor: 3.240

7. Deciphering the Arginine-binding preferences at the substrate-binding groove of Ser/Thr kinases by computational surface mapping.

Authors: Avraham Ben-Shimon; Masha Y Niv
Journal: PLoS Comput Biol Date: 2011-11-17 Impact factor: 4.475

8. Incorporating substrate sequence motifs and spatial amino acid composition to identify kinase-specific phosphorylation sites on protein three-dimensional structures.

Authors: Min-Gang Su; Tzong-Yi Lee
Journal: BMC Bioinformatics Date: 2013-10-22 Impact factor: 3.169

9. Identifying protein phosphorylation sites with kinase substrate specificity on human viruses.

Authors: Neil Arvin Bretaña; Cheng-Tsung Lu; Chiu-Yun Chiang; Min-Gang Su; Kai-Yao Huang; Tzong-Yi Lee; Shun-Long Weng
Journal: PLoS One Date: 2012-07-23 Impact factor: 3.240

10. DephosSite: a machine learning approach for discovering phosphotase-specific dephosphorylation sites.

Authors: Xiaofeng Wang; Renxiang Yan; Jiangning Song
Journal: Sci Rep Date: 2016-03-22 Impact factor: 4.379