Literature DB >> 16381945

dbPTM: an information repository of protein post-translational modification.

Tzong-Yi Lee1, Hsien-Da Huang, Jui-Hung Hung, Hsi-Yuan Huang, Yuh-Shyong Yang, Tzu-Hao Wang.   

Abstract

dbPTM is a database that compiles information on protein post-translational modifications (PTMs), such as the catalytic sites, solvent accessibility of amino acid residues, protein secondary and tertiary structures, protein domains and protein variations. The database includes all of the experimentally validated PTM sites from Swiss-Prot, PhosphoELM and O-GLYCBASE. Only a small fraction of Swiss-Prot proteins are annotated with experimentally verified PTM. Although the Swiss-Prot provides rich information about the PTM, other structural properties and functional information of proteins are also essential for elucidating protein mechanisms. The dbPTM systematically identifies three major types of protein PTM (phosphorylation, glycosylation and sulfation) sites against Swiss-Prot proteins by refining our previously developed prediction tool, KinasePhos (http://kinasephos.mbc.nctu.edu.tw/). Solvent accessibility and secondary structure of residues are also computationally predicted and are mapped to the PTM sites. The resource is now freely available at http://dbPTM.mbc.nctu.edu.tw/.

Entities:  

Mesh:

Substances:

Year:  2006        PMID: 16381945      PMCID: PMC1347446          DOI: 10.1093/nar/gkj083

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Protein post-translational modification (PTM) is an extremely important cellular control mechanism because it may alter proteins' physical and chemical properties, folding, conformation, distribution, stability, activity and consequently, their functions (1). Examples of the biological effects of protein modifications include phosphorylation for signal transduction, attachment of fatty acids for membrane anchoring and association, and glycosylation for changing protein half-life, targeting substrates, and promoting cell-cell and cell-matrix interactions. With the accelerating progress in proteomics, biological knowledgebases containing a wealth of information, in particular protein modifications, are playing crucial roles in cell regulation research (2). The Swiss-Prot knowledge base (3) includes as much modification information as is available with consistency and structure, allowing easy retrieval by biologists. Phospho.ELM (1), which was developed as part of the ELM (Eukaryotic Linear Motif) resource, is a new resource containing experimentally verified phosphorylation sites that were manually curated from the literature. O-GLYCBASE (4) is a database of glycoproteins, most of which include experimentally verified O-linked glycosylation sites. The RESID protein modification database is a comprehensive collection of annotations and structures for protein modifications and cross-links including pre-, co- and post-translational modifications (5). The RESID database provides modification information, literature citations, Gene Ontology (GO) cross-references, protein sequence database feature table annotations, structure diagrams and molecular models. Each RESID entry presents a protein with a chemically unique modification and indicates how the modification is currently annotated in the Swiss-Prot (6). In this study, we collect the known PTM information from external biological data sources. Since only a small fraction of Swiss-Prot proteins are annotated with experimentally verified PTMs, we also developed computational tools to comprehensively identify phosphorylation sites, glycosylation sites and sulfation sites against the Swiss-Prot proteins. Protein structural properties and functional information, such as the solvent accessibility of residues, protein variations, non-synonymous single nucleotide polymorphism (SNP), protein tertiary structures and protein functional domains, are provided for researchers who are investigating the protein PTM mechanisms. Web query interface and graphical visualization were designed and implemented to facilitate access to the database content.

DATA GENERATION

The data generation flow of the dbPTM is briefly depicted in Figure 1. The data generation flow comprises the three major components: integration of external known PTM databases, learning and prediction of PTM sites, and structural or functional annotations. The experimentally validated PTM data sources were extracted from Swiss-Prot (3), Phospho.ELM (1) and O-GLYCBASE (4). The experimentally verified PTM sites were used to generate computer models to further identify putative PTM sites against the Swiss-Prot proteins. Additional structural properties and functional information, such as protein tertiary structures, protein secondary structures, solvent accessibility of residues, protein functional domains, protein variations and non-synonymous SNP, are also annotated to the Swiss-Prot proteins. The detailed data generation flow is described below.
Figure 1

The data generation flow of the dbPTM database.

Integration of external known PTM databases

Three external biological databases related to protein PTM information, Swiss-Prot (3), Phospho.ELM (1) and O-GLYCBASE (4), are integrated into the proposed resource. Both the experimentally validated PTM sites and the putative PTM sites, which are annotated as ‘by similarity’, ‘potential’ or ‘probable’ in the ‘MOD_RES’ fields, have been extracted from the Swiss-Prot database (3). As summarized in Table 1, release 46.0 of Swiss-Prot contributes 11 025 experimental validated PTM sites within 4921 proteins, and 72 308 putative PTM sites within 31 026 proteins. The Phospho.ELM entries store information about substrate proteins with the exact positions of residues are known to be phosphorylated by cellular kinases. A total of 1703 experimentally verified phosphorylation sites within 556 proteins were obtained from Phospho.ELM version 2 (1). O-GLYCBASE (4) Version 6.00 provides 242 glycoproteins containing 2765 experimentally verified O-linked, N-linked and C-linked glycosylation sites. Moreover, 185 glycoproteins in O-GLYCBASE are corresponded to Swiss-Prot proteins, which have 2353 experimentally verified glycosylation sites.
Table 1

The list of the integrated external data sources

DatabasesDescriptionStatistics
Swiss-Prot (3,12)Experimental PTMs11 025 PTM sites within 4921 proteins
Putative PTMs72 308 PTM sites within 31 026 proteins
Proteins176 469 Proteins
Protein variants32 101 Variants corresponding to 6115 proteins
PhosphoELM (1)Experimental phosphorylation sites1703 PTM sites within 556 proteins
O-GLYCBASE (4)Experimental glycosylation sites2353 PTM sites within 185 glycoproteins
RESID (5)PTMs373 PTM types
InterPro (14)Protein domain1 113 928 Entries corresponding to 161 988 Swiss-Prot entries
Ensembl (13)Human variations23 378 Non-synonymous SNPs within 7230 Swiss-Prot human proteins
PDB (15)Protein structures31 721 Entries corresponding to 6806 Swiss-Prot proteins

Learning and prediction of PTM sites

To provide the PTM information of the PTM un-annotated proteins available from Swiss-Prot, we integrated several computational tools for identifying the PTMs of the Swiss-Prot proteins. Our previous work, namely KinasePhos (7), incorporated the profile hidden Markov model (HMM) to identify kinase-specific phosphorylation sites with ∼87% prediction accuracy (8), which was compared with several phosphorylation prediction tools, such as NetPhos (9), DISPHOS (10) and rBPNN (11) (see Supplementary Table S1). KinasePhos is integrated and used to fully detect the kinase-specific phosphorylation sites against the Swiss-Prot proteins. To reduce the number of false positive predictions by KinasePhos, we set the predictive parameters as the values when the prediction specificity is 100% (8). As depicted in Supplementary Figure S1, the KinasePhos-like method, which is similar to KinasePhos for phosphorylation sites, was designed and implemented to learn models for the prediction of the sulfation sites, N-linked glycosylation sites and C-linked glycosylation sites (Table 2). We used the 144 known sulfation sites of tyrosine, 1790 N-linked glycosylation sites of asparagine and 49 C-linked glycosylation sites of tryptophan to evaluate the performance of the KinasePhos-like prediction tools. The result suggests that all three KinasePhos-like tools exhibited high prediction precision, sensitivity and specificity (Supplementary Table S2).
Table 2

The list of the integrated annotated tools

ToolsDescription
KinasePhos (7)Identifying kinase-specific phosphorylation sites
KinasePhos-like sulfationIdentifying sulfation sites
KinasePhos-like N-linked glycosylationIdentifying N-linked glycosylation sites
KinasePhos-like C-linked glycosylationIdentifying C-linked glycosylation sites
DSSP (16)Calculating the secondary structure and solvent accessibility of residues
RVP-net (17)Predicting the solvent accessibility of residues
PSIPRED (18)Predicting the protein secondary structures
Weblogo (15)Generating sequence logo for PTM substrates

Structural and functional annotations

In order to provide more effective information about protein structural and functional annotations relevant to protein PTM, a variety of biological databases, such as Swiss-Prot (12), Ensembl (13), InterPro (14), PDB (15) and RESID (5), are integrated. Protein variation is the change of amino acids in polypeptides. As summarized in Table 1, Swiss-Prot contributes 32 101 protein variants corresponding to 6115 proteins, where 47 variant residues are located at the PTM sites and 267 variant residues are located surrounding 236 PTM sites (−4 ∼ +4 AA). Furthermore, single amino acid polymorphism (SAP) is the amino acid variation corresponding to the genetic variation as the definition of non-synonymous SNP in genomic sequence. The amino acid variants may have an impact on protein folding, active sites, or the overall solubility and stability of a protein. SAP is the type of variation most frequently related to human diseases (12). Therefore, when the amino acid variations occur in the PTM sites or the surrounding residues, they may affect the recognition of PTM sites by catalytic kinases. A total of 23 378 human non-synonymous SNPs located at 7230 Swiss-Prot human proteins were obtained from the variation part of Ensembl database (13). InterPro provides 1 113 928 entries corresponding to 161 988 Swiss-Prot proteins. We found that about 65% of Swiss-Prot annotated PTM sites are located at InterPro annotated protein domains. The RESID (5) protein modifications database is integrated into dbPTM to provide PTM related information, such as mass difference, chemical formula, enzymatic activities, literature citations, GO cross-references, structure diagrams and molecular models. The latest version of PDB contains 31 721 tertiary structures corresponding to 6806 Swiss-Prot protein entries (Table 1). For the proteins with known tertiary structures, the DSSP (16) program was used to extract the true secondary structure and solvent accessibility for those 6808 Swiss-Prot proteins. Solvent accessibility of amino acids residues is important for both the structure and function of proteins, especially the PTMs studied in this investigation. Protein secondary structure is the regular arrangement of amino acid residues in a segment of a polypeptide chain, where each amino acid is assigned a structure state, helix (H), strand (E) or coil (C). There are 1124 experimentally verified PTMs have the true secondary structure and solvent accessibility. However, only ∼4% of Swiss-Prot proteins have the known tertiary structures. For proteins without known tertiary structures, two previously published tools, RVP-net (17) and PSIPRED (18), were applied to predict the solvent accessibility and the secondary structure, respectively (see Table 2). RVP-net (17) presents a feed-forward type neural network which can predict a real value ranging from 0 to 100% of accessible surface areas (ASAs) for amino acid residues, based on their neighborhood information. We applied the RVP-net program (17) to fully predict the real-valued ASA for the amino acid residues of all Swiss-Prot proteins. By selecting a suggested threshold (17) (i.e. 25%), the residues with larger ASA values are viewed as surface residues.

DATA STATISTICS

The statistics of the experimentally verified PTMs and the putative PTMs compiled in the dbPTM resource are shown in the Table 3. For instance, dbPTM contains 14 057 known PTM sites and 772 154 putative PTM sites. The parameters of the predictive tools, KinasePhos, KinasePhos-like Sulfation and KinasePhos-like Glycosylation—for the prediction of phosphorylation sites, sulfation sites and glycosylation sites, respectively—are set as the values when the predictive specificity is set to 100% during the parameter optimization of the trained models (8). The numbers of putative phosphorylation and sulfation sites, where the ASA of the substrates are >25% (defined as the residue locating at the protein surface), are 652 756 and 13 315, respectively. There are a total of 33 887 predicted N-linked glycosylations of asparagine and C-linked glycosylations of tryptophan.
Table 3

The data statistics of the dbPTM database

PTM typesSubstratesNo. of known PTMsNo. of putative PTMsTotal
PhosphorylationSerine, threonine, tyrosine, aspartate, histidine or cysteine33675852
Serine, threonine and tyrosine (predicted in this resource, ASA > 0%)1 346 067661 975 (ASA > 25%)
Serine, threonine and tyrosine (predicted in this resource, ASA > 25%)652 756
GlycosylationN-linked, O-linked and C-linked glycosylation458655 059
N-linked asparagines and C-linked tryptophane (predicted in this resource, ASA > 0%)43 89494 132 (ASA > 25%)
N-linked asparagines and C-linked tryptophane (predicted in this resource, ASA > 25%)33 887
SulfationSerine, threonine and tyrosine144413
Tyrosine (predicted in this resource, ASA > 0%)189 45713 872 (ASA > 25%)
Tyrosine (predicted in this resource, ASA > 25%)13 315
LipidationGPI-anchor, N-terminal myristoylation and palmitoylation52046885208
AcetylationN-terminal of some residues and side chain of lysine or cysteine101915802599
AmidationGenerally at the C-terminal of a mature active peptide after oxidative cleavage of last glycine15545232077
MethylationGenerally of N-terminal phenylalanine, side chain of lysine, arginine, histidine, ralinenes or glutamate and C-terminal cysteine45511051560
HydroxylationGenerally of ralinenes, aspartate, raline or lysine8165151331
Pyrrolidone carboxylic acidN-terminal glutamine which has formed an internal cyclic lactam567408975
Gamma-carboxyglutamic acid4-Carboxyglutamate343263606
TrimethylationN6-methylated lysine, N6,N6,N6-trimethyllysine, N,N,N-trimethylalanine158294452
BlockedUnidentified N- or C-terminal blocking group10810118
FADO-8alpha-FAD tyrosine, Pros-8alpha-FAD histidine, S-8alpha-FAD cysteine and Tele-8alpha-FAD histidine127789
S-nitrosylationS-nitrosocysteine55964
FormylationOf the N-terminal methionine352762
DeamidationDeamidated asparagin and deamidated glutamine (needs to be followed by a G)331851
CitrullinationCitrulline74148
Others32812741134
Total14 057772 154786 211

ASA, accessible surface area.

INTERFACE

To facilitate the use of the dbPTM resource, we developed a website for users to browse and search for content. As depicted in Supplementary Figure S2, the user can select a particular type of PTM for browsing the information. When clicking on a PTM entry, it pops up a window showing the solvent accessibility of the residues, the secondary structures and the flanking sequence of the PTM site. The search pages allow users to query the database using the Swiss-Prot ID and protein name. The interface also presents structural properties and functional information corresponding to the resulting proteins, such as the solvent accessibility of residues, non-synonymous variations, protein domains and protein secondary structures. Furthermore, the positional relationships among the PTMs, protein structural properties and protein functional information are graphically displayed (Figure 2).
Figure 2

The graphical interface reveals the PTMs, the solvent accessibility of the residues, protein variations, protein secondary structures and protein functional domains.

Generally, a 3D presentation is an effective manner for revealing the PTM information corresponding to the protein tertiary structures. For these purposes, we developed a protein structure viewer for the visualization of protein tertiary structures and especially of the port-translational modification residues. As shown in Supplementary Figure S3, the visualization tool provides a comprehensive view of the whole protein structure and marks residues that are annotated as the PTM sites. This visualization tool is implemented as a client-side tool based on OpenGL's pipeline. The visualization of the protein structures and the annotated residues are provided by two different ways according to different users' platforms. For users in MS Windows, the users can download the installable package of the Silver. After the Silver is installed, the protein tertiary structures and the PTM sites can be graphically and directly provided, as shown in Supplementary Figure S3. Alternatively, for users in other platforms such as Mac OS X, Linux and Solaris, the user can download the PDB structure and the Rasmol () scripts for the labeling of the PTM sites.

CONCLUSIONS

The proposed resource not only integrates the experimentally validated PTM information, but it also computationally annotates the Swiss-Prot proteins for putative phosphorylation, glycosylation and sulfation sites. Furthermore, the PTM related protein structural properties and functional information, such as solvent accessibility of amino acid residues, protein variations, protein secondary structures, protein tertiary structures and protein domains, are provided to facilitate the research of protein PTMs. One of the prospective goals for dbPTM is to integrate more efficient prediction tools for other types of PTM in addition to phosphorylation, sulfation and N- and C-linked glycosylation. Other protein sequence databases besides the Swiss-Prot protein database can also be considered and annotated for post-translation modifications by the proposed resource.

AVAILABILITY

The dbPTM resource will be regularly maintained and updated. The resource is now freely available at .

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.
  18 in total

1.  Sequence and structure-based prediction of eukaryotic protein phosphorylation sites.

Authors:  N Blom; S Gammeltoft; S Brunak
Journal:  J Mol Biol       Date:  1999-12-17       Impact factor: 5.469

2.  The PSIPRED protein structure prediction server.

Authors:  L J McGuffin; K Bryson; D T Jones
Journal:  Bioinformatics       Date:  2000-04       Impact factor: 6.937

3.  Annotation of post-translational modifications in the Swiss-Prot knowledge base.

Authors:  Nathalie Farriol-Mathis; John S Garavelli; Brigitte Boeckmann; Séverine Duvaud; Elisabeth Gasteiger; Alain Gateau; Anne-Lise Veuthey; Amos Bairoch
Journal:  Proteomics       Date:  2004-06       Impact factor: 3.984

4.  The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants.

Authors:  Yum L Yip; Holger Scheib; Alexander V Diemand; Alexandre Gattiker; Livia M Famiglietti; Elisabeth Gasteiger; Amos Bairoch
Journal:  Hum Mutat       Date:  2004-05       Impact factor: 4.878

5.  Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.

Authors:  W Kabsch; C Sander
Journal:  Biopolymers       Date:  1983-12       Impact factor: 2.505

6.  O-GLYCBASE version 4.0: a revised database of O-glycosylated proteins.

Authors:  R Gupta; H Birch; K Rapacki; S Brunak; J E Hansen
Journal:  Nucleic Acids Res       Date:  1999-01-01       Impact factor: 16.971

7.  KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites.

Authors:  Hsien-Da Huang; Tzong-Yi Lee; Shih-Wei Tzeng; Jorng-Tzong Horng
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

8.  Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins.

Authors:  Francesca Diella; Scott Cameron; Christine Gemünd; Rune Linding; Allegra Via; Bernhard Kuster; Thomas Sicheritz-Pontén; Nikolaj Blom; Toby J Gibson
Journal:  BMC Bioinformatics       Date:  2004-06-22       Impact factor: 3.169

9.  The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema.

Authors:  Nita Deshpande; Kenneth J Addess; Wolfgang F Bluhm; Jeffrey C Merino-Ott; Wayne Townsend-Merino; Qing Zhang; Charlie Knezevich; Lie Xie; Li Chen; Zukang Feng; Rachel Kramer Green; Judith L Flippen-Anderson; John Westbrook; Helen M Berman; Philip E Bourne
Journal:  Nucleic Acids Res       Date:  2005-01-01       Impact factor: 16.971

10.  Ensembl 2005.

Authors:  T Hubbard; D Andrews; M Caccamo; G Cameron; Y Chen; M Clamp; L Clarke; G Coates; T Cox; F Cunningham; V Curwen; T Cutts; T Down; R Durbin; X M Fernandez-Suarez; J Gilbert; M Hammond; J Herrero; H Hotz; K Howe; V Iyer; K Jekosch; A Kahari; A Kasprzyk; D Keefe; S Keenan; F Kokocinsci; D London; I Longden; G McVicker; C Melsopp; P Meidl; S Potter; G Proctor; M Rae; D Rios; M Schuster; S Searle; J Severin; G Slater; D Smedley; J Smith; W Spooner; A Stabenau; J Stalker; R Storey; S Trevanion; A Ureta-Vidal; J Vogel; S White; C Woodwark; E Birney
Journal:  Nucleic Acids Res       Date:  2005-01-01       Impact factor: 16.971

View more
  105 in total

Review 1.  Using bioinformatics to predict the functional impact of SNVs.

Authors:  Melissa S Cline; Rachel Karchin
Journal:  Bioinformatics       Date:  2010-12-15       Impact factor: 6.937

Review 2.  Toward a complete in silico, multi-layered embryonic stem cell regulatory network.

Authors:  Huilei Xu; Christoph Schaniel; Ihor R Lemischka; Avi Ma'ayan
Journal:  Wiley Interdiscip Rev Syst Biol Med       Date:  2010 Nov-Dec

3.  AutoMotif Server for prediction of phosphorylation sites in proteins using support vector machine: 2007 update.

Authors:  Dariusz Plewczynski; Adrian Tkacz; Lucjan S Wyrwicz; Leszek Rychlewski; Krzysztof Ginalski
Journal:  J Mol Model       Date:  2007-11-08       Impact factor: 1.810

4.  SysPTM: a systematic resource for proteomic research on post-translational modifications.

Authors:  Hong Li; Xiaobin Xing; Guohui Ding; Qingrun Li; Chuan Wang; Lu Xie; Rong Zeng; Yixue Li
Journal:  Mol Cell Proteomics       Date:  2009-04-14       Impact factor: 5.911

5.  Rampant purifying selection conserves positions with posttranslational modifications in human proteins.

Authors:  Vanessa E Gray; Sudhir Kumar
Journal:  Mol Biol Evol       Date:  2011-01-27       Impact factor: 16.240

6.  Baking a mass-spectrometry data PIE with McMC and simulated annealing: predicting protein post-translational modifications from integrated top-down and bottom-up data.

Authors:  Stuart R Jefferys; Morgan C Giddings
Journal:  Bioinformatics       Date:  2011-03-15       Impact factor: 6.937

7.  Loss of post-translational modification sites in disease.

Authors:  Shuyan Li; Lilia M Iakoucheva; Sean D Mooney; Predrag Radivojac
Journal:  Pac Symp Biocomput       Date:  2010

8.  dbPTM 2016: 10-year anniversary of a resource for post-translational modification of proteins.

Authors:  Kai-Yao Huang; Min-Gang Su; Hui-Ju Kao; Yun-Chung Hsieh; Jhih-Hua Jhong; Kuang-Hao Cheng; Hsien-Da Huang; Tzong-Yi Lee
Journal:  Nucleic Acids Res       Date:  2015-11-17       Impact factor: 16.971

9.  FMM: a web server for metabolic pathway reconstruction and comparative analysis.

Authors:  Chih-Hung Chou; Wen-Chi Chang; Chih-Min Chiu; Chih-Chang Huang; Hsien-Da Huang
Journal:  Nucleic Acids Res       Date:  2009-04-28       Impact factor: 16.971

10.  PTM-Switchboard--a database of posttranslational modifications of transcription factors, the mediating enzymes and target genes.

Authors:  Logan Everett; Antony Vo; Sridhar Hannenhalli
Journal:  Nucleic Acids Res       Date:  2008-10-15       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.