Literature DB >> 20733061

GolgiP: prediction of Golgi-resident proteins in plants.

Wen-Chi Chou1, Yanbin Yin, Ying Xu.   

Abstract

UNLABELLED: We present a novel Golgi-prediction server, GolgiP, for computational prediction of both membrane- and non-membrane-associated Golgi-resident proteins in plants. We have employed a support vector machine-based classification method for the prediction of such Golgi proteins, based on three types of information, dipeptide composition, transmembrane domain(s) (TMDs) and functional domain(s) of a protein, where the functional domain information is generated through searching against the Conserved Domains Database, and the TMD information includes the number of TMDs, the length of TMD and the number of TMDs at the N-terminus of a protein. Using GolgiP, we have made genome-scale predictions of Golgi-resident proteins in 18 plant genomes, and have made the preliminary analysis of the predicted data. AVAILABILITY: The GolgiP web service is publically available at http://csbl1.bmb.uga.edu/GolgiP/.

Entities:  

Mesh:

Substances:

Year:  2010        PMID: 20733061      PMCID: PMC2944200          DOI: 10.1093/bioinformatics/btq446

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

The Golgi apparatus is an essential cellular organelle found in most, if not all, eukaryotic cells, serving as an intermediate station of the secretory pathway that transports proteins out of a cell. Besides, Golgi is also a major site for protein post-translational modifications (e.g. glycosylation; Nilsson et al., 2009) and synthesis of various polysaccharides. The plant cell walls are mainly comprised of lignins, glycosylated proteins and polysaccharides, most of which are synthesized in Golgi (Lerouxel et al., 2006). Identification of the Golgi-resident proteins represents a very challenging and a highly important problem for the understanding of the biological processes taking place in Golgi. Although there are 1183 mouse and human Golgi-resident proteins identified (Sprenger et al., 2007), only a little over 400 plant Golgi proteins have been experimentally identified. A key challenging issue is that plant Golgi proteins do not seem to have obvious targeting signals as proteins targeted at other cellular compartments, such as nucleus or extra-cellular space. Most of the existing computational tools for subcellular localization predictions are designed for the general subcellular localization prediction, and their predictions for Golgi-resident proteins are less than adequate (Sprenger et al., 2006). Only one program has been specifically designed for the prediction of Golgi-localized proteins but it focuses only on transmembrane Golgi proteins (Yuan and Teasdale, 2002). The issue is that only 25% of Golgi proteins of Arabidopsis thaliana are estimated to contain transmembrane regions (Schwacke et al., 2003), indicating the inadequacy of the current programs. Based on this consideration, we have designed a support vector machine (SVM)-based classifier, called GolgiP, to predict both Golgi-localized transmembrane proteins and non-transmembrane proteins. GolgiP currently provides multiple models for predicting plant Golgi proteins, based on the specific needs of a user.

2 METHODS AND DATASET

We have collected a large dataset comprising of 402 known Golgi proteins and 5703 known non-Golgi proteins of A.thaliana (91.2%), Oryza sativa (8.2%) and other plants (0.7%), from the SUBA (Heazlewood et al., 2007) and the UniProt (Apweiler et al., 2004) databases, as well as manually curated from the published literature. The non-Golgi proteins are proteins that have subcellular localization annotations, but not in Golgi according to the above databases. The redundant sequences in our dataset are removed by CD-hit using 95% sequence identity as the cut-off (Li and Godzik, 2006). Four-fifth of the data was used to train the classifier and the remaining one-fifth was used to test the trained classifier, where the dataset was randomly partitioned into training and test datasets. To train an SVM-based classifier for Golgi proteins, we have examined three different sets of features, all computed from protein sequences. The first set of features is related to the dipeptide composition (DiAA). For each protein in our training set, we calculated the composition of dipeptides. The second set of features is related to transmembrane domains (TMDs). We used TMHMM (Krogh et al., 2001) and Phobius (Kall et al., 2004) to predict the number of TMDs, the average length of TMDs, the number of TMDs within the N-terminal region consisting of 70 amino acids, the length of the first TMD within the N-terminal region, and the orientation of the N-terminal (i.e. in the cytosol side or in the Golgi lumen side). The third set of features is related to functional domains (FunD). We searched proteins in our datasets against the Conserved Domains Database (CDD) using RPS-BLAST (Marchler-Bauer et al., 2009) with an e-value cutoff <0.01. We did this because the Golgi apparatus is where proteins get post-translational modifications such as glycosylation (Nilsson et al., 2009), and where the syntheses of most polysaccharides take place (Nilsson et al., 2009). In addition, Komatsu et al. (2007) found that the distributions of functional categories of proteins are different in different membranes such as plasma membrane, vascular membrane and Golgi membrane, respectively. Hence, it is expected that enzymes for the Golgi-related activities should be located in Golgi. The CDDs found for the Golgi proteins are then collected as the third set of features. We applied the LIBSVM package (Fan et al., 2005) to train the classifier. We used the Radial Basis Function kernel, and tuned the cost (c) and gamma (γ) parameters to optimize the classification performance on the training dataset.

3 RESULTS AND DISCUSSION

We used the aforementioned three sets of sequence features, and trained three SVM classification models. Besides, we combined all three sets of features to train a comprehensive model. The training performances are shown in Supplementary Material. We have compared the models with the other Golgi protein prediction tools, including PSORT (Nakai and Horton, 1999), WoLF PSORT (Horton et al., 2007) and Yuan's Golgi predictor (Yuan and Teasdale, 2002) by using the testing dataset. As shown in Table 1, Yuan's Golgi predictor has the good sensitivity but the lowest specificity and the lowest accuracy. PSORT and WoLF PSORT are two general subcellular localization prediction tools, and have moderate level of classification performance, which may not be adequate to serve as a plant Golgi protein predictor based on our analysis. Our program, GolgiP, exhibits the better overall performances with a higher accuracy and Matthews correlation coefficient (MCC).
Table 1.

Evaluation of Golgi protein prediction tools

ToolsSensitivitySpecificityAccuracyMCC
(%)(%)(%)
Yuan71.6423.1826.37−0.03
WoLF PSROT15.9292.6987.630.08
GolgiP-TMD61.7367.7567.750.15
PSORT43.5383.1080.490.17
GolgiP-DiAA71.6480.7680.160.31
GolgiP-Comprehensive72.8498.4296.730.73
GolgiP-FunD57.50100.0097.210.75

The performances are sorted by MCC.

Evaluation of Golgi protein prediction tools The performances are sorted by MCC. We have applied GolgiP with the functional domain model to predict Golgi proteins on 18 selected fully sequenced plant genomes using the same cutoff. The reason we chose the functional domain model is that the model performs the best specificity, and therefore tends to avoid false positive results in this genome-wide prediction. The numbers and percentages of the predicted Golgi proteins in these organisms are shown in Table 2. Across algae, moss, monocot and dicot plants, the average percentages of predicted Golgi proteins is 7.25% among all the encoded proteins by these genomes. The stability in the percentage of the predicted Golgi proteins across different genomes indirectly suggests the reliability of our predictions. The trend of distribution of Golgi proteins from lower to higher plant species shows the similar percentage of Golgi proteins. This may suggest that the functionality of the Golgi apparatus has evolved and matured fairly early in the plant evolution.
Table 2.

Application of the GolgiP program on 18 plant genomes

CladeSpecies# predicted Golgi proteins/# Total proteinsPercentage
Red algaeCyanidioschyzon merolae 10D430/50148.58
Green algaeMicromonas pusilla CCMP1545716/104756.84
Green algaeMicromonas strain RCC299833/98158.49
Green algaeOstreococcus lucimarinus706/76519.23
Green algaeOstreococcus tauri656/77258.49
Green algaeChlamydomonas reinhardtii982/145986.73
Green algaeVolvox carteri f. nagariensis1025/155446.59
MossPhyscomitrella patens ssp patens2344/359386.52
Spike mossSelaginella moellendorffii2912/346978.39
MonocotOryza sativa4240/673936.29
MonocotBrachypodium distachyon2446/322557.58
MonocotSorghum bicolor2197/358996.12
MonocotZea mays4748/753876.30
DicotVitis vinifera2008/304346.60
DicotArabidopsis thaliana2727/334108.16
DicotMedicago truncatula1856/300286.18
DicotGlycine max5262/757786.94
Application of the GolgiP program on 18 plant genomes In conclusion, we developed a Golgi protein prediction tool, GolgiP, and demonstrated its superior performance in predicting plant Golgi proteins over the existing prediction servers. In addition, our predictions across multiple plant genomes give an estimation of the percentage of plant Golgi proteins across different plant organisms, which is in general agreement with the previous estimations.
  15 in total

1.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

Authors:  A Krogh; B Larsson; G von Heijne; E L Sonnhammer
Journal:  J Mol Biol       Date:  2001-01-19       Impact factor: 5.469

2.  PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization.

Authors:  K Nakai; P Horton
Journal:  Trends Biochem Sci       Date:  1999-01       Impact factor: 13.807

3.  UniProt: the Universal Protein knowledgebase.

Authors:  Rolf Apweiler; Amos Bairoch; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

4.  A combined transmembrane topology and signal peptide prediction method.

Authors:  Lukas Käll; Anders Krogh; Erik L L Sonnhammer
Journal:  J Mol Biol       Date:  2004-05-14       Impact factor: 5.469

Review 5.  The proteomics of plant cell membranes.

Authors:  Setsuko Komatsu; Hirosato Konishi; Makoto Hashimoto
Journal:  J Exp Bot       Date:  2006-06-27       Impact factor: 6.992

6.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors:  Weizhong Li; Adam Godzik
Journal:  Bioinformatics       Date:  2006-05-26       Impact factor: 6.937

Review 7.  Sorting out glycosylation enzymes in the Golgi apparatus.

Authors:  Tommy Nilsson; Catherine E Au; John J M Bergeron
Journal:  FEBS Lett       Date:  2009-10-28       Impact factor: 4.124

Review 8.  Biosynthesis of plant cell wall polysaccharides - a complex process.

Authors:  Olivier Lerouxel; David M Cavalier; Aaron H Liepman; Kenneth Keegstra
Journal:  Curr Opin Plant Biol       Date:  2006-10-02       Impact factor: 7.834

9.  ARAMEMNON, a novel database for Arabidopsis integral membrane proteins.

Authors:  Rainer Schwacke; Anja Schneider; Eric van der Graaff; Karsten Fischer; Elisabetta Catoni; Marcelo Desimone; Wolf B Frommer; Ulf-Ingo Flügge; Reinhard Kunze
Journal:  Plant Physiol       Date:  2003-01       Impact factor: 8.340

10.  SUBA: the Arabidopsis Subcellular Database.

Authors:  Joshua L Heazlewood; Robert E Verboom; Julian Tonti-Filippini; Ian Small; A Harvey Millar
Journal:  Nucleic Acids Res       Date:  2006-10-28       Impact factor: 16.971

View more
  3 in total

1.  Genome-scale identification of cell-wall related genes in Arabidopsis based on co-expression network analysis.

Authors:  Shan Wang; Yanbin Yin; Qin Ma; Xiaojia Tang; Dongyun Hao; Ying Xu
Journal:  BMC Plant Biol       Date:  2012-08-09       Impact factor: 4.215

2.  OsCYP21-4, a novel Golgi-resident cyclophilin, increases oxidative stress tolerance in rice.

Authors:  Sang S Lee; Hyun J Park; Won Y Jung; Areum Lee; Dae H Yoon; Young N You; Hyun-Soon Kim; Beom-Gi Kim; Jun C Ahn; Hye S Cho
Journal:  Front Plant Sci       Date:  2015-10-01       Impact factor: 5.753

3.  Prediction of endoplasmic reticulum resident proteins using fragmented amino acid composition and support vector machine.

Authors:  Ravindra Kumar; Bandana Kumari; Manish Kumar
Journal:  PeerJ       Date:  2017-09-04       Impact factor: 2.984

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.