Literature DB >> 20733061

GolgiP: prediction of Golgi-resident proteins in plants.

Abstract

UNLABELLED: We present a novel Golgi-prediction server, GolgiP, for computational prediction of both membrane- and non-membrane-associated Golgi-resident proteins in plants. We have employed a support vector machine-based classification method for the prediction of such Golgi proteins, based on three types of information, dipeptide composition, transmembrane domain(s) (TMDs) and functional domain(s) of a protein, where the functional domain information is generated through searching against the Conserved Domains Database, and the TMD information includes the number of TMDs, the length of TMD and the number of TMDs at the N-terminus of a protein. Using GolgiP, we have made genome-scale predictions of Golgi-resident proteins in 18 plant genomes, and have made the preliminary analysis of the predicted data. AVAILABILITY: The GolgiP web service is publically available at http://csbl1.bmb.uga.edu/GolgiP/.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2010 PMID： 20733061 PMCID： PMC2944200 DOI： 10.1093/bioinformatics/btq446

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The Golgi apparatus is an essential cellular organelle found in most, if not all, eukaryotic cells, serving as an intermediate station of the secretory pathway that transports proteins out of a cell. Besides, Golgi is also a major site for protein post-translational modifications (e.g. glycosylation; Nilsson et al., 2009) and synthesis of various polysaccharides. The plant cell walls are mainly comprised of lignins, glycosylated proteins and polysaccharides, most of which are synthesized in Golgi (Lerouxel et al., 2006). Identification of the Golgi-resident proteins represents a very challenging and a highly important problem for the understanding of the biological processes taking place in Golgi. Although there are 1183 mouse and human Golgi-resident proteins identified (Sprenger et al., 2007), only a little over 400 plant Golgi proteins have been experimentally identified. A key challenging issue is that plant Golgi proteins do not seem to have obvious targeting signals as proteins targeted at other cellular compartments, such as nucleus or extra-cellular space. Most of the existing computational tools for subcellular localization predictions are designed for the general subcellular localization prediction, and their predictions for Golgi-resident proteins are less than adequate (Sprenger et al., 2006). Only one program has been specifically designed for the prediction of Golgi-localized proteins but it focuses only on transmembrane Golgi proteins (Yuan and Teasdale, 2002). The issue is that only 25% of Golgi proteins of Arabidopsis thaliana are estimated to contain transmembrane regions (Schwacke et al., 2003), indicating the inadequacy of the current programs. Based on this consideration, we have designed a support vector machine (SVM)-based classifier, called GolgiP, to predict both Golgi-localized transmembrane proteins and non-transmembrane proteins. GolgiP currently provides multiple models for predicting plant Golgi proteins, based on the specific needs of a user.

2 METHODS AND DATASET

We have collected a large dataset comprising of 402 known Golgi proteins and 5703 known non-Golgi proteins of A.thaliana (91.2%), Oryza sativa (8.2%) and other plants (0.7%), from the SUBA (Heazlewood et al., 2007) and the UniProt (Apweiler et al., 2004) databases, as well as manually curated from the published literature. The non-Golgi proteins are proteins that have subcellular localization annotations, but not in Golgi according to the above databases. The redundant sequences in our dataset are removed by CD-hit using 95% sequence identity as the cut-off (Li and Godzik, 2006). Four-fifth of the data was used to train the classifier and the remaining one-fifth was used to test the trained classifier, where the dataset was randomly partitioned into training and test datasets. To train an SVM-based classifier for Golgi proteins, we have examined three different sets of features, all computed from protein sequences. The first set of features is related to the dipeptide composition (DiAA). For each protein in our training set, we calculated the composition of dipeptides. The second set of features is related to transmembrane domains (TMDs). We used TMHMM (Krogh et al., 2001) and Phobius (Kall et al., 2004) to predict the number of TMDs, the average length of TMDs, the number of TMDs within the N-terminal region consisting of 70 amino acids, the length of the first TMD within the N-terminal region, and the orientation of the N-terminal (i.e. in the cytosol side or in the Golgi lumen side). The third set of features is related to functional domains (FunD). We searched proteins in our datasets against the Conserved Domains Database (CDD) using RPS-BLAST (Marchler-Bauer et al., 2009) with an e-value cutoff <0.01. We did this because the Golgi apparatus is where proteins get post-translational modifications such as glycosylation (Nilsson et al., 2009), and where the syntheses of most polysaccharides take place (Nilsson et al., 2009). In addition, Komatsu et al. (2007) found that the distributions of functional categories of proteins are different in different membranes such as plasma membrane, vascular membrane and Golgi membrane, respectively. Hence, it is expected that enzymes for the Golgi-related activities should be located in Golgi. The CDDs found for the Golgi proteins are then collected as the third set of features. We applied the LIBSVM package (Fan et al., 2005) to train the classifier. We used the Radial Basis Function kernel, and tuned the cost (c) and gamma (γ) parameters to optimize the classification performance on the training dataset.

3 RESULTS AND DISCUSSION

We used the aforementioned three sets of sequence features, and trained three SVM classification models. Besides, we combined all three sets of features to train a comprehensive model. The training performances are shown in Supplementary Material. We have compared the models with the other Golgi protein prediction tools, including PSORT (Nakai and Horton, 1999), WoLF PSORT (Horton et al., 2007) and Yuan's Golgi predictor (Yuan and Teasdale, 2002) by using the testing dataset. As shown in Table 1, Yuan's Golgi predictor has the good sensitivity but the lowest specificity and the lowest accuracy. PSORT and WoLF PSORT are two general subcellular localization prediction tools, and have moderate level of classification performance, which may not be adequate to serve as a plant Golgi protein predictor based on our analysis. Our program, GolgiP, exhibits the better overall performances with a higher accuracy and Matthews correlation coefficient (MCC).

Table 1.

Evaluation of Golgi protein prediction tools

Tools	Sensitivity	Specificity	Accuracy	MCC
	(%)	(%)	(%)
Yuan	71.64	23.18	26.37	−0.03
WoLF PSROT	15.92	92.69	87.63	0.08
GolgiP-TMD	61.73	67.75	67.75	0.15
PSORT	43.53	83.10	80.49	0.17
GolgiP-DiAA	71.64	80.76	80.16	0.31
GolgiP-Comprehensive	72.84	98.42	96.73	0.73
GolgiP-FunD	57.50	100.00	97.21	0.75

The performances are sorted by MCC.

Evaluation of Golgi protein prediction tools The performances are sorted by MCC. We have applied GolgiP with the functional domain model to predict Golgi proteins on 18 selected fully sequenced plant genomes using the same cutoff. The reason we chose the functional domain model is that the model performs the best specificity, and therefore tends to avoid false positive results in this genome-wide prediction. The numbers and percentages of the predicted Golgi proteins in these organisms are shown in Table 2. Across algae, moss, monocot and dicot plants, the average percentages of predicted Golgi proteins is 7.25% among all the encoded proteins by these genomes. The stability in the percentage of the predicted Golgi proteins across different genomes indirectly suggests the reliability of our predictions. The trend of distribution of Golgi proteins from lower to higher plant species shows the similar percentage of Golgi proteins. This may suggest that the functionality of the Golgi apparatus has evolved and matured fairly early in the plant evolution.

Table 2.

Application of the GolgiP program on 18 plant genomes

Clade	Species	# predicted Golgi proteins/# Total proteins	Percentage
Red algae	Cyanidioschyzon merolae 10D	430/5014	8.58
Green algae	Micromonas pusilla CCMP1545	716/10475	6.84
Green algae	Micromonas strain RCC299	833/9815	8.49
Green algae	Ostreococcus lucimarinus	706/7651	9.23
Green algae	Ostreococcus tauri	656/7725	8.49
Green algae	Chlamydomonas reinhardtii	982/14598	6.73
Green algae	Volvox carteri f. nagariensis	1025/15544	6.59
Moss	Physcomitrella patens ssp patens	2344/35938	6.52
Spike moss	Selaginella moellendorffii	2912/34697	8.39
Monocot	Oryza sativa	4240/67393	6.29
Monocot	Brachypodium distachyon	2446/32255	7.58
Monocot	Sorghum bicolor	2197/35899	6.12
Monocot	Zea mays	4748/75387	6.30
Dicot	Vitis vinifera	2008/30434	6.60
Dicot	Arabidopsis thaliana	2727/33410	8.16
Dicot	Medicago truncatula	1856/30028	6.18
Dicot	Glycine max	5262/75778	6.94

Application of the GolgiP program on 18 plant genomes In conclusion, we developed a Golgi protein prediction tool, GolgiP, and demonstrated its superior performance in predicting plant Golgi proteins over the existing prediction servers. In addition, our predictions across multiple plant genomes give an estimation of the percentage of plant Golgi proteins across different plant organisms, which is in general agreement with the previous estimations.

15 in total

1. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

Authors: A Krogh; B Larsson; G von Heijne; E L Sonnhammer
Journal: J Mol Biol Date: 2001-01-19 Impact factor: 5.469

2. PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization.

Authors: K Nakai; P Horton
Journal: Trends Biochem Sci Date: 1999-01 Impact factor: 13.807

3. UniProt: the Universal Protein knowledgebase.

Authors: Rolf Apweiler; Amos Bairoch; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

3. Prediction of endoplasmic reticulum resident proteins using fragmented amino acid composition and support vector machine.

Authors: Ravindra Kumar; Bandana Kumari; Manish Kumar
Journal: PeerJ Date: 2017-09-04 Impact factor: 2.984

3 in total

GolgiP: prediction of Golgi-resident proteins in plants.

1 INTRODUCTION

2 METHODS AND DATASET

3 RESULTS AND DISCUSSION

1. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

2. PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization.

3. UniProt: the Universal Protein knowledgebase.

4. A combined transmembrane topology and signal peptide prediction method.

Review 5. The proteomics of plant cell membranes.

6. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Review 7. Sorting out glycosylation enzymes in the Golgi apparatus.

Review 8. Biosynthesis of plant cell wall polysaccharides - a complex process.

9. ARAMEMNON, a novel database for Arabidopsis integral membrane proteins.

10. SUBA: the Arabidopsis Subcellular Database.

1. Genome-scale identification of cell-wall related genes in Arabidopsis based on co-expression network analysis.

2. OsCYP21-4, a novel Golgi-resident cyclophilin, increases oxidative stress tolerance in rice.

3. Prediction of endoplasmic reticulum resident proteins using fragmented amino acid composition and support vector machine.