Literature DB >> 23729470

PlantLoc: an accurate web server for predicting plant protein subcellular localization by substantiality motif.

Shengnan Tang¹, Tonghua Li, Peisheng Cong, Wenwei Xiong, Zhiheng Wang, Jiangming Sun.

Abstract

Knowledge of subcellular localizations (SCLs) of plant proteins relates to their functions and aids in understanding the regulation of biological processes at the cellular level. We present PlantLoc, a highly accurate and fast webserver for predicting the multi-label SCLs of plant proteins. The PlantLoc server has two innovative characters: building localization motif libraries by a recursive method without alignment and Gene Ontology information; and establishing simple architecture for rapidly and accurately identifying plant protein SCLs without a machine learning algorithm. PlantLoc provides predicted SCLs results, confidence estimates and which is the substantiality motif and where it is located on the sequence. PlantLoc achieved the highest accuracy (overall accuracy of 80.8%) of identification of plant protein SCLs as benchmarked by using a new test dataset compared other plant SCL prediction webservers. The ability of PlantLoc to predict multiple sites was also significantly higher than for any other webserver. The predicted substantiality motifs of queries also have great potential for analysis of relationships with protein functional regions. The PlantLoc server is available at http://cal.tongji.edu.cn/PlantLoc/.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2013 PMID： 23729470 PMCID： PMC3692052 DOI： 10.1093/nar/gkt428

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Because the subcellular localization (SCL) of protein is highly correlated with its function, interaction partners and biological processes, it is a hot research topic in biology. According to the UniportKB database released on March 2012 (1,2), the original viridiplantae protein entries were 1 027 477. However, only 15 792 entries have any experimentally annotated SCL. It is both time consuming and expensive to determine the localization of a new protein using experimental methods. Computational prediction of SCLs has become a necessary alternative (3,4). In recent years, various prediction methods have been developed to predict protein SCLs. These approaches may be classified into different categories based on exploiting difference features as follows. (i) The feature is generated on sequence information, such as amino acid composition (5–10), N-terminal sequence (11), pseudo-amino acid composition (5,12) and PSSM (position-specific scoring matrix). (ii) The feature is generated by making use of Gene Ontology (GO) annotations (13–15), or textual information (15) from Swiss-Port keywords to predict SCL. (iii) The feature is generated by hybrid methods, which usually combine sequence information and annotation information (16–20). Despite these features playing important roles in prediction the accuracy of prediction still needs improvement, especially for plant proteins (21–23). Additionally, the complexities of predictive models are difficult for users to understand why a prediction was made. In the past, the concept of the localization motif (LM) had been proposed (24). Recently a novel feature, the localization motif, was proposed by our group (25,26). An LM was defined as a gapped or ungapped fragment of amino acids that were a conserved pattern in a subcellular domain and existed in the N-terminus peptides of sequences. We confirmed that LMs could be utilized as features to accurately predict protein SCLs by using support vector machine, and it is possible to directly use it to predict SCL without other information. Here, we present PlantLoc, a highly accurate and fast web server for predicting the multiple-site SCLs of plant proteins from sequences without any annotation information or any machine learning algorithm. It provided predicted SCL results, confidence estimates and which substantiality motifs and where they located on the query sequences. PlantLoc generated the LMs by using training dataset (4412 entries). The obtained LMs constituted 11 libraries for 11 SCLs of plant proteins. According to the hit numbers of all LM libraries, PlantLoc gave the probabilities of the query sequence for each localization domains. Compared with six plant SCL prediction webservers, by using a new benchmark test dataset of 230 entries, the overall accuracy was 80.8% [cf. iLoc-Plant (21): 22.2%; mGOASVM (19): 51.3%; WegoLoc (17): 57.0%; YLoc (20): 47.8%; ngLOC (18): 26.5%; and WoLF PSORT (27): 61.7%]. Additionally, we tested PlantLoc’s ability to predict multiple localizations. PlantLoc reliably predicted the proteins with multiple localizations and outperformed the best predictors in this area. Moreover, the obtained substantiality motifs are the conservative patterns in protein sequence which can facilitate users for understand why a prediction was made and further biological function analysis.

MATERIALS AND METHODS

Protein sequences were collected from the UniProtKB/Swiss-Prot protein knowledgebase (http://www.uniprot.org/) according to the annotation information in the CC (comment or notes) and OC (organism classification) fields. Some proteins may simultaneously exist in two or more SCLs. Training datasets were collected from the UniProtKB/Swiss-Prot release of March 2012. Plant proteins can be localized in the chloroplast (CHL), cell wall (CEL), cytoplasm (CYT), endoplasmic reticulum (END), extracellular space (EXC), Golgi apparatus (GOL), mitochondrion (MIT), nucleus (NUC), peroxisome (PER), plasma membrane (PLA) and vacuole (VAC). To reduce homology bias, a redundancy cutoff was operated by a culling program CD-HIT (http://weizhong-lab.ucsd.edu/cdhit_suite/cgi-bin/index.cgi) to window those sequences, which have 60% sequence identity) to any other in the same SCL (Figure 1). A test set (STest) was collected from UniProtKB released from March 2012 to October 2012 and reduced the redundancy to 60% refer to the training dataset. The total number of different proteins in STest was 206 and the total number of locations was 230. A multiple-site dataset called SMS included 24 proteins with 51 localizations.

Figure 1.

The numbers of plant proteins for the training and testing datasets. The STrain (4436 entries, the sum of each number in cylinder) and STest (230 entries) sets are divided by time (bottom). The names and the numbers of sequences of 11 subcellular domains of plant proteins are shown in different colors (top). The 36 651 entries (bottom middle) obtained by ‘similarity’ annotation were used in selection of LM (see text). In Figure 1, the experimentally training (14 401 entries), reduced redundancy training (4436, the sum of each number in cylinder) and testing (230 entries) sets are divided by time (bottom). The names and the numbers of sequences of 11 subcellular domains of plant proteins are shown in different colors (top). The 36 651 entries (bottom middle) obtained by ‘similarity’ annotation were used in selection of LM (see text). A substantiality motif was defined as assembled fragments of sequences which were interpretable characters of SCL. A substantiality motif for a query sequence was generated by assembling some LMs. These LMs hit the query and just liked bricks to construct flexible substantiality motifs. There were three steps to obtain a substantiality motif. In the first step, the LM program was carried out to generate candidates of LMs from the training dataset STrain. N-terminal sequence information has been widely used for predicting SCL (7,28–30). In previous studies we also found that most of the motifs were positioned near the N-terminus of protein sequences, where signal peptides were generally considered to be present (26). Please see detail information in Supplementary Material (A). If the length of a sequence is less than 200 amino acids, all sequence will be used for extraction of LM candidates. If the length of a sequence is more than 200 amino acids, only 200 residues of N-terminus will be considered for the extraction of candidates. In previous studies, an algorithm that could extract local combinational variables with fixed locations from aligned sequence was successfully used to predict DNA binding and protein shape string (31,32). Here, the algorithm was developed to extract LM candidates from unequal-length sequences at different locations. There were two parameters in this procedure, the found number threshold and the span. The found number is the number of times a given LM candidate was present in a subcellular domain. If the found number threshold is low the coverage of training sequences will be high and the numbers of LM candidates will be very large. The span is the number of gaps between two residues that is used in generation of 2-length seeds (Figure 2). The larger the span, the more 2-length seeds are generated. All 2-length seeds with frequencies greater than or equal to the given found number threshold were enumerated. Then 2-length seeds were merged according to the same prefix-of-seed and suffix-of-seed (marked in Figure 2, left), and 3-letter seeds were generated. The new 3-length seed will survive if its frequency is greater than or equal to the found number threshold. This iterative circulation was carried out until no seeds survived. Finally, all surviving seeds were collected as LM candidates.

Figure 2.

The process of the LM library building strategy. The LM algorithm is an enumeration and merging procedure, prefix-of-seed and suffix-of-seed are marked with tangerine. In LM selection, the negative set is enlarged. The LM libraries contain LMs (expressed by characters) and their frequencies in training sets. In the second step, LMs were selected from millions of generated candidates. For a subcellular domain, an LM is determined as belonging to this domain if the LM only matches sequences in this domain and does not match any sequences in all other domains. In order to reduce false positive rates, which are the inherent weaknesses of customary motif discovery algorithms, we enlarged the dataset and especially the negative set. The datasets used in this step included not only those of proteins annotated ‘experimentally’ SCL (14 401), but also from those annotated with ‘by similarity’ (36 651 entries of 11 SCLs from UniProtKB released on March 2012. For each subcellular domain the sequences belonging to this domain were considered as the positive set and the sequences of all other domains were considered as the negative set (Figure 2, middle). Thus the number of LMs was lowered and the false positive rates were greatly reduced. After selection, the LMs and their frequencies from the training sets for a special SCL were constituted in an LM library (Figure 2, right). There were 11 libraries for plant proteins. In the third step, a query sequence was identified as belonging to a subcellular domain according to the hit numbers of LMs in LM libraries. When a library had the highest hit number, the query was identified as belonging to this domain. When more than one library had equal highest hit numbers and the hit numbers were above a threshold (say 10) the query was identified as of multiple sites. The substantiality motif of a query was assembled of hit LMs of a special domain and was shown in the output (Figure 2, bottom). When there were >10 LMs hit, only the 10 LMs with the highest frequencies were shown. In the output section of PlantLoc, the hit number is expressed as relative probability. If the hit number achieves the threshold, the probability is defined as 100%. Our approach shows which substantiality motifs and where they locate on the query sequences and this has great potential for analysing relationships with protein functional regions.

RESULTS

For PlantLoc the threshold number was set in the range of 0–10 according to the numbers of proteins in the training subcellular domains. When a subcellular domain had <100 proteins, the threshold was set as 2. The number of span was set to 0–10. We tested the performance of PlantLoc on training datasets. We tested PlantLoc’s ability to predict multiple localization sites with 5-fold cross-validation [Supplementary Material (B)]. The overall accuracy of the STrain was 96.3%. Because the LM was generated and selected from the training dataset, it was easy to understand why it achieved such high accuracy. We tested the performance of PlantLoc on the independent dataset STest and compared PlantLoc with six other SCL predictors (iLoc-Plant (21), mGOASVM (19), WegoLoc (17), Y-LOC (20), ngLOC (18) and WoLF PSORT (27)) based on homology-based method, motif method, GO method. WoLFPSORT (27) converted protein amino acid sequences into numerical localization features, such as sorting signals, amino acid composition and functional motifs. YLoc (20) derived a lot of features from amino acid composition and pseudo composition. In addition, it included PROSITE motifs and GO terms from close homologies. iLoc-Plant (21) proposed by Chou group used PSSM, GO and sequential evolution. WegoLoc (17) was a homology-based and weighted GO-based approach. mGOASVM (19) also used homologous and GO information. ngLOC (18) was developed on fixed-length peptide, called n-gram. The individual prediction performance was evaluated using the overall accuracy. Except for the WoLF PSORT webserver, the others provided only the one or two predicted SCLs. Some webservers such as ngLoc and WoLF PSORT provided two or more probable SCLs. The evaluation results are summarized in Figure 3. For WoLF PSORT, the first and second predicted SCLs are defined as predicted SCLs. The overall accuracy of PlantLoc was 80.8%, which is much higher than by the other methods. Moreover, the results of predicted and probable SCLs were also evaluated as follows: WoLF PSORT 74.3%, ngLOC 59.6% and PlantLoc 90.4%. In Figure 3, the results show that PlantLoc performed markedly better than the existing predictors that based on motif, homology and GO information method with the capacity to deal with a multi-label plant protein.

Figure 3.

Compared results with other methods on STest.

Compared results with other methods on STest. We tested PlantLoc’s ability to predict multiple localization sites on SMS and compared the performances with six other webservers (Table 1). PlantLoc performed markedly better than any of the existing predictors with the capacity to deal with a multi-label plant protein. We defined the multiple-site accuracy (MSA) and the ratio of accuracy (RA) to assess the accuracy for predicting multiple sites, as follows: where NCPS is the number of correct predicted sites, TNRMS is the total number of real multiple sites and TNPS is the total number of predicted sites.

Table 1.

Performance of PlantLoc and other webserver on SMS

	PlanLoc	iLoc-Plant	Yloc	mGoasum	ngLOC	WoLF PSORT	WegoLoc
MSA (%)	86.3	25.5	35.3	41.2	39.2	58.8	64.7
RA (%)	100.0	50.0	75.0	72.4	27.8	42.9	45.8

Performance of PlantLoc and other webserver on SMS WegoLoc provided the confidence estimates of each location. For statistical reasons, the first three localizations were defined as its predicting result. For the other five webservers, all prediction results were defined as predicting result. For PlantLoc, MSA = 86.3% and RA = 100.0%. The ability of PlantLoc to predict multiple sites was better than other six webservers.

WEBSERVER

The PlantLoc server is free available at http://cal.tongji.edu.cn/PlantLoc/. The PlantLoc webserver runs on a Windows 64 server of 2.0 GHz Intel Xeon processors that consists of four cores. It is composed of a front-end web application and a back-end execution cluster. The front-end is written in Java and Java Server Page and uses the Microsoft SQL Server database. The LM software was developed in C#, and can be freely downloaded. With the help of the template Perl program, it is easier to submit sequences and parse the result for users.

Input description

Users can easily input a FASTA format protein sequence in the textbox or FASTA format file (up to 50 entries). Users can then bookmark ‘MY TASK’ page (about 30 s per sequence) and access at a later time. If a user provides an Email address (optional), the address will be considered as an ID to retrieve the results of all tasks the user has ever submitted. For large-scale predictions, the standalone program for finding LMs can also be downloaded from the webserver.

Output description

All prediction results provided a graphic representation of the probabilities of predicted SCLs (Figure 4). The identified SCLs, probable SCLs and the substantiality motifs, including their position on the sequence, are shown in Figure 4. When substantiality motifs were determined in the query sequences, they were represented as amino acid characters. The rest of positions were represented as dots. All the substantiality motifs can be download by pressing ‘download’ in the result file.

Figure 4.

A screenshot of a PlantLoc output and obtained substantiality motifs of Q8S8N6 (protein ID). The probability of prediction expressed by graph. The identified SCL(s) and probability localization(s).

A screenshot of a PlantLoc output and obtained substantiality motifs of Q8S8N6 (protein ID). The probability of prediction expressed by graph. The identified SCL(s) and probability localization(s). The substantiality motifs are very important since they tell the user why the proteins are predicted in this localization. In Figure 5, there is an example of potential relationship between substantiality motifs and functional regions annotated by UniportKB. The substantiality motifs (V.ALN. … L, KYCG. . Y. GCP. E. PCD. . D. CC) were localized on the signal peptide and the metal-binding annotation region of the sp_Q8S8N6_PIA2A_ARATH_Phospho sequence. The substantiality motifs may have relationship with signal peptide and functional regions which is the reason why it can provide more accuracy result. Moreover, if the users found the substantiality motif, they can also use UniportKB database to find potential functional regions for deep research.

Figure 5.

The substantiality motifs with annotation by UniProtKB. Characters colored blue are annotated from UniProtKB. Characters colored red are substantiality motifs for EXC and GOL.

DISCUSSION

The PlantLoc server has two innovative characters: building LM libraries by recursive method without alignment and GO information and establishing simple architecture for rapidly and accurately identifying plant protein SCLs without a machine learning algorithm. In contrast to other webservers, PlantLoc performs excellently not only on single localizations but also on multiple-site proteins. The substantiality motifs can explain why a prediction is made and which substantiality motifs are responsible for prediction. The substantiality motifs will be very important for users on further functional analysis and can be applied to a wide range of sequence identities and so provide a practical tool for biologists.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Natural Science Foundation of China (NSFC) [20705024, 21275108]. Funding for open access charge: NSFC. Conflict of interest statement. None declared.

32 in total

1. Extensive feature detection of N-terminal protein sorting signals.

Authors: Hideo Bannai; Yoshinori Tamada; Osamu Maruyama; Kenta Nakai; Satoru Miyano
Journal: Bioinformatics Date: 2002-02 Impact factor: 6.937

2. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST.

Authors: Manoj Bhasin; G P S Raghava
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

3. WegoLoc: accurate prediction of protein subcellular localization using weighted Gene Ontology terms.

Authors: Sang-Mun Chi; Dougu Nam
Journal: Bioinformatics Date: 2012-01-31 Impact factor: 6.937

4. Predicting plant protein subcellular multi-localization by Chou's PseAAC formulation based multi-label homolog knowledge transfer learning.

Authors: Suyu Mei
Journal: J Theor Biol Date: 2012-06-27 Impact factor: 2.691

5. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.

Authors: O Emanuelsson; H Nielsen; S Brunak; G von Heijne
Journal: J Mol Biol Date: 2000-07-21 Impact factor: 5.469

6. ngLOC: software and web server for predicting protein subcellular localization in prokaryotes and eukaryotes.

Authors: Brian R King; Suleyman Vural; Sanjit Pandey; Alex Barteau; Chittibabu Guda
Journal: BMC Res Notes Date: 2012-07-10

7. Update on activities at the Universal Protein Resource (UniProt) in 2013.

Authors:
Journal: Nucleic Acids Res Date: 2012-11-17 Impact factor: 16.971

8. DSP: a protein shape string and its profile prediction server.

Authors: Jiangming Sun; Shengnan Tang; Wenwei Xiong; Peisheng Cong; Tonghua Li
Journal: Nucleic Acids Res Date: 2012-05-02 Impact factor: 16.971

9. LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST.

Authors: Dan Xie; Ao Li; Minghui Wang; Zhewen Fan; Huanqing Feng
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

10. mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines.

Authors: Shibiao Wan; Man-Wai Mak; Sun-Yuan Kung
Journal: BMC Bioinformatics Date: 2012-11-06 Impact factor: 3.169

6 in total

1. HelitronScanner uncovers a large overlooked cache of Helitron transposons in many plant genomes.

Authors: Wenwei Xiong; Limei He; Jinsheng Lai; Hugo K Dooner; Chunguang Du
Journal: Proc Natl Acad Sci U S A Date: 2014-06-30 Impact factor: 11.205

2. Natural variation in flavonol accumulation in Arabidopsis is determined by the flavonol glucosyltransferase BGLU6.

Authors: Hirofumi Ishihara; Takayuki Tohge; Prisca Viehöver; Alisdair R Fernie; Bernd Weisshaar; Ralf Stracke
Journal: J Exp Bot Date: 2015-12-29 Impact factor: 6.992

3. Identification of AIDS-Associated Kaposi Sarcoma: A Functional Genomics Approach.

Authors: Peng Zhang; Jiafeng Wang; Xiao Zhang; Xiaolan Wang; Liying Jiang; Xuefeng Gu
Journal: Front Genet Date: 2020-01-24 Impact factor: 4.599

4. Soybean transcription factor ORFeome associated with drought resistance: a valuable resource to accelerate research on abiotic stress resistance.

Authors: Chenglin Chai; Yongqin Wang; Trupti Joshi; Babu Valliyodan; Silvas Prince; Lydia Michel; Dong Xu; Henry T Nguyen
Journal: BMC Genomics Date: 2015-08-13 Impact factor: 3.969

5. Protein sub-nuclear localization prediction using SVM and Pfam domain information.

Authors: Ravindra Kumar; Sohni Jain; Bandana Kumari; Manish Kumar
Journal: PLoS One Date: 2014-06-04 Impact factor: 3.240

6. Deciphering Mineral Homeostasis in Barley Seed Transfer Cells at Transcriptional Level.

Authors: Behrooz Darbani; Shahin Noeparvar; Søren Borg
Journal: PLoS One Date: 2015-11-04 Impact factor: 3.240

6 in total