Literature DB >> 15901249

Predicting protein subcellular localization: past, present, and future.

Abstract

Functional characterization of every single protein is a major challenge of the post-genomic era. The large-scale analysis of a cell's proteins, proteomics, seeks to provide these proteins with reliable annotations regarding their interaction partners and functions in the cellular machinery. An important step on this way is to determine the subcellular localization of each protein. Eukaryotic cells are divided into subcellular compartments, or organelles. Transport across the membrane into the organelles is a highly regulated and complex cellular process. Predicting the subcellular localization by computational means has been an area of vivid activity during recent years. The publicly available prediction methods differ mainly in four aspects: the underlying biological motivation, the computational method used, localization coverage, and reliability, which are of importance to the user. This review provides a short description of the main events in the protein sorting process and an overview of the most commonly used methods in this field.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2004 PMID： 15901249 PMCID： PMC5187447 DOI： 10.1016/s1672-0229(04)02027-3

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

Large-scale genomic and proteomic efforts worldwide have contributed to a massive amount of sequence data. Annotating these sequences has been a major driving force in molecular and computational biology 1., 2.. Functional annotation projects seek to elucidate the potential roles that the proteins play in a cellular context, such as metabolic pathways and interaction networks. Eukaryotic cells can synthesize up to 10,000 different kinds of proteins, which all are destined for one or more pre-determined target organelles. Proteins have evolved to function optimally in a specific subcellular localization; hence, the correct transport of a protein to its final destination is crucial to its function. The process of directing a newly synthesized protein to its target organelle is often referred to as protein targeting or protein sorting. Failure in transporting a protein has proven to be a key event behind several human diseases, such as cancer and Alzheimer’s disease 3., 4., 5.. Computational methods aiming to assign subcellular localization in an automated and high-throughput fashion provide an appealing complement to experimental techniques. The development of methods for predicting subcellular location has been an area of great activity during recent years ( and has long been seen as the detective work of a bioinformatician (. The enormous complexity of the protein sorting process, alternative means of transportation pathways, and lack of complete data for every organelle, present great challenges to the eager prediction method developers. In this review, we describe the main events of the protein sorting process, provide an overview of the computational contributions made to this field, and finally give a few guiding words to potential users.

Biological Background

There are at least ten main subcellular localizations in eukaryotes, several of which can be further subdivided into intraorganellar compartments. Bacteria, on the other hand, consist of a single intracellular space and a plasma membrane. The organelles have distinct, well-defined, and complementing functions in the cellular machinery, and are thought to have evolved from ancestral bacterial endosymbionts of prokaryotic cells (. Most proteins in the cell are encoded in the nuclear DNA, only a small subset is encoded in the chloroplastic and mitochondrial DNA. An elaborate and highly selective system of sorting and transportation mechanisms provides for the guidance of each protein to its final destination 9., 10.. The cytoplasm surrounds the nucleus and is the place where the mRNA is translated into protein. Proteins in the cytoplasm can enter the secretory pathway (SP), be directed to other non-secretory pathway (nSP) organelles, or remain in the cytoplasm. The intracellular routing of non-cytoplasmic proteins was traced in pioneering experimental studies by George Palade (, who received the Nobel Prize for his work in 1974. Proteins of the secretory pathway carry a targeting sequence in their precursor protein sequences and are transported co-translationally across the Endoplasmatic Reticulum (ER) membrane. Proteins in the ER are further transported into the Golgi apparatus, plasma membrane, lysosome, vacuole, or the extracellular space, unless they carry an ER-retention sequence. Vesicular carriers are often employed for transporting the proteins and have been shown to shuttle between ER and the Golgi apparatus 12., 13.. The nSP proteins are synthesized on free cytoplasmic ribosomes and are transported from the cytoplasm post-translationally, if they carry specific N-terminal targeting sequences for the chloroplast (cTP), mitochondria (mTP), or the lysosome 14., 15.. These targeting sequences are usually cleaved from the mature protein sequence by specific signal sequence peptidases 16., 17., 18., once the protein has reached its final destination. Additional intrinsic sequences, such as the hydrophobic stop-transfer sequence, present within the mature protein, can initialize membrane insertion of transmembrane proteins. Secondary targeting sequences occur in chloroplasts and enable further intraorganellar transport (. All nuclear proteins have to be imported from the cytoplasm. This import is facilitated as the nuclear pore complex recognizes a nuclear localization signal (NLS), which is present only in nuclear proteins 20., 21.. The NLS is a short stretch of four to eight usually positively charged amino acids, and can be encoded as one fragment (monopartite) or as split into two fragments (bipartite). The NLS has been precisely defined in several nuclear proteins, but there are also nuclear proteins that appear to have no NLS at all. Peroxisomal proteins are also imported from the cytoplasm and carry a short C-terminal signal sequence that facilitates transport across the peroxisomal membrane. Furthermore, post-translational modifications, such as glycosylation (, also play an important role for further protein transport 23., 24., 25.. Common to all signal sequences is that they show a high specificity and an evolutionary conservation (. The conservation is not necessarily evident within the primary amino acid sequence, rather indirectly at the level of the biochemical properties of the amino acids. Some of the targeting sequences show a tendency to form some degree of secondary structure like an amphiphatic alpha helix or beta sheet 27., 28.. The correct cleavage of the TPs is highly dependent on the primary sequence and a few direct sequence motifs have been identified. The organelles present unique biological conditions to the proteins. During the course of the evolution, the only mutations that have been accepted are the ones from which the cell benefits. It has been observed that proteins from different organelles differ in their overall amino acid composition (; hence the underlying hypothesis that each protein has evolved over time to function optimally in a certain subcellular localization can be formed.

Computational Approaches

Computational methods for predicting protein subcellular localization can generally be divided into four categories: prediction methods based on (i) the overall protein amino acid composition, (ii) known targeting sequences, (iii) sequence homology and/or motifs, and (iv) a combination of several sources of information from the first three categories (hybrid methods). (i) The pioneering work in using the overall amino acid composition for prediction was done by Nakashima and Nishikawa, who presented a method for discriminating between intracellular and extracellular proteins (. Using the distance between the overall amino acid composition vectors, Cedano et al. presented ProtLock for predicting five classes of subcellular localizations (extracellular, intracellular, integral membrane, anchored membrane, and nuclear; ref. ). Reinhardt and Hubbard presented NNPSL, an approach using artificial neural networks (ANNs) for predicting four eukaryotic (cytoplasmic, extracellular, mitochondrial, and nuclear) and three prokaryotic (cytoplasmic, extracellular, and periplasmic) subcellular localizations (. Several alternative algorithms have been applied to the data set presented by Reinhardt and Hubbard, including Kohonen’s self-organizing maps (, Support Vector Machines (SVMs; ref. ), and Markov chain models (. Further developments in the area of using the overall amino acid composition have been made by reflecting sequence order effects. Chou et al. presented an SVM-based method for predicting twelve different subcellular localizations, taking sequence order effects into account 36., 37.. A similar approach was recently presented by Park and Kanehisa as they described the PLOC method (. Fuzzy k-NNs were applied by Huang et al. to describe the dipeptide composition of the whole protein sequence for eleven different localizations (. The CELLO method enables prediction of five subcellular localizations in Gram-negative bacteria (cytoplasm, inner membrane, periplasm, outer membrane, and extracellular space), based on the composition of peptides of varying lengths (n-peptide composition; ref. ). Andrade et al. were the first to incorporate structural information into the amino acid composition vectors. The surface composition of eukaryotic proteins with known structure was used to distinguish between nuclear, extracellular, and cytoplasmic proteins (. The rationale behind this approach is that the interiors of proteins have stayed fairly constant during evolution, whereas surface residues have adapted to certain biochemical environments. (ii) The most comprehensive method based on N-terminal targeting sequences is TargetP (, which allows for prediction of chloroplast, mitochondrial, secretory pathway, and other proteins. TargetP can be seen as an integration of the SignalP ( and the ChloroP ( methods; all three methods have been presented by the group of Gunnar von Heijne. MitoProt ( and Predotar (http://www.inra.fr/predotar) are two methods both specifically discriminate chloroplast from mitochondrial proteins. Another method in this category is iPSORT (, offering prediction of the same localization categories as TargetP. The iPSORT uses knowledge-based rules for prediction based on protein sequence features derived from the AAindex database (. (iii) Marcotte et al. presented a method that assigns the subcellular localization by constructing phylogenetic profiles of the proteins (. Mott et al. used SMART ( domains for predicting cytoplasmic, secreted, and nuclear proteins (. The method PredictNLS is a method specialized on recognizing nuclear proteins, based on a collection of nuclear localization sequences (NLSs; ref. ). A nearest neighbour approach using the composition of functional domains has also been presented and tested on the Reinhardt and Hubbard data set (. Proteome Analyst, presented by Lu et al., is based on SWISS-PROT keywords and the annotation of homologous proteins (. This method is similar to the LOCkey ( and LOChom ( methods described by Nair and Rost in 2002. PSLT is a recently presented method that uses Bayesian networks and InterPro motifs for predicting ten subcellular localizations (. (iv) PSORT, presented in 1992, was one of the first methods developed for predicting the subcellular localization (. PSORT uses the overall amino acid composition, N-terminal targeting sequence information, and motifs, hence considered a hybrid approach. This method uses a set of knowledge-based “if-then” rules and predicts 14 animal and 17 plant subcellular localizations. Extensions of the PSORT method include: PSORT II (a modified decision algorithm; ref. ) and PSORT-B (with focus on bacterial proteins; ref. ). ESLpred was developed using the Reinhardt and Hubbard data set and is an SVM-based method, which combines the dipeptide composition and PSI-BLAST scores (. Drawid and Gerstein presented a method that incorporates information about sequence motifs, overall sequence properties (e.g. isoelectric points and surface composition), and mRNA expression levels (. Their method is based on a Bayesian prediction model and was tested on the yeast genome. MITOPRED is a method specialized for predicting mitochondrial proteins, which is based on Pfam domains ( and amino acid composition (. Several of the described methods are available as online prediction servers. A compiled list of methods, associated URLs, and references can be seen in Table 1. Since the methods have different localization coverage and different means to assess their accuracy, it is impossible to compare all methods against each other. Accuracy issues of prediction are considered in the Discussion section below.

Table 1

Prediction Methods Available Online

Method	Url	Ref.
Cello	http://cello.life.nctu.edu.tw/	40
ChloroP	http://www.cbs.dtu.dk/services/ChloroP/	43
ESLpred	http://www.imtech.res.in/raghava/eslpred/	59
iPSORT	http://hc.ims.u-tokyo.ac.jp/iPSORT/	45
MITOPRED	http://mitopred.sdsc.edu/	62
MitoProt	http://ihg.gsf.de/ihg/mitoprot.html	44
NNPSL	http://www.doe-mbi.ucla.edu/%7Eastrid/astrid.html	32
PLOC	http://www.genome.jp/SIT/ploc.html	38
predictNLS	http://cubic.bioc.columbia.edu/predictNLS/	50
Predotar	http://genoplante-info.infobiogen.fr/predotar/
Proteome Analyst	http://www.cs.ualberta.ca/%7Ebioinfo/PA/Sub/index.html	52
PSORT	http://psort.ims.u-tokyo.ac.jp/form.html	56
PSORT II	http://psort.ims.u-tokyo.ac.jp/form2.html	57
PSORT-B	http://www.psort.org/psortb/	58
SignalP	http://www.cbs.dtu.dk/services/SignalP/	42
SubLoc	http://www.bioinfo.tsinghua.edu.cn/SubLoc/	34
TargetP	http://www.cbs.dtu.dk/services/TargetP/	41

Discussion

Currently available prediction methods differ in three main aspects critical to the user: the underlying biological model, the localization coverage, and prediction accuracy. The biological model can be fairly simple as is the case for predictions based on the overall amino acid composition, or more complex as the model underlying the knowledge-based PSORT prediction system. The localization coverage differs immensely and ranges from methods predicting just a few localizations, to all possible localizations. Prediction accuracy can be seen either as the overall accuracy for a method, or as the individual accuracy for each predicted localization. It is often the case that some localizations can be predicted with fairly high accuracy, whereas others not. Furthermore, most methods have been trained using different data sets or training procedures, which makes a fair benchmark comparison a daunting task. Methods based on targeting sequences, such as TargetP and iPSORT, generally predict only four plant and three non-plant localizations (low coverage) but have relatively high prediction accuracy. However, it should be pointed out that two of the predicted categories, SP and others, are not subcellular localizations. The SP category contains proteins from at least six different subcellular localizations, and the category of others from at least three. Only if the prediction is chloroplast or mitochondrial, a specific localization can be assigned to the query protein. In these cases it might even be a good choice to use a more specialized method, such as MITOPRED, that predicts mitochondrial proteins with high accuracy. Predictions based on targeting sequences alone are complicated by the fact that it is hard to determine the presence of a targeting sequence (. Methods based on the overall amino acid composition have great variations in their coverage. The underlying biological model is fairly simple, which probably is one of the main reasons for triggering the avalanche of different computational algorithms applied to the data set by Reinhardt and Hubbard, where four localizations are represented 32., 33., 34., 35., 51., 59.. This data set has a high level of sequence homology (up to 90%) and essentially all methods perform equally, with only minor deviations. Some of the algorithms used are more prone to overfitting than others, since different types of cross-validation schemes have been used. Other methods in this category with higher localization coverage have comparable overall accuracies and are a better choice if very little is known about the protein at hand. Methods based on direct sequence homology are in some cases very accurate. These methods rely on finding a highly similar protein with a known subcellular localization annotation. Predictions neglect protein specific features that can be learned from a training data set, hence also the events in the sorting process. The drawback is if there is no homologous protein with annotated localization available, the result is left to chance. Parallels can be drawn to protein structure prediction, which can be fairly reliable if a homologous protein with known structure is known. High sequence homology is often an indication that the proteins are similar in both structure and function, but not necessarily that they share the same localization. Hybrid methods usually offer prediction of a wider range of subcellular localizations and are the methods of choice, when very little is known about the protein of interest. Methods providing a verbose output of the prediction results can be recommended, since these may give detailed information about potentially detected motifs and targeting sequences. A sophisticated prediction method should strive towards mimicking the biological process of protein sorting, an important step on the way to simulate a small component of the systems biology of the cell. Machine learning plays an important role in the development and implementation of the complex underlying biological models, which can also be seen from the frequent application within this filed. As the accuracy of the methods continue to increase, it will become more interesting to take a closer look at the relatively small proportion of misclassified proteins. It is likely that these proteins are key players at the interfaces of the organelles and they may provide clues to alternative protein sorting routes. There is a need to constantly update methods and to extract new data sets for training the prediction models. An example is the data set used by Reinhardt and Hubbard extracted from SWISS-PROT release 33.0 (52,205 sequence and 15,775 subcellular localization annotations; ref. ), which is still being used to develop new methods. The current release SWISS-PROT 44.1 contains 122,750 protein entries with a total of 80,562 subcellular localization annotations. Hence, the advice to the interested users is to choose a method based on an upto-date data set. To a general biologist, it is hard to assess the choice of algorithm and whether the accuracy estimation is done in a sound way. The pitfalls here are typically too homologous sequences within the training data and cross-validation schemas to estimate the overall performance of the method. In conclusion, understanding and predicting protein subcellular localization is a field of research where a lot has happened recently both experimentally and computationally. It is clear that several challenges lie ahead, especially when translating known facts from experimental biology into reasonable computational models and simultaneously to avoid simplifications. A further challenge is the prediction of proteins known to shuttle between compartments, which in principle can be ascribed to two subcellular localizations 64., 65.. There is no doubt that numerous new approaches, both in terms of algorithms and biological motivations, will be presented in this field in the near future. Prediction of subcellular localization is very likely to be one building block in systems biology approaches, aiming to understand the broader aspects of molecular biology.

65 in total

1. Extensive feature detection of N-terminal protein sorting signals.

Authors: Hideo Bannai; Yoshinori Tamada; Osamu Maruyama; Kenta Nakai; Satoru Miyano
Journal: Bioinformatics Date: 2002-02 Impact factor: 6.937

2. ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites.

Authors: O Emanuelsson; H Nielsen; G von Heijne
Journal: Protein Sci Date: 1999-05 Impact factor: 6.725

Review 3. Protein transport in plant cells: in and out of the Golgi.

Authors: Ulla Neumann; Federica Brandizzi; Chris Hawes
Journal: Ann Bot Date: 2003-08 Impact factor: 4.357

4. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST.

Authors: Manoj Bhasin; G P S Raghava
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

Review 5. Mechanisms of protein import and routing in chloroplasts.

Authors: Paul Jarvis; Colin Robinson
Journal: Curr Biol Date: 2004-12-29 Impact factor: 10.834

Review 6. Nuclear import-export: in search of signals and mechanisms.

Authors: E A Nigg; P A Baeuerle; R Lührmann
Journal: Cell Date: 1991-07-12 Impact factor: 41.582

7. AAindex: Amino Acid Index Database.

Authors: S Kawashima; H Ogata; M Kanehisa
Journal: Nucleic Acids Res Date: 1999-01-01 Impact factor: 16.971

Review 8. Import and routing of nucleus-encoded chloroplast proteins.

Authors: K Cline; R Henry
Journal: Annu Rev Cell Dev Biol Date: 1996 Impact factor: 13.827

9. Relation between amino acid composition and cellular location of proteins.

Authors: J Cedano; P Aloy; J A Pérez-Pons; E Querol
Journal: J Mol Biol Date: 1997-02-28 Impact factor: 5.469

Review 10. Glycoproteins: what are the sugar chains for?

Authors: J C Paulson
Journal: Trends Biochem Sci Date: 1989-07 Impact factor: 13.807

36 in total

Review 1. Organellar proteomics: turning inventories into insights.

Authors: Jens S Andersen; Matthias Mann
Journal: EMBO Rep Date: 2006-09 Impact factor: 8.807

2. Comparative Bioinformatics Analyses and Profiling of Lysosome-Related Organelle Proteomes.

Authors: Zhang-Zhi Hu; Julio C Valencia; Hongzhan Huang; An Chi; Jeffrey Shabanowitz; Vincent J Hearing; Ettore Appella; Cathy Wu
Journal: Int J Mass Spectrom Date: 2007-01-01 Impact factor: 1.986

3. GP4: an integrated Gram-Positive Protein Prediction Pipeline for subcellular localization mimicking bacterial sorting.

Authors: Stefano Grasso; Tjeerd van Rij; Jan Maarten van Dijl
Journal: Brief Bioinform Date: 2021-07-20 Impact factor: 11.622

Review 4. The Escherichia coli proteome: past, present, and future prospects.

Authors: Mee-Jung Han; Sang Yup Lee
Journal: Microbiol Mol Biol Rev Date: 2006-06 Impact factor: 11.056

5. Genome-wide identification and comparative analysis of GST gene family in apple (Malus domestica) and their expressions under ALA treatment.

Authors: Xiang Fang; Yuyan An; Jie Zheng; Lingfei Shangguan; Liangju Wang
Journal: 3 Biotech Date: 2020-06-15 Impact factor: 2.406

6. PROlocalizer: integrated web service for protein subcellular localization prediction.

Authors: Kirsti Laurila; Mauno Vihinen
Journal: Amino Acids Date: 2010-09-02 Impact factor: 3.520

7. PNAC: a protein nucleolar association classifier.

Authors: Michelle S Scott; François-Michel Boisvert; Angus I Lamond; Geoffrey J Barton
Journal: BMC Genomics Date: 2011-01-27 Impact factor: 3.969

8. Protein-protein interaction as a predictor of subcellular location.

Authors: Chang Jin Shin; Simon Wong; Melissa J Davis; Mark A Ragan
Journal: BMC Syst Biol Date: 2009-02-25

9. In-Pero: Exploiting Deep Learning Embeddings of Protein Sequences to Predict the Localisation of Peroxisomal Proteins.

Authors: Marco Anteghini; Vitor Martins Dos Santos; Edoardo Saccenti
Journal: Int J Mol Sci Date: 2021-06-15 Impact factor: 5.923

10. Protein localization prediction using random walks on graphs.

Authors: Xiaohua Xu; Lin Lu; Ping He; Ling Chen
Journal: BMC Bioinformatics Date: 2013-05-09 Impact factor: 3.169