Literature DB >> 24438075

Plus ça change - evolutionary sequence divergence predicts protein subcellular localization signals.

Yoshinori Fukasawa, Ross K K Leung, Stephen K W Tsui, Paul Horton¹.

Abstract

BACKGROUND: Protein subcellular localization is a central problem in understanding cell biology and has been the focus of intense research. In order to predict localization from amino acid sequence a myriad of features have been tried: including amino acid composition, sequence similarity, the presence of certain motifs or domains, and many others. Surprisingly, sequence conservation of sorting motifs has not yet been employed, despite its extensive use for tasks such as the prediction of transcription factor binding sites.
RESULTS: Here, we flip the problem around, and present a proof of concept for the idea that the lack of sequence conservation can be a novel feature for localization prediction. We show that for yeast, mammal and plant datasets, evolutionary sequence divergence alone has significant power to identify sequences with N-terminal sorting sequences. Moreover sequence divergence is nearly as effective when computed on automatically defined ortholog sets as on hand curated ones. Unfortunately, sequence divergence did not necessarily increase classification performance when combined with some traditional sequence features such as amino acid composition. However a post-hoc analysis of the proteins in which sequence divergence changes the prediction yielded some proteins with atypical (i.e. not MPP-cleaved) matrix targeting signals as well as a few misannotations.
CONCLUSION: We report the results of the first quantitative study of the effectiveness of evolutionary sequence divergence as a feature for protein subcellular localization prediction. We show that divergence is indeed useful for prediction, but it is not trivial to improve overall accuracy simply by adding this feature to classical sequence features. Nevertheless we argue that sequence divergence is a promising feature and show anecdotal examples in which it succeeds where other features fail.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2014 PMID： 24438075 PMCID： PMC3906766 DOI： 10.1186/1471-2164-15-46

Source DB: PubMed Journal: BMC Genomics ISSN： 1471-2164 Impact factor: 3.969

Background

Since proper subcellular localization is a prerequisite for protein function, there is a high demand for accurate and complete localization annotation of all proteins [1]. Although proteomics data has allowed large scale determination of protein localization for model organisms [2,3], no experimental evidence is available for the vast majority of organisms. Although sequence similarity can be a good indicator of identical localization site [4], distant similarity is not [5], and thus for many proteins we must rely on computer prediction. In cells, the localization of proteins is largely determined by “zip-code” like sorting signals, encoded in their amino acid sequence [6]. Unfortunately these sorting signals seem to be only very loosely determined, accepting very diverse sequences, subject to some constraints on their physico-chemical properties [7]. Among those signals, the most well-known sorting signal is the signal peptide of secretory path proteins. A typical signal peptide spans 15–30 amino acids near the N-terminus. Signal peptides typically show three distinct blocks: the n-region containing positively charged residues, the h-region mainly consisting of hydrophobic residues, and the c-region which includes polar uncharged residues and a weakly conserved cleavage motif [8]. Similarly, the targeting signals of mitochondria and chloroplasts are also N-terminally coded [7], and cleaved after import to their final location. In the mitochondria matrix, the N-terminal signal is usually cleaved off by the Mitochondrial Processing Peptidase MPP [9,10], while the corresponding chloroplast targeting N-terminal signals are processed by an analogous protease in the chloroplast stroma [10]. Like signal peptides, these signals are often poorly conserved and difficult to align properly between orthologs [11]. Although some consensus motif has been reported for mitochondrial targeting signals [12,13], it is information poor and produces too many false positives to be used for reliable prediction. To date, an impressive number of methods have been developed for protein sorting prediction. For example, in 2004 a survey already listed dozens of methods employing fifteen broad categories of features [14]; from commonly used ones such as amino acid composition [15-19] (and many more) to rare categories such as sequence periodicity [20] and mRNA expression level [21]. Sequence similarity as defined by programs such as BLASTP has been explored as a feature for signal peptide detection [22]. Among these features, amino acid composition is attractive due to its simplicity. The significant correlation between amino acid composition and sub-cellular location is partially causative and partially due to indirect effects such as adaption of surface residues to the pH of the protein’s localization site [23]. The one feature conspicuously missing from this list has been evolutionary sequence conservation, despite the fact that it has seen extensive use in sequence analysis from the prediction of transcription factor binding sites [24], to short linear motifs in proteins [25] and functional RNA [26]. Although profile feature methods which indirectly reflect evolutionary conservation have been employed [27], sequence conservation per se has not – presumably because sorting signals are indeed not well conserved at the sequence level. Here, we propose that instead of looking for sequence conservation of sorting signals, a more effective approach is to exploit their high evolutionary sequence divergence. In this paper we first describe our datasets of yeast, animal and plant proteins with their orthologs, divergence and other features we used for classification, and the classifiers we employed. Then, we present a simple statistical feature analysis followed by performance evaluation of localization prediction for various combinations of features, classifiers and datasets. Unfortunately, combining other features with our sequence divergence did not lead to a systematic improvement in overall performance. However we show that consideration of sequence divergence is critical for correct prediction in certain cases and can sometimes flag non-cleaved or misannotated targeting signals. Finally we discuss future directions and conclude.

Methods

Sorting signal classes

We mainly focused on the two most common N-terminal sorting signals: Signal Peptides (SP), targeting proteins to the endoplasmic reticulum and Matrix Targeting Signals (MTS) which target proteins to the matrix (inner compartment) of the mitochondria. In the plant dataset, we also consider Chloroplast Transit Peptides (CTP). All of these signals reside near the N-terminus but in general have different properties and are effectively discriminated by the cell. In some cases however, the N-terminal “signal” can be ambiguous. In particular many examples are known in which the same amino acid sequence directs some copies of a protein to the mitochondria and others to the chloroplast [28,29]. Nevertheless these examples still constitute only a small percentage of proteins and therefore we simplify the analysis by treating N-terminal sorting signal identification as a simple three- or four-way classification problem: {MTS, SP, (CTP), no signal}. Other types of N-terminal sorting signals exist, for example the PTS2 signal targeting proteins to the peroxisome [30], but the number of proteins using such signals is much smaller than those using the SP, MTS or CTP signals. The sorting signal class labels we use in our datasets are partially based on direct experimental evidence. In the dataset of S.cerevisiae, we used UniProtKB/Swiss-Prot [31] to assign localization class labels, augmented by MTS containing proteins determined in the proteomics experiment of Vögtle et al. [32]. Because only a small number of SP’s have been directly confirmed experimentally, we also included proteins whose SP is inferred in the database and predicted positive by SignalP [33]. We used proteins annotated to localize to the cytosol or nucleus as proteins without N-terminal signals. To reduce bias in training and accuracy estimation, we used BLASTClust 2.2.22 [34] to remove redundant sequences with a setting of 20% identity. For proteins in human and a few plant species we adopted the dataset of Predotar [35] and for plants augmented that small number by experimental proteomics data determined in the mass spectrometry experiment of Huang et al. [11].

Dataset

Organisms used

We gathered protein sequences from 11 relatively diverse and well annotated representative species of the three phylogenetic divisions: yeast, mammal and plant respectively (Table 1). The 11 mammal species and most of the plant species are annotated reference proteomes in UniProt, but a few of the plant species are only included in UniProt as complete, but not fully annotated, proteomes. Note that our “plant” dataset contains the unicellular green algae Chlamydomonas reinhardtii, which is not a typical plant but is classified in the “viridiplantae” kingdom.

Table 1

List of species used to define orthologs in each phylogenetic category

S. cerevisiae	H. sapiens	A. thaliana
Saccharomyces castellii	Gorilla gorilla	Glycine max
Saccharomyces kluyveri	Otolemur garnettii	Ricinus communis
Kluyveromyces waltii	Mus musculus	Populus trichocarpa
Ashbya gossypii	Oryctolagus cuniculus	Vitis vinifera
Candida glabrata	Sus scrofa	Sorghum bicolor
Kluyveromyces lactis	Ailuropoda melanoleuca	Brachypodium distachyon
Zygosaccharomyces rouxii	Myotis lucifugus	Oryza sativa
Kluyveromyces thermotolerans	Loxodonta africana	Selaginella moellendorffii
Saccharomyces bayanus	Sarcophilus harrisii	Physcomitrella patens
Kluyveromyces polysporus	Ornithorhynchus anatinus	Chlamydomonas reinhardtii

The species listed at top are the reference species used to determine the subcellular localization site class labels. In the case of plants, one of G. max, O. sativa and C. reinhardtii were used as the reference species for proteins for which no annotation was available in A. thaliana.

List of species used to define orthologs in each phylogenetic category The species listed at top are the reference species used to determine the subcellular localization site class labels. In the case of plants, one of G. max, O. sativa and C. reinhardtii were used as the reference species for proteins for which no annotation was available in A. thaliana. In each of the three divisions we designated one species as the “reference” species. We used information in proteins from the non-reference species only for computation of sequence divergence (via ortholog multiple sequence alignments). We chose S.cere., H. sapiens, and A. thaliana as the reference species for yeast, animals and plants respectively, because they have the most complete annotation. However for plants even A. thaliana has rather limited annotation of SPs, so in order to increase the plant dataset size we used other species as the reference species in some cases.

Ortholog determination

We performed some experiments on hand curated ortholog sets downloaded from the Yeast Gene Order Browser (YGOB) [36], but also computed ortholog sets for each of the three phylogenetic divisions. Automatic identification of orthologs is a complex subject for which many sophisticated methods have been developed, the most suitable one being application dependent [37]. For this study, we adopted a simple procedure based on reciprocal best hits (RBHs) [38]. Formally, proteins P and P′ from species S and S′ respectively, are RBHs if P is more similar to P′ than any other protein in S′ and P′ is more similar to P than any other protein in S. We define the ortholog set of a reference species protein as all of its RBHs. When computing RBHs it is important that proteins from as many organisms as possible are included; but in the end we only have use for those ortholog sets in which the reference species is annotated, so in general we discarded the rest. However, in the case of plant, we attempted to rescue those discarded sequences by also trying O. sativa, G. max and C. reinhardtii in turn as the reference species. In computing the similarity scores for RBH we chose to use global alignment rather than local alignment. Our motivation for this was: 1) sorting signals often appear on the N- or C-terminal region of proteins, so differences in those regions may indicate a different localization of the “ortholog”, and 2) for multiple domain proteins, strong similarity in one domain may not imply the same localization site (or signal). We used the heuristic but fast USEARCH [39] program with its default parameters to compute the global similarity scores. Table 2 summarizes the datasets.

Table 2

The number of ortholog sets by localization class in each phylogenetic division

Localization	S.cere. curated	S.cere. RBH	H.sapiens RBH	Plants RBH
class	orthologs
MTS	179	219	81	61
SP	53	73	169	15
CTP	N/A	N/A	N/A	97
N-signal-free	450	560	415	99

For each ortholog dataset, the number of ortholog sets in each localization class is listed. RBH orthologs are defined by the reciprocal best hit method.

The number of ortholog sets by localization class in each phylogenetic division For each ortholog dataset, the number of ortholog sets in each localization class is listed. RBH orthologs are defined by the reciprocal best hit method.

Multiple alignment

We computed multiple alignments for each of the 4 orthologs sets (1 curated and 3 automatic) by aligning with the MAFFT program [40], using “LINSI”, its most accurate mode. Hereafter, we denote these alignments as “orthoMSA” in general, and as “autoOrthoMSA” when specifically referring to multiple alignments of automatically generated ortholog sets. The number of sequences in the automatically generated ortholog sets generally differs from the YGOB based sets, however, it seems that the distribution of the divergence score stabilizes when the number of sequences exceeds three (Figure 1), therefore we decided to include ortholog sets with at least four sequences.

Figure 1

Relationship between mean divergence score and the number of sequence in MSA’s. A box plot illustrating the mean, quartiles and range of the column entropy score for MSA’s in the yeast autoOrthoMSA dataset partitioned by the number of sequences in the MSA.

Relationship between mean divergence score and the number of sequence in MSA’s. A box plot illustrating the mean, quartiles and range of the column entropy score for MSA’s in the yeast autoOrthoMSA dataset partitioned by the number of sequences in the MSA.

Features for classification

Column entropy score

Several measures have been suggested for scoring evolutionary sequence conservation (or conversely divergence) [41,42]. Here we adopt a simple Shannon entropy based score. The Shannon entropy H(i) of the ith column of an orthoMSA is defined as: where A denotes the set of 20 amino acid characters plus gap characters, and F(i,j) denotes the frequency of character j in column i of an orthoMSA. Note that when multiple gap characters are present in a column, we consider each to be a unique character. For example, the entropy of an orthoMSA column ‘{L, L, I, -, -}’ is computed as one character (the ‘L’) with frequency 0.4 and three characters with frequency 0.2, because we treat the two ‘-’ characters as distinct. We adopted this treatment of gap characters so that the divergence of orthoMSA columns with many gaps is considered high (we also tried using straight entropy, but the results, not shown, were slightly worse). The range of this divergence score runs from 0 to log2n, where n is the number of sequences.

Divergence based features

For many orthoMSA’s, the entropy often varies widely from column to column. Therefore, we defined a number of evolutionary divergence features based on a smoothed entropy score, , defined as the average entropy score for columns in the interval [ i,j]. For example we define the local divergence (LD) of an orthoMSA at position k as . Another feature we defined is NCdiff, the average difference in divergence between in the first 20 residues and residues 80 to 99. Our motivation for this definition was the hope that subtracting the divergence from residues 80 to 99 would approximately normalize the feature when comparing proteins with different overall rates of evolution. These features are summarized in Table 3.

Table 3

List of entropy derived features

Feature name	Quantity
LD(i)	H¯i−10,i+10
N_raw20	H¯1,20
N_raw40	H¯1,40
N_raw80-99	H¯80,99
μ_w	Average of H¯window for all length w windows
σ_w	Standard deviation of H¯window for all length w windows
NCdiff	N_raw20−N_raw80-99
N20	(Nraw20−μ20)σ20 (z-score normalized)
N40	(Nraw40−μ40)σ40 (z-score normalized)
N80-99	(Nraw80-99−μ20)σ20 (z-score normalized)

List of entropy derived features

Physico-chemical propensities

To explore the possibility of combining sequence divergence with standard features used in protein localization prediction, we defined three features computed from the first 20 or 40 N-terminal residues of each S.cere. protein: 1) the number of positively charged residues (#pos), 2) the number of negatively charged residues (#neg), and 3) the average hydrophobicity as measured by the Kyte-Doolittle [43] index (Hphob).

Amino acid composition

Amino acid composition is another standard feature for protein localization. We tested this feature computed on the first 20 residues, the first 40 residues, and the entire protein sequence.

Classifiers

Majority class classifier

The majority class classifier unconditionally predicts all examples to belong to the most common class. Its accuracy is equal to the fraction of examples belonging to the most common class.

J48

J48 is a version of the C4.5 decision tree induction algorithm of Quinlan [44,45], implemented in the Weka software package [46]. We used the default value of 0.25 for the confidence factor, which controls the complexity of the induced tree.

Support vector machine

The Support Vector Machine (SVM) [47] is perhaps the most popular classifier in current bioinformatics work. In its basic form it is a linear, binary classifier, but it has been extended to non-linear, multiclass classification. In this project, we used the LIBSVM implementation [48]. We used the Gaussian radial basis kernel function with default γ value (1.0/# number of features). We used 50.0 for the SVM cost parameter C, because with the default cost parameter (1.0) prediction by RBF kernel failed for some features. In our study we conducted binary and 3-class classification. For multiclass discrimination LIBSVM adopts the “one-versus-one” method, in which a separate SVM is learned for each pair of classes, and majority voting among those SVM’s is used when classifying examples [49].

Measuring the influence of divergence features

As reported in the Results section, we performed a post-hoc analysis of proteins for which the divergence features greatly influenced the prediction outcome. To do this we needed to compare 6 numbers (three SVM scores {MTS vs SP, MTS vs none, SP vs none} each computed with and without the divergence features) into a measure of how much the divergence features influenced the prediction. Because the SVM scores are not given directly as probabilities and each individual SVM addresses a different subset of classes, it is not trivial to derive a well-principled way to do this. As described in more detail in the Additional file 1, we chose to define this in terms of exponential loss-based decoding [50]. We do not claim that this is necessarily the best measure, but it appears to give reasonable results. Fortunately, for our purposes it is enough that truly large differences are assigned in a roughly suitable order.

Quantifying feature importance

We used the so called “information gain” to quantify the importance of each feature. Information gain is a simple measure of the predictive power of a feature in isolation (i.e. without consideration of its relationship to other features), defined as: where C and F denote class and feature respectively. H(C) the denotes information theoretic entropy of the overall distribution of the class labels, while H(C|F) denotes the conditional entropy of the class label when feature F is given. A larger information gain indicates greater predictive power. Because the divergence based features have a large number of possible values, we first binned those values into a smaller number by the method of Fayyad & Irani [51].

Classification performance evaluation

Accuracy is not always the most meaningful measure of performance for skewed datasets (i.e. datasets with a very uneven number of examples from different classes) [52]. Therefore we report several measures in addition to accuracy.

Matthews correlation coefficient

The Matthews correlation coefficient, MCC [53,54], is a measure of performance for binary classification defined as follows: where “T” and “F” stand for “true” and “false”, while “N” and “P” stand for “negative” and “positive”. Equivalently, MCC can be defined as the Pearson’s correlation coefficient of the binary vector of class labels compared to the binary vector of predicted class labels. MCC ranges from 1.0 for perfect prediction to -1.0 for perfect inverse prediction. Note that the MCC of the majority class classifier is identically zero, as is the expected value of MCC under random prediction.

Area under the ROC curve

The Area under the curve (AUC) for a receiver operating characteristics (ROC) graph is a widely used metric to evaluate binary classification accuracy [55]. The usual way to generate an ROC plot is to rank instances by their predicted scores with increasing threshold values, plotting true positive rate (y-axis) versus false positive rate (x-axis). AUC ranges from 0 to 1.0, with perfect prediction yielding 1.0 and perfectly wrong prediction 0.0. AUC can be interpreted as the probability that a classifier is able to distinguish a randomly chosen positive example from a randomly chosen negative example [56]. For this task, the majority class classifier gives no information over coin flipping and therefore can be considered to yield an AUC of 0.5.

Results

Feature analysis

N-terminal sorting signals are evolutionary divergent

It is well known that N-terminal sorting signals exhibit relatively low sequence conservation [57]. As shown in Figure 2, this phenomenon is particularly clear for the mitochondrial heat shock protein, mtHSP70, in which the main part of the protein is highly conserved but the N-terminal region is highly divergent. Figure 3 quantifies this trend for the proteins in the YGOB ortholog set.

Figure 2

Figure 3

Local divergence score over N-terminal region. Average local divergence scores are shown for the 100 residue N-terminal region of: MTS containing, SP containing, and N-signal-free proteins. Top left panel is calculated from orthologs of yeast curated dataset, and the others from automatically collected orthologs. For the plant dataset, CTP containing proteins are also shown. The error bars denote standard error. For clarity, error bars are only shown for every fifth position.

An example of MTS containing protein. A multiple sequence alignment of the protein mtHSP70 (UniProt accession P0CS90) and its orthologs from five species of yeast. The red box indicates the cleaved MTS in S.cere. Conserved positions are colored by Jalview. Local divergence score over N-terminal region. Average local divergence scores are shown for the 100 residue N-terminal region of: MTS containing, SP containing, and N-signal-free proteins. Top left panel is calculated from orthologs of yeast curated dataset, and the others from automatically collected orthologs. For the plant dataset, CTP containing proteins are also shown. The error bars denote standard error. For clarity, error bars are only shown for every fifth position.

Estimate of importance of each feature

As a rough estimate of feature importance, we computed the information gain for each feature (Figure 4). The two highest scoring features are the physico-chemical features #neg and Hphob, but the LD features near the N-terminus also show information gain significantly greater than zero.

Figure 4

Importance of each feature. The importance of each attribute as estimated by information gain is shown for the YGOB ortholog set. At left, the divergence related scores are shown by light blue color lines. For local divergence features LD(i), only the residue number i is listed. Dark blue colored lines denote standard features of the N-terminal 40 residues such as physico-chemical properties or amino acid composition. The suffix “f” denotes amino acid composition from the full length of the protein.

Sequence divergence is not redundant to physico-chemical trends or amino acid composition

To be promising as a feature for prediction, it is desirable that evolutionary sequence diversity not be perfectly correlated with other features. To investigate this we plotted LD(13), the divergence feature with the highest information gain, against Hphob, #neg and the arginine composition (the three highest scoring standard features in the 40 residue N-terminal region) (Figure 5). Although there may be some relationship, the feature pairs do not appear highly correlated.

Figure 5

Correlation between divergence and physico-chemical properties. Scatter plots of LD(13) (on the vertical axis) vs physico-chemical property (A) average hydrophobiciy, (B) number of negatively charged residues and (C) arginine composition for the YGOB ortholog set (MTS proteins are shown in red, SP in blue and N-signal-free proteins in green).

Correlation between divergence and physico-chemical properties. Scatter plots of LD(13) (on the vertical axis) vs physico-chemical property (A) average hydrophobiciy, (B) number of negatively charged residues and (C) arginine composition for the YGOB ortholog set (MTS proteins are shown in red, SP in blue and N-signal-free proteins in green).

Divergence predicts the presence of N-terminal signals

We tested whether sequence divergence can be used to distinguish between proteins with an N-terminal localization signal (MTS or SP) and those with none. As shown in Table 4, for this binary classification task, sequence divergence alone allows for significantly higher prediction accuracy than randomized control experiments or the majority class fraction (66.0%) in the yeast dataset.

Table 4

Performance of N-signal vs N-signal-free protein binary classification

	Mean accuracy	Mean AUC	Mean MCC
J48	72.49±3.30	0.68±0.09	0.40±0.09
- (randomized)	65.85±0.66	0.50±0.01	0.00±0.03
SVM	74.64±2.38	0.68±0.03	0.40±0.06
- (randomized)	66.19±0.09	0.50±0.00	0.00±0.00
The majority class fraction	65.98%	N/A	N/A

Three classification performance measures when using only divergence features are shown for the discrimination of N-signal containing and N-signal-free proteins (yeast curated ortholog sets). AUC denotes the area under the ROC curves. (randomized) indicates the values obtained with the localization class labels randomly shuffled 100 times. For each measure the average and standard deviation is shown over the 5 folds of the cross-validation, or 500 (5 × 100 trials) folds in the case of the randomized data.

Performance of N-signal vs N-signal-free protein binary classification Three classification performance measures when using only divergence features are shown for the discrimination of N-signal containing and N-signal-free proteins (yeast curated ortholog sets). AUC denotes the area under the ROC curves. (randomized) indicates the values obtained with the localization class labels randomly shuffled 100 times. For each measure the average and standard deviation is shown over the 5 folds of the cross-validation, or 500 (5 × 100 trials) folds in the case of the randomized data.

Divergence distinguishes SP vs. MTS vs. N-signal-free

Although the sequence divergence profile of SP’s and MTS’s appear similar when averaged (Figure 3), we found that sequence divergence is still somewhat effective for the three-way classification of SP vs MTS vs N-signal-free. As shown in Table 5 the performance with divergence features is slightly better than the majority class fraction (66.0%) and also slightly improves the performance when added to the physico-chemical features in N-terminal 40 residues or amino acid composition in either N-terminal 40 or full length (Additional file 1).

Table 5

Performance of 3-way classification using SVM classifier

	Divergence		Classical features		Combination
	AUC	MCC	AUC	MCC	AUC	MCC
MTS	0.67±0.03	0.36±0.06	0.87±0.03	0.76±0.05	0.87±0.03	0.77±0.03
SP	0.50±0.00	0.00±0.00	0.81±0.08	0.70±0.11	0.90±0.06	0.83±0.07
N-signal-free	0.66±0.02	0.36±0.03	0.85±0.03	0.72±0.05	0.87±0.02	0.77±0.03
% accuracy	70.82±1.61		87.24±1.86		89.30±0.66

The 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way classification on the yeast curated ortholog dataset. Classical features are computed based on the N-terminal 40 residues.

Performance of 3-way classification using SVM classifier The 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way classification on the yeast curated ortholog dataset. Classical features are computed based on the N-terminal 40 residues. The ratio of examples in our dataset is 8.5:3.4:1, for N-signal-free, MTS and SP containing proteins respectively. Skewed datasets are known to complicate both learning and performance evaluation [52]. Therefore we also measured performance on a dataset with uniform class occupancy, created by randomly discarding all but 53 proteins from each class. As shown in Table 6, in this experiment the divergence feature only performance (63%) is much higher than the majority class fraction (33%), and the divergence features also contribute more to the performance when combined with the standard features (Table 6).

Table 6

Performance on balanced dataset for MTS vs SP vs N-signal-free protein prediction using SVM classifier

	Divergence		Classical features		Combination
	AUC	MCC	AUC	MCC	AUC	MCC
MTS	0.67±0.10	0.35±0.20	0.84±0.07	0.68±0.13	0.88±0.05	0.78±0.09
SP	0.71±0.09	0.41±0.16	0.92±0.05	0.85±0.10	0.94±0.01	0.88±0.03
N-signal-free	0.79±0.07	0.60±0.13	0.78±0.09	0.57±0.18	0.86±0.07	0.74±0.13
% accuracy	62.86±5.84		79.92±5.54		86.19±4.67

The 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way classification on a balanced dataset (53 proteins from each class, yeast curated orthologs).

Performance on balanced dataset for MTS vs SP vs N-signal-free protein prediction using SVM classifier The 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way classification on a balanced dataset (53 proteins from each class, yeast curated orthologs). We further tested the prediction power of divergence features when combined with classical features computed on a 20 residue N-terminal instead of 40 (which might be too long for the SP class). In this experiment, divergence features improved the performance only slightly when combined with other standard features (Table 7). We also computed the confusion matrix for this dataset (Table 8) and the other datasets investigated in the study (Additional file 1: Tables S14–S25).

Table 7

Performance of 3-way classification using SVM classifier (feature length 20)

	Divergence		Classical features		Combination
	AUC	MCC	AUC	MCC	AUC	MCC
MTS	0.67±0.03	0.36±0.06	0.89±0.02	0.80±0.02	0.89±0.01	0.81±0.02
SP	0.50±0.00	0.00±0.00	0.97±0.03	0.92±0.07	0.98±0.03	0.97±0.04
N-signal-free	0.66±0.02	0.36±0.03	0.90±0.01	0.81±0.02	0.90±0.01	0.83±0.02
% accuracy	70.82±1.61		91.49±1.26		92.23±1.25

The 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way classification on our entire yeast curated ortholog dataset. Classical features are calculated from N-terminal 20 amino acids.

Table 8

Confusion Matrix from 3-way classification using SVM classifier (feature length 20)

	Divergence			Classical features			Combination
Predicted →	MTS	SP	N-signal-free	MTS	SP	N-signal-free	MTS	SP	N-signal-free
MTS	83	0	96	148	1	30	144	0	35
SP	16	0	37	0	50	3	1	51	1
N-signal-free	50	0	400	20	4	426	15	1	434

Confusion matrix of the 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way classification on our entire yeast curated ortholog dataset. Classical features are calculated from N-terminal 20 amino acids.

Performance of 3-way classification using SVM classifier (feature length 20) The 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way classification on our entire yeast curated ortholog dataset. Classical features are calculated from N-terminal 20 amino acids. Confusion Matrix from 3-way classification using SVM classifier (feature length 20) Confusion matrix of the 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way classification on our entire yeast curated ortholog dataset. Classical features are calculated from N-terminal 20 amino acids.

Divergence computed from automatically generated ortholog sets is consistent with the hand curated dataset

Although the YGOB based dataset convincingly demonstrates that the divergence score has discriminative power for N-terminal signal prediction, it covers only 11 yeast species and requires hand curation. Thus as described in the Methods section, in this work we adopted a simple procedure based on reciprocal best hit relationships to obtain automatically generated ortholog sets as well (Table 2). In yeast, the average divergence score at each positions is similar to the score from the YGOB ortholog set, and the overall tendency looks similar for animals and plants (Figure 3). Interestingly, CTP shows a high and longer region of elevated divergence, consistent with previous observations that CTPs tend to be longer than MTSs [11]. Additionally, we note that the score range of the human autoOrthoMSA’s is significantly different from those of yeast or plants. This is expected because divergence amongst yeast sequences is at least as large as that of the chordates [58], so divergence in mammals should be smaller.

Divergence computed from autoOrthoMSA also predicts N-terminal signals

First, we confirmed whether or not divergence features can be applied to a simple binary classification: discrimination between N-terminal signal containing proteins and N-signal-free proteins. Although the ratio of positive to negative examples in each dataset differs, the result of prediction by divergence features alone is higher than majority class classifier for all datasets (Table 9).

Table 9

Performance of N-signal vs N-signal-free protein binary classification on automatically collected orthologs

Yeast dataset	Mean accuracy	Mean AUC	Mean MCC
J48	71.47±5.00	0.67±0.07	0.36±0.12
SVM	75.35±3.49	0.71±0.04	0.44±0.08
The majority class fraction	65.23%	N/A	N/A
Human dataset
J48	69.32±4.10	0.72±0.07	0.43±0.09
SVM	72.28±5.95	0.72±0.06	0.43±0.12
The majority class fraction	62.41%	N/A	N/A
Plant dataset
J48	79.41±6.03	0.75±0.06	0.55±0.13
SVM	83.47±4.01	0.79±0.04	0.64±0.09
The majority class fraction	63.60%	N/A	N/A

Three classification performance measures when using only divergence features are shown for the discrimination of N-signal containing and N-signal-free proteins on automatically collected orthologs. AUC denotes the area under the ROC curves. For each measure the average and standard deviation is shown over the 5 folds of the cross-validation.

Performance of N-signal vs N-signal-free protein binary classification on automatically collected orthologs Three classification performance measures when using only divergence features are shown for the discrimination of N-signal containing and N-signal-free proteins on automatically collected orthologs. AUC denotes the area under the ROC curves. For each measure the average and standard deviation is shown over the 5 folds of the cross-validation. Next, we tested the predictive power of divergence in three-way classification on a dataset balanced to have equal class frequency (Table 10). It is evident that on balanced datasets, divergence also shows significant predictive power in distinguishing between the two different kinds of N-terminal signals, even for the relatively closely related mammal species.

Table 10

Performance for 3-way classification using SVM classifier on automatically collected orthologs

	F _DivYeast (73)		F _DivHuman (81)
	AUC	MCC	AUC	MCC
MTS	0.65±0.09	0.31±0.18	0.66±0.05	0.31±0.11
SP	0.60±0.07	0.19±0.14	0.70±0.08	0.40±0.15
N-signal-free	0.66±0.08	0.35±0.15	0.69±0.06	0.39±0.11
% accuracy	51.63±7.21		57.61±4.71

The 5-fold cross-validation performance of an SVM classifier using divergence features is shown for three-way classification on the automatically generated ortholog dataset for yeasts and mammals. The number of examples is given in parenthesis at top.

Performance for 3-way classification using SVM classifier on automatically collected orthologs The 5-fold cross-validation performance of an SVM classifier using divergence features is shown for three-way classification on the automatically generated ortholog dataset for yeasts and mammals. The number of examples is given in parenthesis at top. In plants, the divergence score can also discriminate between the three possible kinds of N-terminal signals better than random. However, there are only 15 experimentally validated SPs in this phylogenetic category (Table 2). Since this small sample size leads to a high statistical variance, we also computed the performance on balanced 3-way classification of MTS vs CTP vs N-signal-free (Table 11).

Table 11

Performance on balanced plant dataset using SVM classifier on automatically collected orthologs

	F _DivPlant 4 classes (15)		F _DivPlant 3 classes (61)
	AUC	MCC	AUC	MCC
MTS	0.62±0.11	0.24±0.21	0.66±0.08	0.35±0.14
SP	0.78±0.11	0.58±0.23	N/A	N/A
CTP	0.73±0.16	0.43±0.31	0.77±0.12	0.51±0.23
N-signal-free	0.80±0.14	0.72±0.20	0.81±0.09	0.67±0.13
% accuracy	60.00±9.13		66.22±10.11

The 5-fold cross-validation performance of an SVM classifier using divergence features is shown for three-way classification on balanced sets of (automatically generated) plant orthologs with or without the SP class. The number of examples is given in parenthesis at top.

Performance on balanced plant dataset using SVM classifier on automatically collected orthologs The 5-fold cross-validation performance of an SVM classifier using divergence features is shown for three-way classification on balanced sets of (automatically generated) plant orthologs with or without the SP class. The number of examples is given in parenthesis at top. In the Additional file 1 we list cross-validated performance estimates on various combinations of datasets and features. From these we draw two conclusions: in most cases divergence features slightly improve prediction when combined with standard features and in general computing standard features on the N-terminal 20 residues leads to higher accuracy than computing on 40 residues.

Post-hoc analysis of proteins for which divergence strongly influences the prediction result

In this section we discuss proteins for which the use of divergence features strongly affects the results. The ortholog MSA’s of all proteins mentioned in this section are available in the Additional file 2.

Divergence features may help flag misannotation

Prior to this work, evolutionary divergence has not been applied systematically to N-terminal signal prediction. However we expected that it might be able to capture interesting examples not revealed by other features. To investigate this, we ranked instances whose SVM prediction changes drastically depending on whether or not divergence features are used. Because of its rich annotation, we focused on S.cere., using the automatically defined ortholog set. The prediction result of 43 proteins changed depending on whether divergence features were added to conventional features. For these 43 proteins, we used the SVM numerical scores to rank the size of the effect as explained in the Additional file 1 (ranked list in Additional file 3: Table S1). The ortholog set multiple sequence alignments for these proteins are also available in the Additional file 2 in html form. In general, prediction differences are observed between the MTS and N-signal-free classes. The most highly affected protein is mitochondrial alanine tRNA ligase, ALA1 (P40825), which is predicted to have an MTS when sequence divergence features are used. Upon closer inspection we discovered that the sequence we used for this protein should in fact have been labeled as an MTS containing protein, but our dataset based on an earlier version of UniProtKB/Swiss-Prot contained mistaken annotation which holds for an alternative translation start site. Thus in this case sequence divergence yields the correct answer. PTP1 (P25044) is another protein whose prediction changes from N-signal-free to MTS when divergence is considered. Following UniProtKB/Swiss-Prot, we treated it as a cytoplasmic protein, but there is no reference given for this annotation. Moreover PTP1 is identified as a mitochondrial protein by two large-scale experiments. This is suggestive that it may have a mitochondrial localization, although even in that case it would not necessarily have an MTS. Hopefully future work will clarify if this is another case in which divergence features flagged misannotations in our dataset.

Divergence features may help detect mitochondrial proteins with non-classical MTS signals

FMP52 (P40008) is a protein included in our dataset for which the SVM with standard features predicts an MTS but the SVM with divergence features predicts N-signal-free. As shown in Figure 6, FMP52’s N-terminal region is not divergent like typical MTS’s, especially very near the N-terminus. FMP52 is indeed a mitochondrial protein, but upon closer scrutiny we discovered a previous report that it strongly associates with the outer membrane [59] — and therefore is unlikely to have a matrix targeting MTS. Moreover, FMP52 is one of the non-MTS containing proteins in the yeast proteomic analysis [32]. Swiss-Prot does annotate FMP52 with an MTS (1-44), but we could not find a reference or supporting information for this MTS annotation; therefore, we conclude that it is unlikely to have MTS. CYM1 (P32898) is another interesting example which has been reported to localize in the intermembrane space and not to be processed by mitochondrial proteases [60]. Since MTS is a cleavable targeting signal for the matrix, the intermembrane space localization and lack of proteolytic cleavage of CYM1 suggests its N-terminal signal is not a typical classical MTS.

Figure 6

MSA of FMP52 and its orthologs in 11 yeast species. Multiple sequence alignment of FMP52 in S.cerevisiae and its orthologs in other 10 yeast species. The red boxed region shows annotated MTS of FMP52. The conserved positions are colored by Jalview.

MSA of FMP52 and its orthologs in 11 yeast species. Multiple sequence alignment of FMP52 in S.cerevisiae and its orthologs in other 10 yeast species. The red boxed region shows annotated MTS of FMP52. The conserved positions are colored by Jalview. MrpL19 (P53875) is another case in which sequence divergence features highlight a ribosomal mitochondrial protein which does not appear to have a classical MTS signal. According to both UniProtKB/Swiss-Prot annotation and a large-scale proteomics experiment [32] MrpL19 has an MTS, but the annotated “MTS” is unusually long and lacks an arginine in position -2, which is normally observed in MPP cleavage sites [9]. Moreover the N-terminal sequence of MrpL19 is very well conserved not only in yeasts but even in bacteria. Indeed the three dimensional structure of rplK, a homolog of MrpL19 in E.coli, has been solved and it is evident that the two proteins have a similar structured N-terminal. Taken together the evidence suggests that MrpL19 may not have an N-terminal mitochondrial localization signal, but rather be imported via an alternative pathway. On the other hand, we also observed ribosomal mitochondrial proteins whose N-terminal is poorly conserved. One example is MrpL32 (P25348), which cannot be predicted as having an MTS by standard tools such as TargetP [61] or Predotar [35], nor by our SVM’s trained without divergence features. MrpL32 shows a high divergence in its N-terminal region (Figure 7) and is predicted to have an MTS by our SVM when using divergence features. A literature search revealed that MrpL32 does indeed have an MTS, but it is unusual in the sense that it is cleaved by the protease m-AAA [62,63] instead of MPP. Mrp7 (P12687) is a similar case. Like MrpL32, Mrp7 is also a component of a large ribosomal subunit and is not predicted to have an MTS by TargetP, Predator, nor by our SVM without divergence features, but is predicted to have an MTS when divergence features are used. In UniProtKB/Swiss-Prot, Mrp7 is annotated as having an MTS, and indeed the processing of Mrp7 by MPP has been reported multiple times [32,64]. So in this case high sequence divergence allows an MTS to be correctly predicted.

Figure 7

MSA of MrpL32 and its orthologs in 11 yeast species. Multiple sequence alignment of MrpL32 in S.cerevisiae and its orthologs in 10 other yeast species. The red boxed region shows MTS of MrpL32. The conserved positions are colored by Jalview.

MSA of MrpL32 and its orthologs in 11 yeast species. Multiple sequence alignment of MrpL32 in S.cerevisiae and its orthologs in 10 other yeast species. The red boxed region shows MTS of MrpL32. The conserved positions are colored by Jalview. Another case worth discussing is IMO32 (P53219), which has recently been reported to be processed by the intermediate protease Oct1 (after MPP) in the matrix [65]. It is unusual in that its inferred MPP cleavage site represents a rare exception to the almost invariant presence of arginine at the -2 position. IMO32 is predicted as an MTS by Predator [35] and our SVM when we use divergence, but not by our SVM without divergence features, nor by TargetP [61].

Discussion

Although strong sequence similarity is a widely used indicator of co-localization, characteristically low sequence conservation in signal sequence regions has not been utilized for prediction. Other authors have noted the low sequence conservation of N-terminal sorting signals such as MTS sequences [66], but our work reported here is the first investigation of the utility of sequence divergence as a predictive feature for N-terminal sorting signals. Our method requires defining an ortholog set for each gene. The YGOB curated dataset for 11 yeast species is a reliable way to obtain orthologs, but this kind of database is not available for most species. We show that a simple reciprocal best hit method identified orthologs with sufficient reliability for the purposes of computing sequence diversity. One avenue for future research is to relax the requirement of global alignment reciprocal best hit designed to find orthologs, and simply use for (possibly paralogous) homologous sequences. In this study we chose to focus on orthologs because paralogs often have distinct localization sites. For example, Rosso et al. [67] describe the interesting case of the human glutamate dehydrogenases GLUD1 and GLUD2. These paralogs result from a gene duplication event, but GLUD1 localizes to both the cytosol and the mitochondria while GLUD2 localizes exclusively to the mitochondria. Interestingly, the N-terminal region of GLUD2, which functions as an MTS, has evolved faster than GLUD1 [67]. Since we made a few somewhat arbitrary choices when defining divergence features, we performed an post hoc analysis to see if simply tuning those parameters would significantly affect the prediction accuracy. Namely, we investigated the effect of the changing the window length and position of the downstream normalizing window used to define NCdiff, but found that prediction accuracy is not strongly dependent on the exact value of these parameters (Additional file 1: Figures S1,S2). Another potential weakness of our method is the simple entropy based definition we used for sequence divergence, which ignores the phylogenetic relationship of the species involved. Many sophisticated measures have been proposed to quantify the degree of sequence conservation [42]. We did experiment with some of them, such as the Jensen-Shannon divergence [68] to try to improve prediction, but without success (results not shown). However we did not extensively explore the possibilities and believe that the simple entropy score employed here probably can be improved upon. On the other hand, we did provide quantitative evidence that the entropy divergence score has considerable predictive power by itself. The examples ALA1 and FMP52 show that divergence can flag proteins (typically mitochondrial ones) with misannotated MTS information and give a hint regarding which compartment of the mitochondria they localize to. Examples like MrpL32, show that when the predictions of standard predictors are inconsistent with the degree of sequence divergence, non-typical MTS’s, processing proteases or alternative mitochondrial localization pathways may be indicated. One weakness in our datasets is that many of our SP proteins are not experimentally validated, but rather annotated as SP proteins due to UniProtKB/Swiss-Prot annotation and prediction from amino acid sequence with SignalP [33] in the yeast dataset. This unfortunate circularity (predicting predictions) is unavoidable because: 1) only a handful of SP’s have been experimentally verified, and 2) the presence of SP’s cannot be reliably inferred exclusively from localization site for most S.cere. proteins. It may be reasonable to assume that secreted proteins all have SP’s, but S.cere. secretes very few proteins (the Swiss-Prot derived WoLF PSORT [69] dataset lists only six). Proteins which localize to the E.R. or Golgi body generally posses SPs, but many proteins annotated as E.R. or Golgi are non-SP containing peripheral membrane proteins, which localize to the periphery of these organelles. However, the risk of incorrect conclusion resulted from employing non-verified SP data is small. First, this problem only applies to the SP class, as recent proteomics data has provided direct measurement of many MTS’s [11,32]. Second, given the intense study of S.cere. and the continued scrutiny of UniProtKB/Swiss-Prot by the research community, we find it unlikely that a large fraction of the SP proteins in our dataset are incorrectly labeled. Third, our argument is not completely circular. SignalP prediction is based on physico-chemical features but not divergence (or conservation) for prediction, and the results shown in Figure 5 suggest physico-chemical features do not correlate very closely with sequence divergence.

Conclusion

We find it rather remarkable that the accuracy of balanced 3-way prediction can be improved to more than 50% just by using simply defined sequence divergence features, while otherwise completely hiding the amino acid sequence of the protein. Although we readily admit the limited scope of this work, it is the first to quantitatively explore sequence divergence as a feature for localization signal prediction. We feel confident that our observation will stand the test of time, as more and more organisms are fully sequenced.

Note

A preliminary version of this work appeared as a conference proceedings paper [70].

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

YF performed most of the study and wrote much of the manuscript. RL helped with initial attempts at automatic ortholog set determination. PH conceived of the study and wrote some of the manuscript. All authors contributed to discussion and have read and approved the final manuscript.

Additional file 1

Supplementary Text. Contains the supplementary text with tables and figures. Click here for file

Additional file 2

MSA’s of proteins for which sequence divergence changes predicted localization signals. Contains links to ortholog multiple sequence alignments of each protein in Additional file 3: Table S1. Click here for file

Additional file 3

List of proteins for which sequence divergence changes predicted localization signals. A tab separated values file listing proteins and their prediction scores with and without the use of divergence features. Click here for file

55 in total

1. A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome.

Authors: A Drawid; M Gerstein
Journal: J Mol Biol Date: 2000-08-25 Impact factor: 5.469

Review 2. A reevaluation of dual-targeting of proteins to mitochondria and chloroplasts.

Authors: Christopher Carrie; Ian Small
Journal: Biochim Biophys Acta Date: 2012-06-07

Review 3. Prediction of subcellular locations of proteins: where to proceed?

Authors: Kenichiro Imai; Kenta Nakai
Journal: Proteomics Date: 2010-11-02 Impact factor: 3.984

4. Global analysis of the mitochondrial N-proteome identifies a processing peptidase critical for protein stability.

Authors: F-Nora Vögtle; Stefanie Wortelkamp; René P Zahedi; Dorothea Becker; Claudia Leidhold; Kris Gevaert; Josef Kellermann; Wolfgang Voos; Albert Sickmann; Nikolaus Pfanner; Chris Meisinger
Journal: Cell Date: 2009-10-16 Impact factor: 41.582

5. Presequence-dependent folding ensures MrpL32 processing by the m-AAA protease in mitochondria.

Authors: Florian Bonn; Takashi Tatsuta; Carmelina Petrungaro; Jan Riemer; Thomas Langer
Journal: EMBO J Date: 2011-05-24 Impact factor: 11.598

6. Relation between amino acid composition and cellular location of proteins.

Authors: J Cedano; P Aloy; J A Pérez-Pons; E Querol
Journal: J Mol Biol Date: 1997-02-28 Impact factor: 5.469

7. Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes.

Authors: L McCue; W Thompson; C Carmack; M P Ryan; J S Liu; V Derbyshire; C E Lawrence
Journal: Nucleic Acids Res Date: 2001-02-01 Impact factor: 16.971

8. Role of the novel metallopeptidase Mop112 and saccharolysin for the complete degradation of proteins residing in different subcompartments of mitochondria.

Authors: Melanie Kambacheld; Steffen Augustin; Takashi Tatsuta; Stefan Müller; Thomas Langer
Journal: J Biol Chem Date: 2005-03-16 Impact factor: 5.157

9. Characterization of the signal peptide at the amino terminus of the rat peroxisomal 3-ketoacyl-CoA thiolase precursor.

Authors: T Tsukamoto; S Hata; S Yokota; S Miura; Y Fujiki; M Hijikata; S Miyazawa; T Hashimoto; T Osumi
Journal: J Biol Chem Date: 1994-02-25 Impact factor: 5.157

10. The m-AAA protease defective in hereditary spastic paraplegia controls ribosome assembly in mitochondria.

Authors: Mark Nolden; Sarah Ehses; Mirko Koppen; Andrea Bernacchia; Elena I Rugarli; Thomas Langer
Journal: Cell Date: 2005-10-21 Impact factor: 41.582

6 in total

1. Mining the Cicer arietinum genome for the mildew locus O (Mlo) gene family and comparative evolutionary analysis of the Mlo genes from Medicago truncatula and some other plant species.

Authors: Reena Deshmukh; V K Singh; Brahma Deo Singh
Journal: J Plant Res Date: 2016-11-30 Impact factor: 2.629

2. Transcriptome analysis of Leucojum aestivum and identification of genes involved in norbelladine biosynthesis.

Authors: Laurence Tousignant; Aracely Maribel Diaz-Garza; Bharat Bhusan Majhi; Sarah-Eve Gélinas; Aparna Singh; Isabel Desgagne-Penix
Journal: Planta Date: 2022-01-03 Impact factor: 4.116

3. MitoFates: improved prediction of mitochondrial targeting sequences and their cleavage sites.

Authors: Yoshinori Fukasawa; Junko Tsuji; Szu-Chin Fu; Kentaro Tomii; Paul Horton; Kenichiro Imai
Journal: Mol Cell Proteomics Date: 2015-02-10 Impact factor: 5.911

4. Cytosolic and Nuclear Co-localization of Betalain Biosynthetic Enzymes in Tobacco Suggests that Betalains Are Synthesized in the Cytoplasm and/or Nucleus of Betalainic Plant Cells.

Authors: Ning Chen; Zhi-Hai Yu; Xing-Guo Xiao
Journal: Front Plant Sci Date: 2017-05-18 Impact factor: 5.753

Review 5. A Brief History of Protein Sorting Prediction.

Authors: Henrik Nielsen; Konstantinos D Tsirigos; Søren Brunak; Gunnar von Heijne
Journal: Protein J Date: 2019-06 Impact factor: 2.371

6. Evolutionary Interplay between Symbiotic Relationships and Patterns of Signal Peptide Gain and Loss.

Authors: Peter Hönigschmid; Nadya Bykova; René Schneider; Dmitry Ivankov; Dmitrij Frishman
Journal: Genome Biol Evol Date: 2018-03-01 Impact factor: 3.416

6 in total