| Literature DB >> 24438075 |
Yoshinori Fukasawa, Ross K K Leung, Stephen K W Tsui, Paul Horton1.
Abstract
<span class="abstract_title">BACKGROUND: Protein subcellular localization is a central problem in understanding cell biology and has been the focus of intense research. In order to predict localization from amino acid sequence a myriad of features have been tried: including amino acid composition, sequence similarity, the presence of certain motifs or domains, and many others. Surprisingly, sequence conservation of sorting motifs has not yet been employed, despn>ite its extensive use for tasks such as the prediction of transcription factor binding sites.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24438075 PMCID: PMC3906766 DOI: 10.1186/1471-2164-15-46
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
List of species used to define orthologs in each phylogenetic category
The species listed at top are the reference species used to determine the subcellular localization site class labels. In the case of plants, one of G. max, O. sativa and C. reinhardtii were used as the reference species for proteins for which no annotation was available in A. thaliana.
The number of ortholog sets by localization class in each phylogenetic division
| MTS | 179 | 219 | 81 | 61 |
| SP | 53 | 73 | 169 | 15 |
| CTP | N/A | N/A | N/A | 97 |
| N-signal-free | 450 | 560 | 415 | 99 |
For each ortholog dataset, the number of ortholog sets in each localization class is listed. RBH orthologs are defined by the reciprocal best hit method.
Figure 1Relationship between mean divergence score and the number of sequence in MSA’s. A box plot illustrating the mean, quartiles and range of the column entropy score for MSA’s in the yeast autoOrthoMSA dataset partitioned by the number of sequences in the MSA.
List of entropy derived features
| LD( | |
| Average of | |
| Standard deviation of | |
| NCdiff | |
Figure 2An example of MTS containing protein. A multiple sequence alignment of the protein mtHSP70 (UniProt accession P0CS90) and its orthologs from five species of yeast. The red box indicates the cleaved MTS in S.cere. Conserved positions are colored by Jalview.
Figure 3Local divergence score over N-terminal region. Average local divergence scores are shown for the 100 residue N-terminal region of: MTS containing, SP containing, and N-signal-free proteins. Top left panel is calculated from orthologs of yeast curated dataset, and the others from automatically collected orthologs. For the plant dataset, CTP containing proteins are also shown. The error bars denote standard error. For clarity, error bars are only shown for every fifth position.
Figure 4Importance of each feature. The importance of each attribute as estimated by information gain is shown for the YGOB ortholog set. At left, the divergence related scores are shown by light blue color lines. For local divergence features LD(i), only the residue number i is listed. Dark blue colored lines denote standard features of the N-terminal 40 residues such as physico-chemical properties or amino acid composition. The suffix “f” denotes amino acid composition from the full length of the protein.
Figure 5Correlation between divergence and physico-chemical properties. Scatter plots of LD(13) (on the vertical axis) vs physico-chemical property (A) average hydrophobiciy, (B) number of negatively charged residues and (C) arginine composition for the YGOB ortholog set (MTS proteins are shown in red, SP in blue and N-signal-free proteins in green).
Performance of N-signal vs N-signal-free protein binary classification
| J48 | 72.49±3.30 | ||
| - (randomized) | 65.85±0.66 | 0.50±0.01 | 0.00±0.03 |
| SVM | |||
| - (randomized) | 66.19±0.09 | 0.50±0.00 | 0.00±0.00 |
| The majority class fraction | 65.98 | N/A | N/A |
Three classification performance measures when using only divergence features are shown for the discrimination of N-signal containing and N-signal-free proteins (yeast curated ortholog sets). AUC denotes the area under the ROC curves. (randomized) indicates the values obtained with the localization class labels randomly shuffled 100 times. For each measure the average and standard deviation is shown over the 5 folds of the cross-validation, or 500 (5 × 100 trials) folds in the case of the randomized data.
Performance of 3-way classification using SVM classifier
| | ||||||
| MTS | 0.67±0.03 | 0.36±0.06 | 0.76±0.05 | |||
| SP | 0.50±0.00 | 0.00±0.00 | 0.81±0.08 | 0.70±0.11 | ||
| N-signal-free | 0.66±0.02 | 0.36±0.03 | 0.85±0.03 | 0.72±0.05 | ||
| 70.82±1.61 | 87.24±1.86 | |||||
The 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way classification on the yeast curated ortholog dataset. Classical features are computed based on the N-terminal 40 residues.
Performance on balanced dataset for MTS vs SP vs N-signal-free protein prediction using SVM classifier
| | ||||||
| MTS | 0.67±0.10 | 0.35±0.20 | 0.84±0.07 | 0.68±0.13 | ||
| SP | 0.71±0.09 | 0.41±0.16 | 0.92±0.05 | 0.85±0.10 | ||
| N-signal-free | 0.79±0.07 | 0.60±0.13 | 0.78±0.09 | 0.57±0.18 | ||
| 62.86±5.84 | 79.92±5.54 | |||||
The 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way classification on a balanced dataset (53 proteins from each class, yeast curated orthologs).
Performance of 3-way classification using SVM classifier (feature length 20)
| | ||||||
| MTS | 0.67±0.03 | 0.36±0.06 | 0.80±0.02 | |||
| SP | 0.50±0.00 | 0.00±0.00 | 0.97±0.03 | 0.92±0.07 | ||
| N-signal-free | 0.66±0.02 | 0.36±0.03 | 0.81±0.02 | |||
| 70.82±1.61 | 91.49±1.26 | |||||
The 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way classification on our entire yeast curated ortholog dataset. Classical features are calculated from N-terminal 20 amino acids.
Confusion Matrix from 3-way classification using SVM classifier (feature length 20)
| MTS | 83 | 0 | 96 | 148 | 1 | 30 | 144 | 0 | 35 |
| SP | 16 | 0 | 37 | 0 | 50 | 3 | 1 | 51 | 1 |
| N-signal-free | 50 | 0 | 400 | 20 | 4 | 426 | 15 | 1 | 434 |
Confusion matrix of the 5-fold cross-validation performance of an SVM classifier using: divergence features only, classical features only, and the two combined; is shown for three-way classification on our entire yeast curated ortholog dataset. Classical features are calculated from N-terminal 20 amino acids.
Performance of N-signal vs N-signal-free protein binary classification on automatically collected orthologs
| J48 | 71.47±5.00 | 0.67±0.07 | 0.36±0.12 |
| SVM | |||
| The majority class fraction | 65.23 | N/A | N/A |
| Human dataset | | | |
| J48 | 69.32±4.10 | ||
| SVM | |||
| The majority class fraction | 62.41 | N/A | N/A |
| Plant dataset | | | |
| J48 | 79.41±6.03 | 0.75±0.06 | 0.55±0.13 |
| SVM | |||
| The majority class fraction | 63.60 | N/A | N/A |
Three classification performance measures when using only divergence features are shown for the discrimination of N-signal containing and N-signal-free proteins on automatically collected orthologs. AUC denotes the area under the ROC curves. For each measure the average and standard deviation is shown over the 5 folds of the cross-validation.
Performance for 3-way classification using SVM classifier on automatically collected orthologs
| | ||||
| MTS | 0.65±0.09 | 0.31±0.18 | 0.66±0.05 | 0.31±0.11 |
| SP | 0.60±0.07 | 0.19±0.14 | 0.70±0.08 | 0.40±0.15 |
| N-signal-free | 0.66±0.08 | 0.35±0.15 | 0.69±0.06 | 0.39±0.11 |
| 51.63±7.21 | 57.61±4.71 | |||
The 5-fold cross-validation performance of an SVM classifier using divergence features is shown for three-way classification on the automatically generated ortholog dataset for yeasts and mammals. The number of examples is given in parenthesis at top.
Performance on balanced plant dataset using SVM classifier on automatically collected orthologs
| | ||||
| MTS | 0.62±0.11 | 0.24±0.21 | 0.66±0.08 | 0.35±0.14 |
| SP | 0.78±0.11 | 0.58±0.23 | N/A | N/A |
| CTP | 0.73±0.16 | 0.43±0.31 | 0.77±0.12 | 0.51±0.23 |
| N-signal-free | 0.80±0.14 | 0.72±0.20 | 0.81±0.09 | 0.67±0.13 |
| 60.00±9.13 | 66.22±10.11 | |||
The 5-fold cross-validation performance of an SVM classifier using divergence features is shown for three-way classification on balanced sets of (automatically generated) plant orthologs with or without the SP class. The number of examples is given in parenthesis at top.
Figure 6MSA of FMP52 and its orthologs in 11 yeast species. Multiple sequence alignment of FMP52 in S.cerevisiae and its orthologs in other 10 yeast species. The red boxed region shows annotated MTS of FMP52. The conserved positions are colored by Jalview.
Figure 7MSA of MrpL32 and its orthologs in 11 yeast species. Multiple sequence alignment of MrpL32 in S.cerevisiae and its orthologs in 10 other yeast species. The red boxed region shows MTS of MrpL32. The conserved positions are colored by Jalview.