| Literature DB >> 34203866 |
Marco Anteghini1,2, Vitor Martins Dos Santos1,2, Edoardo Saccenti1.
Abstract
Peroxisomes are ubiquitous membrane-bound organelles, and aberrant localisation of peroxisomal proteins contributes to the pathogenesis of several disorders. Many computational methods focus on assigning protein sequences to subcellular compartments, but there are no specific tools tailored for the sub-localisation (matrix vs. membrane) of peroxisome proteins. We present here In-Pero, a new method for predicting protein sub-peroxisomal cellular localisation. In-Pero combines standard machine learning approaches with recently proposed multi-dimensional deep-learning representations of the protein amino-acid sequence. It showed a classification accuracy above 0.9 in predicting peroxisomal matrix and membrane proteins. The method is trained and tested using a double cross-validation approach on a curated data set comprising 160 peroxisomal proteins with experimental evidence for sub-peroxisomal localisation. We further show that the proposed approach can be easily adapted (In-Mito) to the prediction of mitochondrial protein localisation obtaining performances for certain classes of proteins (matrix and inner-membrane) superior to existing tools.Entities:
Keywords: machine learning; neural networks; protein sequence encoding and embedding; sub-mitochondrial localisation; sub-peroxisomal localisation; subcellular localisation
Mesh:
Substances:
Year: 2021 PMID: 34203866 PMCID: PMC8232616 DOI: 10.3390/ijms22126409
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1(A) The eukaryotic cell and its organelles and compartments: (1) Nucleolus, (2) Nucleus, (3) Ribosome, (4) Peroxisome, (5) Rough Endoplasmic Reticulum, (6) Golgi apparathus, (7) Cytoskeleton, (8) Smooth endoplasmic reticulum, (9) Mitochondrion, (10) Vacuole, (11) Cytoplasm, (12) Lysosome, (13) Vescicles. (B) The peroxisome and its structure, showing the lipidic bilayer membrane, the inner matrix and crystalloid core (not always present). Peroxisomal proteins can be divided into two groups, matrix and membrane proteins, depending on the localisation. Membrane proteins are found attached on the inner and outer surface or can span through the layer (trans-membrane proteins). Panel A is partially adapted from en.wikipedia.org/wiki/Ribosome#/media/File:Animal_Cell.svg, accessed on 18 February 2021.
Step Forward Feature Selection for each of the compared methods. (a) Logistic Regression (LR) performances; (b) Support Vector Machines (SVM) performances; (c) Partial Least Square–Discriminant Analysis (PLS–DA) performances; (d) Random Forest (RF) performances. The analysed encodings and embeddings are protein one hot encoding (1HOT), residue physical-chemical properties encoding (PROP), position specific scoring matrix (PSSM), Unified Representation (UniRep) and Sequence to Vector (SeqVec). The results refer to the Double Cross Validation (DCV) procedure performed for each iteration of the forward feature selection (Section 4.5). The (inner) score refers to the inner loop of the DCV while (outer) refers to the outer loop. The performances are reported in terms of F1 score, BACC, MCC and ACC (see Section 4.8 and SM).
| ( | |||||
| BACC | MCC | ACC | |||
| 1HOT | 0.577 | 0.623 0.071 | 0.618 0.075 | 0.269 0.143 | 0.809 0.036 |
| PROP | 0.607 | 0.595 0.109 | 0.591 0.093 | 0.213 0.222 | 0.794 0.054 |
| PSSM | 0.615 | 0.575 0.067 | 0.604 0.089 | 0.177 0.144 | 0.719 0.040 |
|
|
|
|
|
|
|
| SeqVec | 0.792 | 0.712 0.068 | 0.726 0.079 | 0.427 0.140 | 0.825 0.042 |
| UniRep + 1HOT | 0.636 | 0.648 0.103 | 0.65 0.111 | 0.312 0.204 | 0.806 0.061 |
| UniRep + PROP | 0.614 | 0.595 0.104 | 0.589 0.093 | 0.234 0.217 | 0.812 0.040 |
| UniRep + PSSM | 0.634 | 0.615 0.100 | 0.615 0.100 | 0.201 0.166 | 0.738 0.042 |
|
|
|
|
|
|
|
| ( | |||||
| BACC | MCC | ACC | |||
| 1HOT | 0.624 | 0.693 0.130 | 0.713 0.139 | 0.396 0.261 | 0.819 0.070 |
| PROP | 0.634 | 0.616 0.108 | 0.606 0.094 | 0.274 0.226 | 0.819 0.041 |
| PSSM | 0.631 | 0.602 0.087 | 0.623 0.102 | 0.217 0.178 | 0.750 0.044 |
| UniRep | 0.775 | 0.768 0.077 | 0.755 0.099 | 0.544 0.162 | 0.869 0.036 |
|
|
|
|
|
|
|
| SeqVec + 1HOT | 0.68 | 0.757 0.079 | 0.774 0.103 | 0.527 0.166 | 0.856 0.042 |
| SeqVec + PROP | 0.648 | 0.597 0.114 | 0.589 0.099 | 0.218 0.229 | 0.812 0.044 |
| SeqVec + PSSM | 0.634 | 0.614 0.091 | 0.639 0.110 | 0.244 0.188 | 0.756 0.041 |
|
|
|
|
|
|
|
| ( | |||||
| BACC | MCC | ACC | |||
| 1HOT | 0.452 | 0.452 0.005 | 0.500 0.001 | 0.001 0.001 | 0.825 0.015 |
| PROP | 0.551 | 0.582 0.086 | 0.575 0.065 | 0.249 0.198 | 0.831 0.032 |
| PSSM | 0.542 | 0.592 0.133 | 0.582 0.092 | 0.277 0.290 | 0.844 0.044 |
|
|
|
|
|
|
|
| SeqVec | 0.759 | 0.707 0.081 | 0.695 0.080 | 0.419 0.160 | 0.844 0.044 |
| UniRep + 1HOT | 0.478 | 0.471 0.051 | 0.502 0.034 | 0.002 0.112 | 0.806 0.023 |
| UniRep + PROP | 0.478 | 0.471 0.051 | 0.502 0.034 | 0.267 0.128 | 0.825 0.032 |
| UniRep + PSSM | 0.564 | 0.616 0.110 | 0.599 0.075 | 0.326 0.233 | 0.850 0.041 |
|
|
|
|
|
|
|
| ( | |||||
| BACC | MCC | ACC | |||
| 1HOT | 0.569 | 0.401 0.077 | 0.523 0.050 | 0.046 0.089 | 0.450 0.124 |
| PROP | 0.631 | 0.572 0.016 | 0.564 0.012 | 0.203 0.090 | 0.812 0.020 |
| PSSM | 0.618 | 0.585 0.110 | 0.567 0.088 | 0.261 0.261 | 0.819 0.064 |
|
|
|
|
|
|
|
| SeqVec | 0.695 | 0.691 0.035 | 0.720 0.053 | 0.407 0.790 | 0.800 0.042 |
| UniRep + 1HOT | 0.728 | 0.703 0.063 | 0.765 0.089 | 0.443 0.139 | 0.794 0.032 |
| UniRep + PROP | 0.710 | 0.692 0.093 | 0.731 0.113 | 0.403 0.192 | 0.806 0.041 |
| UniRep + PSSM | 0.699 | 0.743 0.100 | 0.776 0.128 | 0.501 0.209 | 0.844 0.052 |
|
|
|
|
|
|
|
| UniRep + SeqVec + 1HOT | 0.774 | 0.721 0.108 | 0.738 0.121 | 0.456 0.214 | 0.844 0.044 |
|
|
|
|
|
|
|
| UniRep + SeqVec + PSSM | 0.741 | 0.733 0.123 | 0.754 0.136 | 0.480 0.242 | 0.850 0.054 |
Figure 2Correlation among the UniRep (1900 features) and the SeqVec (1024 features) protein sequence embeddings. Pearson’s linear correlation is used and are calculated over 160 protein sequences. The two embeddings are uncorrelated.
Figure 3(A) In-Pero workflow (1) Input as protein fasta sequence. (2) Sequence representation via DL encoding, in particular concatenating UniRep and SeqVec. (3) Support Vector Machines based classification. (4) Outputof the sub-peroxisomal location of the queried protein. (B) Example of a typical execution, with 6 sequences contained in the candidates.fasta file: the sub-peroxisomal classification of each protein is give.
Transmembrane and Membrane proteins found with both methods (In-Pero and TMHMM). The seven Transmembrane proteins showing high prediction scores with TMHMM are in bold.
| Membrane | Transmembrane | |
|---|---|---|
| O82399 | Q8H191 |
|
| P36168 | Q8K459 |
|
| P90551 | Q8VZF1 |
|
| Q12524 | Q9LYT1 |
|
| Q4WR83 | Q9NKW1 |
|
| Q75LJ4 | Q9S9W2 |
|
| Q9SKX5 |
| |
Comparison with DeepMito and DeepPred-SubMito (DP-SM) based on the SM424-18 data set (Data A) and the SubMitoPred data set (Data B). The results are reported in terms of Matthews Correlation Coefficient (MCC). The four mitochondrial compartments are outer membrane (O), inner membrane (I), intermembrane space (T) and matrix (M). Given the similar performances of our predictor (In-Mito) implemented with Logistic Regression (LR) and Support Vector Machines (SVM), we report both. The best performances are highlighted in bold font.
| Data A | MCC(O) | MCC(I) | MCC(T) | MCC(M) |
|---|---|---|---|---|
| DeepMito | 0.460 | 0.470 | 0.530 | 0.650 |
| DP-SM |
| 0.490 |
| 0.560 |
| In-Mito (LR) | 0.680 |
| 0.690 |
|
| In-Mito (SVM) | 0.640 | 0.690 | 0.620 | 0.800 |
|
| ||||
| SubMitoPred | 0.420 | 0.340 | 0.190 | 0.510 |
| DeepMito | 0.450 | 0.680 | 0.540 | 0.790 |
| DP-SM |
| 0.690 |
| 0.730 |
| In-Mito (LR) | 0.690 | 0.750 | 0.620 |
|
| In-Mito (SVM) | 0.650 |
| 0.540 | 0.840 |
Figure 4Overview of the full analysis for the predictor pipeline development. Data curation: retrieval and selection of peroxisomal protein sequences (see Section 4.2). Feature extraction: conversion of protein sequences to standard encodings, namely: one-hot encoding (1HOT), residue physical-chemical properties encoding (PROP), position specific scoring matrix (PSSM), unified representation (UniRep), sequence-to-vector (SeqVec). Full Comparison: application of classification algorithms (Section 4.6) and selection of the best combination(s) of sequence encodings and embeddings using step forward feature selection (see Section 4.5).
Hyperparameters for the grid searches.
| Hyperparameters | |
|---|---|
| SVM |
C:logspace(−2,10,13) gamma:logspace(−9,3,13) kernel:[‘linear’,‘poly’,‘rbf’,‘sigmoid’] |
| RF |
n_estimators:[15,25,50,75,100,200,300] criterion:[‘gini’,‘entropy’] max_depth:[2,5,10,None] min_samples_split:[2,4,8,10] max_features:[‘sqtr’,‘auto’,‘log2’] |
| PLS-DA |
n_components:[2,5,10,15,20,25,30] |
| LR |
penalty:[‘l1’,‘l2’] solver:[‘liblinear’,‘saga’] C:logspace(−3,9,13) |