| Literature DB >> 35489069 |
Vineet Thumuluri1, José Juan Almagro Armenteros2,3, Alexander Rosenberg Johansen4,3, Henrik Nielsen5, Ole Winther6,7,8.
Abstract
The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals. The webserver is available at services.healthtech.dtu.dk/service.php?DeepLoc-2.0.Entities:
Year: 2022 PMID: 35489069 PMCID: PMC9252801 DOI: 10.1093/nar/gkac278
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 19.160
Figure 1.An example snippet from the results page on the webserver. The prediction summary is available for download as a comma-separated file (CSV) at the top which consists of the predicted subcellular localization and sorting signals. The image or attention values of each plot can be separately downloaded. All the predicted subcellular localization and sorting signal labels are listed, along with the prediction score table. The predicted localizations in the table are highlighted in green. If no score crosses the threshold, the label closest to the threshold is chosen. High values in the logo-like plot signify important regions in the sequence for localization prediction that may correspond to sorting signals. This is meant to serve as a guideline and specialized tools such as SignalP or TargetP can be used for a more detailed and accurate analysis of these signals.
Figure 2.DeepLoc 2.0 uses a transformer-based protein language model to encode the input amino acid sequence. Then using an interpretable attention pooling mechanism a sequence representation is produced. The two prediction heads then utilize this representation to predict multiple labels for both the 10-type subcellular localization and 9-type sorting signal prediction tasks. Source of cell diagram: https://commons.wikimedia.org/wiki/File:Simple_diagram_of_plant_cell_(blank).svg, attribution: domdomegg, CC BY 4.0
Results on the SwissProt CV dataset
| Counts | DeepLoc 1.0 β | YLoc+ α | DeepLoc 2.0 | ||
|---|---|---|---|---|---|
| ESM1b | ProtT5 | ||||
| Type | Single | Multi | Multi | Multi | |
| Pred. Num. Labels (Actual: 1.27) | 1.00 ± 0.00 | 1.57 ± 0.02 | 1.27 ± 0.02 | 1.26 ± 0.02 | |
| Accuracy | 28303 | 0.48 ± 0.01 | 0.32 ± 0.02 | 0.53 ± 0.02 |
|
| Jaccard | 28303 | 0.56 ± 0.01 | 0.50 ± 0.01 | 0.68 ± 0.01 |
|
| MicroF1 | 28303 | 0.58 ± 0.02 | 0.56 ± 0.01 | 0.72 ± 0.01 |
|
| MacroF1 | 28303 | 0.47 ± 0.01 | 0.42 ± 0.01 | 0.64 ± 0.01 |
|
| MCC per location (↑ is better) | |||||
| Cytoplasm | 9870 | 0.45 ± 0.02 | 0.38 ± 0.02 | 0.61 ± 0.01 |
|
| Nucleus | 9720 | 0.46 ± 0.02 | 0.42 ± 0.02 | 0.66 ± 0.02 |
|
| Extracellular | 3301 | 0.78 ± 0.05 | 0.61 ± 0.05 |
|
|
| Cell membrane | 4187 | 0.53 ± 0.02 | 0.44 ± 0.02 | 0.64 ± 0.01 |
|
| Mitochondrion | 2590 | 0.58 ± 0.04 | 0.47 ± 0.02 | 0.73 ± 0.03 |
|
| Plastid | 1047 | 0.69 ± 0.04 | 0.72 ± 0.02 | 0.88 ± 0.01 |
|
| Endoplasmic reticulum | 2180 | 0.32 ± 0.04 | 0.17 ± 0.04 | 0.52 ± 0.01 |
|
| Lysosome/Vacuole | 1496 | 0.06 ± 0.05 | 0.07 ± 0.03 | 0.24 ± 0.03 |
|
| Golgi apparatus | 1279 | 0.20 ± 0.04 | 0.11 ± 0.04 |
| 0.34 ± 0.05 |
| Peroxisome | 304 | 0.15 ± 0.04 | 0.05 ± 0.02 | 0.48 ± 0.05 |
|
Bold values indicate the best score
α = GO-terms were not used
β = Retrained on this dataset
Results on the HPA independent test set
| Count | YLoc+ | DeepLoc 1.0 β | Fuel-mLoc | LAProtT5 | DeepLoc 2.0 | ||
|---|---|---|---|---|---|---|---|
| Animalα | Euk γ, θ | ESM1b | ProtT5 | ||||
| Type | Multi | Single | Multi | Single | Multi | Multi | |
| Pred. Num. Labels (Actual: 1.22) | 1.44 | 0.89 | 1.00 | 0.94 | 1.15 | 1.21 | |
| Accuracy | 1717 | 0.23 | 0.37 | 0.38 |
| 0.34 | 0.39 |
| Jaccard | 1717 | 0.41 | 0.42 | 0.46 | 0.52 | 0.48 |
|
| MicroF1 | 1717 | 0.51 | 0.46 | 0.52 | 0.56 | 0.57 |
|
| MacroF1 | 1717 | 0.34 | 0.35 | 0.39 | 0.43 | 0.44 |
|
| MCC per location (↑ is better) | |||||||
| Cytoplasm | 562 | 0.14 | 0.23 | 0.23 | 0.33 | 0.29 |
|
| Nucleus | 893 | 0.20 | 0.28 | 0.41 |
| 0.41 | 0.44 |
| Cell membrane | 287 | 0.20 | 0.23 | 0.32 | 0.30 | 0.34 |
|
| Mitochondrion | 196 | 0.37 | 0.39 | 0.33 | 0.59 |
| 0.56 |
| Endoplasmic reticulum | 77 | 0.12 |
| 0.14 | 0.22 | 0.20 | 0.17 |
| Golgi apparatus | 86 | 0.08 | 0.10 | 0.24 | 0.26 | 0.17 |
|
Bold values indicate the best score
α = GO-terms were not used
β = Retrained on the new CV dataset
γ = using local implementation
θ = using reduced ProSeq database
Results of signal type prediction; cross-validation
| DeepLoc 2.0 | Specialized | ||
|---|---|---|---|
| ESM1b | ProtT5 | Predictor | |
| MicroF1 | 0.87 ± 0.01 | 0.87 ± 0.02 | |
| MacroF1 | 0.80 ± 0.02 | 0.80 ± 0.03 | |
| Accuracy | 0.78 ± 0.02 | 0.79 ± 0.03 | |
| MCC per signal (↑ is better) | |||
| SP | 0.89 ± 0.03 | 0.90 ± 0.03 | 0.87 ± 0.02 ( |
| TM | 0.71 ± 0.07 | 0.66 ± 0.05 | - |
| MT | 0.93 ± 0.02 | 0.93 ± 0.03 | 0.94 ± 0.04 ( |
| CH | 0.85 ± 0.07 | 0.86 ± 0.09 | 0.96 ± 0.03 ( |
| TH | 0.86 ± 0.08 | 0.80 ± 0.08 | 0.98 ± 0.04 ( |
| NLS | 0.65 ± 0.06 | 0.66 ± 0.01 | - |
| NES | 0.49 ± 0.20 | 0.46 ± 0.17 | - |
| PTS | 0.85 ± 0.06 | 0.90 ± 0.05 | - |
| GPI | 0.85 ± 0.06 | 0.86 ± 0.06 | 0.91 ± 0.01 ( |
SP = Signal Peoptide, TM = First transmembrane domain, MT = Mitochondrial transit peptide , CH = Chloroplast transit peptide, TH = Thylakoidal transit peptide, NLS = Nuclear localization signal, NES = Nuclear export signal, PTS = Peroxisomal targeting signal, GPI = GPI-anchor
Quantitative comparison of interpretable attention; cross-validation
| DeepLoc 1.0 β | DeepLoc 2.0 | ||
|---|---|---|---|
| ESM1b | ProtT5 | ||
| KL Div (↓ is better) | |||
| SP | 1.31 ± 0.57 | 1.04 ± 0.91 | 0.99 ± 0.86 |
| TM | 1.99 ± 0.81 | 1.13 ± 1.14 | 1.12 ± 1.03 |
| MT | 0.92 ± 0.38 | 0.51 ± 0.54 | 0.50 ± 0.48 |
| CH | 0.74 ± 0.33 | 0.32 ± 0.52 | 0.31 ± 0.31 |
| TH | 0.90 ± 0.31 | 0.19 ± 0.29 | 0.24 ± 0.16 |
| NLS | 3.11 ± 1.02 | 2.63 ± 1.52 | 2.60 ± 1.32 |
| NES | 3.97 ± 1.22 | 4.04 ± 1.51 | 3.88 ± 1.44 |
| PTS | 4.90 ± 0.93 | 0.85 ± 1.29 | 0.72 ± 1.05 |
| GPI | 2.30 ± 0.79 | 1.59 ± 0.73 | 1.85 ± 0.47 |
β = Retrained on the new CV dataset
Abbreviations same as in Table 3