| Literature DB >> 34765098 |
Yuexu Jiang1, Duolin Wang1, Weiwei Wang1, Dong Xu1.
Abstract
The accurate annotation of protein localization is crucial in understanding protein function in tandem with a broad range of applications such as pathological analysis and drug design. Since most proteins do not have experimentally-determined localization information, the computational prediction of protein localization has been an active research area for more than two decades. In particular, recent machine-learning advancements have fueled the development of new methods in protein localization prediction. In this review paper, we first categorize the main features and algorithms used for protein localization prediction. Then, we summarize a list of protein localization prediction tools in terms of their coverage, characteristics, and accessibility to help users find suitable tools based on their needs. Next, we evaluate some of these tools on a benchmark dataset. Finally, we provide an outlook on the future exploration of protein localization methods.Entities:
Keywords: Computational methods; Protein localization prediction; Review
Year: 2021 PMID: 34765098 PMCID: PMC8564054 DOI: 10.1016/j.csbj.2021.10.023
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Relationships among the data, features, models, and prediction outputs in the computational prediction of protein localization. Sequence data can be converted into different features before feeding the data to a classifier model. Some classification models take raw data (e.g., one-hot-encoding of protein sequences for deep learning) as input, while others use engineered features. Localization prediction (at the sub-cellular and/or suborganellar level) is the most common output. Some methods also provide side product predictions such as target peptides, signal peptide cleavage sites, and mechanism interpretability at amino-acid-level resolution (AAI). Homology-based methods are special in the sense that they can make predictions directly based on homology-based features, such as the GO terms of homologous proteins.
Summary of protein localization prediction tools.
| Tool | Cov_lv1 | Cov_lv2 | Species kingdom | Algorithm | Metrics | Year | Web server | Standalone |
|---|---|---|---|---|---|---|---|---|
| BUSCA | 1–4,7,11–14 | Eu,Pro | Integrated method | F1, MCC | 2018 | |||
| CELLO2GO | 1–6,8–11,15 | Eu,Pro,V | SVM and homology search | Acc | 2014 | |||
| MULocDeep | 1–10 | 1–10 | Eu | LSTM + attention | Acc, MCC, Rec, Prec, ROC_AUC, P&R_AUC | 2021 | √ | |
| DeepLoc | 1–10 | Eu | CNN + LSTM + attention | Acc, MCC, Gorodkin measure | 2017 | |||
| TargetP 2.0 | SP,4,7 | Eu,Pro | LSTM + attention | Prec, Rec, F1, MCC | 2019 | |||
| MU-LOC | 4 | P | SVM and neural network | Acc, Prec, F1, MCC | 2018 | √ | ||
| LocTree3 | 1–4,6–11 | Eu,Pro | SVM and homology search | Acc, Std | 2014 | |||
| MitoFates | 4 | Eu | SVM | Prec, Rec, MCC, ROC_AUC | 2015 | √ | ||
| LOCALIZER | 1,4,7 | P | SVM | SN, SP, PPV, MCC, Acc | 2017 | √ | ||
| SignalP 5.0 | SP | Eu,Pro | CNN, bidirectional LSTM, and CRF | MCC, Rec, Prec | 2019 | √ | ||
| DeepSig | SP | Eu,Bac | CNN and CRF | MCC, FPR, F1 | 2018 | √ | ||
| PSORTb 3.0 | 2,3,14–16 | Bac | SVM and homology search | Prec, Rec, Acc, MCC | 2010 | √ | ||
| WoLF PSORT | 1–4,7,11 | Eu | k-NN classifier | Acc | 2007 | |||
| SubCons | 1–4,6,8–11 | Hum | Integrated method | F1, MCC | 2017 | |||
| TPpred 3.0 | 4,7 | Eu | Integrated method | MCC, Prec, Rec | 2015 | √ | ||
| MultiLoc2 | 1–4,6–11 | Eu | SVM | SN, SP, MCC | 2009 | √ | ||
| YLoc | 1–4,6–11 | Eu | Naïve Bayes and entropy-based discretization | F1, Acc | 2010 | √ | ||
| SCLpred-EMS | SP | Eu | Neural network | SP, SN, FPR, MCC | 2020 | |||
| ERPred | 6 | Eu | SVM | Acc, SN, SP, MCC | 2017 | √ | ||
| SeqVec | 1–10 | Eu | Language Model + FNN | Acc, MCC, FPR | 2019 | √ | ||
| ProtTrans | 1–10 | Eu | Language Model + FNN | Acc | 2020 | √ | ||
| LA | 1–10 | Eu | Language Model + attention | Acc | 2021 | √ | ||
| DeepMito | 4 | Eu | CNN | MCC, GCC | 2019 | √ | ||
| SubGolgi v2 | 8 | 8 | Eu | SVM | SN, Acc, MCC | 2013 | ||
| TetraMito | 4 | Eu | SVM | SN, Acc, MCC | 2013 | |||
| Schloro | 7 | P | SVM | Acc, Rec, Prec, F1, ROC_AUC, MCC | 2017 | √ | ||
| SubMitoPred | 4 | 4 | Eu | SVM | Acc | 2017 | √ | |
| SubNucPred | 1 | Eu | SVM | Acc, SN, SP, MCC | 2014 | √ |
The localization coverage codes are: 1. nucleus; 2. cytoplasm; 3. extracellular; 4. mitochondrion; 5. cell membrane; 6. endoplasmic reticulum; 7. plastid/chloroplast; 8. Golgi apparatus; 9. lysosome/vacuole; 10. peroxisome; 11. plasma membrane; 12. organelle membrane; 13. endomembrane system; 14. outer membrane; 15. periplasmic; 16. cell wall; SP. secretory pathway.
Cov_lv1 represents subcellular localization coverage, and Cov_lv2 indicates that suborganellar localization predictions are provided for the organelle.
The species kingdom codes are: Eu (Eukaryota, including animal, plant, and fungi); Pro (Prokaryota, including Bacteria and Archaea); V (Virus); P (Plant); Bac (Bacteria); Hum (Human).
The metrics codes are: MCC (Matthews correlation coefficient), Acc (accuracy), SN (sensitivity), SP (specificity), Prec (precision), Rec (recall), ROC_AUC (area under receiver operating characteristic curve), P&R_AUC (area under precision & recall curve), GCC (Generalized Correlation Coefficient), PPV (positive predictive value), FPR (false positive rate).
Fig. 2Evaluations of protein localization methods/tools. The criterion is the overall prediction accuracy for 10 main localizations. DeepLoc_PSSM and DeepLoc_BLOSUM are DeepLoc methods with PSSM and BLOSUM62 embedding, respectively. ProtT5_MLP and ProtBert_MLP are simple feed-forward neural networks in the ProtTrans method but using pre-train embeddings by T5 and Bert, respectively. ProtT5_LA and ProtBert_LA use the same two pre-trained models as above but are followed by an attention-based neural network.