| Literature DB >> 34980915 |
Felix Teufel1,2, José Juan Almagro Armenteros3, Alexander Rosenberg Johansen4, Magnús Halldór Gíslason5, Silas Irby Pihl1, Konstantinos D Tsirigos6, Ole Winther5,7,8, Søren Brunak3, Gunnar von Heijne9,10, Henrik Nielsen11.
Abstract
Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.Entities:
Mesh:
Substances:
Year: 2022 PMID: 34980915 PMCID: PMC9287161 DOI: 10.1038/s41587-021-01156-3
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 68.164
Fig. 1Modeling SP structure using protein LMs.
a, Region structures of the five SP types. Twin arginine (RR)-translocated SPs feature a twin-arginine motif, while SPs cleaved by SPase II feature a C-terminal lipobox. Sec/SPIII SPs have no substructure. b, Protein LM training procedure. BERT learns protein features by predicting masked amino acids in sequences from UniRef100. c, t-Distributed stochastic neighbor embedding (t-SNE) projection of protein representations before prediction training. Different SP types form distinct clusters, separated from sequences without SPs. d, SignalP 6.0 architecture. An amino acid sequence is passed through the LM, and the resulting representation serves as input for the CRF, which predicts region probabilities at each position and the SP type. CS, cleavage site.
Fig. 2SignalP 6.0 shows strong performance on all types and organism groups.
a, SP detection performance (ARC, Archaea; EUK, Eukarya; NEG, Gram-negative bacteria; POS, Gram-positive bacteria). SignalP 6.0 substantially improves performance on underrepresented types. b, CS prediction performance. SignalP 6.0 has improved precision for all categories. c, Dependence of performance on identity to sequences in the training data. At sequence identities lower than 60%, SignalP 6.0 outperforms SignalP 5.0.