Literature DB >> 33897982

Computational prediction of secreted proteins in gram-negative bacteria.

Xinjie Hui¹, Zewei Chen¹, Junya Zhang¹, Moyang Lu², Xuxia Cai¹, Yuping Deng¹, Yueming Hu¹, Yejun Wang^1,3.

Abstract

Gram-negative bacteria harness multiple protein secretion systems and secrete a large proportion of the proteome. Proteins can be exported to periplasmic space, integrated into membrane, transported into extracellular milieu, or translocated into cytoplasm of contacting cells. It is important for accurate, genome-wide annotation of the secreted proteins and their secretion pathways. In this review, we systematically classified the secreted proteins according to the types of secretion systems in Gram-negative bacteria, summarized the known features of these proteins, and reviewed the algorithms and tools for their prediction.

Entities: Chemical Disease Gene Species

Keywords: Gram-negative bacteria; Prediction; Protein secretion system; Secreted protein; Transmembrane protein

Year: 2021 PMID： 33897982 PMCID： PMC8047123 DOI： 10.1016/j.csbj.2021.03.019

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Gram-negative (or diderm) bacteria contain two phospholipid membranes. The outer membrane encloses individual cells and separates them from extracellular environment, while the inner membrane separates bacterial cytoplasm and the periplasm, a space between the two cell membranes. Bacterial cells also have some constitutive protrusions inserted in or attached to the cell surface. More than one third bacterial proteins undergo an extracellular translocation process from cytoplasm where they have been synthesized [1]. According to the destined location of substrate proteins, the translocation process can be classified as three major types: exportation, secretion and membrane-retention [2]. Exportation only involves the process of passing through the inner membrane actively, secretion means crossing over the outer membrane or two cell membranes completely, and membrane-retention refers in particular to the trans-membrane process after which the substrate protein is inserted in the membrane. Therefore, with a strict definition, a secreted protein should have gone through an active translocation process from cytoplasm to extracellular environment. However, in a broad sense, the proteins undergoing any type of the translocation processes described above are called secreted proteins. There are also proteins called ‘effectors’, which specifically refer to the ones translocated from bacterial cytoplasm to other cells (eukaryotic or other bacterial cells) directly via some transmembrane device contacting other cells at the distal pole. In this review, we used the broad definition, and secreted proteins include the strictly secreted proteins, transmembrane proteins, surface-associated proteins or subunit parts of surface appendages, periplasmic proteins and translocated effectors (Fig. 1).

Fig. 1

Subcellular localization of Gram-negative bacterial proteins. The dashed arrow showed the translocation process of the proteins.

Subcellular localization of Gram-negative bacterial proteins. The dashed arrow showed the translocation process of the proteins. Bacteria employ multiple means to secrete proteins (Fig. 2; Table 1). The mechanisms of protein secretion in Gram-negative bacteria could be summarized as three categories: (1) one-step, two-membrane spanning secretion, (2) two-step, two-membrane spanning secretion, and (3) inner membrane spanning export. Accordingly, the protein secretion systems are divided into three major types – two-membrane spanning secretion systems, inner membrane spanning exporters and outer membrane spanning secretion systems. Based on the destiny of the secreted proteins, the two-membrane spanning secretion systems are further classified into two sub-classes, trans-membrane secretion systems and trans-membrane translocation systems. Trans-membrane secretion systems only secrete substrate proteins outside the bacterial cells, including the well-known Type I secretion Systems (T1SSs), T2SSs and T9SSs (Bacteroidetes PorSS), while trans-membrane translocation systems deliver bacterial substrate proteins into contacting cells. T3SSs, T4SSs and T6SSs are all trans-membrane translocation systems. Inner membrane spanning exporters transport proteins through Sec or Tat pathway. Outer membrane spanning secretion systems are exemplified by T5SSs, T7SSs (Chaperone-Usher pilus secretion), T8SSs (curli secretion), etc.

Fig. 2

Table 1

Overview of protein secretion systems and the substrate features in Gram-negative bacteria.

Secretion system	Secretion step(s)	Membrane spanning	Secretion signal	Substrate state
Sec	1	Inner	N-terminus	Unfolded
Tat	1	Inner	N-terminus	Folded
T1SS	1	Inner + Outer	C-terminus	Unfolded
T2SS	2 (Sec/Tat)	Inner + Outer	N-terminus	Folded
T3SS¹	1 or 2 (Sec)	Inner + Outer (+Host)	N-terminus	Unfolded
T4SS²	1	Inner + Outer (+Host)	C-terminus	Unfolded
T6SS	1	Inner + Outer + Host	N-terminus?	Folded
T5SS	2 (Sec)	Outer	N-terminus	Unfolded
Pili/ T7SS	2 (Sec)	Outer	N-terminus	Folded
Curli/ T8SS	2 (Sec)	Outer	N-terminus	Unfolded
T9SS	2 (Sec)	Inner + Outer	C-terminus	Folded

Notes: 1 T3SSs include non-flagella T3SSs and flagella T3SSs. Non-flagella T3SSs are translocation systems delivering substrates into host cells in one step, while flagella T3SSs involve two steps to secrete substrates extracellularly. 2 T4SSs translocate substrate proteins into host cells like T3SSs, or transport the proteins into extracellular milieu.

Secreted proteins and their transport pathways. The secretion machines are multi-protein complex, with different protein components. The protein transport processes were also indicated, with Sec and Tat pathways secreting the proteins from bacterial cytoplasm to periplasm or inner membrane, Lol pathways transporting the protein within the periplasm side of inner membrane into the periplasm side of outer membrane, Bam and Tam systems transporting periplasmic protein into outer membrane, T1SSs transporting proteins from bacterial cytoplasm to extracellular space, T2SSs and T9SSs transporting periplasmic proteins to extracellular space, and T3SSs, T4SSs and T6SSs translocating proteins from cytoplasm to host cellular cytoplasm directly. T5SSs are autotransporters that transport themselves extracellularly. The pili and curli proteins are transported out of bacterial outer membrane through T7SSs and T8SSs, respectively. The protein names or component types were shown for each secretion systems. OMF, Out Membrane Factor; MFP, Membrane Fusion Protein; IMC, Inner Membrane Component; SRP, Signal Recognition Particle. Overview of protein secretion systems and the substrate features in Gram-negative bacteria. Notes: 1 T3SSs include non-flagella T3SSs and flagella T3SSs. Non-flagella T3SSs are translocation systems delivering substrates into host cells in one step, while flagella T3SSs involve two steps to secrete substrates extracellularly. 2 T4SSs translocate substrate proteins into host cells like T3SSs, or transport the proteins into extracellular milieu. Effective recognition of the proteins secreted through different systems is important, which could facilitate the annotation of bacterial genomes, mechanism exploration of bacterial life processes, and prevention and control of bacterial infections and associated diseases. In recent decades, bioinformatic algorithms and methods have been introduced into the field and developed explosively, promoting the identification of proteins secreted through different systems in a large variety of bacteria. The review summarized our current knowledge on the features of different secreted proteins, and the main progress of bioinformatic applications in prediction of proteins secreted by different mechanisms in Gram-negative bacteria. At present, there are a dozen of protein secretion systems that have been reported in Gram-negative bacteria, including two inner membrane spanning export systems (Sec and Tat pathways), type I-IX secretion systems (T1SSs ~ T9SSs), and new ones. Sec and Tat pathways represent the main mechanisms mediating protein transport from cytoplasm to periplasm [3]. The periplasmic proteins can be further transported to extra-cellular matrix by certain secretion systems such as T2SSs, T5SSs and T9SSs, be secreted onto the bacterial surface such as pili secreted by T7SSs and curli secreted by T8SSs, or stay in the periplasmic space. T1SSs, T3SSs, T4SSs and T6SSs represent the major one-step two-membrane secretion systems. We will review the exporters and the tools predicting proteins exported via inner membrane first, followed by the secretion systems spanning outer membrane or both membranes and the substrate prediction methods. Transmembrane protein prediction algorithms are also summarized.

Inner membrane spanning exporters

In Gram-negative bacteria, there are two classical (Sec and Tat) and some non-classical inner membrane-spanning protein-exporting pathways.

The general secretion (Sec) pathway

Brief summary

Sec pathway is a universal protein export mechanism employed by Archeobacteria, Eubacteria and Eukaryota [3]. The Sec system is composed of a central component, SecYEG, which forms a protein-conducting channel and mediates the translocation of proteins in unfolded state into or across the plasma membrane. In Gram-negative bacteria, most periplasmic proteins and inner membrane proteins (IMPs) are exported through the SecYEG translocon, vectorially or laterally [3], [4]. The IMPs and periplasmic proteins take different targeting mechanisms, i.e., co-translational and post-translational mode, respectively [1]. For IMPs, the export involves a co-translational targeting process mediated by both signal recognition particle (SRP) and its membrane receptor FtsY. SRP binds to the N-terminal transmembrane helix (TMH) of the exported protein, forms a ribosome-nascent chain-SRP-FtsY complex, and targets the protein into the SecYEG channel. SecYEG can mediate export and insertion of the targeted protein into the inner membrane independently or in cooperation with a membrane protein insertase YidC [1], [4]. For the proteins translocated into the periplasm, a post-translational mode of export is adopted, by which an essential ATPase motor SecA recognizes the exported proteins with high affinity and empowers the transmembrane export. Other proteins could also participate in the processes of sorting, targeting and translocation, e.g., the chaperones aiding pre-protein targeting (trigger factor or SecB), the auxiliary components enhancing translocation efficiency (SecDF–YajC), etc [1], [3]. For most proteins, the two export modes are exclusive, but it is not absolutely. Some IMPs, e.g. RodZ, were found to take the co-translational mode but targeted by SecA rather than SRP [5], [6], [7].

Molecular features of the proteins secreted through Sec pathway

The N-terminal signal peptides (SPs) of proteins exported via Sec pathway are important for targeting, show some atypical sequential patterns, and have been explored for prediction of such types of secreted proteins [1]. A typical SP is comprised of 5–30 amino acids, which can be divided into 3 parts: a positively charged amino-terminus (N-region), a hydrophobic function domain (H-region) and a negatively charged polar carboxyl-terminus (C-region) where the cleavage site is located (Fig. 3A). There are two types of SPases that can cleave SPs, SPase I and SPase II, which can recognize different cleavage sites. SPase I cleaves the classical SPs while SPase II cleaves the SPs of lipoproteins. Positions −1 and −3 of the SPase cleavage site are often occupied by non-bulky polar amino acids (AXA pattern). Lipoprotein substrates show a pattern L[AS][GA]C at the −3 to +1 positions. The motif is recognized and cleaved by a SPase II, and the cysteine at the +1 position is lipid modified following translocation [3]. The positively charged N-region and the hydrophobic helical H-region interact with phospholipids, and are recognized by SRP, SecA or trigger factor. The signal sequences with higher hydrophobicity of the H-region show increased binding affinity for SRP [1], [3].

Fig. 3

Sequence features of Sec/Tat SPs. (A) Sec-dependent SPs. There are two types of Sec-SPs, classical (top) and lipoprotein ones (bottom). Both of them are composed of a N-terminus region (blue), a hydrophobic region (dark blue) and a C-terminus region (grey). ‘+’ represents the region positively charged. The residue composition patterns of the C-terminal cleavage sites and corresponding SPases are shown. (B) Tat-dependent SPs. SPs targeted to Tat pathway have the sequential features similar to Sec-SPs, but generally have longer N-terminal regions which often contain a conserved motif with two consecutive arginine residues. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) It should be noted that some secretory proteins, e.g., autotransporters, have N-terminal extensions (N-AT) of varying length preceding the SPs [1]. A protein could be targeted to other pathways (e.g., Tat pathway) despite the presence of a similar SP in the N-terminus [3]. Some unfolded proteins without N-terminal SPs have been identified, which are also exported by Sec pathway [8], [9] or Sec-related non-classical pathways similar to the SecA2 pathway [10].

Algorithms and tools predicting proteins secreted through Sec pathway

More than a dozen of software tools have been developed to predict SPs of proteins secreted through Sec pathway (Table 2). SignalP is the most widely used program to identify the N-terminal SPs [11]. Since the first version of SignalP (SignalP 1.0) was proposed in 1997 [12], four new versions (SignalP 1.0 ~ 5.0) have been updated [13], [14]. Despite the popularity of SignalP tools and their large success in application for Sec substrate identification, other tools also have merits under certain circumstances. For example, till SignalP5.0, the other versions of SignalP can only predict Sec substrates cleaved by SPase I (Sec/SPI) but not those cleaved by SPase II (Sec/SPII) [13], [14]. For the SignalP models of Gram-negative bacteria, due to the bias of the training datasets enriched with E. coli and other γ-proteobacteria sequences, the predictive performance could be compromised for other species [11]. Some secreted proteins contained uncleaved SPs, for which SignalP cannot predict accurately [11].

Table 2

Representative software tools predicting Sec substrates in Gram-negative bacteria.

Tool	Algorithms	Target	URL or reference
SignalP4	Artificial Neural Network (ANN)	Sec/SPI; Cleavage site	https://services.healthtech.dtu.dk/service.php?SignalP-4.1; [13]
SignalP5	Deep Neural Network (DNN)	Sec/SPI; Sec/SPII; Tat/SPI; Cleavage site	https://services.healthtech.dtu.dk/service.php?SignalP-5.0; [14]
Signal-BLAST	BLASTP	Sec/SPI	http://sigpep.services.came.sbg.ac.at/signalblast.html; [15]
Signal-3L 2.0	Hierarchical Mixture Model	Sec/SPI; TMH; Cleavage site	http://www.csbio.sjtu.edu.cn/bioinf/Signal-3L; [16]
PrediSi	Position Weight Matrix (PWM)	Sec/SPI	http://www.predisi.de; [17]
Signal-CF	Pseudo Amino Acid Composition;K Nearest Neighbor Classifier	Sec/SPI; Cleavage site	http://www.csbio.sjtu.edu.cn/bioinf/Signal-CF; [18]
LipoP	HMM	Sec/SPI; Sec/SPII; TMH	https://services.healthtech.dtu.dk/service.php?LipoP; [19]
SPEPlip	ANN; Regular Expression Search	Sec/SPI; Lipoprotein; Cleavage site	http://gpcr.biocomp.unibo.it/cgi/predictors/spep/pred_spepcgi.cgi; [20]
Phobius/ PolyPhobius	Hidden Markov Model (HMM)	Sec/SPI; Full-protein TM topology	http://phobius.sbc.su.se; [21], [22]
Philius	Dynamic Bayesian Network (DBN)	Sec/SPI; Full-protein TM topology; Protein type	http://www.yeastrc.org/philius; [23]
TOPCONS	Consensus prediction	Sec/SPI; Full-protein TM topology; Protein type	http://topcons.net; [24]
SPOCTOPUS	ANN and HMM	Sec/SPI; TMH	http://octopus.cbr.su.se; [25]
MEMSAT3/ MEMSAT-SVM	ANN; Support Vector Machine (SVM)	Sec/SPI; TMH; Re-entrant helix; Protein type	http://bioinf.cs.ucl.ac.uk/psipred; [26], [27]
DeepSig	Deep Convolutional Neural Network(DCNN); Grammar-Restrained Hidden Conditional Random Field	Sec/SPI; Cleavage site	https://deepsig.biocomp.unibo.it/deepsig; [28]
SigUNet	Convolutional Neural Network (CNN)	Sec/SPI	https://github.com/mbilab/SigUNet; [295]
Signal-3L 3.0	Attention Deep Learning; Window-Based Scoring	Sec/SPI; Cleavage site	http://www.csbio.sjtu.edu.cn/bioinf/Signal-3L; [296]

Representative software tools predicting Sec substrates in Gram-negative bacteria. With an independent benchmarking dataset, SignalP 5.0 showed the best performance in prediction of both Sec/SPIs and Sec/SPIIs among the tools except for Signal-BLAST [14]. Signal-BLAST uses BLAST to find the sequences homologous to known SPs, and therefore shows high accuracy [15]. It is affected by the size of curated databases and similarities between query proteins and the databases. However, for a strain whose genome is newly sequenced, the homology-based methods can be applied in parallel with or before SignalP 5.0, since the former can pick out the verified or most likely proteins with SPs most precisely. Other tools can also predict Sec/SPIIs, e.g., LipoP [19], but with performance not comparable to SignalP5.0 [14]. However, not like SignalP5.0 that predicts a protein to be Sec/SPI, Sec/SPII, Tat/SPI and others, LipoP classifies a protein as Sec/SPI, Sec/SPII, a protein with a TMH and a cytoplasmic protein [19]. Therefore, LipoP shows additional application or advantages in distinguishing TMHs from SPs or other proteins without N-terminal TMHs or SPs. There are also other software tools that can particularly distinguish TMHs from Sec/SPI SPs, e.g., Signal-3L 2.0 [11], Phobius [21], Philius [23], TOPCONS [24], SPOCTOPUS [25], etc. These tools can also have other useful application, such as full-protein TM topology prediction, protein classification (e.g., TM proteins with SP, TM proteins without SP, globular proteins with SP and globular proteins without SP) and others, in spite that they cannot predict Sec/SPII SPs, and they cannot or only poorly predict the SP cleavage sites. Savojardo et al recently also proposed a deep learning based SP prediction tool, namely DeepSig, which can both predict Sec/SPI SPs and find the cleavage sites effectively [28]. DeepSig showed better performance than SignalP 4.1 and other tools in prediction of SP cleavage sites [28]. Although SignalP 5.0 was reported to show better performance than DeepSig in recognition of SPs, the prediction accuracy of the cleavage sites has not been compared between the two tools, and therefore DeepSig could still have the advantages in cleavage site prediction [14]. Another deep learning tool, namely SigUNet, was recently developed, which could also predict Sec/SPI SPs of gram-negative bacteria but did not show better performance than SignalP4.0 or DeepSig [295]. Signal-3L 3.0, a model using a 3-layer hybrid method of integrating deep learning algorithms and window-based scoring, showed better performance in prediction of Sec/SPI SPs of gram-negative bacteria but poorer performance in cleavage site prediction compared to SignalP 5.0 [296]. In summary, despite a batch of algorithms or tools that have been developed to predict Sec substrates, at present, SignalP 5.0 appears to have the performance superior to others and could be the first choice. However, other tools remain useful for specific purposes, e.g., cleavage site prediction, TMH / TM topology prediction, additional protein annotation, etc. Novel algorithms and tools are also required, to further improve the precision, to make taxon-specific prediction, and to distinguish the uncleaved SPs and/or other types of SPs.

The twin arginine translocation (Tat) pathway

Tat protein export system is also present in the inner membrane of many archaea, bacteria, chloroplasts, and plant mitochondria. Different from Sec pathway, Tat pathway exports folded proteins of varied size [29]. A typical Tat system is composed by subunits TatA, TatB and TatC (TatABC) or only TatA and TatC (TatAC) [30]. TatA, TatB and TatC are integral membrane proteins. TatA and TatB are homologous to each other, evolved from a common ancestor but have derived different function [29]. The TatA component of the TatAC system shows the function of both TatA and TatB of the TatABC system [29]. In Gram-negative bacterium, TatABC is the only known Tat system. The mechanism of protein translocation through Tat pathway remains largely unclear. In E. coli, TatB and TatC bind the twin arginine containing SPs of substrate proteins, followed by TatA recruitment, translocase channel formation and substrate translocation through the channel [31]. The Tat substrate export process is energized by the transmembrane proton motive force (PMF) [29]. After translocation, the substrate is released to periplasm after the SP is removed by a signal peptidase. However, not all the SPs of Tat substrates are cleaved. For example, bacterial Rieske iron–sulphur proteins have uncleaved SPs, which serve as signal anchors and are released laterally from the transporter into the membrane bilayer [29]. Some proteins with uncleaved Tat SPs can also be destined to the bacterial outer membrane with an unknown mechanism [29]. The proteins targeted to Tat pathway are much fewer than Sec substrates. In some bacteria, there is no Tat pathway [29]. However, the Tat substrates participate in various cellular processes, such as anaerobic metabolism, cell envelope biogenesis, metal acquisition and detoxification, and virulence [32]. In many important pathogens, Tat pathway is essential and closely related with the pathogenicity. Given the lack in mammals, Tat pathway and its substrates serve as ideal targets of new anti-bacterial drugs [30]. Therefore, it appears promising to predict Tat substrates, explore the mechanisms of the Tat exporting pathway, and apply the findings in drug research and development.

Molecular features of the proteins secreted through Tat pathway

Similar to Sec substrates, most Tat substrates also have N-terminal SPs that can be cleaved by SPase I or SPase II. The SPs of Tat substrates (Tat-SPs) also showed similar sequential features with those of Sec substrates, in spite that Tat-SPs often contain a conserved motif with two consecutive arginine residues (Fig. 3B). Tat-SPs are often longer than Sec-SPs, mainly due to their frequently longer N-regions [3]. Most computational models predict Tat substrates by recognition of Tat-SPs specifically.

Algorithms and tools predicting proteins secreted through Tat pathway

Not like the Sec-SP predictors, only a handful of Tat-SP predictors have been developed, including TATFIND, TatP, PRED-TAT and SignalP 5.0 (Table 3). TATFIND uses regular expression pattern matching approach and performs hydrophobicity analysis to identify Tat substrates [33], [34]. A Tat substrate was predicted by TATFIND originally if (1) there was a motif (X-1) R0 R+1 (X+2) (X+3) (X+4) in the N-terminal 35 amino acids where X represented a permitted residue from a pre-defined set, and (2) there was an uncharged peptide fragment with no fewer than 13 amino acids at the downstream of R0 R+1 [30]. TATFIND version 1.2 expanded the rules, allowed methionine at X−1 and glutamine at X+4, and provided a full list of predicted Tat substrates in 84 microorganisms [34]. TATFIND can also distinguish Tat-SPs and Sec-SPs to some extent, but cannot predict the cleave sites of Tat-SPs [34]. Bendtsen et al proposed a new method, TatP, combining pattern-matching for filtering and ANN for classification, which could classify Tat-SPs, Sec-SPs and cytoplasmic proteins with similar motifs high-accurately [35]. TatP can also predict the underlying cleavage sites of Tat-SPs [35]. The comparison between TatP and TATFIND with different independent testing datasets demonstrated that TatP showed a decreased false positive rate but an increased false negative rate [35]. PRED-TAT is an HMM-based method, which could classify Tat-SPs and Sec-SPs, predict the cleavage sites, and show higher accuracy than TATFIND and TatP [36]. Besides these tools, as mentioned above, the most recently developed SignalP 5.0 can also predict the Tat/SPI SPs and shows the best prediction performance [14].

Table 3

Representative software tools predicting Tat substrates in Gram-negative bacteria.

Tool	Method	Target	URL or reference
TATFIND 1.4	Regular expression pattern; Hydrophobicity analysis	Tat/SPI; Sec/SPI	http://signalfind.org/tatfind.html; [33], [34]
TatP 1.0	Regular expression pattern; ANN	Tat/SPI; Sec/SPI; Cleavage site	https://services.healthtech.dtu.dk/service.php?TatP-1.0; [35]
PRED-TAT	HMM	Tat/SPI; Sec/SPI; Cleavage site	http://www.compgen.org/tools/PRED-TAT; [36]
SignalP5	DNN	Tat/SPI; Sec/SPI; Sec/SPII; Cleavage site	https://services.healthtech.dtu.dk/service.php?SignalP-5.0; [14]

Representative software tools predicting Tat substrates in Gram-negative bacteria. Generally, the tools predicting Tat substrates are limited, and currently SignalP 5.0 is the first choice. However, it should be noted that none of the tools (including SignalP 5.0) could recognize the Tat lipoprotein substrates cleaved by SPaseII. Besides the proteins with SPs, there are also Tat substrates that do not contain any targeting sequences. These proteins take a hitchhiker mechanism to be exported by Tat pathway, by forming a complex with partner proteins containing Tat-SPs and being targeted with assistance of the partners sharing the SPs [37]. E. coli hydrogenase 2 subunit, HybC, is an example of such type of Tat substrates [37]. However, the exporting mechanism is still unclear and the substrates remain largely unidentified, and consequently, corresponding prediction tools are still at a lack to date.

Non-classical exporters

The proteins hitchhiking to pass through the inner membrane via Tat pathway take a kind of non-classical secretion mechanism. In gram-negative bacteria, there are also other non-classical pathways, by which proteins without putative Sec-SPs or Tat-SPs can enter periplasm. SodA is a well-known example. SodA proteins in Helicobacter pylori, Aeromonas hydrophila, Rhizobium leguminosarum bv. viciae 3841, Rhodobacter sphaeroides and Paracoccus denitrificans all lack Sec or Tat signal peptides but are secreted into periplasm [38], [39]. Secretion of the proteins via the inner membrane is Tat independent but requires SecA and N-terminal sequences [39]. No SecA2 pathway has been found in Gram-negative bacteria and therefore the proteins could be secreted through Sec pathway in a non-classical manner like the maltose-binding protein and alkaline phosphatase in E. coli [8], [9], or there could be other similar, not-yet-identified, non-classical pathways for secretion of these proteins. There are also other similar proteins secreted by such non-classical pathways, e.g., LuxS and TtsA in Salmonella [40], [41], ChiC in Serratia marcescens [42], etc. Currently, the knowledge about these non-classical pathways and the property of substrates is quite limited, and no method has been developed to predict such secreted proteins.

Outer membrane and two-membrane spanning secretion systems

There are multiple secretion systems identified that span only the outer membrane (e.g., T5SSs, T7SSs and T8SSs) or both inner and outer membranes (e.g., T1SSs, T2SSs, T3SSs, T4SSs, T6SSs and T9SSs). Here, we will review the substrate proteins of each secretion system according to the naming order of the systems, which also reflects the time order for their first identification in Gram-negative bacteria.

T1SS

T1SSs have been reported in a large variety of Gram-negative bacteria, including plant and animal pathogens. They can transport the unfolded substrates outside cells through inner and outer membranes in one step [43]. A T1SS is composed by three elementary components - an ATP-binding cassette (ABC) transporter located in inner membrane, an outer membrane factor (OMF), and a membrane fusion protein (MFP) connecting the ABC transporter and OMF (Fig. 2) [43], [44]. Most OMFs belong to the multi-functional TolC family. T1SSs have a structure similar to that of resistance-nodulation-division (RND) pumps in Gram-negative bacteria which can transport small molecules such as antibiotics [45]. The T1SS substrates, also called Type 1 Secreted Effectors (T1SEs), have various biological function, such as virulence-related HlyA [46], Salmonella non-fimbrial giant adhesin SiiE [47], Legionella pneumophila RtxA [48] and Acinetobacter RTX-serralysin-like toxins [49], biofilm formation related RTX adhesin LapA [50], [51], digestion enzymes TliA and PrtA [52], etc. It appears promising to engineer T1SSs in biomedical applications owing to simple structure of the system and the frequent T1SE C-terminal signal sequences that are convenient for genetic manipulations [53].

Molecular features of the T1SS substrates

The ABC transporters of T1SSs often show high specificity in binding the substrates. According to ABC transporter types, typical T1SEs can be divided into 3 classes (Classes 1 ~ 3) (Fig. 4). Class 1 T1SEs are targeted to C39-containing ABC transporters with hydrolase activity. These T1SEs normally contain N-terminal leader peptides. The C-termini of the leader peptides contain a canonical double glycine (‘GG’) motif, which can be recognized and cleaved by the C39 domains of corresponding ABC transporters before the mature proteins being secreted through T1SSs (Fig. 4) [54]. The Class1 T1SEs are the known smallest T1SS substrates, including the small bacteriocins or microcins.

Fig. 4

Sequence features of T1SS substrates. T1SSs can be divided into 4 groups. The substrates of Class 1 T1SSs typically contain N-terminal leader peptides (blue), while Classes 2–4 have secretion signal sequences in the C-termini (grey). Consensus sequence motifs are shown for the RTX repeats (light green and pink). RTX repeats are not necessarily present in Class 3 T1SEs. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Class 2 T1SEs are targeted to C39-like domain (CLD)-containing ABC transporters. These T1SEs have specific repeats-in-toxin (RTX) domains and are also called RTX proteins. The glycine-rich nanopeptide repeats in RTX domains show a ‘GGxGxDxUx’ consensus sequence motif where ‘x’ is any amino acid and ‘U’ represents a large or hydrophobic amino acid. Different from the Class 1 T1SEs, the RTX proteins have a large molecular mass. The CLDs of corresponding ABC transporters have a structure similar with C39 peptidase domains but do not show any hydrolase activity. The RTX proteins do not contain N-terminal leader peptides or ‘GG’ motifs as seen in Class 1 T1SEs either [54]. The Class 2 RTX T1SEs have secretion signal sequences in the C-termini, but the signal patterns and function mechanisms have not been clarified (Fig. 4) [54]. It is also unclear how the CLD-containing ABC transporters interact with the substrates. Class 3 T1SEs are targeted to a type of ABC transporters without any additional N-terminal domain. The substrates do not necessarily contain RTX repeat sequences, but have a C-terminal secretion signal sequence as in RTX proteins (Fig. 4). They do not contain N-terminal leader peptides either. The T1SEs of this class normally have a size much smaller than RTX proteins and have hydrolytic activity. Various proteases, lipases and the iron scavenger protein HasA belong to this group [54]. Recently, a fourth class of T1SEs has been reported, exemplified by RTX adhesins [55]. Different from Classes 1–3, the RTX adhesins are transported from cytoplasm to extracellular space by two steps, and therefore considered as non-classical [55]. The class of T1SS machinery is often linked with a bacterial transglutaminase-like cysteine proteinase (BTLCP) [56]. The RTX adhesion proteins have dialanine BTLCP cleavage sites in the N-terminal retention module that can be recognized and cleaved by the machinery-coupled BTLCP in periplasm before the cross-outer membrane transport [57], [58], [59]. The currently known RTX adhesins also have specific repeats that are important for their function, RTX repeats and signal sequences in the C-termini (Fig. 4).

Algorithms and tools predicting T1SS substrate proteins

Despite the functional importance and the large number of T1SEs, there are very few software tools developed to predict them (Table 4). Linhartova et al combined pattern searching, HMM profiles and RPS-BLAST, to predict 1024 candidate RTX proteins from 840 bacterial genomes [60]. In 2015, Luo et al made the first try to design a machine-learning model to predict RTX proteins [61]. Luo’s method combined sequence-derived features and the random forest (RF) algorithm, and considered both the full-length T1SE sequences and the newly identified C-terminal signals to improve the prediction precision [61]. To date, algorithms or tools have not been reported for prediction of other classes of T1SEs.

Table 4

Representative software tools predicting substrates of T1 ~ 9SSs.

Secretion System(s)	Tool	Method	URL or reference
T1SS	Linhartova's	Data mining	[60]
	Luo's	Random Forest (RF)	[61]
T3SS	SIEVE	SVM	http://cbb.pnnl.gov/portal/tools/sieve.html; [105]
	SSE-AAC	SVM	[113]
	BPBAac	SVM	http://biocomputer.bio.cuhk.edu.hk/softwares/BPBAac; [92]
	TEREE	Probability scoring	[97]
	T3SEpre	SVM	http://biocomputer.bio.cuhk.edu.hk/softwares/T3SEpre; [114]
	BEAN/BEAN 2.0	SVM	http://systbio.cau.edu.cn/bean/; [116]
	EffectiveT3	Naïve Bayes (NB)	http://www.chlamydiaedb.org; [91]
	Modlab	ANN and SVM	http://www.modlab.org
	T3_MM	Markov Model	http://biocomputer.bio.cuhk.edu.hk/softwares/T3_MM; [109]
	RF model	RF	http://cic.scu.edu.cn/bioinformatics/T3SPs.zip
	pEffect	PSI-BLAST and SVM	http://services.bromberglab.org/peffect; [117]
	GenSET	Voting algorithm	[111]
	DeepT3	DCNN	https://github.com/lje00006/DeepT3; [112]
	Bastion3	Two-layer ensemble model	http://bastion3.erc.monash.edu/; [120]
	Tbooster	Logistic Regression (LR), RF and SVM	http://tbooster.erc.monash.edu/index.jsp; [118]
	orgsissec	Phylogenetic profiles	http://www.iib.unsam.edu.ar/orgsissec/; [115]
	T3SEpp	Multiple features; ensemble models	http://www.szu-bioinf.org/T3SEpp; [121]
	EP3	Ensemble models	http://lab.malab.cn/~lijing/EP3.html; [297]
T4SS	S4TE	Motif searching	http://sate.cirad.fr/; [152]
	Burstein's	Voting algorithm	http://www.tau.ac.il/~talp/LegionellaMachineLearning; [141]
	Lifshitz's	Hidden Semi-Markov Mode (HSMM)	[144]
	Chen's	Genetic Screening	[142]
	T4EffPred	SVM	http://bioinfo.tmmu.edu.cn/T4EffPred; [145]
	T4SEpre	SVM	http://biocomputer.bio.cuhk.edu.hk/softwares/T4SEpre/; [138]
	Wang's	SVM	https://github.com/LoopGan/Effective-prediction-of-bacte-rial-type-IV-secreted-effectors; [146]
	PredT4SE-Stack	Stacked generalization	http://xbioinfo.sjtu.edu.cn/PredT4SE_Stack/index.php; [147]
	Bastion4	Ensemble model	http://bastion4.erc.monash.edu/; [148]
	OPT4e	SVM	https://bitbucket.org/zhesna/opt4e/; [150]
	SecReT4	BLASTp	http://db-mml.sjtu.edu.cn/SecReT4/
	Tbooster	LR, RF and SVM	http://tbooster.erc.monash.edu/index.jsp; [118]
	CNN-T4SE	CNN; voting	https://idrblab.org/cnnt4se/; [298]
	T4SE-XGB	eXtreme gradient boosting (XGBoost) algorithm	https://github.com/CT001002/T4SE-XGB; [299]
	orgsissec	Phylogenetic profiles	http://www.iib.unsam.edu.ar/orgsissec/; [115]
T5SS	twin-HMM	HMM	[163]
	Zude's	Seeded guide trees and HMM	[164]
	Vo's	BLASTp	[165]
T6SS	Bastion6	SVM	http://bastion6.erc.monash.edu/; [190]
	PyPredT6	Consensus of MLP, SVM, KNN, NB, RF	http://projectphd.droppages.com/PyPredT6.html; [191]
	SecReT6	BLAST	http://db-mml.sjtu.edu.cn/SecReT6/; [173]
	Tbooster	LR, RF and SVM	http://tbooster.erc.monash.edu/index.jsp; [118]
	orgsissec	Phylogenetic profiles	http://www.iib.unsam.edu.ar/orgsissec/; [115]
T9SS	Veith's	HMM	[221]

Representative software tools predicting substrates of T1 ~ 9SSs.

T2SS

T2SSs are conserved in Gram-negative bacteria. They transport folded substrate proteins from periplasm through the outer membrane. The substrates could either be anchored in outer membrane or secreted into extracellular milieu completely. T2SS is a complicated apparatus comprised of 40–70 proteins belonging to 12–15 different families. The apparatus consists of four sub-assemblies (Fig. 2): an inner membrane platform, an outer membrane complex, a secretion ATPase and a pseudopilus located in periplasm but connecting with the inner membrane platform [62]. The secretion of T2SS substrates involves a two-step process, while the proteins must be exported into periplasm through Sec or Tat pathway before secretion [62]. If exported through Sec pathway, the protein must fold in periplasm before T2SS secretion. Structural components of the T2SS apparatus cooperate to recruit and facilitate the substrates to enter the secretin channel formed by the outer membrane complex. The inner membrane platform connects the sub-assemblies and coordinates substrate transportation. The secretion process is energized by the ATPase located in cytoplasm, while the pseudopilus pushes substrates forward to pass through the channel in a piston-like manner. The pseudopilus shows similarity to type IV pili phylogenetically and structurally [63]. The inner membrane proteins, outer membrane proteins and ATPases of T2SSs also show homology to the type IV pili system (T4P) and the tight-adherent pili system (Tad). Therefore, both T4P and Tad have been classified as subtypes of T2SSs [2]. Consequently, T2SSs can be divided into 3 classes, i.e., T2aSSs (classical secretin-dependent T2SSs), T2bSSs (T4P) and T2cSSs (Tad) [2]. More details were given for T4P and Tad systems in Section 3.7. T2SS substrates are mainly comprised of enzymes, including proteases, lipases, phosphatases and others, which can facilitate bacteria to adapt to the environment and survive [62]. Some T2SS substrates can destroy host defenses, provide nutrients for bacteria, and facilitate bacterial colonization and diseases [64]. For example, the Acinetobacter lipases LipA and LipH as well as the protease CpaA exert important function in bacterial colonization and spread [65]. Enterohemorrhagic E. coli (EHEC) secretes YodA through T2SS to facilitate its adhesion and colonization [66]. Many pathogens also use T2SSs to secrete toxins and cause diseases. For example, the cholera toxin of Vibrio cholerae is secreted through T2SS and causes severe watery diarrhea [67]. The exotoxin A plays an important role in the lethal infection of Pseudomonas aeruginosa [68].

Molecular features, computational algorithms and tools of the T2SS substrate proteins

It remains an enigma how a T2SS recognizes and transports the widely distributed substrate proteins [62]. Structural studies indicated that T2SS substrate proteins contained relatively abundantβ-strands, and yet a common secretion signal has not been identified that could be specifically recognized by a T2SS [62], [69]. There could be some spatial secretion motifs comprised by the residues from different regions of a protein and formed only after protein folding or assembly [62], [70]. The algorithms and tools remain at a lack to predict the T2SS substrates, mainly because of the difficulty in seeking for common features among the molecules. The structure resolution, analysis and comparison of more T2SS substrate proteins may lend breakthrough features and lay the foundation for accurate prediction of new important T2SS substrates in future.

T3SS

T3SS is a syringe-like apparatus spanning both inner and outer membrane, with the tip of needle piercing the membrane of host cells and mediating the translocation of substrate proteins from bacterial cytoplasm into the host cytoplasm in one step [71]. Being only identified from Gram-negative bacteria, including many important animal and plant pathogens, T3SSs play important roles in bacterial interaction with host cells and the pathogenicity [72], [73]. A T3SS apparatus is comprised of ~30 structural and accessory proteins, which form multiple sub-assemblies, including a cytosolic sorting platform (SctO/L/K for the unified nomenclature for conserved components of T3SS) with an ATPase (SctN), a cytoplasmic ring (SctQ), an inner membrane export apparatus (SctR/S/T/U/V), an inner membrane ring (SctJ and SctD), an outer membrane ring (SctC), an needle assembly also called inner rod, a needle or pilus, and a translocon tip complex that is in the host cell membrane (Fig. 2) [74], [75], [76], [77], [78]. Bacterial flagella transport systems have high structural similarity to T3SSs and component proteins homologous with T3SS proteins, and they are likely to have the most recent common ancestor evolutionarily [78], [79], [80]. Therefore, the flagellar protein export system has been considered as a sub-class of T3SSs (T3bSS) [2]. The effector-translocation non-flagella T3SSs are called T3aSSs correspondingly. Not like the T3aSS needles (or pili) that mainly serve as protein translocation machine components, the homologous counterpart in T3bSS, i.e., flagella, can participate in chemotaxis, adhesion, biofilm formation, effector secretion and immune system regulation [81]. A complete T3bSS is composed by around 30 unique structure proteins with several to 10,000 s of copies [82]. A typical flagella export system contains three structural parts: the basal body which contains the reversible motor that anchors the structure to the membrane, the hook which extends out from the top of the basal body and acts as a universal joint, and the filament which extends many cell body lengths from the hook and, when rotated, forms the helical propeller [81]. Like the membrane rings and inner membrane export apparatus in T3aSSs, the basal body proteins in T3bSSs are exported through Sec pathway. Once the core T3SS is assembled, the subunit proteins of flagella (e.g., the flagellar hook FlgE and the hook-capping protein FlgD) and T3aSS needle pili are transported through respective conduit [80]. Some studies demonstrated that flagella and T3aSS pili proteins contain common secretion signals so that they can be secreted through the other injectosome [83], [84], [85].

Molecular features of the T3SS substrates

Like the fimbriae systems, in most cases, the flagella export systems only secrete the flagella subunit proteins, which can easily be recognized by homology searching. Therefore, here we focus on the substrates of non-flagella T3SSs. The non-flagella T3SSs deliver a list of substrates into host cells, which often exert important function and facilitate bacterial colonization, invasion, infection and survival. These substrates are also called T3SS secreted effectors (T3SEs). A classical T3SE contains a secretion signal bearing N-terminus, C-terminal effector domain(s), and a chaperone-binding domain (CBD) connecting the termini (Fig. 5) [86], [87]. Both the N-terminal signal sequence and the CBD domain are essential for T3SS recognition, recruiting and secretion [88], [89], [90]. The length of signal sequences varied from ~5 to ~100 amino acids, and no common motif has been identified from a majority of the effectors though atypical amino acid composition bias profiles (e.g., serine/threonine/proline being enriched) were observed [91], [92]. Specific chaperones bind to T3SEs at the CBDs and unfold the T3SEs, the latter of which could only be translocated through the T3SS conduit at an unfolded status [90], [93]. The chaperones often pair with effectors, the genes co-localize in genome and they co-evolve [93], [94]. A common structural motif was identified in CBDs of a list of T3SEs from a variety of bacteria [95]. The effector domains are diverse and generally un-conserved among different species. However, there are still several large families identified from a number of bacteria of a broad diversity, e.g., YopM, YopJ, etc [73].Fig. 6B.

Fig. 5

Fig. 6B

T5SSs and the features their substrates. (A) Substrate export of different types of T5SSs. (B) Sequence features of the substrates of different types of T5SSs. The pre-proteins of all these substrates also belong to Sec substrates and therefore contain SPs in N-termini. However, autotransporters (T5aSSs), TpsA exoproteins of the two-partner systems (T5bSSs) and trimeric autotransporters (T5cSSs) have extended signal peptide regions specifically (top). A T5aSS contains a passenger domain and a β-barrel translocator domain. Cleavage occurs between the two-asparagine residues located between the two domains (red arrow). A T5bSS is composed by two polypeptides, substrate TpsA and transporter TpsB. There is a conserved ‘NPNL’ motif in TpsA that is essential for its secretion. The TpsB and T5dSEs both contain POTRA (polypeptide transport-associated) motifs preceding the putative 16-stranded beta-barrel domains in the C-temini. T5cSS is composed by three polypeptides while T5eSS is inverted with an additional small periplasmic domain in the N-termini. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Sequence features of the substrates of type 3/4/6 secretion systems. (A) A classical T3SE contains a secretion signal bearing N-terminus, a C-terminal effector domain, and a CBD connecting the termini. (B) Classical T4SEs (T4aSE/T4bSE) show amino acid preference patterns in the C-terminal regions. Some of the effectors also contain essential translocation-guiding signals in the N-termini. Different from T4aSS and T4bSS effectors, T4xSS effector contains a conserved C-terminal domain termed ‘XVIPCD’. (C) Some of T6SEs contain MIX (marker for type six effectors) motif in the N-termini as the T6S signal potentially. There could be also other putative catalytic motifs as shown in example proteins (green). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) T5SSs and the features their substrates. (A) Substrate export of different types of T5SSs. (B) Sequence features of the substrates of different types of T5SSs. The pre-proteins of all these substrates also belong to Sec substrates and therefore contain SPs in N-termini. However, autotransporters (T5aSSs), TpsA exoproteins of the two-partner systems (T5bSSs) and trimeric autotransporters (T5cSSs) have extended signal peptide regions specifically (top). A T5aSS contains a passenger domain and a β-barrel translocator domain. Cleavage occurs between the two-asparagine residues located between the two domains (red arrow). A T5bSS is composed by two polypeptides, substrate TpsA and transporter TpsB. There is a conserved ‘NPNL’ motif in TpsA that is essential for its secretion. The TpsB and T5dSEs both contain POTRA (polypeptide transport-associated) motifs preceding the putative 16-stranded beta-barrel domains in the C-temini. T5cSS is composed by three polypeptides while T5eSS is inverted with an additional small periplasmic domain in the N-termini. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) There are still debates about the secretion signals of T3SEs, since some effectors appeared to have the signals contained in mRNA rather than protein level [96], [97], [98]. From DNA level, there could be conserved signatures buried in the promoter regions of effector genes or the operons that they belong to [73]. The effector genes are scattered in genome but often transcriptionally co-regulated by the pivotal T3SS regulators. The expression co-regulation requires the conservation of cis-acting elements. For many bacteria, especially plant pathogens, e.g., Pseudomonas syringae and Ralstonia spp., the motif features have been well defined in the regulon promoters of the key T3SS regulators, e.g., HrpL, HrpB, etc [99], [100].

Algorithms and tools predicting T3SS substrate proteins

Computational prediction of bacterial T3SEs has received a lot of research enthusiasm since the first T3SSs were identified. >20 algorithms and software tools have been developed (Table 4). Generally, the methods can be classified into 3 types: (1) methods based on sequence pattern recognition and homology searching, (2) machine-learning or statistic models mostly based on features buried in signal sequences, and (3) simple or ensemble models based on integrated features. Homology searching of known T3SEs was the main strategy to predict new effector and achieved a large success in the early stage [101], [102]. However, a big limitation surfaced soon since the T3SEs show a large diversity in sequences and the number of experimentally verified novel effectors was very small. It was difficult to find more homologous ones based on a limited dataset of known effectors. As the number and diversity of validated effectors increased, however, the pattern-based or homology-based strategy remains an important choice for identification of partially new T3SEs [103], [72], [73]. Using homology-searching strategy and with a list of 519 non-redundant manually curated verified effectors, Hu et al recently identified 8740 T3SEs from hundreds of bacterial genomes with T3SS(s) [73]. Besides sequence patterns or homology in protein level, the promoter sequences were also studied and applied in prediction for the T3SE genes regulated by the key T3SS regulators [104], [99], [100]. However, this kind of features and effector prediction is frequently in species- or genus- specific manner. Since the first two back-to-back reports published ten years ago [91], [105], more and more machine-learning algorithms have been introduced in T3SE prediction. Most algorithms focused on the sequence features in T3SE signal regions. For example, EffectiveT3 mainly learned the sequential composition features of physicochemical property-binned amino acids and oligopeptides in N-termini of known T3SEs using a Naïve Bayes (NB) model [91], while SIEVE also trained sequential features in both N-termini and full length of effectors as well as in gene sequences [105]. The position-specific amino acid composition (Aac) preference of T3SE signal sequences was learned for the first time by an ANN model [106], and further observed, refined, and trained in a Bi-Profile Bayesian (BPB)-SVM model (BPBAac) [92]. Other sequence-derived features, such as codon usage and instability, constraint of neighbor Aac, etc., were also observed and learned in new models [107], [108], [109], [110]. To predict T3SEs more specifically and precisely, Hobbs et al suggested to subgroup the training datasets and to develop species-specific models (GenSET) based on the N-terminal sequence features for better prediction accuracy [111]. New algorithms such as deep learning have also been applied to predict T3SEs. For example, DeepT3 is a deep convolution neural network (CNN) model that was trained most recently to learn the sequential features of known T3SEs within the N-terminal 100 amino acids [112]. Besides the sequential features, Arnold et al, Yang et al and Wang et al also observed the property of secondary structure (SSE) and water accessibility (ACC) in the N-termini of T3SEs [113], [91], [92]. Two models, SSE-ACC and T3SEpre, were trained to learn the SSE and ACC composition features and to predict new T3SEs [113], [114]. Common tertiary structures were also found in the signal regions of T3SEs [114]. Other T3SE features were also studied extensively and applied independently for effector prediction. For example, the chaperone-effector pairing features were explored for Bordetella T3SE identification [94], while the phylogenetic profiles were also observed and used for T3SE prediction [115]. Both single models and prediction models based on single types of features were found to be less effective when independent datasets were tested. Most intuitive examples are the homology-based methods and ab intio machine-learning models. The homology-based strategies can find a lot of true ‘new’ effectors. However, they are in fact not real new ones because they showed high sequence similarity with known effectors. Such approaches depend severely on the scale of validated effectors, and cannot find the real novel effectors. The machine-learning models often over-fit the local features and provide false positive predictions despite the ability in prediction of novel effectors. To overcome the drawbacks of each strategy, several T3SE predictors integrated them as different arms to make more accurate prediction. For example, BEAN2.0 initiate a web-based T3SE prediction platform, with both homology-searching module and machine-learning models [116]. Similarly, pEffect combined homology-based prediction (PSI-BLAST) and ab intio SVM models to make comprehensive prediction of T3SEs [117]. The ensemble of individual machine-learning models was also found to achieve better performance in T3SE prediction [298], [118], [119]. A recently developed tool, Bastion3, which is an ensemble model integrating multiple types of T3SE features, was reported to achieve much better performance compared to commonly used methods [120]. An integrated prediction method, T3SEpp, was also published most recently, which takes into account of multiple-aspect features, considers both homology-searching and machine-learning techniques, and forms a hierarchical ensembler to make more precise T3SE prediction with an apparently lowered false positive rate [121]. Although different algorithms and methods have their own merits in prediction of T3SEs, the integrated prediction strategies making both homology searching and machine-learning prediction, such as pEffect, BEAN2.0 and T3SEpp, achieved better performance in average, by evaluation with different bench-marking datasets. Methods considering multiple-aspect features and hierarchically integrating multiple models, e.g., T3SEpp and Bastion3 are also recommended.

T4SS

T4SS is also a multi-component complex expressed by versatile bacterial species [122], [123]. It can mediate the transfer of DNA or protein substrates into a large range of eukaryotic and bacterial cells. Based on the substrate type, T4SSs can be divided into two major families, conjugation systems and effector translocators [122], [123]. The conjugation T4SSs are distributed in both Gram-positive and Gram-negative bacterial species, and mediate the transfer of mobile genetic elements (MGEs). The effector-translocation T4SSs are mainly found in Gram-negative bacteria, for which the substrates could be proteins, single-strand and double-strand DNA molecules [122]. There are also a few other T4SSs that can secrete DNA or protein substrates into extracellular milieu [122], [123]. Phylogenetic analysis based on the conserved T4SS components suggested that the conjugation T4SSs emerged in Gram-negative bacteria first and were expanded to Gram-positive species, followed by the most recent diversification into the dedicated effector translocation systems and others [124], [125], [126]. The effector-translocation T4SSs were classified into two broad phylogenetic groups designated as types IVA (T4aSS) and IVB (T4bSS) respectively. T4aSS is represented by the Agrobacterium tumefaciens VirB/VirD4 T4SS encoded by the R388 plasmid and the Helicobacter pylori Cag T4SS [127], [128]. The T4SSs are composed of 12 core subunits, VirB1-11 and VirD4, each with multiple copies, forming four structural sub-assemblies (Fig. 2): (1) cytoplasmic ATPases (VirB4, VirB11, VirD4), (2) inner membrane platform (VirB3, VirB6, VirB8), (3) outer membrane core complex (VirB7, VirB9, VirB10) and (4) pilus (VirB1 transglycosylase, VirB2 pilin, VirB5 pilus-tip protein). T4bSS also involves multiple (>25) proteins for assembly, among which a few are similar to VirB/VirD4 subunits and others (>20) are T4bSS specific. The Legionella pneumophila Dot/Icm T4SS is a typical example of T4bSS, which also contains the four major sub-assemblies similar to T4aSS [129], [130]. Recently, Xanthomonas citri T4SSs have been recognized as a new group, namely X-T4SSs (T4xSS), which is similar to T4aSS, but contains an uncharacteristically large VirB7 lipoprotein subunit whose C-terminal N0 domain decorates the periphery of the outer membrane layer of the core complex [131], [132], [133]. Another important feature of T4xSS is its ability to mediate the translocation of effectors into and kill competitor bacteria [134], [135].

Molecular features of the T4SS substrates

Like T3SSs, the protein-translocation T4SSs also show preference for the substrate effectors by specific recognition of the secretion signals (Fig. 5). Motifs or amino acid preference patterns were disclosed within the C-terminal regions of T4SS effectors (T4SEs), e.g., two positively charged amino acids separated by three or four amino acids among which at least one is negatively charged [136], frequently tiny and polar amino acids [137] and significant enrichment of glutamic acid and serine [138]. More flexible secondary structure and higher hydrophilicity were also found for the C-terminal signal regions of T4SEs [138]. Some of the effectors also contain essential translocation- guiding signals in the N-termini [139], [140]. Different from T4aSS and T4bSS effectors, T4xSS effectors interact with the effector-coupling protein VirD4 by a conserved C-terminal domain termed XVIPCD (Xanthomonas VirD4-interacting protein-conserved domain) [133], [134], [135]. The sequence features described above have been used for T4SE prediction frequently (Table 4). There are also other atypical or species-specific sequence-based features, such as the GC content, gene regulatory patterns, eukaryote-like domains, etc, which have also been used for effector identification [141], [142], [143].

Algorithms and tools predicting T4SS substrate proteins

The earliest T4SE prediction algorithms and tools were all species specific, e.g., the ones predicting L. pneunophila and Coxiella burnetii protein substrates [141], [142]. Burstein et al applied machine-learning algorithms in prediction of L. pneunophila T4SS effectors for the first time [141]. SVM, NN, NB and Bayesian network (BM) based models were trained to learn the genomic, evolutionary, regulatory and other specific features of L. pneunophila effectors. A voting-based strategy was adopted to combine the prediction results of different models. Moreover, the model performance improved through an iterative process of model training, prediction, validation and inclusion of newly validated effectors [141]. Chen et al combined gene selection and bioinformatic sequence feature analysis, and proposed a method to infer the T4SEs in C. burnetii [142]. Although these methods achieved ideal effect in prediction of effectors from L. pneunophila or C. burnetii, the species-specific features such as regulatory attributes limited their general application in other species. After the L. pneunophila specific models were developed [141], the same group also trained a hidden semi-Markov model (HSMM) to represent the common Aac in effector secretion signals of Legionella and Coxiella T4SSs [144]. The model could make cross-species effector prediction, but mainly for IVB T4SSs [144]. T4EffPred and T4SEpre represent two real general T4SE predictors developed in an earlier time [138], [145]. Both T4EffPred and T4SEpre are SVM-based models and take protein sequences as input. T4EffPred takes the full-length protein sequences for feature analysis, and can classify the effectors of two types - IVA and IVB - of T4SSs [145]. However, because of the possible common features between T3SEs and T4SEs, there could be false positives of T3SEs in the T4EffPred results [143]. Wang et al manually annotated a complete list of experimentally validated T4SEs from different bacteria, observed the possible common motifs, sequential and position specific Aac, secondary structure and solvent accessibility features in C-termini of the effectors, and developed three models, i.e., T4SEpre_psAac, T4SEpre_bpbAac and T4SEpre_Joint [138]. Despite the good general and cross-species performance, T4SEpre also showed its main drawback, which was caused by the features constrained in only the C-terminal 100 amino acids of the candidate proteins [138]. In fact, at least some T4SEs also contain secretion signals at the N-termini [139], [140]. Another model was thereafter trained to overcome this limitation, with features extracted from both N-terminal 50 and C-terminal 100 amino acids of the subject proteins [146]. Similar for T3SE prediction, recently, several hierarchical ensemble models with multi-aspect features, e.g., PredT4SE-Stack and Bastion4, have been trained to improve the prediction performance for T4SEs [147], [148]. CNN-T4SE integrated three Convolutional Neural Network models training the amino acid composition, solvent accessibility and secondary structure of full-length T4SEs, achieving better performance than other tools and lower false positive predictions [298]. Other groups adopted an alternative strategy, by selecting the best optimized features, and/or training and identifying the best machine learning models, to improve the prediction performance [299], [149], [150], [151]. Some of the models have been well applied in identification of T4SEs in L. pneumophila [151] and Anaplasma phagocytophilum (OPT4e; [150]). Besides the machine-learning based methods, homology searching was also used in T4SE prediction. For example, S4TE integrated 13 sequence homology based features, including homology to known effectors, homology to eukaryotic domains, presence of subcellular localization signals, secretion signals, etc., and developed a scoring scheme to predict T4SEs mainly from α- and γ- proteobacteria [152]. T4SSs and the substrates are most complicated. A T4SS can deliver proteins, double-strand DNAs or single-strand DNAs into extracellular milieu, eukaryotic cytoplasm or competitor bacterial cytoplasm [122]. Currently, it remains unclear about the accurate clustering, distribution and substrate recognition mechanisms of T4SSs. None of the algorithms or tools can identify the possible common features in the protein and DNA substrates of the single or similar T4SSs. The machine-learning models are also unable to predict the effectors of T4SSs targeted to competitor bacterial cells. Generally, for the model species with a large number of T4SEs being validated, such as L. pneumophila, C. burnetii, A. tumefaciens, and A. phagocytophilum the species-specific models are suggested. For the species phylogenetically close to these species, homology-based screening strategies are recommended. For other species with a functional T4SS, as for T3SE prediction, the tools considering both homology and machine-learning ensemblers with multi-aspect features appeared to have better performance and therefore are recommended.

T5SS

T5SS is a special group of protein secretion system widely distributed in Gram-negative bacteria. A classical T5SS is only composed of a unique protein, which transports itself and is also called autotransporter [153]. The protein contains a β-barrel domain, which inserts into the bacterial outer membrane, forms a translocation channel and mediates the transport of the remaining protein fragments (the passenger domain) [63]. The autotransporters are secreted through the inner membrane via Sec pathway before being integrated into the outer membrane. There are also two- or multi-component T5SSs, but the conduits only span outer membrane, and the substrates need get into periplasm through Sec pathway at an unfolded state in the first place. Therefore, the pre-proteins of the T5SS proteins contain N-terminal Sec signal peptides, which are cleaved after export into periplasm [63]. Despite the simplicity in the protein composition compared to other protein secretion systems, T5SSs also show a large diversity of categories and function [153]. At present, the known T5SSs can be divided into 5 classes, namely T5aSS through T5eSS (Fig. 6A-B) [154]. T5aSS represents a classical one-component autotransporter. T5bSS is also called two-partner secretion (TPS) system, which is composed by two polypeptides, including a secreted substrate collectively designated as TpsA and a transporter protein TpsB spanning the outer membrane [154], [155]. Both TpsA and TpsB pre-proteins contain an N-terminal signal peptide that is recognized by Sec pathway. There is also a conserved TPS domain located at the N-ternimus of TpsA, which is targeted to the outer membrane protein TpsB. TpsB protein contains two periplasmic polypeptide transport-associated (POTRA) domains [154]. T5bSSs mainly transport some toxins with large volumes, such as the filamentous haemagglutinin of Bordetella pertussis [156], and the adhesins HMW1 and HMW2 of Haemophilus influenza [157]. T5cSSs, also named trimeric autotransporters, could be the most complicated autotransporters [154]. The passenger domains of T5cSSs show a large diversity while the translocation domains are highly conserved [154]. Most T5cSS substrates are adhesins, and they are also called trimeric autotransporter adhesins (TAAs). TAAs are important virulence factors in Gram-negative bacteria, e.g., the YadA proteins of enteropathogenic Yersinia spp. [158], [159]. T5dSS is the fused two-partner system, which is also composed by a single protein and has a structure similar to a T5aSS, with a C-terminal translocation domain and an N-terminal passenger domain. The prototype of T5dSSs is a patatin-like protein from Pseudomonas aeruginosa, PlpD [160]. It appears that the passenger domain of PlpD fuses with the β-barrel domain by the POTRA domain. T5dSS also contains a periplasmic domain, which is homologous to the periplasmic domains of T5bSSs [154]. The proteins secreted through T5dSSs are mostly distributed in environmental, avirulent bacterial species [154]. T5eSSs are a group of inverted autotransporters, with domains organized in an oppose direction, i.e., passenger domains formed by the C-termini and transport channels formed by the N-termini [161]. The passenger domains of T5eSSs are mainly Ig-like and lectin-like domains not found in other groups of T5SSs [162]. In addition, there is a small periplasmic domain at the N-terminal region, which shows no homology to those of T5bSSs and T5dSSs [161].

Molecular features, computational algorithms and tools of the T5SS substrate proteins

The T5SS substrate proteins often show high sequence conservation from the same class for the local domains, and homology searching is the most frequently adopted approach to recognize these proteins [154], [158], [163]. BLAST (blastp) is the routine tool to find the autotransporters from genome and protein databases. More sensitive, Position-Specific Iterated (PSI)-BLAST can also be used to find the T5SS proteins with lower sequence similarity with known ones [153]. Celik et al built HMM profiles to identify 1523 autotransporter proteins from numerous Chlamydiales and Fusobacteria species as well as all classes of Proteobacteria [163]. Analysis on these proteins disclosed a diversity of passenger domains besides the known proteases, adhesins and esterases [163]. Based on the conserved motifs within the β-barrel domains, the T5aSSs were clustered into 14 sequence families [163]. Zude et al further identified new T5aSS substrates in 111 publically available E. coli genomes with homology and profile based methods, and expanded the number of sequence families to 18 [164]. With the same strategies, Vo et al identified 728 autotransporter proteins of the T5aSS AIDA-I group [165]. Most recently, Goh et al used the similar sequence alignment based strategy to identify four new inverse autotransporters (IATs, T5eSS substrates) from 126 finished genomes of E. coli [166].

T6SS

T6SS is also a multi-protein complex, with a phage tail-like structure but in an opposite orientation from phage infection [167]. A typical T6SS involves ~15 proteins, assembling a two-membrane spanning nanomachine that can translocate the substrate proteins into eukaryotic or competitor bacterial cells (Fig. 2) [167]. The T6SSs are related with both bacterial pathogenicity and competition with non-self microorganisms [168], [169], [170], [171]. The known T6SSs are only distributed in Gram-negative bacteria, mostly Proteobacteria and Bacteroidetes [172], [173], [174]. Phylogenetic analysis of the T6SS core genes classified the T6SSs into three major classes: (1) group i (T6aSS) predominated in proteobacteria, (2) group ii (T6bSS) represented by the Francisella Pathogenicity Island (FPI) T6SSs and (3) group iii (T6cSS) comprised of Bacteroidetes T6SSs [173], [175], [176].

Molecular features, computational algorithms and tools of the T6SS substrate proteins

Till now, only a few T6SS effectors (T6SEs) have been identified experimentally. Many of them are specialized effectors, which are VgrG and Hcp proteins fused with C-terminal effector domains [177], [178], [179]. Strategies based on sequence alignment or motif pattern searching identified a list of VgrG or Hcp C-terminal extended T6SEs, and also disclosed various effector domain containing protein families, which are called ‘cargo’ effectors [180], [181], [182], [183], [184]. The cargo effectors can bind the inner surface of the Hcp tube or interact with VgrG spike or PAAR repeat-containing proteins [185], [186]. Salmon et al identified a conserved MIX (marker for type six effectors) motif in the N-termini of a group of effector-domain containing independent T6SEs (Fig. 5) [187]. By searching the motifs, a number of new potential T6SEs were identified [187], [188]. However, experiments have not been performed to examine the function of the motif. There are also a lot of non-MIX effectors [188], [189]. Bastion6 is the first machine-learning based T6SE predictor [190]. It extracted a large number of features from a very limited number of homology-filtered T6SEs, including sequence profile, evolutionary information and physicochemical property, and trained the two-layer hierarchical model [190]. Bastion6 is also restricted to process less than 500 sequences per job with amino acid count between 50 and 5000, and to overcome this inconvenience, Sen et al proposed a new tool PyPredT6 [191]. PyPredT6 also used a broadened positive training dataset, considered both the amino acid and nucleotide based sequence features, and adopted 5 different machine learning algorithms to find the consensus predictions [191]. There are also other comprehensive tools or algorithms, which can predict T3SEs, T4SEs and T6SEs simultaneously. For example, Tbooster contains three ensemble models integrating the different machine learning methods or algorithms developed by others to predict T3SEs, T4SEs and T6SEs respectively [118], while Orgsissec encodes and uses the phylogenetic profile features to predict T3SEs, T4SEs and T6SEs [115]. Generally, we have very limited knowledge about the sequence features of T6SEs, the number of validated effectors is also limited, and there are no many software tool choices for T6SE prediction at present. Experiments, thorough feature analysis and new algorithms are all urgently required to facilitate identification of more T6SEs.

T7SS - Chaperone-Usher (CU) fimbriae, T8SS – curli, and other pili secretion systems

T7SSs have been widely recognized as the ESAT-6 secretion systems (ESXs) distributed in Gram-positive bacteria, especially Mycobacteria [192], [193], [194], [195]. However, because the numerical categorization was originally used for the protein secretion systems in Gram-negative bacteria, the Chaperone-Usher (CU) pathway was suggested to be named T7SS, which was considered as an independent protein secretion system [2], [196]. In this research, we continued to use the naming scheme suggested by Desvaux et al. Pili, or named fimbriae, are a family of extracellular polymers attached at the bacterial outer membrane as non-flagella protein accessories. They have multiple functions such as adhesion, invasion, motility, biofilm formation and transmembrane transport of DNA and proteins [197], [198]. These protein accessories can be divided into 5 major classes according to their biosynthesis pathways: (1) Chaperone-usher (CU) pili (both the P and type 1 pili), and the alternative chaperone (AC) pili (such as the CS1 fimbriae and CFA/I fimbriae), (2) curli, (3) T4P, (4) the type III secretion needle pili (T3SP), and (5) type IV secretion pili (T4SP) including F-pili and T-pili [197]. Functionally, CU and AC pili, curli and T4P can help pathogens recognize, adhere and even invade target cells, but seldom transport substrates except for pilin proteins themselves, while the T3SP only serves as transporter device components and the T4SP can function in both ways [197]. CU pathway, also named T7SS, is a ubiquitous protein accessory attached on bacterial cell surface [199]. The system is of simple structure, involving two assembly proteins: a specific periplasmic chaperone and an outer membrane assembly platform also called usher (Fig. 2; Fig. 7) [198]. The general concept of CU pathway contains the AC pili, for which the chaperone is not specific. All the structural proteins of CU pathway contain typical N-terminal signal peptides that are recognized and exported out of inner membrane through Sec pathway. The structure and protein transport mechanisms of CU pathway show similarity to T5SS, but the substrates of CU pathway fold in periplasmic space before being transported through the outer membrane [200]. CU system has 6 phylogenetic clades: α-, β-, γ- (subdivided into γ1, γ2, γ3 and γ4 sub-clades), κ-, π- and σ- fimbriae. The members from each phylogenetic clade have the common operon structure which encodes the fimbriae subunits of the similar protein domains [201]. The α clade is exemplified by CS1 fimbriae and secreted through AC pathway. The type P pili and type 1 pili belong to π and γ1 clade, respectively. The κ clade is mainly represented by the K88 (F4) fimbriae, while the σ-fimbriae refers in particular to the spore coat protein U from Myxococcus xanthus [197], [202], [203]. The β-clade fimbriae remain conceptual and are derived according to the sequences, as have not been observed for expression or assembly in any bacteria [197], [203].

Fig. 7

Sequence features and the transport of the T7SS substrates. Pilus subunits contain SPs in the N-termini. The proteins are taken up by their cognate chaperones within periplasm, and a donor strand complementation (DSC) reaction occurs, by which a motif of four alternating hydrophobic residues (termed P1 to P4) on the chaperone G1 ftrand are inserted into a hydrophobic groove (known as the P1 to P4 pockets) of the pilus subunits so that a correct folding of the pilus subunits is catalyzed. CU pilus subunits also contain a 10–20 residue-long N-terminal extension (Nte) peptide that is sequentially conserved. During CU pilus subunit polymerization, the complementing G1 strand donated by the chaperone is replaced by the Nte on the subunit of the incoming chaperone–subunit complex. The assembly reaction is termed donor strand exchange (DSE). After DSE, the P2 to P5 pockets of the subunit groove are occupied by the hydrophobic residues (termed P2–P5) of the incoming subunit Nte. The P4 Gly residue in Nte sequences is strictly conserved. Curli, a kind of functional amyloid fibers in nature, are the main protein compositions of the complex extracellular matrix of many enteric bacteria including E. coli and Salmonella species (Fig. 2) [204], [205]. As a type of secretion apparatus, Curli system is also called the extracellular nucleation-precipitation (ENP) pathway [206] or T8SS [2]. Curli fibers are linear, noncovalent polymers composed of the major and minor subunits CsgA and CsgB, respectively. These subunit proteins are transported through the ENP pathway at an unfolded state with the assistance of the accessory proteins CsgE, CsgF and CsgG [207], [208]. Curli are implicated in surface adhesion, cell aggregation, biofilm formation, infection and host inflammation [207]. Different from the CU pili and curli, the T4P system is independent of Sec pathway but requires the assembly of a two-membrane spanning, ATP-powering transporter apparatus. The pilin contains an unusual N-methylated amino terminus, a conserved hydrophobic N-terminal region composed of 25 residues, and a C-terminal disulphide bond [209]. T4P has long fibers (1–4 μm), strong and flexible dynamic filaments, which are formed and disassembled quickly by polymerization and depolymerization of the plilin subunits respectively [210]. T4P shows large structural similarity to T2SS [209], and was considered as a subtype of T2SS, i.e., T2bSS [2]. T4P contains two subtypes, type IVa (T4aP) and type IVb (T4bP). T4aP is distributed in various Gram-negative bacteria, while T4bP has only been reported in human intestinal bacteria [209]. Recently, McCallum et al introduced the ‘T4P-like system’ to describe the T4P and their alike systems with similar structure and transporting mechanisms, and classify them into 5 subtypes: T4aP, T2aSS, T4bP, Tad/Flp pili (T2cSS), Com pili, and archaeal flagellum (archaellum) [211]. T3SP is the core component of the T3SS apparatus. The pilus is a short, stiff filament (animal pathogens or symbionts) or a long flexible pilus (plant pathogens or symbionts). The distal polar structure allows bacteria to reach the plasma membrane of target cells [212]. The pilin subunits of T3SP are also transported by T3SSs and have the sequence features of T3SEs [212], [213]. Similar to T3SP, T4SP is the core component of the T4SS apparatus. T4SP plays dual roles by serving as the conduit of substrates other than T4SP pilin proteins, and functioning in adhesion of bacteria with target cells [214], [215]. T4SP was divided into two subtypes: IncF-like pili (conjugative pili produced by Inc-F, -H, -T and -J systems) and IncP-like pili (conjugative pili produced by Inc-P, -N, -W systems) [197], [216]. Dependent on T4aSSs and composed by VirB-like components, the IncP-like pili are short (<1 μm) rigid rods with 8–12 nm in diameter. The IncF-like pili, in contrast, are long (2–20 μm) and flexible appendages, which depend on T4bSSs and are composed by both VirB-like components and other proteins not present in IncP-like pili [197]. As protein secretion systems, except the T3SP and T4SP, the other fimbriae pathways have only been reported to transport the pilin subunit proteins themselves. Sequence alignment based strategies can identify these fimbriae systems and the pilin proteins.

T9SS – PorSS

T9SS, also known as PorSS, is a protein transport system specifically deployed by the Fibrobacteres–Chlorobi–Bacteroidetes (FCB) superphylum, which also serves as an important pathogenic factor in severe periodontal diseases [217], [218]. T9SS is a two-membrane spanning protein secretion system (Fig. 2). The protein-conducting translocon SprA (also named SOV) located in bacterial outer membrane is the core component of T9SS. SprA forms a large (36-strand) single polypeptide transmembrane β-barrel in bacterial outer membrane [218]. T9SS substrates are exported through bacterial plasma membrane via Sec pathway, folded in periplasm and then targeted to the T9SS translocon [219]. The substrates are large (100–650 kDa) multi-domain proteins, containing N-terminal Sec signal peptides and C-terminal folded domains (CTDs) composed by ~100 amino acids where T9SS targeting signals are located [218], [219], [220]. The CTDs have been proven to play essential roles in secretion, modification, and attachment of the substrates to cell surface [220]. The signal patterns of T9SS substrates have not been fully studied, and their prediction is mainly homology searching based [220]. By building HMM profiles for three conserved sequence motifs in CTDs and screening in 21 completely sequenced genomes of Bacteroidetes phylum, Veith et al predicted 663 CTD-containing proteins [221]. These proteins function as proteases, glycosidases, motility adhesins, hemagglutinins and internalins [221].

Non-classical and novel protein transporting systems

There are also some non-classical secreted proteins that do not have signal sequences and are not secreted through the known secretion systems [222]. Besides the non-classical pathways descried before by which proteins could be exported from cytoplasm to periplasm, there are also other transporting systems that can mediate the transport of proteins into extracellular space. ClyA is an example that is secreted through non-classical pathways in Gram-negative bacteria [222]. ClyA is a pore-forming protein encoded by E. coli and other intestinal bacteria, which is toxic to mammalian cells. ClyA does not bear an N-terminal signal peptide but can be released from the outer membrane of E. coli. The protein is likely to be secreted to the extracellular milieu as outer-membrane vesicles (OMV) [223]. Novel protein transporting systems remain to be disclosed [224]. Most recently, an extracellular contractile injection system (eCIS) was recognized as an independent protein translocation system [225], [226]. To be precise, the eCISs are not protein secretion system, since they only mediate the translocation of secreted proteins into host cells. Structurally, eCISs resemble headless bacteriophages and share evolutionarily related proteins such as the tail tube, sheath, and baseplate complex [225]. Three sets of eCISs were independently identified previously, including the anti-feeding prophages (AFPs) in Serratia entomophila [227], the Photorhabdus virulence cassettes (PVCs) [228] and metamorphosis-associated contractile (MAC) structure identified in Pseudoalteromonas luteoviolacea [229]. Using these three verified eCISs and the phage-like-protein translocation structures (PLTSs) screened by BLAST [230], Chen et al built the core protein HMM profiles and updated them iteratively, and detected 631 eCIS-like loci from 11,699 publicly available complete bacterial genomes [226]. The eCISs are distributed among Gram-negative, Gram-positive bacteria and archaea. They are phylogenetically diverse and form six clusters [226]. Both eCISs and T6SSs are CISs, which encode proteins homologous to the phage contractile tails, deliver effectors to mediate bacterial-host interactions. However, eCISs differ from T6SSs apparently in the mode of action - the eCIS devices are released into the extracellular space, bind to and deliver substrates into target host cells [225], [229], [231]. Only a paucity of eCIS effectors have been identified, while the possible sequence signal features and their modes of action remain unknown [232], [233], [234]. Holin-like protein secretion systems (also named Type 10 Secretion Systems, TXSSs) were also reported in Gram-negative bacteria, mediating the transport of specific proteins into extracellular milieu from periplasm [42]. The systems are different from T2SSs and the two types of secretion systems recognize different substrates. The mechanism of the substrate recognition and secretion of TXSSs remain unclear, and there are only a few substrates that have been identified to secrete through this pathway.

Prediction of transmembrane proteins

Transmembrane proteins (TMPs) participate in molecular recognition, signal transduction and transmembrane transport, playing roles in various diseases [235]. They are also important molecular targets for many commercial drugs [236], [237]. TMPs pass through cell membrane either with transmembrane α-helices (TMHs) exclusively or with β-barrels formed by transmembrane β-strands [235], [294]. TMH proteins constitute 20–30% of the proteome of most organisms [235], [238], [294]. β-barrel proteins are mainly distributed in gram-negative and acid-fast bacteria, chloroplasts and mitochondria [239]. In Gram-negative bacteria, TMH proteins are mainly located in the inner membrane while β-barrel proteins are distributed in outer membrane [295], [239], [240], [241].

Features and prediction of TMH proteins

TMH proteins show the fragmental bias of hydrophobicity and electric charges [242]. Peptide fragments on the cytoplasmic side of the membrane are often enriched with positively charged residues [243]. A number of software tools predicting TMHs and the TM topology have been developed based on these two features [244], [245] (Table 5). There are also other features that were observed and applied, including the length of helix, grammar constrains of cytoplasmic and non-cytoplasmic loops, etc [244]. The tools can be simply classified into physicochemical property and statistics based and machine learning methods, the latter of which depend on a certain size of validated proteins and therefore were developed later than the former. Most of the tools can predict both TMHs (or TMH proteins) and the TM topology (Table 5).

Table 5

Representative software tools predicting TMHs.

Tool	Method	Target	URL or reference
TOPPred2	Physiochemical property and statistics based	TMH; TM topology	http://bioweb.pasteur.fr/seqanal/interfaces/toppred.html; [246]
SOSUI	Physiochemical property and statistics based	TMH	http://www.tuat.ac.jp/mitaku/sosui/; [247]
SCAMPI	Physiochemical property and statistics based	TMH; TM topology	http://topcons.cbr.su.se/; [248]
PHDhtm	ANN	TMH	[249]
MEMSAT3	ANN	TMH; TM topology; Sec/SPI	http://bioinf.cs.ucl.ac.uk/memsat/; [26]
SPOCTOPUS	NN + HMM	TMH; TM topology; Sec/SPI	http://octopus.cbr.su.se/; [25]
SOMPNN	PNN	TMH	http://www.csbio.sjtu.edu.cn/bioinf/SOMPNN/; [250]
TMSEG	NN + RF	TMH; TM topology	www.predictprotein.org; [251]
TMHMM 2.0	HMM	TMH; TM topology	https://services.healthtech.dtu.dk/service.php?TMHMM; [244]
HMMTOP2	HMM	TMH; TM topology	http://www.enzim.hu/hmmtop; [252]
Phobius/ PolyPhobius	HMM	TMH; TM topology; Sec/SPI	http://phobius.sbc.su.se/; [22]
Philius	DBN	TMH; TM topology; Sec/SPI	http://www.yeastrc.org/philius/; [23]
MEMSAT-SVM	SVM	TMH; TM topology; Sec/SPI; Re-entrant helix	http://bioinf.cs.ucl.ac.uk/psipred/; [27]
MemBrain	OET-KNN	TMH; Sec/SPI	http://www.csbio.sjtu.edu.cn/bioinf/MemBrain/; [253], [254]

Representative software tools predicting TMHs. TOPPred2, SOSUI and SCAMPI are representative physicochemical property based models predicting TMH proteins. TOPPred2 used a trapezoid sliding window and hydrophobicity scale to predict TM fragments, followed by seeking the best topology according to the ‘positive-inside’ charge bias rule [246]. SOSUI improved the TMH prediction performance by introduction of 4 physicochemical parameters - the hydropathy index of Kyte and Doolittle, the amphiphilicity index of polar side chains, the index of amino acid charges, and the length of each sequence - to classify TM and soluble proteins and to predict the topology of TMH proteins [247]. SCAMPI adopted a position-specific membrane-insertion propensity scale and the ‘positive-inside’ rule to predict the topology of TMHs, reaching the performance comparable to the best-performed machine learning tools at the time [248]. Both NN and HMM are most frequently used to train the machine learning models predicting TMH fragments, TMH proteins or their topology. PHPhtm [249], SPOCTOPUS [25] and MEMSAT3 [26] are the representative NN models. PHDhtm used the phylogenetic information derived from multiple sequence alignments and amino acid composition features to predict the locations of TMH fragments in the TMPs [249]. Both SPOCTOPUS and MEMSAT3 integrated a SP prediction step to better distinguish TMH fragments [25], [26]. Besides TMH recognition, they can also predict the TM topology of TMPs as well as Sec/SPIs [25], [26]. There are also other novel NN models developed recently. Yu et al proposed a SOMPNN model combining a self-organizing map (SOM) with a probabilistic neural network (PNN) model [250]. SOMPNN used SOM to learn the knowledge of helix distribution hidden in the training datasets adaptively, and predicted TMH fragments with PNN. The model showed the advantages of high computational efficiency and low requirements in the prior hypothesis of parameters [250]. Another method, TMSEG, integrated multiple models, including NN, RF and experience filters, to identify TMPs and predict TMH fragments and the topology accurately [251]. The HMM models showed more advantages in prediction of the TM and non-TM state transition and are therefore widely used for TM topology too. A list of HMM models have been developed, such as TMHMM [244], HMMTOP [252], Phobius [21] and PolypPhobius [22]. TMHMM trained models for each region of TMPs, including helix caps, middle of helix, regions close to the membrane and globular domains [244]. HMMTOP depends on the amino acid distribution difference among structural components of proteins, and the version 2 allows users to submit other location related information of the fragments to improve the prediction accuracy [252]. Both TMHMM and HMMTOP have been widely used for TMH and TM topology prediction, and yet neither of them distinguishes SPs. Phobius and PolyPhobius solved such a problem and provide the module to classify TMH fragments and SPs, as was also described in Section 2.1 [21], [22]. Compared to Phobius, PolyPhobius incorporated information from homologs, and the prediction performance was improved substantially [22]. Other machine-learning algorithms have also been used to predict TMH proteins, such as the Dynamic Bayesian Networks (DBNs) based model Philius [23], the SVM model MEMSAT-SVM [27], and the optimized evidence-theoretic K-nearest neighbor (OET-KNN) model MemBrain [253], [54], etc. Philius, MEMSAT-SVM and MemBrain can also distinguish the TMHs from SPs [23], [27], [253], [254].

Features and prediction of β-barrel proteins

It is difficult to identify β-barrel TMPs experimentally, while traditional TM prediction methods can also hardly predict β-barrel TMPs due to their smaller size of TM regions than TMHs [255], [256], [257]. However, there are still a number of software tools that have been developed to predict these proteins, using physicochemical property analysis, statistic measures or machine learning algorithms (Table 6).

Table 6

Representative software tools predicting β-barrel OMPs.

Tool	Method	Target	URL or reference
Neuwald’s	Motif searching	TMβ-strand	[261]
Gromiha’s	Physicochemical property, structure and statistics based	TM β-strand	[262]
BBF	Physicochemical property, structure and statistics based	β-barrel OMP	[263]
BOMP	Physicochemical property, structure and statistics based	β-barrel OMP	http://www.bioinfo.no/tools/bomp; [264]
transFold	Physicochemical property, structure and statistics based	TM β-barrel; residue side-chain orientations; inter-strand residue contact; strand inclination	http://bioinformatics.bc.edu/clotelab/transFold; [265]
HHomp	Sequence similarity searching	β-barrel OMP	http://toolkit.tuebingen.mpg.de/hhomp; [266]
Freeman-Wimley	Physicochemical property, structure and statistics based	TM β-barrel	http://www.tulane.edu/~biochem/WW/apps.html; [256]
OM-TOPO-PREDICT	NN	TM β-strand; OMP topology	http://strucbio.biologie.unikonstanz.de/-kay/om-topo-predict.html (Page not found); [267]
B2TMPRED	NN	TM β-strand; OMP topology	http://www.biocomp.unibo.it; [268]
TMBETA-NET	NN	TM β-strand	http://psfs.cbrc.jp/tmbeta-net/; [269]
TMBpro	NN	TM β-barrel; secondary structure; β-contacts; tertiary structure	http://www.igb.uci.edu/servers/psss.html; [270]
TBBPred	NN + SVM	TM β-barrel	http://www.imtech.res.in/raghava/tbbpred; [270]
TMbeta-SVM	SVM (sequential Aac + residue pairs)	β-barrel OMP	http://tmbeta-svm.cbrc.jp; [276]
PredβTM	SVM (position-specific Aac + residue pairs)	TM β-strand	http://transpred.ki.si/; [277]
BOCTOPUS/ BOCTOPUS2	SVM; HMM	OMP topology	http://boctopus.cbr.su.se/; [284], [285]
HMMB2TMR	HMM	OMP topology	[273]
PROFtmb	HMM (beta-hairpin motifs)	β-barrel OMP; non-β-barrel OMP	http://www.rostlab.org/services/PROFtmb/; [274]
PRED-TMBB	HMM;	OMP; soluble protein	http://bioinformatics.biol.uoa.gr/PRED-TMBB; [275]
ConBBPRED	Consensus approaches	β-strand; OMP topology	http://bioinformatics.biol.uoa.gr/ConBBPRED; [286]
TMB-Hunt	k-NN	β-barrel TMP; non-β-barrel TMP	http://www.bioinformatics.leeds.ac.uk/betaBarrel; [278]
IDQD	Quadratic discriminant analysis	β-barrel TMP; TMH; global protein	[280]
TMBETADISC-RBF	Radial Basis Function (RBF) Networks	OMP	http://rbf.bioinfo.tw/~sachen/OMP.html; [281]
GRHCRF	Grammatical-Restrained Hidden Conditional Random Fields (GRHCRFs)	OMP	http://www.biocomp.unibo.it/~savojard/biocrf-0.9.tar.gz; [282]
BetAware	N-to-1 Extreme Learning	β-barrel TMP	http://betaware.biocomp.unibo.it/BetAware; [255]
BETAWARE	N-to-1 network encoding and ELM training algorithm	β-barrel TMP	http://www.biocomp.unibo.it/~savojard/betawarecl; [283]
MemBrain-TMB	Statistical machine learning	β-barrel TMP	www.csbio.sjtu.edu.cn/bioinf/MemBrain-TMB; [257]
Koehler's	NN	β-barrel TMP; TMH	[272]

Representative software tools predicting β-barrel OMPs. The amphipathicity of TM chains, i.e., the alternating patterns of hydrophobic-hydrophilic residues, was first used for prediction of β-barrel TMPs [257], [258], [259], [260]. Other statistic features were also adopted. For example, Neuwald et al proposed a new Gibbs-sampling algorithm to detect the repeated motif features buried in β-strands of bacteria outer membrane proteins (OMPs) [261]. Gromiha et al predicted the TM β-chains of bacterial porin family by statistical analysis of amino acid bias and the prior knowledge on protein structural properties [262]. Zhai et al used multiple statistics-based features including secondary structure, hydrophilicity, amphipathicity and N-terminal target sequence patterns to develop a program named BBF, with which β-barrel TMPs were detected from the E. coli genome [263]. BOMP based on the C-terminal motif features of β-barrel TM proteins and the typical amino acid property of TM β-strands to predict the β-barrel OMPs in Gram-negative bacterial genomes [264]. The transFold web server described all potential conformations based on multi-tape S-attribute grammars, and then used a dynamic programming algorithm to predict the structure and residue contacts of TM β-barrels [265]. HHomp based on the finding that all TM β-barrels are homologous to each other, and therefore used a database of profile HMMs to identify new β-barrel OMPs based on more sensitive profile-profile alignments [266]. Freeman and Wimley also proposed a method to predict genes encoding β-barrel TMPs from genome databases by analyzing the physicochemical properties of the proteins [256]. Machine-learning algorithms have also been widely applied in prediction of β-barrel TMPs (Table 6). NN and HMM are most frequently adopted though other models are also used such as SVM, k-NN, etc. As one of the earliest applications, Diederichs et al trained an NN model to predict the topology of β-chain OMPs [267]. B2TMPRED [268], TMBETA-NET [269], TBBPred [270] and TMBpro [270] are also NN-based methods. B2TMPRED considered phylogenetic information [268], TMBETA-NET introduced the concept of “residue probability” [269], while TBBPred trained both NN and SVM models and combined them to predict the β-barrel regions in proteins [270]. The TMBpro suite includes three modules, which can predict the secondary structure, β-contacts and tertiary structure of β-barrel TMPs with TMBpro-SS, TMBpro-CON and TMBpro-3D module respectively [271]. Koehler et al proposed a ANN based method, which can predict TM β-strands and TMHs simultaneously [272]. The HMM models are represented by HMMB2TMR [273], PROFtmb [274], PRED-TMBB [275]. There are also tools based on other algorithms, e.g., SVM based TMbeta-SVM [276] and PredβTM [277], k-NN based TMB-Hunt [278] and OMP-kNN [279], and others like IDQD [280], TMBETADISC-RBF [281], GRHCRFs [282], BetAware [255], BETAWARE [283] and MemBrain-TMB [257]. The links or references for these tools and the brief description were shown in Table 6. It is noteworthy that some tools combined multiple models to improve the prediction performance of β-barrel TM proteins, e.g., TMBETA-NET described above [270], BOCTOPUS [284], BOCTOPUS2 [285] and ConBBPRED [286]. BOCTOPUS and BOCTOPUS2 trained SVM models to predict the location of each residue and to detect the likely TM β-strands, followed by building HMM models to analyze the global topology of the β-barrel OMPs [284], [285]. ConBBPRED is a consensus prediction method integrating the results of 9 NN, HMM or SVM models, which can predict the β-strands and the topology of β-barrel OMPs with higher accuracy than individual models [286].

Integrated prediction pipelines and other applications

There are also some tools designed to predict protein subcellular localization, e.g., PSORTb, PSSM-S and FUEL-mLoc, which can also predict the extracellular (secreted) proteins and TMPs, but without specific secretion pathway information [287], [288], [293]. PREFFECTOR is another representative of tools predicting proteins secreted through non-specific pathways [289]. It combined effector proteins secreted though T1 ~ 6SSs of Gram-negative bacteria and trained models to classify general effectors from non-effectors regardless of the secretion system knowledge [289]. Because specific predictors require the prior knowledge about the specific secretion pathways or secreted proteins, PREFFECTOR would show the advantage in finding novel effectors secreted by unknown mechanisms. Other integrated prediction pipelines for secreted proteins include the ones predicting T3SEs, T4SEs and T6SEs (e.g., Tbooster and Orgsissec) and those predicting SPs/TMH fragments (e.g., Phobius, Philius and SPOCTOPUS) or TMHs/TMBs (e.g., Koehler’s method) simultaneously, as described before. Besides the secreted proteins, tools have also been developed to predict the secretion devices. For example, SSPred can recognize T1-4SSs and Sec pathways [290]. T346Hunter can find T3SS, T4SS and T6SS gene clusters from bacterial genomes [291]. TXSScan can predict T1-6SSs, T9SSs, flagella T3SSs, T4P and Tad fimbriae systems [292]. There are also tools predicting individual secretion systems, which will not be discussed in this review.

Summary and perspectives

In this review, we summarized protein secretion systems and the bioinformatic tools predicting these secreted proteins in Gram-negative bacteria. Precise prediction and classification of the secreted proteins is important for both bacterial genome annotation and molecular mechanism exploration of bacterial virulence, drug resistance and other important biological phonotypes. A large number of algorithms and tools have been developed. However, there remains a long way till our ultimate destination. First of all, there is often a gap between computational scientists and experimental biologists. Despite the high accuracy of software tools demonstrated by the developers, the non-homology based effector predictors (especially for T3SEs, T4SEs and T6SEs) have yet seldom been successfully applied to identify novel effectors by wet-lab researchers. More enthusiasms have been put in new algorithms rather than the biological side, e.g., new features. Most effector prediction tools are general and no specific biological prior information is considered, such as species, secretion system subtypes and regulation conduit specificity. For the software tools themselves, few groups collected, annotated and filtered the training proteins manually and carefully. The size and distribution of negative protein dataset was not optimized either. However, these aspects are really important for the development of a practically useful prediction model. Secondly, our current knowledge on protein secretion systems and the secretion mechanisms remains limited. There are new protein secretion systems that remain to be identified. Except for few pathways, the secretion mechanisms are not fully clear for a majority of the known protein secretion systems. Only a paucity of secreted proteins is experimentally validated for many newly disclosed secretion systems, and it is very difficult to identify common features stably. There are still challenges and improvement requirements for the current algorithms and tools. For example, too many tools have been developed for some protein secretion systems, e.g., T3SS and T4SS. It is difficult for experimental biologists to select the most appropriate tool. For other systems, there is still a lack of tools, e.g., T1SS and T2SS. An integrated pipeline is also desired urgently for comprehensive annotation of different types of secreted proteins. Moreover, most of the current software tools were designed for individual bacterial strains and the individual genome-derived proteome. New algorithms, databases and tools are useful and desired to facilitate evaluation of the secretome of microbiota with the metagenomic, metatranscriptomic or metaproteomic data [300], [301], [302]. All the authors certify that they have seen and approved the final version of the manuscript being submitted. They warrant that the article is the authors' original work, hasn't received prior publication and isn't under consideration for publication elsewhere.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

2 in total

1. Host-microbiome protein-protein interactions capture disease-relevant pathways.

Authors: Hao Zhou; Juan Felipe Beltrán; Ilana Lauren Brito
Journal: Genome Biol Date: 2022-03-04 Impact factor: 13.583

2. T1SEstacker: A Tri-Layer Stacking Model Effectively Predicts Bacterial Type 1 Secreted Proteins Based on C-Terminal Non-repeats-in-Toxin-Motif Sequence Features.

Authors: Zewei Chen; Ziyi Zhao; Xinjie Hui; Junya Zhang; Yixue Hu; Runhong Chen; Xuxia Cai; Yueming Hu; Yejun Wang
Journal: Front Microbiol Date: 2022-02-08 Impact factor: 5.640

2 in total