| Literature DB >> 28607462 |
Georgia Orfanoudaki1, Maria Markaki2, Katerina Chatzi3, Ioannis Tsamardinos2,4, Anastassios Economou5,6.
Abstract
More than a third of the cellular proteome is non-cytoplasmic. Most secretory proteins use the Sec system for export and are targeted to membranes using signal peptides and mature domains. To specifically analyze bacterial mature domain features, we developed MatureP, a classifier that predicts secretory sequences through features exclusively computed from their mature domains. MatureP was trained using Just Add Data Bio, an automated machine learning tool. Mature domains are predicted efficiently with ~92% success, as measured by the Area Under the Receiver Operating Characteristic Curve (AUC). Predictions were validated using experimental datasets of mutated secretory proteins. The features selected by MatureP reveal prominent differences in amino acid content between secreted and cytoplasmic proteins. Amino-terminal mature domain sequences have enhanced disorder, more hydroxyl and polar residues and less hydrophobics. Cytoplasmic proteins have prominent amino-terminal hydrophobic stretches and charged regions downstream. Presumably, secretory mature domains comprise a distinct protein class. They balance properties that promote the necessary flexibility required for the maintenance of non-folded states during targeting and secretion with the ability of post-secretion folding. These findings provide novel insight in protein trafficking, sorting and folding mechanisms and may benefit protein secretion biotechnology.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28607462 PMCID: PMC5468347 DOI: 10.1038/s41598-017-03557-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Bioinformatics pipeline of data analysis. Summary workflow of the machine learning process for the separation of secretory from cytoplasmic sequences. First secretory and cytoplasmic proteins from the E. coli K-12 proteome were collected based on the subcellular annotation in STEPdb[1] (Table S1). In total, 2365 cytoplasmic and 505 secretory from eight sub-classes of the cell envelope were defined in STEPdb[1] (Table S1a). 20% of the dataset (test set) along with 120 mutated preproteins that were collected from the literature, were left outside the training process and were used later on for validation. Raw data (sequences) where first processed and transformed into nine groups of training features (e.g. binary representation of amino acids, cPseAAC). The MatureP model was trained using all data and merging all training features. Data Processing pipeline: The sample set is partitioned to K folds. For each configuration (combination of algorithms and values of their hyper-parameters) and each excluded fold, a model is produced. The average performance of each configuration is then estimated and the optimal one is selected. Subsequently, the final model is trained on all the data using the best configuration. Next, using a boostrapped-based procedure the bias of the performance estimation of the final model is computed; the bias-corrected performance and the final models are returned by the pipeline.
Figure 2Representation of the selected features. Logo-like representation of the amino acid features taken into acount by each classifier: (a) the “preprotein”, (b) the “mature domain”, (c) and the MatureP. Different features at various positions on the protein sequence can be selected. The features correspond to either individual aminoacyl residues or groups of aminoacyl residues and are represented by a unique letter or symbol (see below). In (a) and (b) the complete set of the selected features is depicted whereas in (c) only the position specific amino acid features are applicable. The coefficients of the two linear classifiers, are the weights of the features which are employed here to represent the classifiers in Logo-like format (see Methods). If more than one features are selected at a position then a stack of symbols is drawn. The height of each stack is indicative of the significance of the position (see Methods). The weights have been normalized from −1 to 1 so that the classifiers are comparable (see Methods). Positively weighted features are selected for secretory whereas negatively for cytoplasmic proteins. In the “preprotein” classifier the most significant features are selected in the signal peptide region. However, there are also features selected in the mature domain region. When the signal peptide is removed (bottom) then more features are selected in the mature domains. A cluster of hydrophobic residues or arginine are disfavoured in the early mature domain (position 1 to 33). Symbols: @: (D,E); + : (K,R); sml: (V,G,A,P); sm: (A,G); h: (I,L,V,M); ph: (L,I,F); b: (Y,W,F); o- (T,S); x:(Y,T,S); pol: (N,Q,C); q: (N,Q,H).
Comparison with other bioinformatics tools.
| Bioinformatics tool | Performance (%) | ||
|---|---|---|---|
| Train set | Test set¹ | Experimental data² | |
| SignalP 4.0[ | 99.61 | 99.27 | 51.64 |
| LipoP[ | 99.83 | 99.71 | 61.39 |
| Phobius[ | 98.78 | 98.72 | 72.08 |
| PRED-TAT[ | 99.66 | 99.61 | 62.07 |
| Classifier | |||
| Preprotein (#P1) | 97.19 | 95.51 | 97.96 |
| MatureP (#M22) | 91.46 | 91.24 | 85.10 |
| Disorder (#M7) | 84.73 | 84.18 | 85.86 |
We measured the performance of four bioinformatics tools: SignalP 4.0 LipoP, Phobius and PRED-TAT on the training, testing and experimental datasets. We used the AUC as a performance metric[35]. AUC depicts relative trade-offs between true positive (benefits) and false positive (costs) and represents the performance of the average classifier (over different classifiers which assume different miss-classification cost ratios).
¹Randomly selected samples (20% of the total sample set; Table S1) which remained unused during the training of the classifiers.
²Experimental data manually collected from the literature (Table S4).
Figure 3The selected folding features. Decomposition of the energy predictor matrix P to its eigenvectors[37]. Each element of the P matrix, tells how the energy of the residue of type i affects the occurrences of the residues of type j in the sequence[37]. We name the eigenvectors as folding components (FCs) because they represent a balance between stabilizing and destabilizing aminoacyl residues interactions. We represent each FC as a stack of letters (amino acids one letter code) where the size of each letter is proportional to the corresponding coordinate of the respective eigenvector. The black line denotes the weights of the linear classifier trained using the total energy per folding component (see Methods), termed as the “disorder classifier”. The FCs are rearranged per the weights of the classifier and are numbered in decreasing order of the respective eigenvalues, indicated at the top. Positively weighted FCs are selected for the mature domain sequences. For example, mature domains are composed mostly of Gln, Thr and Lys as the seventh FC dictates.
Prediction of Gram+ and Gram- secretory proteins from their mature domains.
| Classifier | AUC (%) | |
|---|---|---|
| Gram− | Gram+ | |
| Preprotein (#P1) | 97.96 | 97.59 |
| MatureP (#M22) | 85.79 | 90.04 |
| Disorder (#M7) | 74.61 | 82.40[ |
We measured the performance of the main classifiers (“preprotein” and “mature domain”) on a set of secretory proteins predicted in other bacterial proteomes. For this analysis we selected 25 Gram- and 10 Gram+ bacteria and collected 7120 and 1361 secretory proteins correspondingly (see Methods; Table S5). The E. coli K-12 secretome was excluded since it was used for the initial training.