| Literature DB >> 32598314 |
Hafida Bouziane1, Abdallah Chouarfia1.
Abstract
To date, many proteins generated by large-scale genome sequencing projects are still uncharacterized and subject to intensive investigations by both experimental and computational means. Knowledge of protein subcellular localization (SCL) is of key importance for protein function elucidation. However, it remains a challenging task, especially for multiple sites proteins known to shuttle between cell compartments to perform their proper biological functions and proteins which do not have significant homology to proteins of known subcellular locations. Due to their low-cost and reasonable accuracy, machine learning-based methods have gained much attention in this context with the availability of a plethora of biological databases and annotated proteins for analysis and benchmarking. Various predictive models have been proposed to tackle the SCL problem, using different protein sequence features pertaining to the subcellular localization, however, the overwhelming majority of them focuses on single localization and cover very limited cellular locations. The prediction was basically established on sorting signals, amino acids compositions, and homology. To improve the prediction quality, focus is actually on knowledge information extracted from annotation databases, such as protein-protein interactions and Gene Ontology (GO) functional domains annotation which has been recently a widely adopted and essential information for learning systems. To deal with such problem, in the present study, we considered SCL prediction task as a multi-label learning problem and tried to label both single site and multiple sites unannotated bacterial protein sequences by mining proteins homology relationships using both GO terms of protein homologs and PSI-BLAST profiles. The experiments using 5-fold cross-validation tests on the benchmark datasets showed a significant improvement on the results obtained by the proposed consensus multi-label prediction model which discriminates six compartments for Gram-negative and five compartments for Gram-positive bacterial proteins.Entities:
Keywords: gene ontology terms; gram-negative bacteria; gram-positive bacteria; multi-label learning; profile alignment; subcellular localization prediction
Mesh:
Substances:
Year: 2020 PMID: 32598314 PMCID: PMC8035964 DOI: 10.1515/jib-2019-0091
Source DB: PubMed Journal: J Integr Bioinform ISSN: 1613-4516
Figure 1:Flowchart for the proposed prediction model for Gram-positive and Gram-negative bacterial proteins subcellular localization. Firstly, protein sequences datasets were collected from the published database. Secondly, they were filtered out and preprocessed using different strategies to obtain a fixed size feature vector representation that can be fed into the learning model. Thirdly, the resulting encoded feature vectors were independently put into the multi-label learning model-based on Label Powerset (LP) transformation to produce independent prediction scores using Random Forest (RF) ensemble method as base classifier. Once optimum performance scores were calculated by using 5-fold cross-validation tests, the final prediction model is built.
Prokaryotic benchmark datasets statistics. Code column indicates the subcellular location representation in our predictive model. Gram-negative bacteria have five major subcellular localization sites, namely, the cytoplasm, the periplasm, the inner membrane, the outer membrane, and the extracellular space, whereas Gram-positive bacteria do not have an outer cell membrane. However in these benchmark datasets cell wall is absent in Gram-negative dataset and in Gram-positive bacteria, we observe the lack of periplasm proteins.
| No | Subcellular location | Code | Proteins count | |
|---|---|---|---|---|
| Gram negative | Gram positive | |||
| 1 | Cytoplasm | C | 4,152 | 349 |
| 2 | Extracellular | S | 272 | 290 |
| 3 | Inner membrane | I | 1,415 | 1,779 |
| 4 | Outer membrane | O | 346 | – |
| 5 | Periplasm | P | 422 | – |
| 6 | Cell wall | W | – | 34 |
| 7 | Vacuole | V | 10 | 4 |
| Multiple localizations | 39 | 8 | ||
| Total | 6,578 | 2,448 | ||
Figure 4:Observed localization sites of proteins in Gram-negative bacteria dataset.
Figure 5:Observed localization sites of proteins in Gram-positive bacteria dataset.
Extracted GO terms statistics for the three components of GO namespace, namely, Molecular Function (MF), Biological Process (BP), and Celullar Component (CC).
| Dataset | GO terms count | ||
|---|---|---|---|
| MF | BP | CC | |
| Gram negative | 1,140 | 1,316 | 223 |
| Gram positive | 570 | 667 | 116 |
Performance evaluation by 5-fold cross-validation tests on Gram-negative bacteria proteins using a consensus of PSSM and GO terms-based predictions. The multi-label confusion matrix reflects well the predictions performance for each location separetely.
| Subcellular locations | Metrics | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TP | FP | FN | TN | Correct | Wrong | % TP | % FP | % FN | % TN | % Correct | ||
| 4,152 | Cytoplasm (C) | 4,132 | 157 | 20 | 2,269 | 6,401 | 177 | 0.63 | 0.02 | 0 | 0.34 | 0.97 |
| 1,415 | Inner membrane (I) | 1,388 | 238 | 27 | 4,925 | 6,313 | 265 | 0.21 | 0.04 | 0 | 0.75 | 0.96 |
| 346 | Outer membrane (O) | 315 | 29 | 31 | 6,203 | 6,518 | 60 | 0.05 | 0 | 0 | 0.94 | 0.99 |
| 422 | Periplasm (P) | 393 | 72 | 29 | 6,084 | 6,477 | 101 | 0.06 | 0.01 | 0 | 0.92 | 0.98 |
| 272 | Extracellular (S) | 250 | 39 | 22 | 6,267 | 6,517 | 61 | 0.04 | 0.01 | 0 | 0.95 | 0.99 |
| 10 | Vacuole (V) | 8 | 0 | 2 | 6,568 | 6,576 | 2 | 0 | 0 | 0 | 1 | 1 |
Performance evaluation by 5-fold cross-validation tests on Gram-positive bacteria proteins using a consensus of both GO terms and PSSM profiles predictions. The multi-label confusion matrix reflects well the predictions performance for each location separetely.
| Subcellular locations | Metrics | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TP | FP | FN | TN | Correct | Wrong | % TP | % FP | % FN | % TN | % Correct | ||
| 349 | Cytoplasm (C) | 323 | 30 | 26 | 2,069 | 2,392 | 56 | 0.13 | 0.01 | 0.01 | 0.85 | 0.98 |
| 1,779 | Inner membrane (I) | 1,768 | 61 | 11 | 608 | 2,376 | 72 | 0.72 | 0.02 | 0 | 0.25 | 0.97 |
| 290 | Extracellular (S) | 282 | 101 | 8 | 2,057 | 2,339 | 109 | 0.12 | 0.04 | 0 | 0.84 | 0.96 |
| 34 | Cell wall (W) | 13 | 3 | 21 | 2,411 | 2,424 | 24 | 0.01 | 0 | 0.01 | 0.98 | 0.99 |
| 4 | Vacuole (V) | 4 | 0 | 0 | 2444 | 2448 | 0 | 0 | 0 | 0 | 1 | 1 |
The proposed SCL prediction model predictions for multiple sites proteins of Gram-positive bacterial dataset versus CELLO2GO, BUSCA, and UniLoc predictions.
| Protein name | Predicted essential GO terms | Top ranked CC description | Predicted location (s) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| MF | BP | CC | PseAAC | PSSM profiles | GO terms | Consensus | CELLO2GO | BUSCA | UniLoc | ||
| COOS2_CARHZ | GO:0043885 | GO:0006091 | GO:0005886 | plasma membrane | I | I | C/I | C/I | C | M | C/M |
| COMGG_BACSU | GO:0030420 | GO:0016021 | integral component of membrane | I | C | C | C | S/M | M | M/S | |
| COOS1_CARHZ | GO:0043885 | GO:0006091 | GO:0005886 | plasma membrane | I | I | I | I | C | M | C/M |
| ECPA_STAEP | GO:0008234 | GO:0006508 | GO:0005576 | extracellular region | S | S | S | S | S | C | S/W |
| LYS_CLOAB | GO:0003796 | GO:0016998 | GO:0005576 | extracellular region | S | S | S | S | S | C | S |
| HYES_CORS2 | GO:0016787 | GO:0019439 | GO:0005886 | plasma membrane | I | I | C | C/I | M/C | C | M |
| CPLR_DESHA | GO:0050781 | GO:0046193 | GO:0009275 | cell wall | I | S | C | C/S | S | S | S/M/W |
| PLC_LISMO | GO:0008081 | GO:0006629 | GO:0005576 | extracellular region | I | S | S | S | S | M | S |
Figure 8:Effect of the presence of the MF GO term GO:0043885, the BP GO term GO:0006091, and the CC GO term GO:0005886 on the proposed model prediction of COOS2_CARHZ (C/I) and COOS1_CARHZ(C/I) proteins where it really succeeded, while they are predicted as cytoplasm (C) or membrane (M) by the others predictors.
Performance evaluation results of cross-validation tests on Gram-negative bacteria dataset for different predictions: pseudo-amino acid composition (PseAAC), position-specific scoring matrix (PSSM) profiles, gene ontology (GO) terms 0/1-based representation, GO terms PPV-based representation, features fusion, a consensus prediction using both GO terms and PSSM profiles outputs, and a consensus of PseAAC, GO terms and PSSM profiles outputs. Italic values correspond to the best predictive model.
| Sequence features | Example-based metrics | Label-based metrics | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | F1 score | Subset_accuracy | Hamming-loss | Rank-loss | Macro_precision | Macro_recall | Macro_F1 score | Micro_precision | Micro_recall | Micro_F1 score | |
| PseACC | 0.882 | 0.884 | 0.882 | 0.883 | 0.880 | 0.039 | 0.117 | 0.873 | 0.656 | 0.732 | 0.884 | 0.879 | 0.881 |
| PSSM profiles | 0.940 | 0.941 | 0.940 | 0.940 | 0.937 | 0.020 | 0.060 | 0.919 | 0.815 | 0.860 | 0.941 | 0.936 | 0.939 |
| GO terms 0/1 | 0.963 | 0.965 | 0.963 | 0.963 | 0.960 | 0.012 | 0.037 | 0.952 | 0.865 | 0.903 | 0.965 | 0.960 | 0.962 |
| GO terms ppv | 0.965 | 0.967 | 0.965 | 0.965 | 0.962 | 0.011 | 0.035 | 0.958 | 0.869 | 0.908 | 0.967 | 0.962 | 0.964 |
| PseAAC+ PSSM+GO | 0.951 | 0.953 | 0.951 | 0.952 | 0.949 | 0.016 | 0.048 | 0.938 | 0.835 | 0.880 | 0.953 | 0.948 | 0.951 |
| PSSM+GO | 0.951 | 0.954 | 0.951 | 0.952 | 0.949 | 0.016 | 0.048 | 0.939 | 0.850 | 0.890 | 0.954 | 0.948 | 0.951 |
| ConsensusPseAAC +PSSM+GO | 0.920 | 0.920 | 0.986 | 0.941 | 0.858 | 0.030 | 0.043 | 0.852 | 0.930 | 0.884 | 0.855 | 0.984 | 0.915 |
| ConsensusPSSM+GO |
|
|
|
|
|
|
|
|
|
|
|
|
|
Performance evaluation results of cross-validation tests on Gram-positive bacteria dataset for different predictions: pseudo-amino acid composition (PseAAC), position-specific scoring matrix (PSSM) profiles, gene ontology (GO) terms 0/1-based representation, GO terms PPV-based representation, features fusion, a consensus prediction using both GO terms and PSSM profiles outputs, and a consensus of PseAAC, GO terms and PSSM profiles outputs. Italic values correspond to the best predictive model.
| Sequence features | Example-based metrics | Label-based metrics | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | F1 score | Subset_accuracy | Hamming-loss | Rank-loss | Macro_precision | Macro_recall | Macro_F1 score | Micro_precision | Micro_recall | Micro_F1 score | |
| PseACC | 0.896 | 0.897 | 0.895 | 0.896 | 0.894 | 0.041 | 0.104 | 0.708 | 0.510 | 0.559 | 0.897 | 0.894 | 0.895 |
| PSSM profiles | 0.937 | 0.938 | 0.937 | 0.937 | 0.935 | 0.025 | 0.062 | 0.909 | 0.763 | 0.812 | 0.938 | 0.935 | 0.937 |
| GO terms 0/1 | 0.959 | 0.959 | 0.958 | 0.959 | 0.957 | 0.016 | 0.041 | 0.952 | 0.785 | 0.791 | 0.959 | 0.957 | 0.958 |
| GO terms ppv | 0.962 | 0.962 | 0.961 | 0.962 | 0.960 | 0.015 | 0.038 | 0.927 | 0.803 | 0.819 | 0.962 | 0.960 | 0.961 |
| PseAAC+PSSM+GO | 0.947 | 0.949 | 0.947 | 0.948 | 0.946 | 0.020 | 0.052 | 0.931 | 0.772 | 0.824 | 0.949 | 0.946 | 0.947 |
| PSSM+GO | 0.948 | 0.950 | 0.948 | 0.949 | 0.946 | 0.020 | 0.051 | 0.920 | 0.822 | 0.851 | 0.950 | 0.947 | 0.948 |
| ConsensusPseAAC+PSSM+GO | 0.924 | 0.924 | 0.978 | 0.941 | 0.871 | 0.033 | 0.050 | 0.855 | 0.858 | 0.839 | 0.871 | 0.977 | 0.921 |
| ConsensusPSSM+GO |
|
|
|
|
|
|
|
|
|
|
|
|
|
The proposed SCL prediction model predictions for some multiple sites proteins of Gram-negative bacterial dataset versus CELLO2GO, BUSCA, and UniLoc predictions.
| Protein name | Predicted essential GO terms | Top ranked CC description | Predicted location (s) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| True location (s) | MF | BP | CC | PseAAC | PSSM profiles | GO terms | Consensus | CELLO2GO | BUSCA | UniLoc | |
| CH60_NEIGO | GO:0051082 | GO:0042026 | GO:0005737 | cytoplasm | C | C | C | C | C | C | C |
| ENO_ECOLI | GO:0004634 | GO:0006096 | GO:0000015 | phosphopyruvate hydratase complex | C | C | C | C | C | C | C/S |
| SPIC_SALTY | GO:0009405 | GO:0005576 | extracellular region | C | I | S | I/S | S | C | C/S | |
| MXIG_SHIFL | GO:0009405 | GO:0016021 | extracellular region | C | I | I | I | I/C | C | M | |
| PGPB_ECOLI | GO:0008962 | GO:0016311 | GO:0016021 | integral component of membrane | I | I | I | I | I | M | M |
| LEPA_ECOLI | GO:0043022 | GO:0045727 | GO:0005886 | integral component of membrane | C | C | C | C | I/C | C | C/M |
| LEPA_SALTY | GO:0043022 | GO:0045727 | GO:0005886 | plasma membrane | I | S | C | C/S | I/C | C | C/M |
| TIBA_ECOLI | GO:0005509 | GO:0009405 | GO:0019867 | outer membrane | S | S | S | S | S/O | C | M |
| FRPC_NEIMC | GO:0005509 | GO:0009405 | GO:0005576 | extracellular region | O | S | S | S | S/C | C | S/C |
| YOPM_YERPE | GO:0004842 | GO:0016567 | GO:0045335 | phagocytic vesicle | C | C | C | C | C | S/I/C | S/C |
| FRPA_NEIMC | GO:0005509 | GO:0009405 | GO:0005576 | extracellular region | O | S | S | S | S | C | M/S |
| TCPS_VIBCH | GO:0009297 | GO:0042597 | periplasmic space | C | I | S | I/S | P/C | S | P/S | |
| PPBL_PSEAE | GO:0042301 | GO:0035435 | GO:0043190 | ATP-binding cassette (ABC) transporter complex | S | S | P | S/P | S | S | S/P/M |
| HLYE_ECOLI | GO:0090729 | GO:0044179 | GO:0020002 | host cell plasma membrane | C | C | S | S/C | S | C | P/M/S |
| PPBH_PSEAE | GO:0090729 | GO:0016311 | GO:0042597 | periplasmic space | C | P | P | P | P | S | P/S |