| Literature DB >> 27034664 |
MuthuKrishnan Selvaraj1, Munish Puri2, Kanak L Dikshit3, Christophe Lefevre4.
Abstract
The recent upsurge in microbial genome data has revealed that hemoglobin-like (HbL) proteins may be widely distributed among bacteria and that some organisms may carry more than one HbL encoding gene. However, the discovery of HbL proteins has been limited to a small number of bacteria only. This study describes the prediction of HbL proteins and their domain classification using a machine learning approach. Support vector machine (SVM) models were developed for predicting HbL proteins based upon amino acid composition (AC), dipeptide composition (DC), hybrid method (AC + DC), and position specific scoring matrix (PSSM). In addition, we introduce for the first time a new prediction method based on max to min amino acid residue (MM) profiles. The average accuracy, standard deviation (SD), false positive rate (FPR), confusion matrix, and receiver operating characteristic (ROC) were analyzed. We also compared the performance of our proposed models in homology detection databases. The performance of the different approaches was estimated using fivefold cross-validation techniques. Prediction accuracy was further investigated through confusion matrix and ROC curve analysis. All experimental results indicate that the proposed BacHbpred can be a perspective predictor for determination of HbL related proteins. BacHbpred, a web tool, has been developed for HbL prediction.Entities:
Year: 2016 PMID: 27034664 PMCID: PMC4789356 DOI: 10.1155/2016/8150784
Source DB: PubMed Journal: Adv Bioinformatics ISSN: 1687-8027
Performance of various SVM modules of HbL proteins predictions with non-HbL and HbL classification (single domain, two domains (flavoHbs) and truncated Hbs (trHb)) developed using various methods: amino acids (AC), dipeptides (DC), PSSM, and MM profiles.
| Methods | ACC | SN | SP | MCC | Parameter | ||
|---|---|---|---|---|---|---|---|
|
|
| ||||||
| HbL versus non-HbL | AC | 86.14 | 96.18 | 76.11 | 0.82 | 25 | 400 |
| DC | 83.02 | 94.78 | 71.27 | 0.78 | 1 | 375 | |
| PSSM | 90.20 | 97.76 | 82.64 | 0.89 | 1 | 300 | |
| MM | 86.28 | 96.08 | 76.49 | 0.83 | 25 | 450 | |
| Hybrid | 85.21 | 95.80 | 74.62 | 0.81 | 0.1 | 375 | |
|
| |||||||
|
| AC | 94.96 | 100 | 94.56 | 0.97 | 15 | 9 |
| DC | 83.23 | 100 | 82.15 | 0.91 | 0.2 | 250 | |
| PSSM | 95.05 | 100 | 94.66 | 0.97 | 5 | 7 | |
| MM | 94.87 | 100 | 94.46 | 0.97 | 1 | 150 | |
| Hybrid | 91.51 | 100 | 90.83 | 0.95 | 0.1 | 350 | |
|
| |||||||
|
| AC | 96.46 | 100 | 89.67 | 0.95 | 10 | 300 |
| DC | 87.50 | 100 | 63.58 | 0.80 | 1 | 350 | |
| PSSM | 95.05 | 100 | 85.59 | 0.93 | 1 | 350 | |
| MM | 96.46 | 100 | 89.67 | 0.95 | 10 | 300 | |
| Hybrid | 90.29 | 100 | 71.73 | 0.84 | 1 | 150 | |
|
| |||||||
|
| AC | 85.26 | 98.89 | 80.62 | 0.89 | 5 | 350 |
| DC | 78.17 | 98.89 | 71.13 | 0.83 | 1 | 275 | |
| PSSM | 87.97 | 100 | 83.88 | 0.92 | 1 | 400 | |
| MM | 85.07 | 99.26 | 80.25 | 0.89 | 4 | 500 | |
| Hybrid | 80.03 | 100 | 73.25 | 0.85 | 1 | 150 | |
Performance of various SVM modules of HbL proteins domains classifications (flavoglobin, flavoglobin-cyto-FAD, flavoglobin-FAD, and single and trHb domain) developed using amino acid (AC), dipeptide composition (DC), PSSM, and MM profile, respectively.
| HbL protein domain | Methods | ACC (%) | SN (%) | SP (%) | MCC | Parameters | |
|---|---|---|---|---|---|---|---|
|
|
| ||||||
| Flavoglobin (FAD-insignificant) | AC | 91.88 | 65.62 | 92.69 | 0.74 | 50 | 200 |
| DC | 78.09 | 56.25 | 78.75 | 0.52 | 2 | 250 | |
| PSSM | 84.05 | 78.13 | 84.23 | 0.77 | 2 | 400 | |
| MM | 93.28 | 100.00 | 93.08 | 0.96 | 25 | 450 | |
| Hybrid | 82.00 | 62.50 | 82.59 | 0.62 | 1 | 450 | |
|
| |||||||
| Flavoglobin-cyto-FAD/NAD | AC | 89.65 | 99.51 | 75.89 | 0.86 | 10 | 200 |
| DC | 87.78 | 98.55 | 72.77 | 0.83 | 5 | 200 | |
| PSSM | 91.04 | 100 | 78.57 | 0.89 | 2 | 500 | |
| MM | 89.17 | 96.96 | 78.34 | 0.84 | 25 | 400 | |
| Hybrid | 89.00 | 98.56 | 75.67 | 0.85 | 3 | 350 | |
|
| |||||||
| Flavoglobin-FAD | AC | 83.68 | 50.00 | 84.71 | 0.53 | 10 | 275 |
| DC | 77.05 | 59.37 | 77.60 | 0.54 | 1 | 275 | |
| PSSM | 82.74 | 18.75 | 84.71 | 0.09 | 1 | 500 | |
| MM | 89.74 | 50.00 | 90.96 | 0.60 | 15 | 500 | |
| Hybrid | 76.77 | 43.75 | 77.78 | 0.37 | 1 | 200 | |
|
| |||||||
| Single bac domain (globin-like) | AC | 94.96 | 100 | 94.56 | 0.97 | 15 | 9 |
| DC | 83.23 | 100 | 82.15 | 0.91 | 0.2 | 250 | |
| PSSM | 95.05 | 100 | 94.66 | 0.97 | 5 | 7 | |
| MM | 94.87 | 100 | 94.46 | 0.97 | 1 | 150 | |
| Hybrid | 91.51 | 100 | 90.83 | 0.95 | 0.1 | 350 | |
|
| |||||||
| Truncated BacHb domain (globin_trunc_bac-like) | AC | 85.26 | 98.89 | 80.62 | 0.89 | 5 | 350 |
| DC | 78.17 | 98.89 | 71.13 | 0.83 | 1 | 275 | |
| PSSM | 87.97 | 100 | 83.88 | 0.92 | 1 | 400 | |
| MM | 85.07 | 99.26 | 80.25 | 0.89 | 4 | 500 | |
| Hybrid | 80.03 | 100 | 73.25 | 0.85 | 1 | 150 | |
Figure 1Confusion matrix system of bacterial Hbs (single domain, two domains (flavoHbs) and trHb (truncated Hbs)).
Figure 2The performance of HbL proteins SVM models by ROC plots. C-1: HbL proteins AUC 0.943, 0.969, 0.992, and 0.943 of AC, DC, PSSM, and MM profile methods, C-2: flavoHbs AUC 0.968, 0.994, 0.991, and 0.968 of AC, DC, PSSM, and MM profile methods, C-3: single-domain (sHb) AUC 1.00, 0.99, 1.00, and 1.00 of AC, DC, PSSM, and MM profile methods, and C-4: trHb AUC 0.950, 0.994, 0.993, and 0.949 of AC, DC, PSSM, and MM profile methods, respectively.
HbL domain prediction performance of BLAST-search sequences compared with Pfam along with BacHbpred all models (AC, DC, PSSM, and MM).
| Total | Pfam | AC | DC | PSSM | MM | |
|---|---|---|---|---|---|---|
| SHb | 499 | 162 | 140 | 140 | 103 | 140 |
| Flavoglobin | 749 | 04 | 31 | 27 | 07 | 48 |
| Flavoglobin-cyto-FAD/NAD | 749 | 673 | 667 | 605 | 578 | 631 |
| Flavoglobin-FAD | 749 | 30 | 30 | 20 | 00 | 34 |
| trHb | 1203 | 1130 | 1008 | 1081 | 1164 | 1011 |
Pfam predicts flavoglobin with FAD/NAD only, but it does not show any signal for cytochrome reductase domain.
Figure 3(a) Sequence length histograms of HbL based on domain organization: single domain (sHb), two domains (flavoHbs, i.e., globin-FAD, globin-FAD/NAD, and globin-cyto-FAD/NAD) and trHb (truncated Hbs) (x-axis for sequence length range and y-axis for number of sequences). (b) Sequence similarity histograms of HbL proteins; single domain (sHb), two domains (flavoHbs) and trHb (truncated Hbs). (c) Domain architecture of HbL protein based on Pfam/InterPro web search tool, (c)(A) flavoHb globin-FAD/NAD, (c)(B) flavoHb globin-cyto-FAD/NAD, (c)(C) flavoHb globin-insignificant-FAD domain, (c)(D) sHb globin domain, and (c)(E) trHb globin_FAM_2 domain.