| Literature DB >> 24736651 |
Ankit Gupta1, Rohan Kapil1, Darshan B Dhakan1, Vineet K Sharma1.
Abstract
The identification of virulent proteins in any de-novo sequenced genome is useful in estimating its pathogenic ability and understanding the mechanism of pathogenesis. Similarly, the identification of such proteins could be valuable in comparing the metagenome of healthy and diseased individuals and estimating the proportion of pathogenic species. However, the common challenge in both the above tasks is the identification of virulent proteins since a significant proportion of genomic and metagenomic proteins are novel and yet unannotated. The currently available tools which carry out the identification of virulent proteins provide limited accuracy and cannot be used on large datasets. Therefore, we have developed an MP3 standalone tool and web server for the prediction of pathogenic proteins in both genomic and metagenomic datasets. MP3 is developed using an integrated Support Vector Machine (SVM) and Hidden Markov Model (HMM) approach to carry out highly fast, sensitive and accurate prediction of pathogenic proteins. It displayed Sensitivity, Specificity, MCC and accuracy values of 92%, 100%, 0.92 and 96%, respectively, on blind dataset constructed using complete proteins. On the two metagenomic blind datasets (Blind A: 51-100 amino acids and Blind B: 30-50 amino acids), it displayed Sensitivity, Specificity, MCC and accuracy values of 82.39%, 97.86%, 0.80 and 89.32% for Blind A and 71.60%, 94.48%, 0.67 and 81.86% for Blind B, respectively. In addition, the performance of MP3 was validated on selected bacterial genomic and real metagenomic datasets. To our knowledge, MP3 is the only program that specializes in fast and accurate identification of partial pathogenic proteins predicted from short (100-150 bp) metagenomic reads and also performs exceptionally well on complete protein sequences. MP3 is publicly available at http://metagenomics.iiserb.ac.in/mp3/index.php.Entities:
Mesh:
Year: 2014 PMID: 24736651 PMCID: PMC3988012 DOI: 10.1371/journal.pone.0093907
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Steps used by HMM module for the prediction of pathogenic protein.
Performance of SVM module on the main dataset.
| Threshold | Sensitivity | Specificity | Accuracy | MCC |
| −1 | 95.08 | 50.34 | 59.89 | 0.38 |
| −0.9 | 93.98 | 56.12 | 64.2 | 0.41 |
| −0.8 | 93.43 | 61.6 | 68.39 | 0.45 |
| −0.7 | 92.24 | 67.46 | 72.74 | 0.49 |
| −0.6 | 89.65 | 72.71 | 76.33 | 0.52 |
| −0.5 | 88.43 | 76.66 | 79.17 | 0.55 |
| −0.4 | 86.56 | 80.27 | 81.61 | 0.58 |
| −0.3 | 84.67 | 83.88 | 84.04 | 0.61 |
|
|
|
|
|
|
| −0.1 | 79.22 | 89.06 | 86.96 | 0.64 |
| 0 | 76.12 | 90.96 | 87.79 | 0.65 |
| 0.1 | 72.3 | 92.61 | 88.27 | 0.65 |
|
|
|
|
|
|
| 0.3 | 65.58 | 95.01 | 88.72 | 0.65 |
| 0.4 | 61.57 | 96.02 | 88.66 | 0.64 |
| 0.5 | 59.16 | 96.7 | 88.68 | 0.64 |
| 0.6 | 55.65 | 97.32 | 88.42 | 0.63 |
| 0.7 | 52.07 | 97.89 | 88.1 | 0.61 |
| 0.8 | 48.19 | 98.36 | 87.65 | 0.6 |
| 0.9 | 42.59 | 98.82 | 86.82 | 0.56 |
| 1 | 38.42 | 99.04 | 86.11 | 0.54 |
The point where sensitivity and specificity is roughly equal is highlighted in bold. The point of maximum MCC is highlighted in bold and italics.
Performance of SVM, HMM and combined Modules on the Genomic Blind (Blind) and Metagenomic Blind datasets (BlindA and BlindB).
| Method | Dataset | Sensitivity | Specificity | Accuracy | MCC |
| SVM | Blind | 84 | 92 | 88 | 0.76 |
| BlindA | 72.86 | 94.34 | 82.49 | 0.68 | |
| BlindB | 64.14 | 91.69 | 76.50 | 0.57 | |
| HMM | Blind (86.50%) | 100 | 97.02 | 98.27 | 0.97 |
| BlindA (52.60%) | 99.47 | 98.11 | 98.68 | 0.97 | |
| BlindB (35.64%) | 78.17 | 99.03 | 89.24 | 0.80 | |
| Combined (MP3) | Blind | 92 | 100 | 96 | 0.92 |
| BlindA | 82.39 | 97.86 | 89.32 | 0.80 | |
| BlindB | 71.60 | 94.48 | 81.86 | 0.67 |
*The values in the brackets show the percentage prediction provided by HMM. The default threshold of −0.2 was used for SVM module and default e-value of 1e−5 was used for HMM module.
Performance of SVM module on metagenomic datasets.
| Threshold | Sensitivity | Specificity | Accuracy | MCC | ||||
| A | B | A | B | A | B | A | B | |
| −1 | 99.32 | 98.41 | 35.87 | 35.47 | 52.30 | 51.58 | 0.35 | 0.33 |
| −0.9 | 98.81 | 97.61 | 48.67 | 47.32 | 61.65 | 60.19 | 0.43 | 0.41 |
| −0.8 | 98.10 | 96.53 | 60.94 | 58.58 | 70.56 | 68.29 | 0.52 | 0.48 |
| −0.7 | 97.05 | 95.23 | 71.56 | 68.78 | 78.16 | 75.55 | 0.60 | 0.56 |
| −0.6 | 95.89 | 93.37 | 79.90 | 77.20 | 84.04 | 81.34 | 0.68 | 0.63 |
| −0.5 | 94.05 | 90.98 | 86.26 | 83.89 | 88.28 | 85.70 | 0.74 | 0.68 |
| −0.4 | 91.83 | 88.19 | 91.00 | 88.85 | 91.22 | 88.69 | 0.79 | 0.73 |
| −0.3 | 89.35 | 84.91 | 94.37 | 92.59 | 93.07 | 90.62 | 0.82 | 0.76 |
|
|
|
|
|
|
|
|
|
|
| −0.1 | 81.85 | 77.57 | 97.92 | 96.71 | 93.76 | 91.81 | 0.83 | 0.78 |
| 0 | 77.49 | 73.09 | 98.75 | 97.80 | 93.25 | 91.48 | 0.82 | 0.77 |
| 0.1 | 72.72 | 68.52 | 99.18 | 98.53 | 92.33 | 90.85 | 0.80 | 0.75 |
| 0.2 | 67.60 | 63.46 | 99.45 | 99.06 | 91.20 | 89.95 | 0.77 | 0.73 |
| 0.3 | 62.30 | 58.51 | 99.59 | 99.34 | 89.94 | 88.89 | 0.73 | 0.70 |
| 0.4 | 57.26 | 52.82 | 99.73 | 99.54 | 88.74 | 87.59 | 0.70 | 0.66 |
| 0.5 | 51.58 | 47.25 | 99.82 | 99.69 | 87.33 | 86.28 | 0.66 | 0.62 |
| 0.6 | 45.38 | 41.17 | 99.85 | 99.78 | 85.75 | 84.78 | 0.61 | 0.58 |
| 0.7 | 39.59 | 35.47 | 99.88 | 99.84 | 84.27 | 83.37 | 0.57 | 0.53 |
| 0.8 | 32.97 | 29.28 | 99.91 | 99.89 | 82.58 | 81.82 | 0.51 | 0.48 |
| 0.9 | 27.32 | 23.74 | 99.94 | 99.92 | 81.14 | 80.43 | 0.46 | 0.43 |
| 1 | 21.99 | 17.65 | 99.96 | 99.94 | 79.77 | 78.88 | 0.41 | 0.37 |
‘A’ refers to metagenomic Set A, and ‘B’ refers to metagenomic Set B. The point where sensitivity and specificity is roughly equal, and MCC is maximum is highlighted in bold and italics.
Figure 2Prediction of pathogenic or nonpathogenic proteins using MP3.
HS: predictions from both HMM and SVM are in consensus, H: prediction is based only on HMM module, S: prediction is based only on SVM module.
Performance of MP3 on known bacterial genomes.
| Genome | Type | Number of Pathogenic Proteins (%) |
|
| N | 249 (13.2) |
|
| P | 1,052 (20.8) |
|
| N | 778 (18.6) |
|
| P | 1,291(24.3) |
|
| N | 885 (21.5) |
|
| N | 261 (13.0) |
|
| P | 938(30.28) |
|
| P | 328 (20.4) |
|
| N | 485 (10.9) |
|
| P | 973 (23.7) |
|
| N | 310 (15.7) |
|
| P | 330 (16.3) |
|
| P | 1,527 (26.2) |
|
| N | 1,153 (23.24) |
|
| P | 759 (18.7) |
|
| P | 474 (18.0) |
The threshold of 0.2 was used to achieve high specificity for the above analysis by SVM module.
Type indicates pathogenicity of the bacteria indicated by ‘N’ for nonpathogenic and ‘P’ for pathogenic bacteria.
Performance of MP3 on second independent dataset consisting of proteins from virulence plasmid of Shigella flexineri.
| Protein | Secreted | Translocated | Function | Inhibition | MP3 prediction | Tag |
| icsB | Y | Y | Inhibits autophagy | Complete | Pathogenic | S |
| IpgGB2 | N | N | G-protein mimic | Complete | Pathogenic | S |
| IpgD | Y | y | Inositol phosphate phosphatase | Complete | Pathogenic | S |
| VirA | Y | Y | Microtubule-severing activity | Complete | Pathogenic | S |
| mvpT | N | N | Toxin- plasmid segregation | Complete | Non-Pathogenic | H |
| IpaJ | N | N | Unknown | Complete | Pathogenic | S |
| IpgGB1 | Y | y | G-protein mimic | Intermediate | Pathogenic | S |
| OspC1 | Y | N | Unknown | Intermediate | Pathogenic | S |
| OspD3 | N | N | Unknown | Intermediate | Non-Pathogenic | S |
| OspF | Y | Y | MAPK phosphothreonine | Intermediate | Pathogenic | S |
| parA | N | N | Plasmid segregation | Intermediate | Non-Pathogenic | HS |
| OspB | Y | N | Unknown | Weak | Pathogenic | S |
Y: refers to Yes; N: refers to No. The complete results for all the 38 proteins are provided in Table S8 in File S1.