| Literature DB >> 27379078 |
Sudheer Gupta1, Ashok K Sharma1, Shubham K Jaiswal1, Vineet K Sharma1.
Abstract
Approximately 75% of microbial infections found in humans are caused by microbial biofilms. These biofilms are resistant to host immune system and most of the currently available antibiotics. Small peptides are extensively studied for their role as anti-microbial peptides, however, only a limited studies have shown their potential as inhibitors of biofilm. Therefore, to develop a unique computational method aimed at the prediction of biofilm inhibiting peptides, the experimentally validated biofilm inhibiting peptides sequences were used to extract sequence based features and to identify unique sequence motifs. Biofilm inhibiting peptides were observed to be abundant in positively charged and aromatic amino acids, and also showed selective abundance of some dipeptides and sequence motifs. These individual sequence based features were utilized to construct Support Vector Machine-based prediction models and additionally by including sequence motifs information, the hybrid models were constructed. Using 10-fold cross validation, the hybrid model displayed the accuracy and Matthews Correlation Coefficient (MCC) of 97.83% and 0.87, respectively. On the validation dataset, the hybrid model showed the accuracy and MCC value of 97.19% and 0.84, respectively. The validated model and other tools developed for the prediction of biofilm inhibiting peptides are available freely as web server at http://metagenomics.iiserb.ac.in/biofin/ and http://metabiosys.iiserb.ac.in/biofin/.Entities:
Keywords: biofilm; machine learning; peptides; random forest; support vector machine
Year: 2016 PMID: 27379078 PMCID: PMC4909740 DOI: 10.3389/fmicb.2016.00949
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Figure 1Flowchart showing steps involved in the development of prediction model and web server.
Figure 2Compositional analysis of biofilm inhibiting and biofilm non-inhibiting peptides. Positively charged amino acids (Lys and Arg) and aromatic amino acids (Trp, Tyr, and Phe) were found to be abundant in BIPs, whereas, non-BIPs were rich in negatively charged amino acids, such as Glu and Asp.
Performance (CV-10 fold) of SVM-based models on both balanced and realistic datasets using AAC and DPC features as input.
| Balanced dataset | AAC | t0 | 0.1 | 90.14 | 93.66 | 91.9 | 0.84 | 0.94 | t:0,c:80 |
| t1 | −0.1 | 86.62 | 96.48 | 91.55 | 0.84 | 0.96 | t:1,d:2 | ||
| t2 | 0.2 | 93.66 | 90.85 | 92.25 | 0.85 | 0.97 | g:0.001:c:0.5:j:4 | ||
| DPC | t0 | −0.1 | 90.14 | 89.44 | 89.79 | 0.8 | 0.96 | t:0,c:990 | |
| t1 | −0.2 | 90.85 | 94.37 | 92.61 | 0.85 | 0.96 | t:1,d:1 | ||
| t2 | 0.1 | 84.51 | 95.77 | 90.14 | 0.81 | 0.96 | g:0.001:c:1:j:1 | ||
| Realistic dataset | AAC | t0 | 0.3 | 69.72 | 99.3 | 96.62 | 0.78 | 0.95 | t:0,c:5 |
| t1 | −0.6 | 88.03 | 98.38 | 97.45 | 0.85 | 0.98 | t:1,d:3 | ||
| t2 | −0.2 | 86.62 | 98.74 | 97.64 | 0.86 | 0.98 | g:0.001:c:10:j:1 | ||
| DPC | t0 | 0.5 | 78.87 | 97.89 | 96.17 | 0.77 | 0.95 | t:0,c:990 | |
| t1 | −0.3 | 79.58 | 98.67 | 96.93 | 0.81 | 0.96 | t:0,d:1 | ||
| t2 | −0.3 | 83.8 | 98.88 | 97.51 | 0.85 | 0.96 | g:0.001:c:1:j:2 |
The models constructed using balanced dataset displayed the highest MCC values of 0.85 for both AAC and DPC. The models constructed using realistic dataset displayed the highest MCC values of 0.86 and 0.85, respectively, on AAC and DPC as input features.
Performance (CV-10 fold) of RF models both balanced and realistic datasets using AAC and DPC features as input at optimized parameters.
| Balanced dataset | AAC | 0.5 | 89.44 | 94.37 | 91.90 | 0.84 | 0.96 | mtry:8,ntree:500 |
| DPC | 0.5 | 90.85 | 84.51 | 87.68 | 0.76 | 0.95 | mtry:10,ntree:500 | |
| Realistic dataset | AAC | 0.5 | 75.35 | 99.44 | 97.25 | 0.82 | 0.97 | mtry:8,ntree:500 |
| DPC | 0.5 | 58.45 | 99.79 | 96.04 | 0.73 | 0.95 | mtry:10,ntree:500 |
It is apparent that AAC-based RF models could achieve maximum MCC value of 0.84 with an accuracy of 91.9% and displayed better performance than DPC-based RF models.
Performance (CV-10 fold) of SVM-based models on both balanced and realistic datasets using the composition-motif hybrid features.
| Balanced dataset | AAC_Motif hybrid | t0 | 0.3 | 91.55 | 94.37 | 92.96 | 0.86 | 0.95 | t:0,c:60 |
| t1 | 0.2 | 90.14 | 97.18 | 93.66 | 0.88 | 0.95 | t:1,d:1 | ||
| t2 | 0.9 | 90.14 | 97.18 | 93.66 | 0.88 | 0.97 | g:0.001 c:0.05 j:4 | ||
| DPC_Motif hybrid | t0 | 0.4 | 87.32 | 95.07 | 91.2 | 0.83 | 0.97 | t:0,c:990 | |
| t1 | −0.4 | 93.66 | 97.18 | 95.42 | 0.91 | 0.96 | t:1,d:2 | ||
| t2 | 0.1 | 92.25 | 95.77 | 94.01 | 0.88 | 0.97 | g:0.001 c:1 j:1 | ||
| Realistic dataset | AAC_Motif hybrid | t0 | 0.3 | 75.35 | 99.3 | 97.13 | 0.82 | 0.96 | t:0,c:5 |
| t1 | −0.6 | 89.44 | 98.38 | 97.57 | 0.86 | 0.98 | t:1,d:3 | ||
| t2 | −0.3 | 88.73 | 98.67 | 97.77 | 0.87 | 0.98 | g:0.001 c:4 j:1 | ||
| DPC_Motif hybrid | t0 | 0.5 | 78.87 | 97.89 | 96.17 | 0.77 | 0.95 | t:0,c:990 | |
| t1 | −0.3 | 81.69 | 99.02 | 97.45 | 0.84 | 0.97 | t:1,d:2 | ||
| t2 | −0.3 | 85.92 | 99.02 | 97.83 | 0.87 | 0.97 | g:0.001 c:2 j:1 |
The models constructed using balanced dataset displayed the highest MCC values of 0.88 and 0.91 for AAC_Motif hybrid and DPC_Motif hybrid models, respectively. The models constructed using realistic dataset displayed the highest MCC values of 0.87 for both AAC_Motif hybrid and DPC_Motif Hybrid models.
Performance of composition-motif hybrid models on validation dataset.
| t0 | 0.3 | 94.44 | 86.11 | 90.28 | 0.81 | 0.96 |
| t1 | 0.2 | 91.67 | 97.22 | 94.44 | 0.89 | 1 |
| t2 | 0.9 | 91.67 | 97.22 | 94.44 | 0.89 | 0.99 |
| t0 | 0.4 | 88.89 | 94.44 | 91.67 | 0.83 | 0.99 |
| t1 | −0.4 | 91.67 | 100 | 95.83 | 0.92 | 1 |
| t2 | 0.1 | 91.67 | 100 | 95.83 | 0.92 | 0.99 |
| t0 | 0.3 | 72.22 | 98.03 | 95.66 | 0.73 | 0.98 |
| t1 | −0.6 | 97.22 | 96.07 | 96.17 | 0.81 | 0.99 |
| t2 | −0.3 | 91.67 | 97.19 | 96.68 | 0.82 | 0.99 |
| t0 | 0.5 | 83.33 | 95.51 | 94.39 | 0.71 | 0.97 |
| t1 | −0.3 | 86.11 | 98.03 | 96.94 | 0.82 | 0.98 |
| t2 | −0.3 | 91.67 | 97.75 | 97.19 | 0.84 | 0.99 |
DPC_Motif Hybrid models on validation set displayed the highest MCC values of 0.92 and 0.84 on balanced and realistic datasets, respectively.
Top scoring 20 peptides (12-mers) found in protein sequences belonging to different gut associated species of Lactobacillus and Bifidobacterium.
| KKLFKVVKKRGI | Peptide chain release factor 3 | 4 | 0.63 | ||
| IKQVKKLFKKWG | Bacteriocin plantaricin-A | 1 | 0.51 | ||
| AIKQVKKLFKKW | Bacteriocin plantaricin-A | 1 | 0.49 | ||
| TKKLFKVVKKRG | Peptide chain release factor 3 | 4 | 0.47 | ||
| KKRIHELLRTLK | Putative ABC transporter ATP-binding protein BL0043 | 1 | 0.41 | ||
| DRIKKAAKKIQN | Glucose-6-phosphate isomerase | 1 | 0.38 | ||
| RIKKAAKKIQND | Glucose-6-phosphate isomerase | 1 | 0.38 | ||
| KQVKKLFKKWGW | Bacteriocin plantaricin-A | 1 | 0.37 | ||
| QTKKLFKVVKKR | Peptide chain release factor 3 | 4 | 0.36 | ||
| NRKKHVIRVCQD | tRNA(Ile)-lysidine synthase | 2 | 0.35 | ||
| NRKKHVIRVCQD | tRNA(Ile)-lysidine synthase | 2 | 0.35 | ||
| NRKKHVIRVCQD | tRNA(Ile)-lysidine synthase | 2 | 0.35 | ||
| RIGDRVIRAARV | Protein Grp E | 1 | 0.31 | ||
| RIGDRVIRAARV | Protein Grp E | 1 | 0.31 | ||
| RKKHVIRVCQDG | tRNA(Ile)-lysidine synthase | 5 | 0.30 | ||
| RKKHVIRVCQDG | tRNA(Ile)-lysidine synthase | 5 | 0.30 | ||
| RKKHVIRVCQDG | tRNA(Ile)-lysidine synthase | 5 | 0.30 | ||
| QAKKRIHELLRT | Putative ABC transporter ATP-bindingproteinBL0043 | 1 | 0.29 | ||
| PAAVLLKKAAKV | 50S ribosomal protein L11 | 1 | 0.29 | ||
| DKIVKKIFKKYS | Ribosomal RNA small subunit methyl transferase H | 2 | 0.26 | ||
| IKKAYRKLSKKY | Chaperone protein DnaJ | 10 | 0.25 | ||
| KIVKKIFKKYSE | Ribosomal RNA small subunit methyl transferase H | 4 | 0.24 | ||
| AQAKKRIHELLR | Putative ABC transporter ATP-binding protein BL0043 | 1 | 0.23 | ||
| AKKRIHELLRTL | Putative ABC transporter ATP-binding protein BL0043 | 1 | 0.22 | ||
| EDKIVKKIFKKY | Ribosomal RNA small subunit methyl transferase H | 2 | 0.19 |