| Literature DB >> 36101405 |
Carlo M Bergamini1, Nicoletta Bianchi2, Valerio Giaccone3, Paolo Catellani3, Leonardo Alberghini3, Alessandra Stella4, Stefano Biffani4, Sachithra Kalhari Yaddehige3, Tania Bobbo4,5, Cristian Taccioli3.
Abstract
Probiotic bacteria are microorganisms with beneficial effects on human health and are currently used in numerous food supplements. However, no selection process is able to effectively distinguish probiotics from non-probiotic organisms on the basis of their genomic characteristics. In the current study, four Machine Learning algorithms were employed to accurately identify probiotic bacteria based on their DNA characteristics. Although the prediction accuracies of all algorithms were excellent, the Neural Network returned the highest scores in all the evaluation metrics, managing to discriminate probiotics from non-probiotics with an accuracy greater than 90%. Interestingly, our analysis also highlighted the information content of the tRNA sequences as the most important feature in distinguishing the two groups of organisms probably because tRNAs have regulatory functions and might have allowed probiotics to evolve faster in the human gut environment. Through the methodology presented here, it was also possible to identify seven promising new probiotics that have a higher information content in their tRNA sequences compared to non-probiotics. In conclusion, we prove for the first time that Machine Learning methods can discriminate human probiotic from non-probiotic organisms underlining information within tRNA sequences as the most important genomic feature in distinguishing them.Entities:
Keywords: Chargaff’s Second Parity rule; Machine Learning; Shannon’s Entropy; probiotics; tRNA
Year: 2022 PMID: 36101405 PMCID: PMC9311688 DOI: 10.3390/biology11071024
Source DB: PubMed Journal: Biology (Basel) ISSN: 2079-7737
Figure 1Results of the recursive feature elimination incorporating 1 to all investigated features. An RF analysis was conducted to predict the probiotic/non-probiotic status. The number of features included in the model and the accuracy of prediction are shown on the x-axis and on the y-axis, respectively.
Selected features (n = 16) identified using the most parsimonious and performant model (prediction accuracy = 90.9%).
| Id | Selected Features | Description |
|---|---|---|
| Genome | bp_genome_total | Genome size |
| bp_genA | Total number of Adenines (within the genome) | |
| bp_genT | Total number of Thymines (within the genome) | |
| fr_genG | Frequency of Guanines (number of Guanines divided by DNA total length) within the genome | |
| genomic_shannon_score | Shannon’s Entropy of total genome sequence | |
| CDS | n_cds_total | Total number of CDS elements (Coding DNA Sequences) |
| bp_cds_total | Total number of CDS nucleotides | |
| bp_cdsA | Total number of CDS Adenines | |
| bp_cdsG | Total number of CDS Cytosines | |
| bp_cdsT | Total number of CDS Thymines | |
| cds_chargaff_score_ct | Chargaff’s Second Parity rule score of total CDS sequence (ct method) | |
| cds_chargaff_score_pf | Chargaff’s Second Parity rule score of total CDS sequence (pf method) | |
| cds_shannon_score | Shannon Entropy value of total CDS sequence | |
| tRNA | tRNA_chargaff_score_ct | Chargaff’s Second Parity rule score of total tRNA sequence (ct method) |
| tRNA_chargaff_score_pf | Chargaff’s Second Parity rule score of total tRNA sequence (pf method) | |
| tRNA_shannon_score | Shannon’s Entropy value of total tRNA sequence |
Accuracy and Kappa value to compare methods performance on validation set. Prediction models were developed using four different machine learning methods: GLM, RF, SVM and NN.
| Method | Accuracy | Kappa Value |
|---|---|---|
| GLM | 0.936 | 0.869 |
| RF | 0.941 | 0.880 |
| SVM | 0.948 | 0.895 |
| NN | 0.951 | 0.900 |
Figure 2Feature importance plot showing the ranking of the selected features for the prediction of the probiotic/non-probiotic status, using NN as predictive method. Detailed information of the genomic features is fully explained in File S1.
Metrics (accuracy and 95% Confidence Interval (CI), sensitivity (Se), specificity (Sp), precision, Kappa value, F1 score, MCC and area under the receiver operating characteristic curve (AUC)) to compare methods performance on test set. Prediction models were developed using four different ML methods: GLM, RF, SVM and NN.
| Method | Accuracy | 95% CI | Se | Sp | Precision | Kappa | F1 Score | MCC | AUC |
|---|---|---|---|---|---|---|---|---|---|
| GLM | 0.667 | 0.349–0.901 | 0.667 | 0.667 | 0.857 | 0.273 | 0.750 | 0.293 | 0.630 |
| RF | 0.750 | 0.423–0.945 | 0.778 | 0.667 | 0.875 | 0.400 | 0.823 | 0.408 | 0.704 |
| SVM | 0.833 | 0.516–0.979 | 0.778 | 1.000 | 1.000 | 0.636 | 0.875 | 0.683 | 0.815 |
| NN | 0.833 | 0.516–0.979 | 0.778 | 1.000 | 1.000 | 0.636 | 0.875 | 0.683 | 0.815 |
Figure 3False negative error (FN.ERR), false positive error (FP.ERR) and total error (TOT.ERR) in predicting the probiotic/non-probiotic status on the test set of four ML methods: GLM, RF, SVM and NN.
Microorganisms included in the test dataset.
| Species | Order | NN Classification |
|---|---|---|
|
| Rickettsiales | Non-probiotic (referred as a pathogen in literature) |
|
| Enterobacterales | Non-probiotic (referred as a pathogen in literature) |
|
| Vibrionales | Non-probiotic (referred as a pathogen in literature) |
|
| Bacteroidales | Non-probiotic (referred as possible probiotics in literature) |
|
| Bacteroidales | Non-probiotic (referred as possible probiotics in literature) |
|
| Lactobacillales | Probiotic (referred as possible probiotics in literature) |
|
| Verrucomicrobiales | Probiotic (referred as possible probiotics in literature) |
|
| Lactobacillales | Probiotic (referred as possible probiotics in literature) |
|
| Lactobacillales | Probiotic (referred as possible probiotics in literature) |
|
| Lactobacillales | Probiotic (referred as possible probiotics in literature) |
|
| Lactobacillales | Probiotic (referred as possible probiotics in literature) |
|
| Eubacteriales | Probiotic (referred as possible probiotics in literature) |