| Literature DB >> 30453987 |
Diogo Manuel Carvalho Leite1,2, Xavier Brochet1,2, Grégory Resch3, Yok-Ai Que4, Aitana Neves2, Carlos Peña-Reyes5,6.
Abstract
BACKGROUND: Antibiotic resistance and its rapid dissemination around the world threaten the efficacy of currently-used medical treatments and call for novel, innovative approaches to manage multi-drug resistant infections. Phage therapy, i.e., the use of viruses (phages) to specifically infect and kill bacteria during their life cycle, is one of the most promising alternatives to antibiotics. It is based on the correct matching between a target pathogenic bacteria and the therapeutic phage. Nevertheless, correctly matching them is a major challenge. Currently, there is no systematic method to efficiently predict whether phage-bacterium interactions exist and these pairs must be empirically tested in laboratory. Herein, we present our approach for developing a computational model able to predict whether a given phage-bacterium pair can interact based on their genome.Entities:
Keywords: Health; Machine learning; Phage-therapy; Supervised learning
Mesh:
Substances:
Year: 2018 PMID: 30453987 PMCID: PMC6245486 DOI: 10.1186/s12859-018-2388-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
DDI-score-based datasets
| Histogram’s bin generation | Normalized | Values | Abbreviation |
|---|---|---|---|
| Fixed number of bins | Yes | 5, 10, 15, 30, 50 | NBN sets |
| No | 5, 10, 15, 30, 50 | NB sets | |
| Fixed-size bins | Yes | 1, 5, 10, 15, 20 | SBN sets |
| No | 1×10−6, 2.5×10−6, 5×10−6 | SB sets |
Different configurations were used to generate 18 datasets based on the frequency distribution of domain-domain interaction scores. There exist four types of datasets according to (1) whether the histogram’s bins are defined with fixed size or fixed number and (2) whether or not the score frequencies are absolute or normalized values
Fig. 1Heatmaps resuming the F1 scores obtained during the exploratory phase. Each heatmap represents the results obtained by all the configurations for each method: K-NN (top left), RF (bottom left), SVM (top right), and NN (bottom left). The lines in the heatmap correspond to the different datasets and the columns correspond to the different configurations. The vertical white lines indicate the change of one parameter value: number of neurons for ANN and penalty factor for SVM
Fig. 2F1-score results obtained in the refinement phase. Each line represents a different dataset, while the columns correspond to the different combinations of parameter values detailed in Table 2. The change on the number of neurons is represented by the vertical white lines
Configurations used along the machine-learning algorithm in both exploratory and refinement modeling phases
| Modeling phase | ||||
|---|---|---|---|---|
| Parameters | Exploratory | Refinement | ||
| Method | K-NN | K | {1,2,3,4,5,6,7,8,9} | |
| RF | N-trees | {10^2, 10^3, 10^4} | ||
| L-size | {2,3,4} | |||
| SVM | Penalty | {10^4, 10^3,... 10^-2} | ||
| Momentum | {10^-4, 10^-3,..., 10^4} | |||
| ANN | N-neurones | {2,3,4,5,6} | {7,8,9,10} | |
| Epochs | {10,25,50,75,100} | {10,25,50,75,100} | ||
| Momentum | {0.1,0.4,0.7} | {0.1,0.4,0.7} | ||
| Datasets | All 19 sets | SB1E-6, NBN50, NB50, CH | ||
Summary of the results obtained by the selected modeling approach (i.e., ANN with 9 neurones in the hidden layer) on both validation and test conditions.
| DataSet | Accuracy | F-Score | Sensitivity | Specificity | ||||
|---|---|---|---|---|---|---|---|---|
| Val. | Test | Val. | Test | Val. | Test | Val. | Test | |
| CH | 99,0% | 97,9% | 99,0% | 97,0% | 99,9% | 97,5% | 98,6% | 98,3% |
| SBN1E-6 | 90,4% | 85,7% | 90,6% | 86,2% | 90,5% | 85,4% | 90,9% | 86,3% |
| NB50 | 91,4% | 88,2% | 91,7% | 88.5% | 91,1% | 88,6% | 92.1% | 87.7% |
| NBN50 | 92,4% | 89,8% | 92,5% | 90,1% | 93,6% | 90.7% | 91,3% | 88.8% |