| Literature DB >> 31778355 |
Nadejda Lupolova1, Samantha J Lycett1, David L Gally1.
Abstract
With the ever-expanding number of available sequences from bacterial genomes, and the expectation that this data type will be the primary one generated from both diagnostic and research laboratories for the foreseeable future, then there is both an opportunity and a need to evaluate how effectively computational approaches can be used within bacterial genomics to predict and understand complex phenotypes, such as pathogenic potential and host source. This article applied various quantitative methods such as diversity indexes, pangenome-wide association studies (GWAS) and dimensionality reduction techniques to better understand the data and then compared how well unsupervised and supervised machine learning (ML) methods could predict the source host of the isolates. The study uses the example of the pangenomes of 1203 Salmonella enterica serovar Typhimurium isolates in order to predict 'host of isolation' using these different methods. The article is aimed as a review of recent applications of ML in infection biology, but also, by working through this specific dataset, it allows discussion of the advantages and drawbacks of the different techniques. As with all such sub-population studies, the biological relevance will be dependent on the quality and diversity of the input data. Given this major caveat, we show that supervised ML has the potential to add real value to interpretation of bacterial genomic data, as it can provide probabilistic outcomes for important phenotypes, something that is very difficult to achieve with the other methods.Entities:
Keywords: Salmonella; host attribution; host specificity; machine learning; whole-genome sequences
Mesh:
Year: 2019 PMID: 31778355 PMCID: PMC6939162 DOI: 10.1099/mgen.0.000317
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Basic description of data workflow
|
Whole-genome sequences quality control, mapping to reference, | ||
|---|---|---|
|
Description |
Purpose |
Methods |
|
What describes the data? |
Features extraction |
SNPs, k-mers, proteins, … |
|
What are the most important descriptors? |
Features selection |
Pan and core GWAS, chi-square, recursive feature elimination algorithms, … |
|
Can descriptors be combined/transformed? |
Feature transformation |
Scaling and centring, PCA, MDS, t-SNE, auto-encoders, … |
|
Group data by underlying similarities |
Unsupervised ML |
Phylogeny, k-means, hierarchical clustering, … |
|
Find the hidden patterns in a defined class; classify unknown data |
Supervised ML |
Random forest, neural networks, SVM, k-nearest-neighbour, … |
Summary of analysis methods with links to tutorials
|
|
| |
|
| ||
|
1 |
Principal component analysis (PCA) [ |
PCA tutorial |
|
| ||
|
1 |
k-means [ |
k-means tutorial |
|
2 |
Agglomerative hierarchical clustering (AHC) [ |
AHC tutorial |
|
3 |
Divisive hierarchical clustering (DHC) [ |
DHC tutorial |
|
4 |
Latent Dirichlet allocation (LDA) [ |
LDA tutorial |
|
| ||
|
1 |
Support vector machines (SVMs) [ |
SVM tutorial |
|
2 |
Random forest (RF) [ |
RF tutorial |
|
3 |
Neural network (NN) [ |
NN tutorial |
Fig. 1.serovar Typhimurium pangenome exploration. Colours represent host: avian (yellow), bovine (red), human (blue), swine (pink). (a) Shannon index calculated for each host based on differential PVs. (b) Beta-dispersions show multivariate homogeneity of isolates in each host. Non-Euclidean distances between objects were reduced to principal coordinates (x-axis and y-axis). Ellipses indicate 1 sd from each host centroid marked as a letter. (c) 4041 PVs (x-axis) and the proportions (y-axis) of presence vary between the different hosts (colours as defined above). Those PVs that are significantly associated with each host (as calculated by pangenome GWAS, Scoary) are plotted in black. The bottom panel shows best fit lines (Loess) for distribution of differential PVs from all hosts. (d) Numbers of PVs significantly associated with host and overlap of differential PVs between the hosts. (e) Ordered dissimilarity matrix based on differential PVs. Heatmap colours: red (high) and blue (low) similarity. Labels are coloured by host.
Fig. 2.Unsupervised ML. The colours represent host: avian (yellow), bovine (red), human (blue), swine (pink). The first column of the figure shows the cluster’s relative size and composition by host. The second column demonstrates Silhouette index cluster assessment, where each of the four clusters are coloured differently and each isolate is drawn as a bar with its allocated value between −1 to 1. The mean value of all individual indexes is given on top of the silhouette cluster and is also denoted as a red dotted line through each graph. The clusters are drawn in the same order as those from the first column. The third column illustrates cluster correlation with phylogeny (accessory genome tree) with the inner ring depicting the host and the outer ring the unsupervised ML clusters based on the four group allocation.
Fig. 3.Optimal number of clusters as calculated by: (a) elbow method, (b) silhouette method, (c) gap statistic method. The methods that calculate the optimal number of clusters in the serovar Typhimurium dataset were in disagreement and the recommended number of clusters ranged between 2 using the silhouette method to greater than 10 using GAP statistics.
Fig. 4.Supervised ML. The colours represent host: avian (yellow), bovine (red), human (blue), swine (pink). The first column of the figure shows the cluster’s relative size and composition by host. The second column demonstrates silhouette index cluster assessment, where each of the four clusters are coloured differently and each isolate is drawn as a bar with its allocated value between −1 to 1. The mean value of all individual indexes is given on top of the silhouette cluster and also denoted as a red dotted line through each graph. The clusters are drawn in the same order as those from the first column. The third column illustrates cluster correlation with phylogeny (accessory genome tree) with the inner ring depicting the host of isolation and the outer ring the supervised ML clusters based on the four group allocation.
Comparison of supervised ML methods by strain numbers assigned to each hostRow names A, B, H and S correspond to the actual host of isolation: avian, bovine, human and swine, respectively. Column names Ap, Bp, Hp and Sp, correspond to the predictions for these hosts.
|
|
| ||||
|
|
|
|
|
|
|
|
A |
278 |
16 |
9 |
8 |
311 |
|
B |
13 |
234 |
15 |
38 |
300 |
|
H |
1 |
25 |
309 |
1 |
336 |
|
S |
7 |
45 |
12 |
192 |
256 |
|
Total |
299 |
320 |
345 |
239 |
1203 |
|
| |||||
|
A |
275 |
15 |
15 |
6 |
311 |
|
B |
15 |
240 |
18 |
27 |
300 |
|
H |
1 |
24 |
309 |
2 |
336 |
|
S |
4 |
52 |
10 |
190 |
256 |
|
Total |
295 |
331 |
352 |
225 |
1203 |
|
| |||||
|
A |
281 |
13 |
11 |
6 |
311 |
|
B |
14 |
226 |
19 |
41 |
300 |
|
H |
1 |
19 |
305 |
11 |
336 |
|
S |
6 |
47 |
11 |
192 |
256 |
|
Total |
302 |
305 |
346 |
250 |
1203 |