| Literature DB >> 35004362 |
Sean J Buckley1, Robert J Harvey1,2.
Abstract
Group A Streptococcus is a globally significant human pathogen. The extensive variability of the GAS genome, virulence phenotypes and clinical outcomes, render it an excellent candidate for the application of genotype-phenotype association studies in the era of whole-genome sequencing. We have catalogued the distribution and diversity of the transcription regulators of GAS, and employed phylogenetics, concordance metrics and machine learning (ML) to test for associations. In this review, we communicate the lessons learnt in the context of the recent bacteria genotype-phenotype association studies of others that have utilised both genome-wide association studies (GWAS) and ML. We envisage a promising future for the application GWAS in bacteria genotype-phenotype association studies and foresee the increasing use of ML. However, progress in this field is hindered by several outstanding bottlenecks. These include the shortcomings that are observed when GWAS techniques that have been fine-tuned on human genomes, are applied to bacterial genomes. Furthermore, there is a deficit of easy-to-use end-to-end workflows, and a lag in the collection of detailed phenotype and clinical genomic metadata. We propose a novel quality control protocol for the collection of high-quality GAS virulence phenotype coupled to clinical outcome data. Finally, we incorporate this protocol into a workflow for testing genotype-phenotype associations using ML and 'linked' patient-microbe genome sets that better represent the infection event.Entities:
Keywords: Streptococcus pyogenes; machine learning; phenotype metadata; random forest; virulence
Mesh:
Year: 2021 PMID: 35004362 PMCID: PMC8739889 DOI: 10.3389/fcimb.2021.809560
Source DB: PubMed Journal: Front Cell Infect Microbiol ISSN: 2235-2988 Impact factor: 5.293
Compilation of the expected classifications of tissues sampled, clinical presentation, and human patient risk factors in group A Streptococcus infection for use in a quality control protocol for the collation of high-quality virulence phenotype metadata.
| Genomic metadata categories | Expected classifications |
|---|---|
| Tissue sampled1 | Epithelial swab, blood, sputum, urine, saliva, synovial fluid, soft tissue, cerebrospinal fluid |
| Clinical presentation1 | Throat carriage, scarlet fever, streptococcal toxic shock syndrome, type II necrotizing fasciitis, pharyngitis, superficial soft tissue infection, deep soft tissue infection, cellulitis, meningitis, pneumonia, bacteraemia, arthritis, puerperal sepsis, genital infection, iGAS, acute phlegmonous gastritis, rheumatic fever, rheumatic heart disease, post-streptococcal glomerulonephritis, paediatric autoimmune neuropsychiatric disorders associated with |
| Human patient risk factors2 | Blood antigen group ( |
1Expected classifications are adapted from the Davies GAS atlas (Davies et al., 2019), 2Compliance with human ethics standards is required.
Figure 1Workflow for the application of machine learning in comparative genomics genotype-phenotype association studies. ‘Linked’ samples of microbe, host, and microbiome (optional) are simultaneously collected and sequenced. Recommended virulence phenotype metadata is also collated and assessed for quality. The DNA sequences of target genes (for example, transcription regulators) are extracted and the alleles are typed. Machine learning algorithms are applied to the allele types (predictor variables) in the prediction of response variables (for example, invasive virulence phenotype). The machine learning models are validated by comparison of observed and predicted phenotype data. The most important predictor variables are selected as the basis of dimension reduction, as required. Legend: WGS, whole-genome sequencing; mNGS, metagenomic next generation sequencing.