| Literature DB >> 34135390 |
Sean J Buckley1, Robert J Harvey2,3, Zack Shan4.
Abstract
Group A Streptococcus (GAS) is a globally significant bacterial pathogen. The GAS genotyping gold standard characterises the nucleotide variation of emm, which encodes a surface-exposed protein that is recombinogenic and under immune-based selection pressure. Within a supervised learning methodology, we tested three random forest (RF) algorithms (Guided, Ordinary, and Regularized) and 53 GAS response regulator (RR) allele types to infer six genomic traits (emm-type, emm-subtype, tissue and country of sample, clinical outcomes, and isolate invasiveness). The Guided, Ordinary, and Regularized RF classifiers inferred the emm-type with accuracies of 96.7%, 95.7%, and 95.2%, using ten, three, and four RR alleles in the feature set, respectively. Notably, we inferred the emm-type with 93.7% accuracy using only mga2 and lrp. We demonstrated a utility for inferring emm-subtype (89.9%), country (88.6%), invasiveness (84.7%), but not clinical (56.9%), or tissue (56.4%), which is consistent with the complexity of GAS pathophysiology. We identified a novel cell wall-spanning domain (SF5), and proposed evolutionary pathways depicting the 'contrariwise' and 'likewise' chimeric deletion-fusion of emm and enn. We identified an intermediate strain, which provides evidence of the time-dependent excision of mga regulon genes. Overall, our workflow advances the understanding of the GAS mga regulon and its plasticity.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34135390 PMCID: PMC8209152 DOI: 10.1038/s41598-021-91941-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Summary of the highest accuracy with which the emm-type was inferred when the three tested random forest algorithms were applied to the optimal set of response regulator allele types of group A Streptococcus. Predictions were made using tenfold cross validation and 10 replicates.
| Random forest algorithma | Accuracy (%) | AUCb (%) | F1c (%) | Precisionc (%) | Recallc (%) |
|---|---|---|---|---|---|
| Ordinary | 95.7 | 99.8 | 96.4 | 94.4 | 87.4 |
| Regularized | 95.2 | 99.4 | 97.0 | 94.7 | 91.8 |
| Guided | 96.7 | 99.9 | 97.6 | 97.0 | 92.3 |
aThe optimal sets for the Ordinary, Regularized, and Guided random forests were [mga2, lrp, and gntR_spy0715], [mga2, lrp, copY, and crgR], and [mga2, lrp, spy1934, gntR_spy0715, rivR, M28_spy1337, spy1325, gntR_spy1602, spy1817, and crgR], respectively.
bAUC = Multiclass classification area under the receiver operating characteristic curve.
cDivision by zero errors have been excluded from this average.
Figure 1Normalised importance scores of group A Streptococcus response regulator (RR) alleles displaying the highest accuracy in inferring the isolate emm-type for the three RF classifiers tested. The Guided (a), Ordinary (b), and Regularized (c) RF classifiers employed ten, three, and four RR alleles to attain 96.7%, 95.7%, and 95.2%, respectively. The SPY locus numbers refer to the SF370 isolate, unless stated otherwise.
Importance value rankings of response regulators alleles (predictor features) in the optimal feature sets inferring GAS emm-type for the random forest algorithms tested.
| Response regulator | Guided (96.7%)a | Ordinary (95.7%)a | Regularized (95.2%)a |
|---|---|---|---|
| 1 | 1 | 1 | |
| 2 | 2 | 2 | |
| 3 | |||
| 4 | 3 | ||
| 5 | |||
| 6 | |||
| 7 | |||
| 8 | |||
| 9 | |||
| 10 | 4 | ||
| 3 |
The optimal feature set is the set of features (from 53 response regulator alleles) selected in attaining the highest accuracy of inferring the emm-type for a particular random forest algorithm.
aThe percentage in brackets is the accuracy of inference.
Figure 2Susceptibility tests. The accuracy of inferring the group A Streptococcus emm-type by applying a different number of predictor features (response regulator alleles) to each of the three tested random forest classifiers [(a) Guided, (b) Ordinary, and (c) Regularized).
Examples of inaccurately inferred GAS emm-type using the most accurate Guided random forest algorithm and the optimal set of response regulator (RR) allele types.
| Strain | Putative explanations for inaccuracy | ||||
|---|---|---|---|---|---|
| Observedb | Inferred | Non-discriminatoryc | Singletonsd | Chimeric | |
| K17011 | 79 (E3) | 183 (E3) | Yes | ||
| K23685 | 79 (E3) | 183 (E3) | Yes | ||
| 33181V4T1 | 205 (E5) | 101 (D4) | Yes | ||
| K9612 | 99 (E6) | 182 (E6) | |||
| NGAS148 | New type (NT) | 5 (M5) | New type[ | ||
| K23182 | 63 (E6) | 4 (E1) | [ | ||
| K5690 | 81 (E6) | 82 (E3) | [ | ||
| NGAS473 | 82 (E3) | 74 (M74) | [ | ||
| 31140V1S1 | 98 (D4) | 9 (E3) | [ | ||
| 33181V1T1_01 | 137 (E5) | 39 (A-C4) | This study | ||
| K29655 | 53 (D4) | 52 (D4) | |||
| 33123V2S1 | 71 (D2) | 70 (D4) | |||
| K47020 | 80 (D4) | 81 (E6) | |||
| K20641 | 80 (D4) | 81 (E6) | |||
| K33951 | 80 (D4) | 81 (E6) | |||
| 20027V1I1 | 110 (E2) | 109 (E4) | |||
| K17074 | 218 (M218) | 119 (D4) | |||
| K9927 | 223 (D4) | 22 (E4) | |||
| K37741 | 239 (A-C3) | STG866 (NT) | |||
aemm-cluster type in brackets.
bThe observed or published emm-type.
cPrior to the random forest testing, it was known that the variation between the RR alleles in the feature set was not able to discriminate emm79 from emm183, or emm101 from emm205.
dSingleton denotes where the dataset contained only one representative of this emm-type.
eChimeric emm-enn events have been observed in isolates of this emm-type.
femm-switching has also been inferred in this isolate.
Figure 3Novel cell wall-spanning domain of group A Streptococcus (GAS) emm, SF5, described by the chimerisation of SF3 and SF1[9,10]. SF3 and SF1 are typical encoded in the majority of enn and a subset of emm, respectively. SF5 was observed in emm39.4 GAS (31005V6S1) and emm137.0 GAS (33181V1T1_01).
Figure 4Evolutionary pathways of two novel chimeric emm-enn events in the mga regulon of GAS. (a) ‘Contrariwise’ and (b) ‘Likewise’ events are depicted were the mutated isolate changes its emm-subtype, and retains its emm-subtype, respectively. The chimeric emm-enn is represented by a deletion-fusion event that culminates in a new gene containing the 5′ end of emm and the 3′ end of enn.
Figure 5Phylogeny of group A Streptococcus E3 emm-cluster types. The tree has been labelled with the corresponding emm-type. The table summarises examples of recombination and mutation observed in the mga regulon of E3-type isolates. The tree is drawn to scale, with branch lengths in the same units (number of amino acid substitutions per site) as those of the evolutionary distances used for the phylogenetic tree. Approximate likelihood-ratio test values > 80% are indicated at the nodes. Adapted from Ref.[13]. Legend: ANGAS473, an emm82 isolate, inferred to have been the result of an emm-switch event has been previously described[18,19], and was observed in this study to be pgs-negative.
Figure 6Evolutionary pathway explaining the major disruption to emm of group A Streptococcus 33087V1T1. This also represents a mechanism for the time-dependent excision of the genes of the mga regulon seen in chimeric emm-enn events. It is likely that the nucleotide deletions observed in 20059V1I1 cause disruption that drastically diminishes the function of emm, leading to its eventual deletion.
Figure 7Accuracy of the random forest classifiers tested in inferring group A Streptococcus genomic traits from a selection of 53 response regulator allele types. The labels tested include emm-subtype, the tissue and country from which the isolate was sampled, clinical outcomes from the infection, and the propensity of the isolate to cause invasive disease.
Comparison of properties of the emm-based and response regulator-based typing systems of group A Streptococcus.
| Response regulator allele-based typing | |
|---|---|
| The RRs are a family of cytosolic proteins that share broadly similar functional domains, including control of the expression of traditional GAS typing proteins | |
| Many proteins with a range of recombinogenicity | |
Preferred (bold) and non-preferred (italics) properties of a molecular bacterial typing system.
Figure 8Summary of (a) input data nomenclature and (b) process flow of this study.
Figure 9Schematic representation of the normalised importance score plot for the selected predictor feature set. Subsets of features were selected based on arbitrary threshold normalised importance values. Steps 2 and 3 of the process flow were then applied to each of these subsets.