Literature DB >> 34402777

The roles of antimicrobial resistance, phage diversity, isolation source and selection in shaping the genomic architecture of Bacillus anthracis.

Spencer A Bruce¹, Yen-Hua Huang², Pauline L Kamath³, Henriette van Heerden⁴, Wendy C Turner⁵.

Abstract

Entities: Chemical

Keywords: AMR; Bacillus anthracis; bacterial genomics; natural selection

Mesh：

Substances：
Anti-Bacterial Agents

Year: 2021 PMID： 34402777 PMCID： PMC8549369 DOI： 10.1099/mgen.0.000616

Source DB: PubMed Journal: Microb Genom ISSN： 2057-5858

× No keyword cloud information.

All NCBI accession numbers related to sequence reads and bioproject data used in this study are listed in File S1. All supplementary material can be found at https://doi.org/10.6084/m9.figshare.14485140.v1 The R script used for the random forest classification and code used for identifying clustered mutations can be found at the following GitHub repository: https://github.com/spencer411/B_anthracis_adaptation. Understanding the drivers of pathogen genomic structure allows for targeted disease management based on factors contributing to virulence and host susceptibility. Despite the large range of published information on genetic structure, little work has been done to understand the factors shaping its global genetic constitution. The data presented here allow for the first large-scale accounting of antimicrobial resistance and phage sequence diversity for this species. These results suggest that antibiotic resistance genes and isolation source may be driving aspects of population structure and emphasize the importance of examining multiple factors dictating pathogen evolution and genotypic persistence.

Introduction

For pathogens, consideration of intraspecific variation is central to understanding the evolution of virulence and genotypic persistence [1-3]. Phenotypic and genetic variation in a population may influence ecological composition and function, leading to increased or decreased evolutionary capacity under altered habitat regimes [4, 5]. Incorporating intraspecific diversity into effective management strategies demands the identification of factors influencing ecological plasticity and reproductive success [6, 7]. A wide range of genomic analyses have revealed genetic anomalies supporting ecologically variable phenotypes, suggesting a consequential role for genomic architecture in driving intraspecific heterogeneity [8, 9]. For example, single nucelotide polymorphisms (SNPs) may facilitate the evolution of new phenotypes through the formation of novel proteins in regions that code for spore formation in or virulence factors in [10, 11]. By evaluating the relationship between whole genome architecture and ecologically relevant sequence variation, we can gain a detailed understanding of how biocomplexity drives genomic structure in pathogenic bacteria [12, 13]. Nevertheless, the relationship between genomic variation and adaptive relevance remains largely unknown for the vast majority of pathogenic species [14, 15]. has been extensively studied given its ability to cause anthrax, a disease that can be fatal to wildlife, livestock and humans [16]. Studies that have examined the integration of bacteriophage DNA into the genome have suggested that these sequences may influence gene expression, potentially driving increased sporulation and observable phenotypic differences [17, 18]. In addition, antimicrobial resistance (AMR) has recently garnered a great deal of attention given the wide range of antibiotics administered to both humans and livestock throughout the world, driving selective resistance in a myriad of bacteria including [19, 20]. Therefore, when examining genomic architecture in light of selection, the use of classification methodologies that incorporate potentially ecologically relevant differences in phage diversity and AMR may shed light on the drivers of modern population genomic structure in this species. This in turn will allow us to better forecast what genomic clusters or clades may pose the greatest risk of disease emergence and re-emergence in animals and humans [21]. Nevertheless, it should be noted that the detection of an AMR gene does not always translate to conferred resistance [22]. Individual and regional genetic diversity that differentiates populations by SNP architecture has been identified on a global scale [23-26]. Recent work has refined our understanding of population genomic structure for this species [27, 28]. Work by Sahl et al. sought to expand on the original classification system, and generated an SNP database used to characterize the branching structure of isolates based on 193 genomes [28]. More recent genomic analyses that comprise the largest global phylogeny of to date (356 genomes) has redefined population genomic structure, resulting in six primary clusters and 18 nested clades. This new classification system uses an intuitive, simplified naming system and allows for linkable, rapid classification [27]. Two of the major genotype clusters, cluster 1 (C Branch) and cluster 2 (B Branch), are vastly underrepresented in terms of prevalence and have been hypothesized to be less fit than the majority of specimens isolated and sequenced [26, 29, 30]. However, the link between genomic architecture, and the scarcity of these genotypes remains largely unexamined. In addition, some of the individual clades identified are geographically specific, whereas others seem to be widely distributed, raising numerous questions about what factors are driving evolutionary success in this species [26-28]. Understanding the relationships among spatial variation, population stability, and genomic architectural variation is particularly important for , as it is a major threat to wildlife, livestock and public health globally [31]. In this study, we explore genomic variation and selection in a global whole-genome dataset of isolates spanning 39 countries and six continents. In addition, we apply an ensemble machine learning method [random forest (RF)] to elucidate the ways in which isolation source, geography, phage diversity and AMR genes may be shaping genomic diversity and genotypic persistence. RF operates by constructing decision trees on various subsamples of the dataset, allowing for predictions regarding evolutionary potential at the population level.

Methods

Whole genome mapping and assembly

The population genomic dataset used in these analyses was previously developed and published by Bruce et al. [27] consisting of 356 whole genomes collected from the NCBI sequence read archive ([27], File S1, available in the online version of this article). Each read pair was mapped to the fully annotated Ames Ancestor genome (accession AE017334.2), using the RedDog pipeline (https://github.com/katholt/RedDog). Mapped reads were then subjected to extensive post-processing to remove calls (a) found in regions with large ‘inexact’ repeats, (b) within prophage regions of the reference genome, (c) from regions that were found to be invariable in all but the outgroup, (d) from regions potentially resulting from recombination and (e) potentially related to stutter. Full details relating to methods for mapping, SNP calling and determination of population genomic structure can be found in Bruce et al. [27]. For the purpose of this study, the same trimmed sequence reads were also subjected to de novo assemblies using SPAdes version 3.13.0, a genome assembly algorithm specifically developed for single cell and multi-cell bacterial isolates [32]. De novo assemblies allow for the identification of unique sequences in each isolate not identifiable using the mapping method described above.

Identification of AMR genes and phage sequence variation

We screened each assembly for AMR genes employing The Resistance Gene Identifier (RGI) tool provided by the Comprehensive Antibiotic Resistance Database (CARD) [33]. RGI can be used to predict resistomes from protein or nucleotide data based on homology and SNP models. In addition to identifying AMR genes, we identified prophage sequences within the contigs of each assembled genome using the Phage Search Tool Enhanced Release (PHASTER) [34]. PHASTER is a web-based application that is designed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids. The full phylogenetic tree of isolates from Bruce et al. [27] was then annotated using iTOL [35] with both phage sequence variation (scored as either intact, questionable or incomplete; see Table S1 for details), and presence (or absence) of AMR genes. AMR gene data were then plotted geographically to understand patterns of resistance on a global scale using Adobe Illustrator [36].

Classifying population genomic architecture using RF

To understand how various factors may be influencing the genomic architecture of we used an RF approach [37], incorporating AMR gene data, phage diversity data and isolate metadata (continent of isolation and source) accessed through the NCBI biosample database [38]. RF has gained increased attention over the past several decades given its ability to produce excellent classification results while also being computationally inexpensive [39, 40]. The RF classifier produces valid classifications using predictions derived from a group of decision trees and can also be used to select and rank those variables, allowing the user to successfully discriminate between the target classes [41]. RF was carried out using the R package randomForest [37] to construct a multitude of decision trees and determine the mean prediction of each individual tree pertaining to the six primary population clusters [27]. The R package SPM was then used to carry out a 5-fold cross-validation [42]. We first removed samples with missing values for independent variables, and additionally removed three variables (AMR gene mphL, and phage sequence Bacillus virus 1 and Bacillus phage PfEFR-5) which exhibited no variation across the dataset. The final dataset resulted in 20 independent variables (Table S2). We divided the dataset into a training dataset including 75 % of the samples and a validation dataset including the other 25 %. To determine Mtry and Ntree (number of variables and number of trees), we used a 5-fold cross-validation and grid search. To carry out 5-fold cross-validation, we randomly assigned each sample to one of five groups. For each pass of cross-validation, RF classifiers were trained with a test dataset of which one group was held out [43]. The model with the highest correct classification rate and Kappa index of the classification was selected for determining values of Mtry and Ntree. We used the best combination of the Mtry and Ntree for the final RF model. To assess the model fit of the RF we subjected the model to the validation dataset and estimated the accuracy. To determine the contribution of the variables to the classification in the model, the importance of variables was evaluated by the mean decrease in accuracy. The mean decrease in accuracy was computed with the difference between the out-of-bag (OOB) error (training observations not included in the bootstrap) from a dataset with the selected variable permuted and the OOB error from the original dataset [41].

Recombination, high-impact SNPs and candidate genes for selection

To determine how recombination may be influencing population genomic structure across our dataset, we first used the program Gubbins to iteratively identify loci containing elevated densities of base substitutions in the SNP dataset (prior to removal of recombinant sequences) [44]. To analyse selection in non-recombining regions, we analysed SNPs (post-removal of recombinant sequences) using the program SnpEff [45]. SnpEff annotates and predicts the effects of genetic variants on genes and proteins (such as amino acid changes). To assess ‘high-impact’ SNPs influencing population genomic structure, we compiled a list of SNPs that produce significant changes to protein structure in the chromosome and plasmids, specific to each primary cluster and groups of primary clusters, such as mutations that result in the gain of a stop codon, the loss of a start codon and splice region variants. We also looked for clustered SNPs across each of the aforementioned groups to identify genes that were possibly associated with selection using a modified version of the algorithm developed by Cui et al. [46], classifying genes that showed three or more mutations within a 2000 bp range, as well as genes that showed two or more SNPs within a 50 bp range. Clustered mutations have a low probability of occurring under a neutral substitution model, in which variations are assumed to be randomly distributed across the genome Zhou et al. [47]. Examining the ratio of non-synonymous to synonymous SNPs at the gene level was problematic given the clonal nature of and reduced variability at the level of the gene, and was therefore not included in our analysis. We then compiled a list of candidate genes for selection that were identified using both methods above. Finally, we examined differences in SNP variation across the virulence genes (in the plasmids), again using SNPeff to identify SNPs potentially leading to functional differences across the different population genomic clusters.

Results

Global variation in AMR and phage diversity

We identified a total of ten AMR genes across the global collection of 356 genomes analysed (Fig. 1, Table 1). Additional information regarding the AMR genes and their frequency is provided in Table S3. A key linking the classification framework shown here to the previously established branch labels outlined by Sahl et al. [28] are provided in Fig. S1. Five AMR genes (mphL, bla1, fosB, bla2 and vmlR) were found across the majority of isolates tested. AMR gene mphL was identified in every isolate examined. AMR gene bla1 was absent in two unrelated isolates, one collected in South Carolina, USA [clade 4.3 (Vollum), NCBI Sequence Read Archive: SRR5811007], and the other collected in Morioka, Japan [clade 5.2 (Sterne), NCBI Sequence Read Archive: DRR128181]. AMR gene fosB was absent from a single isolate collected in South Carolina [clade 4.2 (Vollum), NCBI Sequence Read Archive: SRR5811063], and all isolates that comprise primary cluster 1 (C Branch) from the USA (N=5). The gene bla2 was absent from a number of other disparate samples (N=11) and was completely absent from all isolates that comprise clade 3.1 (Ancient A; N=4). AMR gene vmlR was present in all isolates with the exception of a handful of closely related isolates collected in the USA between 1956 and 1978 from clade 4.1 (Vollum, N=3), as well as all isolates that comprise primary cluster 5 (V770, Ames, Sterne, Aust94; N=72). All other AMR genes were far rarer. AMR gene bcII was present in only six samples, including one isolate from Alabama [clade 4.2 (Vollum), NCBI Sequence Read Archive: SRR1739961], one isolate from Akita, Japan [clade 5.2 (Sterne), NCBI Sequence Read Archive: DRR128182], one isolate from Argentina [clade 5.2 (Sterne), NCBI Sequence Read Archive: SRR5810989], and three isolates from Albania [clade 6.1 (TEABr008/011), NCBI Sequence Read Archive: SRR2968139, SRR2968140 and SRR2968213]. AMR gene tem-116 was present in only three isolates, all collected in Zambia between 2012 and 2013 [clade 3.3 (Ancient A), NCBI Sequence Read Archive: DRR014736, DRR014737 and DRR125655]. AMR gene cfrC was present in a single isolate from Germany [clade 5.3 (Aust94), NCBI Sequence Read Archive: SRR2968155], dfrG was present in a single isolate from Zambia [clade 3.3 (Ancient A), NCBI Sequence Read Archive: DRR125655] and oxa-59 was present in a single isolate from Italy [clade 6.1 (TEABr008/011), NCBI Sequence Read Archive: SRR2968209].

Fig. 1.

Table 1.

AMR genes and their definitions from the Comprehensive Antibiotic Resistance Database (CARD)

Name	Resistance mechanism	Accession	Definition
mphL	Antibiotic inactivation	ARO:3003072	A chromosomally encoded macrolide phosphotransferase that inactivates macrolides such as erythromycin, clarithromycin, azithromycin
bla1	Antibiotic inactivation	ARO:3000090	A chromosomally encoded beta-lactamase that hydrolyses penicillins
fosB	Antibiotic inactivation	ARO:3000172	A thiol transferase that leads to fosfomycin resistance
bla2	Antibiotic inactivation	ARO:3004189	A chromosomally encoded beta-lactamase that has penicillin-, cephalosporin- and carbapenem-hydrolysing abilities
vmlR	Antibiotic target protection	ARO:3004476	An ABC-F ATPase ribosomal protection protein shown to confer resistance to lincomycin and streptogramin A virginiamycin
bcII	Antibiotic inactivation	ARO:3002878	A zinc metallo-beta-lactamase that hydrolyses a large number of penicillins and cephalosporins
tem-116	Antibiotic inactivation	ARO:3000979	A broad-spectrum beta-lactamase found in many species of bacteria
cfrC	Antibiotic target alteration	ARO:3004146	A cfr-like 23S rRNA methyltransferase shown to confer resistance to linezolid and phenicol antibiotics, including florfenicol and chloramphenicol
dfrG	Antibiotic target replacement	ARO:3002868	A plasmid-encoded dihydrofolate reductase
oxa-59	Antibiotic inactivation	ARO:3001772	A beta-lactamase

AMR genes identified in the whole‐chromosome tree of 356 global isolates (a). Primary clusters are divided into their numbered nested clades by grey lines. The key on the right lists the ten AMR genes identified. Outer rings reflect presence (colour) or absence (white) of each gene across all isolates in the phylogeny. A world map depicting the prevalence of AMR genes is depicted in (b). Each circle represents an AMR gene coloured according to the key and figure above. Percentages represent the total proportion of isolates from each continent where the respective AMR gene was identified (including North America, South America, Europe, Asia and Oceania). See Fig. S1 for a key to previously established classification schemes. AMR genes and their definitions from the Comprehensive Antibiotic Resistance Database (CARD) Name Resistance mechanism Accession Definition mphL Antibiotic inactivation ARO:3003072 A chromosomally encoded macrolide phosphotransferase that inactivates macrolides such as erythromycin, clarithromycin, azithromycin bla1 Antibiotic inactivation ARO:3000090 A chromosomally encoded beta-lactamase that hydrolyses penicillins fosB Antibiotic inactivation ARO:3000172 A thiol transferase that leads to fosfomycin resistance bla2 Antibiotic inactivation ARO:3004189 A chromosomally encoded beta-lactamase that has penicillin-, cephalosporin- and carbapenem-hydrolysing abilities vmlR Antibiotic target protection ARO:3004476 An ABC-F ATPase ribosomal protection protein shown to confer resistance to lincomycin and streptogramin A virginiamycin bcII Antibiotic inactivation ARO:3002878 A zinc metallo-beta-lactamase that hydrolyses a large number of penicillins and cephalosporins tem-116 Antibiotic inactivation ARO:3000979 A broad-spectrum beta-lactamase found in many species of bacteria cfrC Antibiotic target alteration ARO:3004146 A cfr-like 23S rRNA methyltransferase shown to confer resistance to linezolid and phenicol antibiotics, including florfenicol and chloramphenicol dfrG Antibiotic target replacement ARO:3002868 A plasmid-encoded dihydrofolate reductase oxa-59 Antibiotic inactivation ARO:3001772 A beta-lactamase In addition to AMR genes, we also identified 11 prophage sequences across our global dataset (Fig. 2, Table S4). Prophage sequences were scored as intact, questionable or incomplete. Criteria related to this categorization can be found in Table S1. Additional information regarding the phage sequences and their lineages is provided in Table S5. Bacillus virus 1, Bacillus phage PfEFR-5 and Staphylococcus phage vB_SepS_SEP9 sequences were detected across all of our samples. Bacillus virus 1 was determined to be intact in all isolates examined. Bacillus phage PfEFR-5 was determined to be questionable across most isolates, but incomplete for all isolates comprising primary cluster 1 (C Branch, N=5), while Staphylococcus phage vB_SepS_SEP9 was determined to be questionable across all isolates, but incomplete for all isolates comprising primary clusters 1 and 2 (C and B Branches; N=18). The eight remaining prophage sequences were scattered in comparatively minimal amounts across the global dataset, with the exception of Bacillus phage phBC6A52 which was intact in a large number of the isolates examined (N=91), with seemingly no link to relatedness or geography among isolates.

Fig. 2.

Prophage sequences identified in whole‐chromosome tree of 356 global isolates. Primary clusters are divided into their numbered nested clades by grey lines. The key on the right indicates the phage sequence present for each isolate, numbered according to their order from the inside of the ring to the outside. Colour indicates whether the phage sequence was determined to be intact, questionable or incomplete. Criteria related to this categorization can be found in Table S1. See Fig. S1 for a key to previously established classification schemes.

Explaining global genomic clusters with RF

The RF model was trained using 5-fold cross-validation with a training dataset. The best model parameter (where Ntree and Mtry equalled 400 and 9 respectively) produced a cross-correlation rate that showed a high value of 83.6, while kappa equalled 0.764. Variables examined include presence of AMR genes and phage sequences, as well as sample source (details provided in Table S2). With this combination of Ntree and Mtry, the OOB error based on the confusion matrix was 16.89 %. Applying the model to the validation dataset and comparing the observations and predictions, the overall accuracy was 0.861. The model always failed to predict primary cluster 1 (C Branch) for the validation dataset (N=1), and primary cluster 2 (B Branch) for both the training set (N=9) and the validation dataset (N=2), probably due to the reduced number of representatives comprising these clusters. The AMR gene vmlR, the isolation source (host, environment or industry) and the continent of isolation were the most important variables in explaining genomic clusters across the entire dataset (Fig. 3a). The variable importance based on the mean decrease in accuracy for each individual cluster is shown in Fig. 3b. The absence of AMR gene fosB was the strongest predictor for primary cluster 1 (C Branch), whereas the vmlR gene in primary cluster 2 (B Branch) acted as the strongest predictor. Nevertheless, both of these models exhibited negligible accuracy in the confusion matrix, suggesting more data are needed for accurate classification for these two clusters (Table S6). In cluster 3 (Ancient A) the continent of isolation was the strongest predictor by a large margin, as the vast majority of the samples that make up this population were isolated in Africa. For cluster 4 (Vollum) isolation source was the strongest predictor, followed by continent, as the majority of the isolates from this cluster were collected from industry (textile factories, animal processing plants, etc.) in North America. For cluster 5 (V770, Ames, Sterne, Aust94) the absence of the vmlR gene was the strongest, lone overall predictor. Finally, in cluster 6 (TEA), isolation source, presence of the vmlR gene and continent all showed comparatively strong power in classifying this cluster, with the majority of isolates from this cluster being isolated from animal hosts in North America and Europe.

Fig. 3.

Importance of the covariates in defining population genomic architecture for all primary clusters combined (a) and for each primary cluster on its own (b) by the RF classifier.

Role of selection in shaping the genome

The program Gubbins predicted two instances of recombination, the first in a single isolate from Thailand [clade 5.2 (Ames), NCBI Sequence Read Archive: SRR5811219], based on 26 SNPs, and the second encompassing 13 isolates [comprising all of primary cluster 2 (B Branch)], based on 24 SNPs. Both instances of predicted recombination were specific to the rrsA rRNA gene [positions 9335–10 841 in the Ames Ancestor reference genome (NCBI accession: AE017334.2)], which encodes the 16S rRNA, essential to translating messenger RNA into proteins. After removing SNPs associated with recombination, SNP calls were split by primary cluster designations using a hierarchical approach, grouping primary clusters based on their nested structure, while filtering at a minimum allele frequency of 0.10 to avoid the identification of relatively rare alleles that were not necessarily indicative of their respective population genomic cluster. High-impact SNPs were then identified using the program snpEff. A detailed account of all 62 high-impact SNPs identified (including their predicted effect) is presented in Table S7. We also looked at clustered mutations in the same hierarchical manner, leading to the identification of 122 candidate genes potentially influencing selection (Table S8). Comparing both methodologies, five genes that spanned clustered mutations also contained a high-impact SNP (Table 2). These include genes coding for a DNA-binding response regulator and stage 0 sporulation regulatory protein (both specific to primary cluster 1), a tetratricopeptide repeat (TPR) domain protein [specific to primary cluster 2 (B Branch)], as well as a chlorohydrolase family protein and a hypothetical protein [both specific to primary clusters 5 (V770, Ames, Sterne, Aust94) and 6 (TEA)].

Table 2.

Cluster	Position	Reference	Alternate	Comparative effect	Gene/product
1	1 260 604	C	T	Stop gained	DNA-binding response regulator
	1 292 469	C	A	Stop gained	Stage 0 sporulation regulatory protein
2	3 140 849	A	T	Stop gained	TPR domain protein
5 and 6	1 748 642	A	T	Stop gained	Chlorohydrolase family protein
	2 423 864	T	C	Start lost	Hypothetical protein

Information for SNPs that exhibit a potentially high-impact effect and fall within clustered mutations across the genome; positions are relative to the Ames Ancestor reference genome (NCBI accession: AE017334.2) Cluster Position Reference Alternate Comparative effect Gene/product 1 1 260 604 C T Stop gained DNA-binding response regulator 1 292 469 C A Stop gained Stage 0 sporulation regulatory protein 2 3 140 849 A T Stop gained TPR domain protein 5 and 6 1 748 642 A T Stop gained Chlorohydrolase family protein 2 423 864 T C Start lost Hypothetical protein Lastly, we looked at non-synonymous mutations across the virulence genes (in the pXO1 and pXO2 plasmids), again using a minimum allele frequency of 0.10 to avoid the identification of relatively rare alleles that were not necessarily indicative of a group. All of the non-synonymous SNPs identified were on the pXO1 plasmid and spanned two toxin genes: the cya (calmodulin-sensitive adenylate cyclase) and the pagA (protective antigen) genes (Table 3). Both of the mutations in the cya gene were specific to clade 6.3 (WNA/TEABr011) for which all isolates were collected in western North America. In the pagA gene, one missense mutation was specific to the genetically and geographically diverse primary cluster 4 (Vollum), and the other to cluster 5 (V770, Ames, Sterne, Aust94), for which most of the isolates were collected in Asia and Europe.

Table 3.

Information for non-synonymous SNPs in virulence genes across the plasmids; positions are relative to the Ames Ancestor reference genome (NCBI accession: AE017336)

Cluster.clade	Position	Plasmid	Reference	Alternate	Comparative effect	Gene/product
6.3	123 936	pXO1	A	T	Missense mutation	cya: calmodulin-sensitive adenylate cyclase
6.3	124 007	pXO1	A	G	Missense mutation	cya: calmodulin-sensitive adenylate cyclase
4	145 471	pXO1	C	T	Missense mutation	pagA: protective antigen
5	145 577	pXO1	C	T	Missense mutation	pagA: protective antigen

Information for non-synonymous SNPs in virulence genes across the plasmids; positions are relative to the Ames Ancestor reference genome (NCBI accession: AE017336) Cluster.clade Position Plasmid Reference Alternate Comparative effect Gene/product 6.3 123 936 pXO1 A T Missense mutation cya: calmodulin-sensitive adenylate cyclase 6.3 124 007 pXO1 A G Missense mutation cya: calmodulin-sensitive adenylate cyclase 4 145 471 pXO1 C T Missense mutation pagA: protective antigen 5 145 577 pXO1 C T Missense mutation pagA: protective antigen

Discussion

Understanding the drivers of population genomic structure in pathogens is essential for making informed decisions related to wildlife management, disease control and public health. The data presented in this study offer the first detailed, global accounting of AMR genes and phage diversity in . In addition, our findings suggest that the six primary clusters defining population genomic structure in this species are consistent with differences in both AMR genes, geography and the source from which they were isolated. We also demonstrate that a recombination event linked to protein translation may take part in determining the persistence of certain strains. Finally, we offer a wealth of information on genomic diversity potentially associated with functional differences driving selection, allowing for further investigations into persistence, biogeography and evolution. AMR has gained increased attention as a major threat to public health throughout the world [48, 49]. By documenting AMR genes on a global scale, we can gain a better understanding of how biogeography and persistence are transforming the genomic constitution of dangerous pathogens at both regional and wider scales [50, 51]. Based on our analysis of over 350 whole genomes, we have identified ten AMR genes present in isolates collected from over 35 countries, many consistent with the different population clusters examined in this study. Five of these genes are commonplace and can be found in the majority of isolates examined, whereas the other five are comparatively rare across the dataset. Given the application of commonly used antibiotic drugs, such as penicillin, doxycycline and ciprofloxacin, to treat infections, the regions where rare antibiotic-resistant gene isolates were sampled may benefit from monitoring, in order to document the persistence of these novel, resistant population clusters and modify antibiotic treatments for effectiveness [52, 53]. The resistance gene bcII for example, which was found in only six samples, is known to hydrolyse a large number of penicillins (Table 1). Rarer antibiotic-resistant gene strains such as these may be indicative of a larger problem with antibiotic resistance in other dangerous pathogens as well, especially if the overuse of certain antibiotics is driving resistance in those regions where novel resistance genes reside [54]. The influence of bacteriophage sequences on population genomic structure across the global dataset is less clear. As with AMR genes, several phage sequences were commonplace across isolates examined, while others were rarer or without pattern. Phage diversity was the least important factor in predicting population genomic structure based on the RF technique applied in this study. This is in contrast to studies of other pathogens, where phage sequence variation has been consistent with population genomic structure and therefore used for strain typing [55, 56]. Although previous studies have suggested some phage sequences may affect certain bacterial processes in , such as sporulation [17, 18], there was not an observable example of this leading to any advantage reflected in the form of genetically similar population clusters. Applying the RF model, population genomic structure was most readily described by a combination of AMR genes, isolation location and source. The strongest predictor of population genomic structure when examining the dataset in its entirety was the presence of the AMR gene vmlR, which was completely absent in primary cluster 5 [which was A.Br.001–A.Br.004 (Ames, Sterne, Aust94, V770) in the original classification system], the most genetically diverse population cluster examined in this study from which isolates were collected across Europe, Asia, Africa and the Americas. Interestingly, isolation source (host, environment or industry) was the second strongest predictor, suggesting that some strains of may be better suited to different environmental circumstances (or at least more readily cultured within them). Previous work that has examined population genomic structure has suggested that environmental growth outside of the host is possible [28]. Additionally, strains collected from industry may represent geographical consistencies in raw wool procural rather than a niche associated with this type of artificial environment [57, 58]. Nevertheless, long latent periods in the spore phase may be hindering our ability to detect environmental consistencies with population genomic structure. Not surprisingly, the continent of isolation was also a strong predictor in terms of population genomic structure, consistent with expected biogeographical patterns based on centuries of dispersal, complex trading patterns and global commerce. These findings are largely consistent with past work that has examined the population genetics and ubiquitous dissemination of this bacterium [26, 28, 58]. These combined forces – AMR genes, isolation source and biogeography – all seem to play a role in defining modern population structure in this bacterium. Using RF models to look at the factors influencing each primary cluster individually, we found that varying circumstances seem to act as predictors for each individual cluster. The most underrepresented group, primary cluster 1, previously referred to as the C Branch in the literature and viewed as a rarely occurring clade [26, 28], is largely defined by the absence of the AMR gene fosB, which is found universally across all other population clusters examined. The relatively rare primary cluster 2 (B Branch) was not easily defined by any of the variables examined. Nevertheless, classification performance for both primary cluster 1 and cluster 2 was equally poor when assessing the accuracy. Previous work that has specifically examined isolates belonging to cluster 2 from Kruger National Park found that they were prevalent in more alkaline calcium-rich soils than cluster 3 (Ancient A) isolates occurring in the same region [30]. Cluster 3 (Ancient A) was described primarily by its isolation from the continent of Africa (although there are several isolates from elsewhere as well), suggesting that isolates from this group may be uniquely suited to or may have originated in this region. Primary cluster 4 is primarily described by a combination of isolation source and continent. This group, formerly referred to as A.Br.007 or Vollum in the literature, was isolated almost exclusively in a manufacturing setting in North America. Metadata and historical records for some of these isolates which were originally sequenced by the Centers for Disease Control (CDC) suggest that these isolates may have originated in other areas, most notably Asia and the Middle East [58, 59]. Cluster 5 (A.Br 001–004) is most readily described by the complete absence of the AMR gene vmlR. Lastly, cluster 6 [previously the A.Br.008 and A.Br.009 lineages (TAE)] was primarily described by isolation source, as the majority of these isolates were collected from animal hosts throughout Europe and North America, although this group also contained isolates from Asia and South America in smaller numbers. When examining population genomic structure in the context of candidate genes for selection, we see that recombination specific to primary cluster 2 (previously known as B Branch) may be responsible for the comparatively extreme difference in population structure in this group when compared to groups 3 to 6 (A Branch). A study that specifically looked at this group suggests that there may be phenotypic differences leading to contrasting mechanisms of infection, making this group specifically well suited to bovine species [58]. Given that this recombination event is rooted in a gene responsible for protein translation, these results support the hypothesis that phenotypic and functional traits for this cluster may be substantially different from the others. We examined genes that were identified using two methods for pinpointing candidates for selection (high-impact SNPs and clustered mutations) and found that a range of functional differences may be driving population genomic structure. Primary cluster 1 (C Branch) exhibited premature stop codons in two genes, a DNA-binding response regulator and a stage 0 sporulation regulatory protein. If these premature stop codons are hindering this cluster’s ability to produce proteins and influencing the timing and magnitude of sporulation, then this may indeed be why they are so underrepresented in the global dataset, and comparatively rare. Primary cluster 2 (B Branch) exhibited a premature stop codon in the TPR domain protein. TPR proteins may act as scaffolds for the assembly of different multiprotein complexes [60]. A premature stop codon in this sequence may be similarly affecting primary cluster 2’s ability to persist and reproduce, leading to its similar rarity across the remainder of the global dataset (N=13/356). When primary clusters 5 and 6 are examined as a unit we see that the chlorohydrolase family protein exhibits a premature stop codon. Hydrolase proteins commonly perform as biochemical catalysts that use water to break a chemical bond, which typically results in dividing larger molecules into smaller molecules [61]. If this protein lacks the ability to perform this function, isolates specific to this group may be functionally different from the other population groupings. Overall these findings lay the groundwork for future studies into evolution, allowing for investigations into how protein structure drives functional and phenotypic differences across varied lineages. Lastly, we looked at the virulence genes and found that several missense mutations may be influencing protein structure in some population clusters relative to others. Primary clusters 4 (Vollum) and 5 (A.Br001-004), the second and third most common designations across all isolates examined, exhibited different missense mutations in the pagA gene. The pagA gene encodes the protective antigen (PA), which binds to a receptor in sensitive eukaryotic cells, thereby facilitating the translocation of the enzymatic toxin components, oedema factor and lethal factor, across the target cell membrane [62]. Past work on this gene found six different haplotypes, which translate into three different amino acid sequences. Amino acid changes were shown to be located in an area near a highly antigenic region critical to lethal factor binding [63]. These mutations may therefore explain these clusters’ comparatively robust prevalence compared to some others if this differentiated structure is more beneficial to genotypic persistence. We also found two mutations in the cya gene specific to clade 6.3 (WNA) entirely from North America. The cya gene codes for the calmodulin-sensitive adenylate cyclase that, when associated with PA, causes oedema. This protein product is not toxic in and of itself, although it is required for the survival of germinated spores within macrophages at the early stages of infection, provoking dramatic elevation of intracellular cAMP levels in the host [64]. When evaluating the population genomic structure of in light of biogeography, AMR, phage diversity and candidate genes for selection, we find varying explanations for differences in population genomic structure. Nevertheless, it should be noted that in a mined dataset such as this, inaccuracy in metadata and/or sequencing has the potential to produce unintentional errors. In addition, our dataset is highly biased towards developed countries where whole genome sequencing technology is readily available and government support for such work is more abundant. Given the complex dispersal history of this notorious pathogen and the competing factors that ultimately sculpt its global genomic architecture, no single factor alone can be attributed to its modern genomic constitution. Despite these limitations we were able to determine the most influential factors consistent with differences and similarities among lineages using modern bioinformatic techniques. The information provided in this study not only offers a detailed accounting of AMR genes and phage diversity in this species, but also allows for the groundwork upon which future studies into evolution can be built. This work has the potential to drive further discovery of functional differences in terms of virulence and genotypic persistence that may ultimately help to inform management strategies in the realm of public health and wildlife conservation. Click here for additional data file. Click here for additional data file.

54 in total

1. The comprehensive antibiotic resistance database.

Authors: Andrew G McArthur; Nicholas Waglechner; Fazmin Nizam; Austin Yan; Marisa A Azad; Alison J Baylay; Kirandeep Bhullar; Marc J Canova; Gianfranco De Pascale; Linda Ejim; Lindsay Kalan; Andrew M King; Kalinka Koteva; Mariya Morar; Michael R Mulvey; Jonathan S O'Brien; Andrew C Pawlowski; Laura J V Piddock; Peter Spanogiannopoulos; Arlene D Sutherland; Irene Tang; Patricia L Taylor; Maulik Thaker; Wenliang Wang; Marie Yan; Tennison Yu; Gerard D Wright
Journal: Antimicrob Agents Chemother Date: 2013-05-06 Impact factor: 5.191

2. The effect of a bacteriophage on diversification of the opportunistic bacterial pathogen, Pseudomonas aeruginosa.

Authors: Michael A Brockhurst; Angus Buckling; Paul B Rainey
Journal: Proc Biol Sci Date: 2005-07-07 Impact factor: 5.349

Review 3. Emerging mechanisms of antimicrobial resistance in bacteria and fungi: advances in the era of genomics.

Authors: John Osei Sekyere; Jonathan Asante
Journal: Future Microbiol Date: 2018-01-10 Impact factor: 3.165

4. TPR motifs: hallmarks of a new polysaccharide export scaffold.

Authors: Chris Whitfield; Iain L Mainprize
Journal: Structure Date: 2010-02-10 Impact factor: 5.006

5. Detailed genomic analysis of the Wbeta and gamma phages infecting Bacillus anthracis: implications for evolution of environmental fitness and antibiotic resistance.

Authors: Raymond Schuch; Vincent A Fischetti
Journal: J Bacteriol Date: 2006-04 Impact factor: 3.490

Review 6. Bacillus anthracis genetics and virulence gene regulation.

Authors: T M Koehler
Journal: Curr Top Microbiol Immunol Date: 2002 Impact factor: 4.291

Review 7. Pathogenicity, population genetics and dissemination of Bacillus anthracis.

Authors: Paola Pilo; Joachim Frey
Journal: Infect Genet Evol Date: 2018-06-20 Impact factor: 3.342

8. A classification framework for Bacillus anthracis defined by global genomic structure.

Authors: Spencer A Bruce; Nicholas J Schiraldi; Pauline L Kamath; W Ryan Easterday; Wendy C Turner
Journal: Evol Appl Date: 2020-01-23 Impact factor: 5.183

9. The secret life of the anthrax agent Bacillus anthracis: bacteriophage-mediated ecological adaptations.

Authors: Raymond Schuch; Vincent A Fischetti
Journal: PLoS One Date: 2009-08-12 Impact factor: 3.240

10. A Bacillus anthracis Genome Sequence from the Sverdlovsk 1979 Autopsy Specimens.

Authors: Jason W Sahl; Talima Pearson; Richard Okinaka; James M Schupp; John D Gillece; Hannah Heaton; Dawn Birdsell; Crystal Hepp; Viacheslav Fofanov; Ramón Noseda; Antonio Fasanella; Alex Hoffmaster; David M Wagner; Paul Keim
Journal: MBio Date: 2016-09-27 Impact factor: 7.867

1 in total

1. CRISPR-Cas9 Based Bacteriophage Genome Editing.

Authors: Xueli Zhang; Chaohui Zhang; Caijiao Liang; Bizhou Li; Fanmei Meng; Yuncan Ai
Journal: Microbiol Spectr Date: 2022-07-26

1 in total