Lindsey L Bohr1, Madison A Youngblom1, Vegard Eldholm2, Caitlin S Pepperell1,3. 1. Department of Medical Microbiology and Immunology, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, WI, USA. 2. Norwegian Institute of Public Health, Oslo, Norway. 3. Department of Medicine, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, WI, USA.
Abstract
Mycobacterium abscessus is a rapid growing, free-living species of bacterium that also causes lung infections in humans. Human infections are usually acquired from the environment; however, dominant circulating clones (DCCs) have emerged recently in both M. abscessus subsp. massiliense and subsp. abscessus that appear to be transmitted among humans and are now globally distributed. These recently emerged clones are potentially informative about the ecological and evolutionary mechanisms of pathogen emergence and host adaptation. The geographical distribution of DCCs has been reported, but the genomic processes underlying their transition from environmental bacterium to human pathogen are not well characterized. To address this knowledge gap, we delineated the structure of M. abscessus subspecies abscessus and massiliense using genomic data from 200 clinical isolates of M. abscessus from seven geographical regions. We identified differences in overall patterns of lateral gene transfer (LGT) and barriers to LGT between subspecies and between environmental and host-adapted bacteria. We further characterized genome reorganization that accompanied bacterial host adaptation, inferring selection pressures acting at both genic and intergenic loci. We found that both subspecies encode an expansive pangenome with many genes at rare frequencies. Recombination appears more frequent in M. abscessus subsp. massiliense than in subsp. abscessus, consistent with prior reports. We found evidence suggesting that phage are exchanged between subspecies, despite genetic barriers evident elsewhere throughout the genome. Patterns of LGT differed according to niche, with less LGT observed among host-adapted DCCs versus environmental bacteria. We also found evidence suggesting that DCCs are under distinct selection pressures at both genic and intergenic sites. Our results indicate that host adaptation of M. abscessus was accompanied by major changes in genome evolution, including shifts in the apparent frequency of LGT and impacts of selection. Differences were evident among the DCCs as well, which varied in the degree of gene content remodelling, suggesting they were placed differently along the evolutionary trajectory toward host adaptation. These results provide insight into the evolutionary forces that reshape bacterial genomes as they emerge into the pathogenic niche.
Mycobacterium abscessus is a rapid growing, free-living species of bacterium that also causes lung infections in humans. Human infections are usually acquired from the environment; however, dominant circulating clones (DCCs) have emerged recently in both M. abscessus subsp. massiliense and subsp. abscessus that appear to be transmitted among humans and are now globally distributed. These recently emerged clones are potentially informative about the ecological and evolutionary mechanisms of pathogen emergence and host adaptation. The geographical distribution of DCCs has been reported, but the genomic processes underlying their transition from environmental bacterium to human pathogen are not well characterized. To address this knowledge gap, we delineated the structure of M. abscessus subspecies abscessus and massiliense using genomic data from 200 clinical isolates of M. abscessus from seven geographical regions. We identified differences in overall patterns of lateral gene transfer (LGT) and barriers to LGT between subspecies and between environmental and host-adapted bacteria. We further characterized genome reorganization that accompanied bacterial host adaptation, inferring selection pressures acting at both genic and intergenic loci. We found that both subspecies encode an expansive pangenome with many genes at rare frequencies. Recombination appears more frequent in M. abscessus subsp. massiliense than in subsp. abscessus, consistent with prior reports. We found evidence suggesting that phage are exchanged between subspecies, despite genetic barriers evident elsewhere throughout the genome. Patterns of LGT differed according to niche, with less LGT observed among host-adapted DCCs versus environmental bacteria. We also found evidence suggesting that DCCs are under distinct selection pressures at both genic and intergenic sites. Our results indicate that host adaptation of M. abscessus was accompanied by major changes in genome evolution, including shifts in the apparent frequency of LGT and impacts of selection. Differences were evident among the DCCs as well, which varied in the degree of gene content remodelling, suggesting they were placed differently along the evolutionary trajectory toward host adaptation. These results provide insight into the evolutionary forces that reshape bacterial genomes as they emerge into the pathogenic niche.
Previously published sequencing data used in this study are available from the National Center for Biotechnology Information (NCBI) under the Sequence Read Archive (SRA) accession numbers found in Table S1 (available with the online version of this article).is an important pathogen that causes skin abscesses, and pulmonary and other infections, and is a common bacterial pathogen isolated from cystic fibrosis patients. Until recently,
was thought to be acquired exclusively from the environment; however, new evidence of person-to-person transmission has serious implications for how we monitor and treat these infections, particularly with antibiotic-resistant strains. Our study leverages powerful genomics tools to characterize the recent emergence of
into the human pathogenic niche. We find significant genomic changes associated with strains originating from person-to-person transmission including changes in patterns of lateral gene transfer, and evidence for relaxed purifying selection. Our work highlights the importance of genome remodelling in the process of pathogen emergence.
Introduction
is a rapid growing mycobacterial species found in a variety of environments [1, 2]. This free-living bacterium is an opportunistic pathogen capable of causing skin abscesses, and pulmonary and other infections [1, 3].
is the most common species of non-tuberculous mycobacterium (NTM) isolated from individuals with cystic fibrosis (CF) [4].
infections are hard to treat, as the bacteria are intrinsically resistant to many antibiotics and additional resistances are on the rise [5-7]. Until recently, infection with
was thought to be exclusively acquired from the environment; however, evidence of person-to-person transmission has been identified for recently emerged dominant circulating clones (DCCs) [4, 8, 9]. DCCs are genetically homogeneous groups of bacteria that are globally distributed, and account for a substantial proportion of infections in CF patients; the circumstances in which these organisms may be transmitted person to person are as yet unclear [10, 11].is distantly related to
and other pathogenic mycobacteria.
has been divided into three subspecies: abscessus, massiliense and bolletii [12]. Average nucleotide identity (ANI) values within and between subspecies of
are consistent as described in other species: within subspecies ANI values are >98 %, while between subspecies ANI values are ~96–97 % [12, 13]. It was recently reported that
subsp.
recombines more frequently than the other subspecies [13].
encodes a diverse array of plasmids and prophages [8, 14–16]; the extent to which distinct mechanisms of lateral gene transfer (LGT) shape diversity of
is unknown.The recent emergence of host-adapted lineages of
provides an opportunity to learn about the mechanisms underlying bacterial pathogen emergence, including the role of LGT. LGT can enable bacterial adaptation to new environments, including the pathogenic niche. Emergence of the major pathogen
was accompanied by acquisition of novel genetic material by LGT [17-19], followed by apparent loss of the capacity to engage in LGT [20-23]. Previous research points to a reduction in genome size, an increase in pseudogenes and a loss of metabolic capacity in association with bacterial pathogen emergence [24-27]. Our aim here was to use comparative genomics to investigate genome reorganization occurring during emergence of host-associated
. We found putatively host-adapted DCCs to be characterized by lower levels of LGT relative to environmentally acquired
. We did not observe large-scale pseudogenization across the DCCs; however, we found evidence of relaxation of purifying selection in the core and accessory genomes of DCCs, which may be an earlier indicator of host adaptation. Taken together, these results provide insight into genome reorganization that accompanies bacterial pathogen emergence.
Methods
Data set and subsampling
We compiled Illumina sequence read data from a collection of 554 clinical isolates including newly sequenced isolates from Norway (PRJEB9515), as well as publicly available isolates from Michigan (PRJNA315990), Maryland (PRJNA231221), and the sample published by Bryant et al. [8]. We mapped raw sequence reads to the
reference ATCC 19977 genome. Our pipeline can be found at https://github.com/pepperell-lab/RGAPepPipe (accessed 20 July 2021). Briefly, we used FastQC v0.11.8 [28] and TrimGalore v0.6.4 [29] for quality assessment and read adaptor trimming. We mapped the trimmed reads to the
ATCC 19977 genome using SAMtools v1.5 [30], and used Picard Tools v1.183 for sorting and format conversion (http://broadinstitute.github.io/picard/). Lastly, we called variants using Pilon v1.16 [31]. Once all isolates were aligned to the reference, we separated the
isolates by subsp. abscessus and massiliense, and used FastTree v2.1.9 [32] to reconstruct a maximum-likelihood phylogeny under the discrete gamma distribution model. Then, we used TreeGubbins v2.4.1 [33] to identify the DCCs:
subsp.
DCC1 and DCC2, and
subsp.
DCC1, following the nomenclature described by Bryant et al. [8]. We labelled the non-DCC isolates as environmentally acquired isolates (EAIs). We used Treemer v0.3 [34] to informatively subsample based on maintaining maximum phylogenetic diversity, and selected 40 DCCs and 40 EAIs for each group; this resulted in a final dataset of 120
subsp.
and 80 subsp. massiliense isolates. We included isolates from subspecies abscessus and massiliense. We did not include isolates from
subspecies
, because there were few isolates available and there have not been reports of host-associated DCC emergences within the subspecies [8]. The final data set spans seven geographical regions: Australia, Denmark, Ireland, Holland, Norway, the UK and the USA. Isolate metadata and accession numbers can be found in Table S1.
DNA extraction and sequencing
Bacterial isolates received at the Norwegian Institute of Public Health were grown on Lowenstein–Jensen media. Genomic DNA was extracted as described in Eldholm et al. [35]. Sequencing libraries were prepared using the Nextera XT kit (Illumina) as per the manufacturer’s protocol and sequenced on the Illumina MiSeq platform generating paired-end 250 bp reads.
De novo assembly, annotation, quality control and core-genome alignment
The
genomes were de novo assembled with SPAdes v. 3.12.0 [36] and annotated using Prokka v 1.14-dev [37]. We assessed the quality of the de novo assemblies using Quast v. 4.6.3 [38] and removed isolates with N50 <50 kbp. We identified orthologous genes in the core and accessory genomes using Roary v. 3.11.2 [39] with a amino acid identity of 95%, and we aligned the core genes with prank [40]. Additionally, for downstream intergenic region (IGR) analyses, we re-analysed the core genes using Roary without splitting paralogous genes.
Maximum-likelihood phylogenetic analysis
We performed maximum-likelihood phylogenetic inference on core-genome alignments using RAxML v 8.2.3 [41] with the GTR (general time reversible) model of nucleotide substitution and gamma distribution of rate variation. We performed bootstrapping using autoMR convergence. We used ggtree [42] for tree visualization.
Recombination detection in the core genome
Recombinant tracts were identified in the core genomes using Gubbins v 2.3.2 [43] and visualized with Phandango [44]. Briefly, Gubbins identifies recombination by using spatial scanning statistics to identify loci with high densities of SNPs. We calculated the proportion of sites affected by recombination per isolate and performed a Kruskal–Wallis test in R to determine the differences in proportion of recombinant sites by group [45]. We then performed pairwise Mann–Whitney–Wilcoxon tests in R with Bonferroni correction to identify which distribution pairs were significantly different [46, 47]. Additionally, we compared the fragment length distributions of recombinant tracts between
subspecies to inferred recombinant tracts in other mycobacterial species known to participate in distributive conjugal transfer (DCT): 'Mycobacterium canettii' [48, 49] and
[50, 51]. We used SplitsTree4 v. 4.14.4 [52] to reconstruct a phylogenetic network of the core-genome alignments for both subspecies and conducted a PHI (pairwise homoplasy index) test to detect recombination [53].
Admixture analyses
We performed bacterial admixture analyses using FineSTRUCTURE v. 2.1.3 [54] to investigate the extent of LGT among
subspecies. We constructed a core biallelic SNP alignment across all 200
.
isolates using SNP-sites [55] and VCFtools [56]. Next, we used FineSTRUCTURE to build a co-ancestry matrix that contains the number of recombination events from each donor to each recipient. Briefly, FineSTRUCTURE infers fragments transferred via recombination and reconstructs haplotypes on the chromosome of a recipient as a series of fragments from all other donor individuals within the sample [54]. Additionally, we performed these analyses using core biallelic SNP alignments of each subspecies separately to compare recombination events within and between the DCCs and EAIs.
Pangenome selection and diversity analyses
We calculated pangenome accumulation and rarefaction curves of
subsp.
and subsp. massiliense isolates, as well as DCCs and EAIs for both subspecies using random subsampling without replacement 100 times, plotting the median value for the core and pangenome values. Additionally, we calculated the gene frequencies of accessory genes from both subspecies. Using EggLib [57], we calculated the mean π per accessory gene within and between subspecies using a subset of genes found at intermediate frequencies (1–99 %) in both species. We performed a Kruskal–Wallis test in R to determine the differences in mean gene π values by group [45]. We then performed pairwise Mann–Whitney–Wilcoxon tests in R with Bonferroni correction to identify significant differences among distribution pairs [46, 47]. To investigate the effects of LGT and selection on diversity within subspecies, we calculated the distribution of pairwise πN/πS values across the accessory genes present in at least four isolates per group [57].
Core-genome and accessory-genome distances
Core- and accessory-genome tree distances were calculated for all pairs of isolates within a subspecies using the distTips function from the adephylo (https://cran.r-project.org/web/packages/adephylo/index.html) package in R. For core distances, the separate core-genome phylogenies for each subspecies were used (see the Maximum-likelihood phylogenetic analysis section in Methods). For accessory distances, we used the phylogenetic tree based on accessory gene presence/absence as output by Roary (see the De novo assembly, annotation, quality control and core-genome alignment section in Methods).In order to measure core-genome divergence for samples, as opposed to pairs of isolates, we calculated segregating sites for the separate core-genome alignments of
subsp.
and subsp. massiliense. The number of segregating sites (normalized to the overall size of the core-genome alignment) was measured for random subsamples of EAIs and DCC isolates (100 repetitions for each isolate type, N=10 sampled without replacement). From each subsample, the core-genome and pangenome sizes (as calculated by Roary) were plotted against the number of segregating sites calculated using EggLib [57]. A linear regression with a 95 % confidence interval was fit to the data from the EAIs using the stat_smooth() function in ggplot2 with the lm method. Regression equations, R
2 values and P values were plotted using the stat_poly_eq() and stat_fit_glance() methods from ggpmisc (https://cran.r-project.org/web/packages/ggpmisc/index.html).
Prophage identification
We used ProphET [58] to detect prophage in our collection of genomes. Briefly, ProphET performs a similarity search of annotated proteins from bacterial genomes against a database of known phage proteins to identify prophage within bacterial genomes. ProphET discards regions with a low density of phage-associated genes. After identifying the prophage regions, we calculated pairwise mash [59] distances, which are based on shared k-mer (sequences of length k) content between prophage nucleotide sequences. To visualize the relatedness between prophage regions, we plotted a dendrogram in R.
Pseudogene analysis
To identify and characterize possible pseudogenes, we looked for identity between intergenic sequences in our alignment and known protein sequences. Specifically, we constructed an alignment of all IGRs using Piggy [60] with a minimum per cent length identity and minimum per cent nucleotide identity of 90 %. Next, we created a custom protein blast [61] database of all publicly available
proteins and blasted the IGRs against the custom database. Pseudogenes were identified if the blast result had an E value cut-off of 10e−5 as described by Belda and colleagues [62]. We performed a Kruskal–Wallace test in R to determine differences in mean pseudogene values by group [45]. We then performed pairwise Mann–Whitney–Wilcoxon tests in R with Bonferroni correction to identify significant differences among distribution pairs [47, 49].
IGR selection and diversity analyses
To identify the extent to which selection is acting on IGRs in the DCCs of both
subsp.
and subsp. massiliense, we estimated dI/dS across intergenic alignments of
subsp.
DCC1, DCC2 and EAI, and
subsp.
DCC and EAIs, as described previously [63]. Briefly, we calculated dN/dS across the core-gene alignments using the yn00 implementation [64] in paml [65]. We calculated pairwise SNPs in the intergenic alignment and calculated dI by dividing the number of SNPs by the length of the alignment. The dS values calculated from the core-gene alignment were used to calculate dN/dS and dI/dS.
Results
Pangenome and phylogenetic analyses
We de novo assembled and analysed core-genome alignments of 200 clinical
(120 subsp. abscessus and 80 subsp. massiliense) isolates from seven geographical locations: Australia, Denmark, Dublin, Holland, Norway, the UK and the USA. Pangenome analysis of these 200
.
isolates defined a large accessory genome in both
subsp.
and subsp. massiliense, with many genes found at low frequencies. The core genome consisted of 3672 genes present in at least 99 % of isolates, whereas a total of 41 939 genes were found at lower frequencies (Table S2). The mean number of genes per isolate was comparable between
subsp.
and subsp. massiliense: 5182 and 5107, respectively.We found the core genomes of
subsp.
and subsp. massiliense to be clearly differentiated, as shown by the long branch separating them in the maximum-likelihood phylogeny and network (Figs 1 and S1). This was also true of the accessory genome, with each subspecies characterized by distinct accessory gene content (Fig. 2). Both subspecies are globally distributed, with little evidence of geographical structure in either the host-associated DCCs or EAIs (Figs 1 and S2). The sample of
subsp.
includes two DCC clades, whereas subsp. massiliense contains one DCC clade (Figs 1 and S2). The DCCs of both species have larger genomes than their EAI counterparts: DCCs had a median gene count of 250, 200 and 150 genes higher than EAIs for the
subsp.
DCC1, DCC2 and
subsp.
DCC, respectively.
Fig. 1.
subsp.
(MAA) and subsp. massiliense (MAM) core-genome maximum-likelihood phylogeny shows distinct subspecies structure. We inferred a maximum-likelihood phylogeny from a core-genome alignment of 120
subsp.
(top) and 80
subsp.
(bottom) isolates. DCCs as defined by Bryant and colleagues [8] are labelled in both subspecies. Non-DCC isolates are considered EAIs. Isolates from different geographical regions are interspersed in the phylogeny, consistent with efficient global dispersal of bacteria. The phylogeny is midpoint rooted, and nodes with bootstrap values lower than 70 are labelled in grey. Branch lengths are scaled by the number of substitutions per site.
Fig. 2.
Accessory gene content of
is structured by subspecies/clade. Gene homologues with 95 % amino acid similarity identified using Roary are plotted, excluding singleton genes. Order of species in the phylogenetic tree (left) corresponds to the species order in Fig. 1. Accessory content differs among
subspecies and niche specialists (DCCs). Plot created using Phandango [44]. MAA,
subsp.
; MAM,
subsp.
.
subsp.
(MAA) and subsp. massiliense (MAM) core-genome maximum-likelihood phylogeny shows distinct subspecies structure. We inferred a maximum-likelihood phylogeny from a core-genome alignment of 120
subsp.
(top) and 80
subsp.
(bottom) isolates. DCCs as defined by Bryant and colleagues [8] are labelled in both subspecies. Non-DCC isolates are considered EAIs. Isolates from different geographical regions are interspersed in the phylogeny, consistent with efficient global dispersal of bacteria. The phylogeny is midpoint rooted, and nodes with bootstrap values lower than 70 are labelled in grey. Branch lengths are scaled by the number of substitutions per site.Accessory gene content of
is structured by subspecies/clade. Gene homologues with 95 % amino acid similarity identified using Roary are plotted, excluding singleton genes. Order of species in the phylogenetic tree (left) corresponds to the species order in Fig. 1. Accessory content differs among
subspecies and niche specialists (DCCs). Plot created using Phandango [44]. MAA,
subsp.
; MAM,
subsp.
.
LGT in the core genome
In our analyses of LGT in the core genome of
, we found little evidence of admixture between the two subspecies. A previous study using multilocus sequence typing data and structure [66] hypothesized that
subspecies are admixed [67]; however, using FineSTRUCTURE [54], we found evidence for substantial genetic exchange within
subsp.
and subsp. massiliense, but limited admixture between them (Fig. 3).
Fig. 3.
Co-ancestry matrix depicting population structure and gene flow in the core genome of 120
.
isolates. The co-ancestry matrix is coloured by the estimated fractions of genome fragments that are imported from a donor genome (column) to a recipient genome (row). For example, colours higher on the relative scale (right) indicate a higher proportion of transferred fragments. The trees on the top and left show clustering assignments as determined by FineSTRUCTURE.
.
subsp.
(MAM) and subsp. abscessus (MAA) appear genetically isolated as indicated by the large red rectangles in the matrix. Both
.
subsp.
(upper left) and subsp. abscessus (lower right) appear to recombine more frequently within subspecies than between. Within subspecies recombination appears more frequent in
.
subsp.
than in subsp. abscessus, indicated by the lighter yellow colour of the upper left versus lower right box.
Co-ancestry matrix depicting population structure and gene flow in the core genome of 120
.
isolates. The co-ancestry matrix is coloured by the estimated fractions of genome fragments that are imported from a donor genome (column) to a recipient genome (row). For example, colours higher on the relative scale (right) indicate a higher proportion of transferred fragments. The trees on the top and left show clustering assignments as determined by FineSTRUCTURE.
.
subsp.
(MAM) and subsp. abscessus (MAA) appear genetically isolated as indicated by the large red rectangles in the matrix. Both
.
subsp.
(upper left) and subsp. abscessus (lower right) appear to recombine more frequently within subspecies than between. Within subspecies recombination appears more frequent in
.
subsp.
than in subsp. abscessus, indicated by the lighter yellow colour of the upper left versus lower right box.Based on these results, we characterized LGT in the core genomes of the two subspecies separately. We found more reticulations in the network of
.
subsp.
’s core genome than that of
.
subsp.
(Fig. S3), suggesting
.
subsp.
participates in LGT more than
.
subsp.
. Both subspecies did, however, show evidence of recombination (PHI (pairwise homoplasy index) test of recombination, P=0). Pangenomes of the two subspecies are of similar size (Fig. S4). Using Gubbins [43], we found that 99 % of sites in a core-genome alignment of
.
subsp.
has been affected by recombination, which compares to 91 % for
.
subsp.
(Fig. 4). In a pairwise comparison of EAIs, the mean proportion of recombinant sites was higher in
.
subsp.
than subsp. abscessus (Mann–Whitney–Wilcoxon test, W=435, P=0.0045) (Fig. 5). We found the mean size of recombinant tracts for
subsp.
and subsp. massiliense to be 3606 and 3252 bp, respectively. The distributions of recombinant fragment lengths for both
subsp.
and subsp. massiliense appear similar to that of Mycobacterium canettii, a species known to participate in DCT [48, 49] (Fig. S5). DCT is novel form of LGT that involves the transfer of multiple segments of chromosomal DNA from donor to recipient, and has thus far only been described in mycobacteria [68, 69].
Fig. 4.
Within subspecies recombination is frequent in
core genomes. Recombination in the core genomes of (a)
subsp.
and (b) subsp. massiliense. Each row corresponds to the core genome of an isolate in the phylogenetic tree to the left. Blue segments represent laterally transferred fragments unique to an individual isolate. Red segments indicate laterally transferred fragments that are shared across multiple isolates. The proportion of sites affected by recombination is 90.9 % for
subsp.
and 98.6 % for subsp. massiliense. The DCCs in both subspecies appear to have less recombination than EAIs.
Fig. 5.
subsp.
(MAM) core genomes have more evidence of recombination than subsp. abscessus (MAA). Boxplots show the proportion of each isolate’s core genome affected by recombination, as estimated with Gubbins. The box spans the interquartile range, the median is represented by the middle line and the whiskers extend to ±1.5 times the interquartile range. Data beyond the end of the whiskers are outliers and plotted individually. The mean proportion of recombinant sites differs significantly by group (Kruskal–Wallis test, H=76, P<0.0001). The distribution of affected core genomes in
subsp.
EAIs is higher than that of
subsp.
EAIs (Mann–Whitney–Wilcoxon test, W=435, P=0.0045). Additionally, the
subsp.
DCCs have a significantly lower proportion of their genomes affected by recombination when compared to the
subsp.
EAI (Mann–Whitney–Wilcoxon test, W=435, P=0.0045). Finally, the DCCs from both subspecies have narrower distributions of proportion of the core genome affected by recombination than the EAIs. This is consistent with a reduction in recombination. P value: **, <0.01.
Within subspecies recombination is frequent in
core genomes. Recombination in the core genomes of (a)
subsp.
and (b) subsp. massiliense. Each row corresponds to the core genome of an isolate in the phylogenetic tree to the left. Blue segments represent laterally transferred fragments unique to an individual isolate. Red segments indicate laterally transferred fragments that are shared across multiple isolates. The proportion of sites affected by recombination is 90.9 % for
subsp.
and 98.6 % for subsp. massiliense. The DCCs in both subspecies appear to have less recombination than EAIs.subsp.
(MAM) core genomes have more evidence of recombination than subsp. abscessus (MAA). Boxplots show the proportion of each isolate’s core genome affected by recombination, as estimated with Gubbins. The box spans the interquartile range, the median is represented by the middle line and the whiskers extend to ±1.5 times the interquartile range. Data beyond the end of the whiskers are outliers and plotted individually. The mean proportion of recombinant sites differs significantly by group (Kruskal–Wallis test, H=76, P<0.0001). The distribution of affected core genomes in
subsp.
EAIs is higher than that of
subsp.
EAIs (Mann–Whitney–Wilcoxon test, W=435, P=0.0045). Additionally, the
subsp.
DCCs have a significantly lower proportion of their genomes affected by recombination when compared to the
subsp.
EAI (Mann–Whitney–Wilcoxon test, W=435, P=0.0045). Finally, the DCCs from both subspecies have narrower distributions of proportion of the core genome affected by recombination than the EAIs. This is consistent with a reduction in recombination. P value: **, <0.01.In our analyses of LGT in the core genomes of
subsp.
and subsp. massiliense, we found differences in the patterns of LGT between the DCCs and EAIs. DCC isolates were more likely to have recombinant tracts shared across the entire clade (Figs. 4 and S6). Additionally, in the co-ancestry matrices of both subspecies, the DCCs are highly structured compared to EAIs, and appear less likely to recombine (Fig. S7). Across both subspecies, the proportion of recombinant sites is higher in EAIs than DCCs (Kruskal–Wallis test, H=76, P<0.01) (Fig. 5).
Patterns of LGT in the accessory genome of
When comparing the accessory genomes of
subsp.
and subsp. massiliense, we found several lines of evidence suggesting there are barriers to accessory gene flow between subspecies. First, shared accessory genes are maintained at different frequencies between the two subspecies (Fig. 6). In the setting of frequent exchange between subspecies, we would instead expect equalization of their accessory gene frequencies. If the two subspecies were readily exchanging their accessory genes, we would further expect intergroup measures of diversity (π) to be similar to measures within subspecies. Instead, we found genewise π values of shared accessory genes to be lower within subspecies than between subspecies (Kruskal–Wallace test, P<0.01; Mann–Whitney–Wilcoxon test, P<0.01) (Fig. S8). These results suggest accessory genes are transferred more frequently within than between subspecies.
Fig. 6.
Shared accessory genes are found at different frequencies in
subspecies. Heat map of gene frequencies in
subsp.
(MAA) and subsp. massiliense (MAM). Much of the accessory gene content is unique to each subspecies and is rare within subspecies. Shared accessory genes are maintained at distinct frequencies across the subspecies. This suggests there are barriers to gene flow as we would expect frequencies to equalize in the presence of free flow between subspecies.
Shared accessory genes are found at different frequencies in
subspecies. Heat map of gene frequencies in
subsp.
(MAA) and subsp. massiliense (MAM). Much of the accessory gene content is unique to each subspecies and is rare within subspecies. Shared accessory genes are maintained at distinct frequencies across the subspecies. This suggests there are barriers to gene flow as we would expect frequencies to equalize in the presence of free flow between subspecies.DCCs and EAIs differed with respect to their pangenomes. In both subspecies, the pangenome is smaller and the core genome is larger for DCCs than for EAIs (Fig. 7). This is consistent with lower rates of LGT among DCCs relative to EAIs. However, given that DCCs are recently emerged, it is also possible that underlying rates of LGT are similar between groups but the DCCs have had less time to acquire novel gene content. In order to further investigate these alternatives, we assayed the relationships between proxy measures of divergence time, accessory-genome differentiation and pangenome expansion. We used segregating sites for the subspecies core-genome alignment as a proxy for divergence time of a sample. There was no significant correlation between divergence time and the overall size of the pangenome in either
subsp.
or subsp. abscessus (R
2 0.03 and 0.04, respectively; P>0.05 for both) (Fig. S9), suggesting that the
pangenome does not expand linearly as a sample diversifies. There was a significant correlation between core-genome size and sample diversification for both subspecies (R
2 0.13 for
subsp.
, 0.15 for
subsp.
; P<0.001 for both) (Fig. 7). In the context of these trends, DCCs in both subspecies remained clear outliers with respect to the size of their core genomes (Fig. 7); that is, their core genomes appear larger than expected given the divergence time of the sample.
Fig. 7.
EAIs have a larger pangenome than DCCs. In both
subsp.
(MAA) (a) and subsp. massiliense (MAM) (b), rarefaction plots show that EAIs have a larger pangenome and a smaller core genome than DCCs, consistent with lower rates of LGT among EAIs relative to DCCs. However, it is also possible that pangenome size reflects the timescales over which samples diverged, with the DCCs simply having had less time to accumulate novel genes. To test for this, we randomly sampled the EAIs, measured segregating sites as a proxy for divergence times and calculated the core-genome and pangenome sizes. Pangenome size was not significantly correlated with sample divergence (Fig. S9), whereas core-genome size was modestly correlated (right panels). The DCCs were outliers with respect to core-genome size, suggesting that observed trends are not fully accounted for by differences in sample divergence times.
EAIs have a larger pangenome than DCCs. In both
subsp.
(MAA) (a) and subsp. massiliense (MAM) (b), rarefaction plots show that EAIs have a larger pangenome and a smaller core genome than DCCs, consistent with lower rates of LGT among EAIs relative to DCCs. However, it is also possible that pangenome size reflects the timescales over which samples diverged, with the DCCs simply having had less time to accumulate novel genes. To test for this, we randomly sampled the EAIs, measured segregating sites as a proxy for divergence times and calculated the core-genome and pangenome sizes. Pangenome size was not significantly correlated with sample divergence (Fig. S9), whereas core-genome size was modestly correlated (right panels). The DCCs were outliers with respect to core-genome size, suggesting that observed trends are not fully accounted for by differences in sample divergence times.We also examined the relationship between core- and accessory-genome diversification at the level of individual isolates (Fig. S10). Here, we compared tree distances from the core-genome phylogeny with accessory-genome distances inferred from a gene presence/absence matrix for pairs of EAIs. Examining pairs of isolates in this way, we observed a linear relationship between core- and accessory-genome distances for EAIs from both subspecies (R
2 0.24 for
ubsp.
, 0.21 for subsp. abscessus; P<0.01 for both) suggesting that accessory-genome differentiation occurs in parallel with differentiation of the core genome. Synthesizing this observation with findings at the sample level, we infer that rates of acquisition of novel genes parallel rates of core-genome diversification, but that the pool of novel genes for EAIs is very large and unstructured such that the overall size of the pangenome does not expand linearly as a sample diversifies. We also analysed EAI–DCC pairs (Fig. S10) and found that for a given core-genome distance, accessory gene content of DCCs is more differentiated from EAIs than EAIs are from each other. This is consistent with genetic isolation of DCCs. Taken together, these results suggest that the pangenome dynamics of the DCCs are distinct from those of the EAIs, likely due to the DCCs losing access to the large pool of accessory genes shared among EAIs.
Prophage analyses
Despite barriers to LGT evident in the accessory genome, mobile genetic elements appear to be readily exchanged between subspecies. Using ProphET [58], we identified 407 prophage in our dataset of 200
.
genomes. We then used multi-dimensional scaling to visualize pairwise mash distances of these prophage and found that prophage from different bacterial groupings (DCC vs EAI,
.
subsp.
vs subsp. massiliense) were intermingled (Fig. 8). This suggests that prophage are readily transferred among otherwise genetically distinct isolates of
.
Fig. 8.
prophage are readily exchanged between subspecies and strain types. We used multi-dimensional scaling of pairwise mash distances of prophage nucleotide sequences identified using ProphET to visualize prophage diversity and distribution across groups. Prophage do not cluster according to subspecies or strain type (DCC vs EAI), suggesting they are readily transferred between groups. Plot points jittered ±0.05 for greater visibility. MAA,
subsp.
; MAM,
subsp.
.
prophage are readily exchanged between subspecies and strain types. We used multi-dimensional scaling of pairwise mash distances of prophage nucleotide sequences identified using ProphET to visualize prophage diversity and distribution across groups. Prophage do not cluster according to subspecies or strain type (DCC vs EAI), suggesting they are readily transferred between groups. Plot points jittered ±0.05 for greater visibility. MAA,
subsp.
; MAM,
subsp.
.
Genomic signatures of pathogen emergence
In addition to differences in recombination, we identified other genomic differences between DCCs and EAIs.
.
subsp.
DCC1 shows evidence of remodelling of its core genome relative to EAIs, with both loss of what was previously core gene content and gain of unique core gene content.
.
subsp.
DCC2 shows evidence of gain only, whereas subsp. massiliense DCC exhibits neither loss nor gain (Fig. S11). This suggests that, relative to the
ubsp.
DCCs, the
.
subsp.
DCC may be further along an adaptive trajectory away from the EAIs and toward host association and specialization.The genomes of bacterial pathogens are often observed to be smaller relative to their free-living relatives and it has been posited that ‘genome downsizing’ is a feature of host adaptation [24, 25]. Pseudogenization may signal genome downsizing as newly unnecessary genes accumulate nonsense and other mutations that disrupt protein coding. We investigated the extent of pseudogenization among DCCs by identifying gene fragments using a blast method described previously [62]. The data did not show a consistent trend (Fig. 9) toward pseudogenization among the putatively host-adapted bacteria.
Fig. 9.
Levels of pseudogenization are low and similar between DCCs and EAIs. Counts of pseudogenes per isolate as a per cent of total intact genes per isolate are shown. We aligned all IGRs and blasted them against a custom
proteome database. Pseudogenes were identified based on a published E value cut-off of 10e−5 [62].
.
subsp.
(MAM) DCC and
.
subsp.
(MAA) DCC1 have slightly more pseudogenes than their respective EAIs, a difference that is not statistically significant.
Levels of pseudogenization are low and similar between DCCs and EAIs. Counts of pseudogenes per isolate as a per cent of total intact genes per isolate are shown. We aligned all IGRs and blasted them against a custom
proteome database. Pseudogenes were identified based on a published E value cut-off of 10e−5 [62].
.
subsp.
(MAM) DCC and
.
subsp.
(MAA) DCC1 have slightly more pseudogenes than their respective EAIs, a difference that is not statistically significant.In order to investigate potential differences in selection pressures acting on DCCs and EAIs, we compared the distributions of pairwise πN/πS values across their core and accessory genomes. Gene-wise πN/πS values were higher in DCCs than in EAIs, for both subspecies (
.
subsp.
DCC1 vs EAI: Mann–Whitney–Wilcoxon test, W=2.9e5, P=0.0003) (Fig. S12). This suggests DCCs could be evolving under relaxed purifying selection relative to EAIs.
Genetic diversity across IGRs
We hypothesized that the IGRs of the DCCs may be under distinct selection pressures relative to EAIs. To investigate this hypothesis, we calculated pairwise dI/dS and dN/dS values within and between groups for each subspecies. Distributions of dN/dS values were consistently lower than those of dI/dS, consistent with genic regions being under relatively strong selective constraint (Figs 10 and 11). Distributions of pairwise intergenic variants are closely approximated for DCC–DCC, DCC–EAI and EAI–EAI in
.
subsp.
and less so for
.
subsp.
(Fig. 11c). This is consistent with
.
subsp.
DCCs being further along in their emergence as reflected in the accumulation of intergenic variants, parallel to the observation of more extensive gene content remodelling in the core genome (Fig. S11).
Fig. 11.
High rates of intergenic variation in DCCs due to recent emergence. (a) We calculated pairwise dI/dS values along intergenic alignments for each species and plotted dI/dS values across binned values of dS. (b) The detail for the lowest values of dS. NS = not significant. (c) Histograms of the number of intergenic SNPs in each pairwise comparison. For
subsp.
(MAM), the three distributions are distinct and ordered DCC–DCC, DCC–EAI and finally EAI–EAI; by contrast the three distributions are approximated in
subsp.
(MAA) indicating accumulation of intergenic SNPs in
subsp.
DCCs, consistent with maturation of the clone and remodelling of its core genome.
DCCs exhibit limited diversity with a skew to non-synonymous variation. (a) To investigate whether high values of dN/dS among DCCs could be explained by their relatively short divergence times, we plotted dN/dS for different values of dS. (b) The detail for the lowest values of dS. As expected, dN/dS values decrease as dS increases. However, DCCs exhibit higher than expected values of dN/dS relative to EAIs at the same depth of divergence (Mann–Whitney–Wilcoxon test, **** = P <0.001), consistent with relaxation of purifying selection.High rates of intergenic variation in DCCs due to recent emergence. (a) We calculated pairwise dI/dS values along intergenic alignments for each species and plotted dI/dS values across binned values of dS. (b) The detail for the lowest values of dS. NS = not significant. (c) Histograms of the number of intergenic SNPs in each pairwise comparison. For
subsp.
(MAM), the three distributions are distinct and ordered DCC–DCC, DCC–EAI and finally EAI–EAI; by contrast the three distributions are approximated in
subsp.
(MAA) indicating accumulation of intergenic SNPs in
subsp.
DCCs, consistent with maturation of the clone and remodelling of its core genome.
Impact of divergence time on patterns of variation
Divergence time has been shown to affect overall patterns of genetic diversity, with a greater proportion of variation at non-synonymous sites observed among recently diverged bacteria [70]. We hypothesized that this phenomenon could potentially explain differences observed between DCCs and EAIs. Measures of nearly neutral variation, i.e. at synonymous sites (dS), are lower among DCCs than EAIs, consistent with recent emergence and shorter divergence times (Figs 10 and 11). As expected, we observed non-synonymous variation to decrease as dS increased among EAIs (Fig. 10a). Interestingly, we observed an even stronger trend at intergenic sites (Fig. 11a), suggesting that many intergenic mutations are also deleterious and are removed over time. Relatively short divergence times appear to completely explain the high rates of intergenic variation observed among DCCs (Fig. 11b). However, this was not the case for genic variants, where there appears to be relaxation of purifying selection among DCCs relative to EAIs (Fig. 10b).
Fig. 10.
DCCs exhibit limited diversity with a skew to non-synonymous variation. (a) To investigate whether high values of dN/dS among DCCs could be explained by their relatively short divergence times, we plotted dN/dS for different values of dS. (b) The detail for the lowest values of dS. As expected, dN/dS values decrease as dS increases. However, DCCs exhibit higher than expected values of dN/dS relative to EAIs at the same depth of divergence (Mann–Whitney–Wilcoxon test, **** = P <0.001), consistent with relaxation of purifying selection.
Discussion
M. abscessus: an emerging health threat
is a rapid growing, free-living mycobacterium that is capable of causing serious infections, particularly in people with CF and other lung diseases, as well as immunocompromised individuals [1, 3, 4]. The incidence of infection with non-tuberculous mycobacteria (NTMs) including
has increased substantially over the past decade [71-73]. Treatment of
infections is challenging due to both intrinsic and acquired resistance to antimicrobials [5-7]. The adaptability of this organism is further highlighted by the recent discovery of DCCs with evidence of person-to-person transmission [8-11].
Mechanisms of LGT in
Our analyses point to what is likely to be an important source of this plasticity, namely the large pangenome and frequent LGT evident among
(Figs 4, 5 and 7, Table S2). These findings are consistent with prior studies that predicted an open pangenome [74] and found evidence of LGT among
[13, 67]. LGT in
is likely to occur via multiple mechanisms. We identified abundant prophage in our sample, consistent with previous findings [16]. The distribution and relatedness of these prophage suggest that active transduction is occurring across subspecies (Fig. 8). We did not attempt to identify plasmids, as short-read sequencing data are unreliable for this purpose [75]. Plasmids are likely to be another mechanism of LGT, as previous studies have identified conjugative plasmids in
[8, 14–16].DCT is another mechanism of LGT that has so far been uniquely identified in mycobacteria [50, 51]. Genetic requirements for DCT in
include the type VII secretion system (T7SS) loci ESX-1 and ESX-4 [76-78].
encodes T7SS loci ESX-4 or ESX-4-bis/EVOL and ESX-3, as well as plasmids containing T7SS loci [14, 16, 79]. Our analyses suggest that
engages in DCT, as we found recombinant fragments in the core genome to be similar in length to those of M. canettii, which is known to engage in DCT (Fig. S6) [48, 49]. Long recombinant fragment lengths are a characteristic feature of DCT [48, 68]. Taken together, our results suggest that LGT occurs via multiple mechanisms in
.
Barriers to LGT between
subspecies
A previous study found evidence suggesting that
ubsp.
engages in LGT more frequently than does subsp. abscessus [13]. Our results support and extend these observations: we found the proportion of sites affected by recombination in the EAIs to be higher in
ubsp.
than subsp. abscessus (Figs. 4 and 5), and the pace of accessory-genome diversification to be higher in
ubsp.
relative to subsp. abscessus (Fig. S9). A previous report of whole-genome sequence data from three isolates mapped to different references interpreted mapping patterns as indicative of admixture between subspecies of
and hypothesized that this was the result of DCT between subspecies [67]. DCT is likely to require substantial sequence similarity between donor and recipient genomes [68]. We found the core genomes of
ubsp.
and subsp. abscessus to be genetically distinct, without evidence of substantial admixture (Fig. 3); we further found several lines of evidence indicating there are barriers to LGT between subspecies’ accessory genomes (Figs. 6 and S8). Prophage are an exception, as they appear to be readily shared across subspecies (Fig. 8). These findings indicate that while LGT is frequent among
, this exchange is organized such that bacterial population structure is maintained. We recently identified similar phenomena among the highly recombinogenic
species [80]. This bacterial population structure could reflect niche partitioning that maintains functional differences between sub-populations, and/or mechanical barriers to LGT, e.g. in transfer via DCT or plasmid conjugation.
Genomic signatures of pathogen emergence in
infection was previously thought to be exclusively acquired from the environment; however, multiple globally distributed DCCs were recently identified that appear to be transmitted person to person among people with CF [8] (Fig. 1). The DCCs are illustrative of the genomic changes that accompany host adaptation and bacterial pathogen emergence. Prior research has identified pseudogenization, genome downsizing and a loss of metabolic capacity in comparisons of host-adapted with free-living bacteria [24–27, 81]. A recently published study of
[82] found that some functional categories were enriched among accessory gene content acquired by DCCs. We found putatively host-adapted DCCs to be characterized by lower levels of LGT relative to environmentally acquired strains of
(Figs. 4, 5 and 7). This echoes observations of the highly pathogenic species
: while there is evidence of frequent LGT in its closest relative, M. canettii,
appears to evolve strictly clonally [48, 49].
is transmitted person to person, whereas M. canettii is thought to be acquired from the environment or another reservoir [83]. Similar patterns have been observed outside the mycobacteria, e.g. in comparisons of pathogenic and non-pathogenic strains of
s [84]. Free-living
encode a massive pangenome (Fig. 7), which we hypothesize is fuelled by LGT with diverse partners found in soil and similarly complex niches. We further hypothesize that the CF lung offers fewer opportunities to partner with diverse bacteria, such that the DCCs’ occupation of this niche is reflected in contraction of their pangenomes relative to the EAIs.
IGR remodelling during pathogen emergence
Although IGRs have traditionally been neglected as targets of selection, more recent research has uncovered a rich array of functions in association with these loci [85-87] and methods have been developed to characterize selection pressures at IGRs [60]. IGRs have been identified as a target of positive selection in
[63]. We hypothesized that regulatory elements could be involved in host adaptation of
and that this would be reflected in elevated dI/dS. DCCs were indeed characterized by relatively high rates of variation at intergenic sites (Fig. 11a). Extrapolating from the literature on adaptation at genic sites [70], we investigated whether intergenic variants appear to be pruned as bacteria diverge from each other. Our data are consistent with this phenomenon, as dI/dS decreased dramatically among the EAIs as dS increased (Fig. 11). This phenomenon appears to explain the high levels of intergenic variation observed among DCCs (Fig. 11b). Together, these data suggest that purifying selection is an important force shaping adaptation at intergenic sites in
.We found evidence of relaxation of purifying selection in the core and accessory genomes, reflected in elevated values of πN/πS relative to EAIs (Fig. S12). A recently published study [82] reported that the DCCs were under purifying selection based on dN/dS estimates <1 on the phylogenetic branch leading to the DCCs. It is difficult to interpret and contextualize their result as they used an unpublished method and did not compare DCCs with EAIs. Our findings are broadly consistent (πN/πS <1), but comparison with EAIs reveals a relative increase in non-synonymous variants among the DCCs. This likely reflects their recent emergence and divergence, and possibly linkage of mildly deleterious mutations to those positively selected during invasion of a new niche [70]. We did find that the increase in non-synonymous variation was out of keeping with divergence time, as measured by dS (Fig. 10b), suggesting that some protein-altering mutations have become newly tolerable during the shift to host adaptation. Interestingly, this pattern differs from observations at intergenic sites (Fig. 11b): we hypothesize that this could be due to greater pleiotropy of regulatory versus genic mutations. The distribution of dI/dS values among DCCs has a long tail (Fig. 11). High levels of intergenic mutations in some isolates could reflect cascades of mutations as regulatory mutations engender bursts of subsequent compensatory mutations that offset negative pleiotropy.As noted above, pseudogenization has previously been identified in association with the shift to a host-associated lifestyle. We did not observe large-scale pseudogenization across the DCCs (Fig. 9). This is consistent with recently published research that found patterns of diversity among pathogens to be distinct from what had previously been described for other host-associated bacteria [84]. Murray et al. found that unlike symbionts, bacterial pathogen genomes were not characterized by excess pseudogenes or a loss of metabolic coding capacity.
Conclusion
In conclusion,
populations are characterized by a complex structure, within which subspecies maintain genetic differentiation in their core and accessory genomes despite overall high levels of LGT mediated by multiple mechanisms. Phage are an exception in that they appear to be readily exchanged among subspecies, suggesting that barriers between subspecies are not maintained by specific pairings of host and phage. Evolving niche specialists, the DCCs, show evidence of genome remodelling and a dramatic shift in their mode of evolution with less evidence of LGT. This could reflect a loss of opportunity to partner with diverse bacteria and/or loss of the capacity to perform this function. The core genomes of DCCs exhibit varying levels of core-genome remodelling, consistent with successive emergence and slightly different placement along an evolutionary trajectory toward host adaptation. The DCCs appear to be under relaxed purifying selection relative to their free-living relatives, as has been observed in other pathogenic bacterial species. Consistent with recent research contrasting pathogens with other host-associated bacteria, we did not identify large-scale pseudogenization among the DCCs. Taken together, these results provide insight into adaptation of this important pathogen and the genomic remodelling events that occur during emergence into a new niche.Click here for additional data file.
Authors: Vegard Eldholm; Gunnstein Norheim; Bent von der Lippe; Wibeke Kinander; Ulf R Dahle; Dominique A Caugant; Turid Mannsåker; Anne Torunn Mengshoel; Anne Ma Dyrhol-Riise; Francois Balloux Journal: Genome Biol Date: 2014-11-07 Impact factor: 13.583
Authors: João L Reis-Cunha; Daniella C Bartholomeu; Abigail L Manson; Ashlee M Earl; Gustavo C Cerqueira Journal: PLoS One Date: 2019-10-02 Impact factor: 3.240
Authors: James Hadfield; Nicholas J Croucher; Richard J Goater; Khalil Abudahab; David M Aanensen; Simon R Harris Journal: Bioinformatics Date: 2018-01-15 Impact factor: 6.937