Literature DB >> 31843054

Combining genomics and epidemiology to analyse bi-directional transmission of Mycobacterium bovis in a multi-host system.

Joseph Crispell¹, Clare H Benton², Daniel Balaz³, Nicola De Maio⁴, Assel Ahkmetova⁵, Adrian Allen⁶, Roman Biek⁵, Eleanor L Presho⁶, James Dale⁷, Glyn Hewinson⁸, Samantha J Lycett³, Javier Nunez-Garcia⁹, Robin A Skuce⁶, Hannah Trewby¹⁰, Daniel J Wilson¹¹, Ruth N Zadoks⁵, Richard J Delahay², Rowland Raymond Kao^3,12.

Abstract

Quantifying pathogen transmission in multi-host systems is difficult, as exemplified in bovine tuberculosis (bTB) systems, but is crucial for control. The agent of bTB, Mycobacterium bovis, persists in cattle populations worldwide, often where potential wildlife reservoirs exist. However, the relative contribution of different host species to bTB persistence is generally unknown. In Britain, the role of badgers in infection persistence in cattle is highly contentious, despite decades of research and control efforts. We applied Bayesian phylogenetic and machine-learning approaches to bacterial genome data to quantify the roles of badgers and cattle in M. bovis infection dynamics in the presence of data biases. Our results suggest that transmission occurs more frequently from badgers to cattle than vice versa (10.4x in the most likely model) and that within-species transmission occurs at higher rates than between-species transmission for both. If representative, our results suggest that control operations should target both cattle and badgers.

Entities: Chemical Disease Species

Keywords: Mycobacterium bovis; badger; bovine tuberculosis; cattle; epidemiology; global health; infectious disease; microbiology; whole genome sequencing

Mesh：

Year: 2019 PMID： 31843054 PMCID： PMC6917503 DOI： 10.7554/eLife.45833

Source DB: PubMed Journal: Elife ISSN： 2050-084X Impact factor: 8.140

Introduction

Control of a pathogen in a system where it can infect multiple species requires an understanding of the role of each host species in the infection dynamics (Haydon et al., 2002). For example, when each host species is capable of maintaining infection independently, control operations in one species can be rendered ineffective as a result of spillover from another. Mycobacterium bovis infection in cattle populations (resulting in bovine tuberculosis - bTB) is a problem around the world (Ayele et al., 2004; Cousins and Roberts, 2001; de Kantor and Ritacco, 2006; Godfray et al., 2013; Reviriego Gordejo and Vermeersch, 2006; Schmitt et al., 2002), with many wildlife species implicated in its spread and persistence in different bTB systems (Delahay et al., 2002; Gortazar et al., 2003; Miller and Sweeney, 2013; Nugent, 2005; Nugent et al., 2015). On the islands of Britain and Ireland, the current evidence suggests that effective control of infection in cattle is hindered by transmission from an infected wildlife population – the European badger (Meles meles) (Godfray et al., 2013). Although a considerable amount of research demonstrates an association between M. bovis found in sympatric cattle and badger populations (Balseiro et al., 2013; Goodchild et al., 2012; Olea-Popelka et al., 2005; Vial et al., 2011; Woodroffe et al., 2005), quantification of the direction and extent of transmission remains elusive. Recent studies using whole genome sequences (WGS) have demonstrated a close genetic relationship among M. bovis isolates taken from sympatric cattle and wildlife populations (Biek et al., 2012; Glaser et al., 2016; Patané et al., 2017). However, the low genomic variability of M. bovis and imbalanced sampling across host species has limited the ability to identify the direction of transmission. Evidence to date suggests that, even with access to pathogen sequence data, obtaining directional estimates of transmission might only be possible at the population level and will require dense targeted sampling and fine-grained epidemiological metadata (Kao et al., 2016; Kao et al., 2014), as has previously been demonstrated in investigations of M. tuberculosis outbreaks in humans (Bryant et al., 2013; Gardy et al., 2011; Guthrie et al., 2018; Walker et al., 2012; Walker et al., 2018; Yang et al., 2017) and in tracing between cattle herds for outbreaks of M. bovis (Biek et al., 2012; Salvador et al., 2019). However, these approaches have yet to be applied to situations where dense multi-host pathogen data are available. Since the 1970s, a high-density naturally infected badger population at Woodchester Park in southwest England has been the subject of detailed study (Delahay et al., 2013). Both the resident badgers and sympatric cattle herds are frequently infected with M. bovis, providing the potential for inter-species transmission of infection to occur in either direction (DEFRA, 2017; Delahay et al., 2013). The data and samples associated with bTB occurrence in and around Woodchester Park are uniquely detailed, with individual-level host life history data and archived M. bovis isolates available for both the cattle (Orton et al., 2018) and badger (Delahay et al., 2013) populations. By combining WGS of selected cattle and badger isolates, with detailed local population data from this exceptionally in-depth study system, our work aimed to quantify the relative roles of the local badger and cattle populations in the spread and persistence of M. bovis in an endemic area. Based on previous evidence of transmission between cattle and badgers, and the success of combining detailed tracing methods with WGS for M. tuberculosis, our hypothesis is that M. bovis circulation in our endemic setting is not limited to a single maintenance host and that it involves bi-directional transmission between the two host populations. Our research aimed to test this hypothesis and to quantify transmission patterns by analysing the Woodchester Park data using a series of statistical and observational analyses linking pathogen genome data with diagnostic testing and population movement and demographic data for both cattle and badgers.

Results

Selecting the isolates, generating and processing the sequencing data

Archived M. bovis isolates were available from 116 badgers and 189 cattle living in and around Woodchester Park. Multiple isolates were available from the sampled badgers, resulting in a total of 230 isolates sourced from badgers. These isolates were whole genome sequenced, and, after quality assessments, 193 badger-derived (from 98 individual badgers taken from 2000 to 2011) and 159 cattle-derived sequences (from 1988 to 2013) were retained for further analyses.

Evidence of epidemiological signatures in the genetic data

To investigate the presence of spatial, temporal, and network signatures associated with infection dynamics in the M. bovis genomic data, inter-sequence genetic distances were calculated between all the cattle- and badger-derived sequences and compared to population metrics. The metrics described the spatial-, temporal-, and network-based relationships that were expected to be associated with pathogen transmission. The genetic and epidemiological data were compared using Random Forest (Liaw and Wiener, 2002) and Boosted Regression (Elith et al., 2008) models in R (v3.4.3; R Development Core Team, 2016) to separately analyse badger–badger (n = 12483), cattle–cattle (n = 1927), and badger–cattle (n = 4838) comparisons. The Random Forest (and Boosted Regression) models were able to explain approximately 67% (62%), 60% (54%) and 75% (70%) of the variation observed in the inter-sequence genetic distance distributions associated with the badger–badger, cattle–cattle, and badger–cattle comparisons, respectively. For each of these models, metrics based on spatial and temporal distances were the most informative in explaining the variation in the genetic distances. Generally, as the temporal and spatial distances associated with the sampled animals decreased, the number of differences between the M. bovis genomes decreased (Appendix 1—figures 5, 6 and 7). There was substantial agreement in the variable rankings between the Random Forest and Boosted Regression models (Appendix 1—figures 2, 3 and 4). For the within-species comparisons metrics, the network data were also highly informative. Generally, the number of differences between the genomes associated with a pair of animals of the same species decreased as the connectedness of their social groups (badgers) or herds (cattle) increased. The variation explained by the Random Forest models and the high ranking of spatial-, temporal-, and network-based metrics was robust to the presence of highly correlated or non-informative metrics and those with missing data (data not shown).

Appendix 1—figure 5.

Partial dependence plots estimating the average marginal effect of each epidemiological metric fitted in the Random Forest regression models on the inter-badger-sequence genetic distance distribution.

The Y axis in each sub-plot represents the genetic distance distribution of the number of the differences between the M. bovis genomes. The X axis of each plot corresponds to the range associated with the corresponding epidemiological metrics. The red line represents the average marginal effect on the predicted genetic distance for each value of the epidemiological metric. Metrics with low importance in the Random Forest models were removed (% Mean Squared Error change of < 0.5%).

Appendix 1—figure 6.

Partial dependence plots estimating the average marginal effect of each epidemiological metric fitted in the Random Forest regression models on the inter-cattle-sequence genetic distance distribution.

Appendix 1—figure 7.

Partial dependence plots estimating the marginal effect of each epidemiological metric fitted in the Random Forest regression models on the badger-cattle-sequence genetic distance distribution.

Appendix 1—figure 2.

The importance of each epidemiological metric in explaining variation in the inter-badger-sequence genetic distance distribution.

Metrics are coloured according to whether they used temporal (gold), spatial (red), or network (blue) information. The correlation (Pearson’s correlation) of the variable importance from the Random Forest and Boosted Regression models is reported in the legend. Two random metrics were included, a sample from a uniform distribution and a sample from a Boolean distribution, in the regression models.

Appendix 1—figure 3.

The importance of each epidemiological metric in explaining variation in the inter-cattle-sequence genetic distance distribution.

Appendix 1—figure 4.

The importance of each epidemiological metric in explaining variation in the badger-cattle-sequence genetic distance distribution.

Metrics are coloured according to whether they used temporal (gold), or spatial (red), or network (blue) information. The correlation (Pearson’s correlation) of the variable importance from the Random Forest and Boosted Regression models is reported in the legend. Two random metrics were included, a sample from a uniform distribution and a sample from a Boolean distribution, in the regression models.

Inter-species clades identified in the phylogeny

The relatedness of M. bovis genomes sampled from the cattle and badgers was evaluated by constructing a phylogenetic tree (Figure 1) using RAxML (v8.2.11; Stamatakis, 2014). Genetic diversity was observed between the cattle- and badger-derived M. bovis sequences, with the number of Single Nucleotide Variants (SNVs) between sequences ranging from 0 to 150 (median = 20). Five clades including cattle- and badger-derived sequences were identified (Figure 1 and Figure 1—figure supplement 1), using a 10 SNV threshold (informed by thresholds used for M. tuberculosis [Bryant et al., 2013; Jajou et al., 2018; Roetzer et al., 2013; Yang et al., 2017]).

Figure 1.

A Maximum Likelihood phylogenetic tree constructed using RAxML (v8.2.11; Stamatakis, 2014) and rooted against the Mycobacterium bovis reference sequence, AF2122/97 (Malone et al., 2017).

Badger and cattle isolates are represented at the tips of the phylogeny by circles and triangles, respectively. Five clades, labelled 1–5, are highlighted with cyan, pink, green, purple, and brown branches, respectively. Cattle and badger isolates within the clades can be distinguished by their shape and colour. Each internal node in the phylogeny is shown as a grey to black shaded circle, with the intensity of the shading indicating the amount of support each node had across 100 bootstraps.

Figure 1—figure supplement 1.

Each of the clades from Figure 1 in the main manuscript are plotted separately.

A Maximum Likelihood phylogenetic tree constructed using RAxML (v8.2.11; Stamatakis, 2014) and rooted against the Mycobacterium bovis reference sequence, AF2122/97 (Malone et al., 2017).

Each of the clades from Figure 1 in the main manuscript are plotted separately.

These clades were extracted from the Maximum Likelihood phylogenetic tree constructed using RAxML (v8.2.11; Stamatakis, 2014) and rooted against the M. bovis reference sequence, AF2122/97 (Malone et al., 2017). Badger and cattle isolates are represented at the tips of the phylogeny by red circles and blue triangles, respectively. Four of the five clades (1–4) identified contained highly similar (within three SNVs) badger- and cattle-derived M. bovis sequences. The badger-derived M. bovis sequence in clade 5 was six SNVs away from its closest cattle-derived sequence. The similarities between the cattle-derived and badger-derived M. bovis sequences in clades 1–4 indicate recent shared transmission histories (Meehan et al., 2018). Clade 4 (highlighted in purple in Figure 1) contained the majority (156/193) of the badger-derived M. bovis sequences and represents the main lineage circulating within the Woodchester Park badger population. In addition, the presence of 16 cattle-derived sequences in clade 4, 15 of which were distant (up to 12 SNVs) from the clade root is consistent with multiple badger-to-cattle transmission events. In contrast, the presence of cattle-derived sequences close to the roots of clades 1–5 suggests that these lineages might have originated in cattle, although these patterns could also be explained by the cattle population being sampled up to 12 years prior to the badger population (cattle were sampled from 1988 to 2013 and badgers from 2000 to 2011). Although clades 1 and 5 contained highly similar sequences originating from cattle and badgers, each clade was associated with only eight animals, making meaningful inference of inter-species transmission patterns difficult. In addition to inter-species clades, several cattle-only clades were identified (Figure 1). Consistent with our hypothesis, the close proximity of M. bovis genomes sourced from cattle and badgers suggests that inter-species transmission occurred in the sampled system. In addition, the presence of clades dominated by a single species suggests that sustained within-species transmission has been occurring in both the cattle and badger populations. The life histories of the sampled cattle and badgers and in-contact animals associated with the inter-species clades (clades 1–5) identified in Figure 1 were interrogated. In this manuscript, a badger or cow is considered ‘sampled’, if one of the M. bovis genomes analysed here was sourced from it. In-contact animals were defined as those that lived in the same herd (for cattle) or social group (for badgers) at the same time as one or more of the sampled animals, according to the available data. From the interrogations of the life history data, further evidence indicative of inter-species transmission and disease maintenance in the Woodchester Park badger population was identified for the animals associated with clade 4 (Figure 2; equivalent figures for the remaining clades can be found in Figure 2—figure supplements 1, 2, 3, and 4). Infection was detected in the majority of the sampled badgers before it was detected in the majority of the sampled cattle. Sampled badgers were present in Woodchester Park at least from 1993 until 2011, based on the available capture and sampling data (Figure 2c). The sampled badgers were in contact with 575 captured badgers, 291 (51%) of which had tested positive for M. bovis infection at some point in their lives (Figure 2a). In contrast, the sampled cattle were in contact with 1760 cattle, of which only 312 (18%) tested positive for M. bovis (Figure 2b). In the animals associated with clade 4, infection was detected earlier in badgers, except in the case of one cow, despite the cattle population being sampled over a broader temporal and spatial window (see Materials and methods section: ‘Selecting the isolates’ for more details). In addition, the badgers were the most represented species in clade 4. These two observations suggest that the clade 4 lineage was being maintained in the badger population. The single cattle-derived sequence that was found closest to the root node of clade 4 (Figure 2c) was sourced from an animal sampled six years prior to any sequences derived from badgers being available. Across all inter-species clades investigated, the sampled cattle (n = 71) were in contact with approximately 11,732 animals, 1356 of which tested positive for M. bovis infection, whereas the sampled badgers (n = 97) were in contact with approximately 650 badgers, over half of which (329) tested positive.

Figure 2.

Life history summaries of the sampled and in-contact cattle and badgers associated with clade 4 in Figure 1.

(a) The number of in-contact badgers associated with the sampled badgers (total in grey, number of animals that have tested positive in red). (b) The number of in-contact cattle associated with the sampled cattle (total in grey [right axis], number of animals that reacted inconclusively [red] or positively [blue] to routine skin test [left axis]). In-contact animals are those that lived in the same herd (cattle) or social group (badgers) at the same time as the sampled animals. (c) The recorded lifespans of the sampled cattle (black horizontal bars) and badgers (grey horizontal bars) associated with clade 4.

(a) The number of in-contact badgers associated with the sampled badgers (total in grey, number of animals that have tested positive in red). (b) The recorded lifespans of the sampled cattle (black horizontal bars) and badgers (grey horizontal bars) associated with clade 1.

(a) The number of in-contact badgers associated with the sampled badgers (total in grey, number of animals that have tested positive in red). (b) The number of in-contact cattle associated with the sampled cattle (total in grey (right axis), number of animals that reacted inconclusively (red) or positively (blue) to routine skin test (left axis). In-contact animals are those that lived in the same herd (cattle) or social group (badgers) at the same time as the sampled animals. (c) The recorded lifespans of the sampled cattle (black horizontal bars) and badgers (grey horizontal bars) associated with clade 2.

(a) The number of in-contact badgers associated with the sampled badgers (total in grey, number of animals that have tested positive in red). (b) The number of in-contact cattle associated with the sampled cattle (total in grey (right axis), number of animals that reacted inconclusively (red) or positively (blue) to routine skin test (left axis). In-contact animals are those that lived in the same herd (cattle) or social group (badgers) at the same time as the sampled animals. (c) The recorded lifespans of the sampled cattle (black horizontal bars) and badgers (grey horizontal bars) associated with clade 3.

(a) The number of in-contact badgers associated with the sampled badgers (total in grey, number of animals that have tested positive in red). (b) The number of in-contact cattle associated with the sampled cattle (total in grey (right axis), number of animals that reacted inconclusively (red) or positively (blue) to routine skin test (left axis). In-contact animals are those that lived in the same herd (cattle) or social group (badgers) at the same time as the sampled animals. (c) The recorded lifespans of the sampled cattle (black horizontal bars) and badgers (grey horizontal bars) associated with clade 5.

Figure 2—figure supplement 1.

Life history summaries of the sampled and in-contact cattle and badgers associated with clade 1 in Figure 1.

Figure 2—figure supplement 2.

Life history summaries of the sampled and in-contact cattle and badgers associated with clade 2 in Figure 1.

(a) The number of in-contact badgers associated with the sampled badgers (total in grey, number of animals that have tested positive in red). (b) The number of in-contact cattle associated with the sampled cattle (total in grey (right axis), number of animals that reacted inconclusively (red) or positively (blue) to routine skin test (left axis). In-contact animals are those that lived in the same herd (cattle) or social group (badgers) at the same time as the sampled animals. (c) The recorded lifespans of the sampled cattle (black horizontal bars) and badgers (grey horizontal bars) associated with clade 2.

Figure 2—figure supplement 3.

Life history summaries of the sampled and in-contact cattle and badgers associated with clade 3 in Figure 1.

(a) The number of in-contact badgers associated with the sampled badgers (total in grey, number of animals that have tested positive in red). (b) The number of in-contact cattle associated with the sampled cattle (total in grey (right axis), number of animals that reacted inconclusively (red) or positively (blue) to routine skin test (left axis). In-contact animals are those that lived in the same herd (cattle) or social group (badgers) at the same time as the sampled animals. (c) The recorded lifespans of the sampled cattle (black horizontal bars) and badgers (grey horizontal bars) associated with clade 3.

Figure 2—figure supplement 4.

Life history summaries of the sampled and in-contact cattle and badgers associated with clade 5 in Figure 1.

(a) The number of in-contact badgers associated with the sampled badgers (total in grey, number of animals that have tested positive in red). (b) The number of in-contact cattle associated with the sampled cattle (total in grey (right axis), number of animals that reacted inconclusively (red) or positively (blue) to routine skin test (left axis). In-contact animals are those that lived in the same herd (cattle) or social group (badgers) at the same time as the sampled animals. (c) The recorded lifespans of the sampled cattle (black horizontal bars) and badgers (grey horizontal bars) associated with clade 5.

Life history summaries of the sampled and in-contact cattle and badgers associated with clade 4 in Figure 1.

Life history summaries of the sampled and in-contact cattle and badgers associated with clade 1 in Figure 1.

Life history summaries of the sampled and in-contact cattle and badgers associated with clade 2 in Figure 1.

(a) The number of in-contact badgers associated with the sampled badgers (total in grey, number of animals that have tested positive in red). (b) The number of in-contact cattle associated with the sampled cattle (total in grey (right axis), number of animals that reacted inconclusively (red) or positively (blue) to routine skin test (left axis). In-contact animals are those that lived in the same herd (cattle) or social group (badgers) at the same time as the sampled animals. (c) The recorded lifespans of the sampled cattle (black horizontal bars) and badgers (grey horizontal bars) associated with clade 2.

Life history summaries of the sampled and in-contact cattle and badgers associated with clade 3 in Figure 1.

(a) The number of in-contact badgers associated with the sampled badgers (total in grey, number of animals that have tested positive in red). (b) The number of in-contact cattle associated with the sampled cattle (total in grey (right axis), number of animals that reacted inconclusively (red) or positively (blue) to routine skin test (left axis). In-contact animals are those that lived in the same herd (cattle) or social group (badgers) at the same time as the sampled animals. (c) The recorded lifespans of the sampled cattle (black horizontal bars) and badgers (grey horizontal bars) associated with clade 3.

Life history summaries of the sampled and in-contact cattle and badgers associated with clade 5 in Figure 1.

(a) The number of in-contact badgers associated with the sampled badgers (total in grey, number of animals that have tested positive in red). (b) The number of in-contact cattle associated with the sampled cattle (total in grey (right axis), number of animals that reacted inconclusively (red) or positively (blue) to routine skin test (left axis). In-contact animals are those that lived in the same herd (cattle) or social group (badgers) at the same time as the sampled animals. (c) The recorded lifespans of the sampled cattle (black horizontal bars) and badgers (grey horizontal bars) associated with clade 5.

Estimated inter-species transmission rates

Although the patterns observed in the phylogenetic and animal life history data were consistent with inter-species transmission in both directions, further analyses were required to quantify the inter-species transmission rates. These further analyses should account for the temporal and spatial sampling biases resulting from the broader sampling window applied to the cattle population in time (1988 to 2013 versus 2000 to 2011) and space (cattle were sampled from up to 100 km away from the Woodchester Park area, whereas the badgers were only sampled from within Woodchester Park). A series of analyses were conducted using the Bayesian Structured coalescent Approximation, or BASTA, package (De Maio et al., 2018) available as part of Bayesian evolutionary analyses platform BEAST2 (Bayesian Evolutionary Analysis by Sampling Trees; Bouckaert et al., 2014). These analyses aimed to estimate the M. bovis inter-species transmission rates between the sampled badger and cattle populations. BASTA is capable of estimating evolutionary dynamics in a structured population and accounting for sampling biases. Here the sampled M. bovis population was structured as it was circulating largely separately in the sampled cattle and badger populations, as seen in Figure 1 and the strong population-specific epidemiological signatures found by the Random Forest and Boosted Regression analyses. In addition, further structure exists within the cattle and badger populations as these were subdivided into herds and social groups, respectively. A series of increasingly spatially structured population models were defined to determine whether the inter-species transmission rates estimated using BASTA were affected by the spatial patterns evident from the Random Forest and Boosted Regression analyses. Structured population models were also chosen to address the spatial sampling biases, by introducing an increasingly structured unsampled badger population. Previous analyses have used BASTA in a similar fashion to estimate evolutionary dynamics in the presence of unsampled populations (De Maio et al., 2015). To further reduce the influence of the spatial and temporal biases and the computational load, the BASTA analyses used a subset of the cattle- (n = 83) and badger-derived (n = 97) M. bovis sequences obtained between 1999 and 2014 within 10 km of Woodchester Park. The AICM (Akaike’s Information Criterion Markov Chain Monte Carlo) score (Baele et al., 2013) was used to compare the BASTA analyses based on different structured populations (Figure 3a). The structured population with two demes (M. bovis populations in badgers and cattle) had the best (lowest) AICM score, although there was considerable overlap with the bootstrapped AICM score interval for one of the four deme models (splitting the M. bovis populations in badgers and cattle into inner and outer populations based on being within or beyond 3.5 km from Woodchester Park [Figure 3a]). The estimated inter-species transition rates provided from each BASTA analysis demonstrated considerable variation, with some estimated cattle-to-badger transition rates bounding zero (Figure 3b). The estimated transition rates can be considered equivalent to the transmission rates, because the states (between which the transition rates were estimated) considered here represented different species. The estimates of the inter-species transition rates from the two-deme model with the best AICM score support the existence of both badger-to-cattle transmission (0.045 times per lineage per year, lower 2.5%: 0.028, upper 97.5%: 0.069) and cattle-to-badger transmission (0.0044 times per lineage per year, lower 2.5%: 0.00021, upper 97.5%: 0.017). Figure 3b shows the order of magnitude differences between the estimated inter-species transmission rates, with the highest supported two-deme model estimating that badger-to-cattle transmission events occurred on average 10.4 times more frequently than cattle-to-badger transmission events in the sample population. Figure 3c represents the lower bound on the number of times (according to the analyses based on the favoured two-deme model) that the sampled M. bovis population was transmitted from one animal to another (regardless of sub-population and, where possible, assuming the ancestral node and one of its daughter nodes represent infection in the same animal [Figure 3—figure supplement 1]). The estimated counts of these transmission events are consistent with the estimated inter-species transition rates and demonstrate that within-species transmission occurs at a higher rate. Specifically, badger-to-badger transmission was estimated to occur at least 2.7 times more frequently than badger-to-cattle transmission (lower 2.5%: 2.2, upper 97.5%: 3.8). In cattle, analyses estimated that at least 46 cattle-to-cattle transmission events occurred (lower 2.5%: 40, upper 97.5%: 56), whereas the estimated number of cattle-to-badger events bounded zero (lower 2.5%: 0, upper 97.5%: 4, with a median value of zero). The counts of events between individual animals outputted by BASTA represent the lower bound of the number of transmission events that occurred over the evolutionary history of the sampled M. bovis population because they are estimated on the transmission chains between the sampled and ancestral host animals and do not account for missing individuals in these chains.

Figure 3.

Comparison of likelihood scores and inter-species transition rate estimates from the BASTA analyses.

Model structure is described in Figure 6, and for each model the sizes of defined demes were held equal or allowed to vary. (a) The Akaike Information Criterion Markov Chain Monte Carlo (AICM; Baele et al., 2013) scores (lower is better) calculated for each of the representations of a structured population analysed in BASTA (Figure 6). The vertical lines show the lower and upper (2.5% and 97.5%, respectively) bounds of the AICM scores computed on 100 bootstrapped posterior likelihoods. (b) Estimated inter-species transition rates for each model. Where multiple badgers-to-cattle and cattle-to-badgers transition rates were estimated (see Figure 6), the values were summed. The values above each vertical line represent the posterior probability of each rate, either as a mean of probabilities associated with multiple estimated rates (for the 3Deme_outerIsBadgers, 4Deme, 6Deme, and 8Deme models) or a single probability (for the 2Deme, 3Deme_outerIsBoth, and 3Deme_outerIsCattle models). (c) The number of transitions between the known and estimated states counted on each phylogenetic tree in the posterior distribution produced by the ‘2Deme_equal’ structured population model analysed in BASTA (counting is illustrated in Figure 3—figure supplement 1). The vertical lines show the lower and upper (2.5% and 97.5%, respectively) bounds of the distributions.

Figure 3—figure supplement 1.

Diagrams illustrating how the transmission events were counted on each of the phylogenies in the posterior distributions produced by BASTA.

Comparison of likelihood scores and inter-species transition rate estimates from the BASTA analyses.

Figure 6.

Deme assignment diagrams illustrating the different demes (sub-populations) defined in a range of structured population analyses conducted using BASTA.

In each analysis, the Mycobacterium bovis sequences available were assigned to each deme based upon the sampled species and their sampling location. The grey doughnut in the badger demes represents an un-sampled population. These diagrams are based on the spatial associations of the badger and cattle-derived M. bovis sequences shown in Figure 5.

Diagrams illustrating how the transmission events were counted on each of the phylogenies in the posterior distributions produced by BASTA.

These counts are shown in panel c ofFigure 3. Each diagram has a simple phylogeny with the estimated states (blue or red) of a parent and its two daughter nodes. The count of the number of transition events on each phylogeny is recorded in a matrix. Transitions are counted in the direction from parent to daughter. Each node has an ID to illustrate the situations when the parent node is assumed to represent one of its daughter nodes earlier in evolutionary time. Taken together, the results from the BASTA analyses are consistent with the hypothesis that circulation of M. bovis in our study populations involved transmission within and between the badgers and cattle. In addition, the directional inter-species transmission rates indicate that transmission from badgers to cattle occurred more frequently than transmission from cattle to badgers and inter-species transmission rates were estimated to be considerably lower than intra-species transmission rates.

Discussion

We hypothesised that the sampled M. bovis population was circulating within and between the sampled cattle and badger populations. Testing our hypothesis across multiple analyses, we found that, while none of these analyses are definitive in their own right, our results are consistent with our hypothesis and suggest that there has been a long history of within- and between-species transmission in the Woodchester Park area, and an important role for badgers in disease persistence. Our choice of analytical methods was based in part on our awareness of underlying data biases. Ideally, sampling should be proportionate to prevalence in the host populations and matched over the same spatial and temporal ranges. Here, the combination of poor sensitivities of the standard tests for cattle (~50–80%; de la Rua-Domenech et al., 2006) and badgers (~50–70%; Chambers et al., 2009) and a reliance on historical archived isolates, meant data biases were unavoidable. Counterbalancing this weakness are the dense sampling of both host populations and the exceptionally detailed metadata. Random Forest and Boosted Regression models identified strong epidemiological signatures of M. bovis transmission within and between host populations. Within species, metrics capturing the spatial, temporal, and network dynamics were all highly informative, indicative of M. bovis circulation being dependent on these factors. Between species, the variation observed between M. bovis sourced from cattle and badgers was found to be well explained by where the animals resided and when they were infected. Changes in these relationships could be exploited to rapidly identify changes in the epidemiology, as might be caused by badger social perturbation under culling operations (Tuyttens et al., 2000; Woodroffe et al., 2006). The present study identified further evidence of within- and between-species transmission in the phylogenetic relationships between the M. bovis genomes (Figure 1). Five clades containing highly similar M. bovis genomes derived from infected cattle and badgers were identified, suggesting that substantial inter-species transmission had occurred. The presence of clades dominated by a single host species was also consistent with sustained within-species transmission. However, these phylogenetic relationships are particularly sensitive to sampling biases and should be interpreted with caution. For example, one interpretation of the basal location of the cattle-derived M. bovis genomes in the clades shown in Figure 1 is that they originated in cattle. Alternatively, this pattern could be the result of sampling the cattle population over a broader temporal range (from 1988 to 2013) than the badgers (2000 to 2011). Further interrogation of the cattle and badger life histories associated with clade 4 (Figure 1) revealed evidence of prolonged persistence of this lineage in the badger population (Figure 2). Despite the cattle population being sampled over a longer time period, the badgers associated with clade 4 were predominantly infected earlier than the cattle and that strain persisted in the badgers for over 10 years. The remaining clades examined suggested that cattle could have been infected before badgers; however, it was not possible to determine whether badgers outside of Woodchester Park could be driving these interactions. Our results do suggest that inter-badger transmission is likely to be dominated by short-range interactions, given that short spatial distances (all less than 3.7 km) were highly informative in describing the genetic relationships examined in the machine learning analyses. Therefore, badgers further away from Woodchester Park are unlikely to be directly driving the patterns observed in our sampled badger population, and the ‘invading’ clades observed here are more parsimoniously explained by introductions of M. bovis from cattle. An additional limitation of these analyses is that no other wildlife species were sampled. Previous research by Delahay et al. (2007) found other mammal species infected with M. bovis in the area, albeit at lower prevalence (7.2% in Fallow deer and 6.8% in Muntjac deer) than the sampled badger population (~30%; Delahay et al., 2013). Given considerable evidence in the present study for inter-species transmission of M. bovis, we next used BASTA, an analysis platform that can account for sampling biases (De Maio et al., 2018), to quantify these processes (Figure 3b). The BASTA analyses estimated transition rates between demes within a structured population. As the demes within the structured model were species-specific, the estimated between-species transition rates can be considered equivalent to transmission rates between populations of badgers and cattle. The most favoured two-deme model estimated badgers-to-cattle transmission rates were, on average, 10.4 times higher than cattle-to-badgers transmission rates (Figure 3a and b). However, the second most favoured four-deme model (which included a more complex population structure) estimated that inter-species transmission rates were close to equal. Although even structured coalescent models do not accurately reflect spatial contact patterns, that the simplest ‘two-deme’ model is favoured is encouraging (i.e. more spatially structured models do not perform better). However, the two-deme model may also have been favoured because of the limited genetic diversity available to estimate the evolutionary parameters and therefore further exploration with explicitly spatial approaches is an important next step. In the process of quantifying inter-species transmission rates, the BASTA analyses also provide counts of the number of transmission events within and between the sampled badgers and cattle (Figure 3c). These counts provide a conservative estimate of the minimum number of transitions between the sampled animals and their ancestors. Although it is not appropriate to directly compare the counts within- and between-species, they do demonstrate that, at a minimum, within-species transmission occurs at least twice as frequently as between-species transmission. The high degree of within-species transmission estimated here is consistent both with the results of other studies that highlight the importance of cattle-to-cattle transmission (Costello et al., 1998; Gilbert et al., 2005; Goodchild and Clifton-Hadley, 2001; Green et al., 2008; Menzies and Neill, 2000), and the persistent long-term infection observed in the Woodchester Park badger population (Delahay et al., 2013). The high-density badger population in Woodchester Park is likely to be similar to populations found in other parts of southwest England (Judge et al., 2017). However, broader representativeness should be confirmed by comparison to sympatric cattle and badger populations elsewhere in Britain and Ireland, particularly in areas with high bTB incidence. In addition, we selected only isolates of spoligotype SB0263, as this was the dominant type in the badger population. The selection of SB0263 could artificially inflate the badgers-to-cattle transition rates estimated here, as the high prevalence of this spoligotype in the badgers could be a reflection of host preference. However, though there are known phenotypic differences between spoligotypes, there is no evidence that these fundamentally change the epidemiology (Garbaccio et al., 2014; Wright et al., 2013). In addition, many different M. bovis spoligotypes have been observed in sympatric badger and cattle populations (Smith et al., 2003) and SB0263 is not only one of the most common spoligotypes in the UK (Smith et al., 2003), it is also highly prevalent in the cattle around Woodchester Park. If the transmission interactions estimated in our research are replicated elsewhere, this could help to explain the failure of efforts to address recurrent and persistent infection in cattle herds that co-exist with badger populations (Gallagher et al., 2013; Karolemeas et al., 2011). In addition, the bi-directional transmission of M. bovis between species has the potential to combine local persistence in badgers with the long-distance mobility of the cattle. In line with a recent evidence review (Godfray et al., 2018), our research also suggests that coordinated bTB control in both cattle and badgers may be necessary to control infection in cattle. More generally, our analyses illustrate the complex interplay that underpins multi-host pathogen problems and demonstrate that, despite this complexity, appropriately defined suites of methods can be used to overcome issues of data biases and identify important epidemiological properties of these systems.

Materials and methods

Analyses layout

Figure 4 describes the complete set of analyses conducted on the M. bovis whole genome sequences sourced from infected cattle and badgers living in and around Woodchester Park. These analyses are described in the sections that follow.

Figure 4.

Steps involved in the analysis of M.bovis whole genome sequences and epidemiological data.

Analyses are shown in blue and outputs and inputs in black. Red arrows represent the removal of data. The three main outputs are highlighted with grey boxes. SNV: Single Nucleotide Variant. BASTA: Bayesian Structured coalescent Approximation.

Steps involved in the analysis of M.bovis whole genome sequences and epidemiological data.

Selecting the isolates

Since 1976, the Woodchester Park badger population has been the subject of a capture-mark-recapture study whereby each badger social group is trapped four times a year (Delahay et al., 2013). Social group territories are delineated annually using bait-marking (Delahay et al., 2000). During trapping operations, each captured badger is given a unique tattoo and at each capture event a number of samples are obtained to determine M. bovis infection status (full details described in Delahay et al., 2013). From 1990 onwards, any M. bovis isolated from samples taken during trapping were spoligotyped (spacer-oligo typing) using conventional methods (Aranaz et al., 1996) and archived. Spoligotyping reports the presence or absence of 43 known spacer sequences within a single direct repeat region of the M. bovis genome. In total, 230 isolates were available from the archive, which originated from samples taken from 116 different badgers from 2000 to 2011. The cattle herds surrounding Woodchester Park undergo statutory annual testing for M. bovis infection as a part of routine surveillance, and results are stored in APHA’s cattle testing (SAM) database (Lawes et al., 2016). Test-positive cattle are slaughtered, selected tissues taken for culture and any M. bovis isolates are spoligotyped and archived. In addition, the movements of every cow in the UK are recorded in the Cattle Tracing System (CTS). For the present study 124 cattle-derived M. bovis isolates, each collected from an individual cow between 1988 and 2013, were selected from the archives. Cattle isolates were selected if they were of the same spoligotype as the badger isolates and were from herds within 10 km of Woodchester Park. More than 90% of the badger-derived isolates were spoligotype SB0263. More than 75% (1096/1442) of the isolates available from cattle within 10 km of Woodchester Park shared the same spoligotype and it is the second most common type found across England (Smith et al., 2003; Smith et al., 2006). To increase the chances of sequencing strains that were shared with the badgers in Woodchester Park, rather than circulating in the cattle population independently, only cattle-derived isolates of spoligotype SB0263 were selected. Additional spoligotype SB0263 isolates from cattle that lived in herds within 100 km of Woodchester Park (n = 65) were included to provide a broader spatio-temporal context, resulting in a total of 189 isolates.

Generating and processing the sequencing data

Badger-derived M. bovis isolates were prepared for sequencing by the Agri-Food and Biosciences Institute in Northern Ireland (AFBI-NI) and for the cattle-derived isolates by APHA. M. bovis isolates were selected from the frozen archives and re-cultured on Löwenstein-Jensen medium. Prior to DNA extraction the isolates were heat killed in a water bath at 80°C for a minimum of 30 min. DNA was extracted from these cultures using standard high salt and cationic detergent cetyl hexadeycl trimethyl ammonium bromide (CTAB) and solvent extraction protocols (Parish and Stoker, 2001; van Soolingen et al., 2001). Extracted DNA was sequenced at the Glasgow Polyomics facility using an Illumina Miseq producing 2 × 300 bp paired end reads (badger derived isolates) and at the APHA central sequencing unit in Weybridge using an Illumina Miseq producing 2 × 150 bp paired end reads (cattle derived isolates). The 65 additional cattle-derived isolates were sequenced at the APHA central sequencing unit in Weybridge using an Illumina NextSeq producing 2 × 150 bp paired end reads (cattle-derived isolates). Following quality assessments in FASTQC (v0.11.2; Andrews, 2010; RRID:SCR_014583), the raw WGS data were trimmed using PRINSEQ (v0.20.4; Schmieder and Edwards, 2011; RRID:SCR_005454) and adapters were removed using TRIMGALORE (v0.4.1; Krueger, 2015; RRID:SCR_016946). The trimmed data were aligned to the M. bovis reference genome (AF2122/97; Malone et al., 2017) using the Burrows-Wheeler aligner (BWA, v0.7.17; Li and Durbin, 2009; RRID:SCR_010910). Regions encoding proline-glutamate and proline-proline-glutamate surface proteins, or annotated repeat regions were excluded (Sampson, 2011). Mapping quality information on all the SNVs identified was retained for each isolate. The allele frequencies at each position in the aligned (against reference) sequence from each isolate were examined. For a haploid organism these frequencies are expected to be either 0 or 1, with some random variation expected from sequencing errors (Sobkowiak et al., 2018). A heterozygous site was defined as one where the allele frequencies were >0.05 and <0.95. Four cattle-derived sequences that had more than 150 heterozygous sites, and allele frequencies that were clustered and non-random (data not shown), were removed. In addition, 26 badger-derived and 16 cattle-derived M. bovis sequences were removed because of suspected errors in the metadata (Appendix 1: Investigating isolate metadata discrepancies). For the sequences from the remaining isolates (204 badger- and 169 cattle-derived isolates), alleles were called at each variant position if they had mapping quality ≥30, high-quality base depth ≥4 (applied to reverse and forward reads separately), read depth ≥30, and allele support ≥0.95. For any site that failed these criteria, if the allele called had been observed in a different isolate that had passed, a second round of filtering was conducted using a high-quality base depth of 5 (total across forward and reverse reads) and the same allele support. As recombination is thought to be extremely rare for mycobacteria (Namouchi et al., 2012), variants in close proximity could indicate a region that is difficult to sequence or under high selection. To avoid calling variants in these regions, variant positions within 10 bp of one another were removed. Following filtering, sequences from 11 badger and 10 cattle isolates that had insufficient coverage (<95%) of the variant positions were removed. Once the alignment was generated, sites with a consistency index less than 1, generally considered homoplasies (Farris, 1989), were removed (n = 4, of 14,991 sites) using HomoplasyFinder (v0.0.0.9; Crispell et al., 2019; RRID: SCR_017300). All the scripts necessary for the processing of the WGS data are freely available online.

Comparing genetic and epidemiological distances

Our research hypothesized that within- and between-species transmission was occurring in the study system. If bi-directional transmission was occurring, then there should be epidemiological signatures in the genomic data linked to these events. These signatures are likely to relate to the spatial, temporal, and network dynamics of the sampled badger and cattle populations, as these will determine their contact patterns. To investigate whether there were any epidemiological signatures of within- and between-species transmission of the sampled M. bovis isolates, the genetic distances between sequences were compared to epidemiological metrics describing the spatial, temporal, and network relationships between the animals associated with each sequence. Inter-sequence genetic distances were calculated, for every pair of sequences, by dividing the number of differences present between the pair of sequences by the total number of sites considered (n = 14,987). In addition, epidemiological metrics were calculated to identify any similarities among animals associated with a particular pair of isolates. Epidemiological metrics were calculated using the data, where available, on each animal obtained from its capture or movement and testing history (further details in Appendix 1: Defining the epidemiological metrics). Two additional dummy metrics, samples from a uniform distribution and a Boolean distribution, were included to determine a threshold of importance that distinguishes noise from signal. Inter-isolate genetic distances and associated epidemiological metrics were compared using Random Forest (RRID:SCR_015718; Liaw and Wiener, 2002) regression and Boosted Regression (RRID:SCR_017301; Elith et al., 2008) models in R (v3.4.3; R Development Core Team, 2016). These machine learning approaches were used to separately analyse badger–badger, badger–cattle, and cattle–cattle comparisons. For each set of comparisons, a training dataset was constructed using 50% of the data available and, following training using these data, the model was tested on the remaining 50% of the data. Genetic distances ≤ 15 SNVs were used for these analyses to avoid larger inter-sequence distances that were not likely to relate to the fine resolution epidemiological relationships of interest. Random Forest and Boosted Regression approaches were selected as these methods can deal with large datasets with many highly correlated variables whose relationship to the response variable (genetic distances) cannot readily be defined (Auret and Aldrich, 2012). A broad range of epidemiological metrics were defined as the Random Forest and Boosted Regression models are robust to non-informative and/or highly correlated variables (Auret and Aldrich, 2012; Elith et al., 2008; Liaw and Wiener, 2002). The two independent approaches were used to ensure that any patterns observed were robust. The influence of including highly correlated and non-informative predictor variables and variables with a large amount of missing data in the machine learning approaches was investigated using the Random Forest models. For highly correlated variables, clusters of correlated variables were defined and the least informative variable from each cluster was incrementally removed and the impact on the fitted Random Forest regression models was examined. A similar approach was used twice more to evaluate the influence of retaining non-informative predictor variables and of including predictor variables with large amounts of missing data in the models.

Building phylogeny and interrogating clusters

Following investigation of population level epidemiological signatures in the sequence data, a phylogenetic tree was constructed to describe the evolutionary relationships among our set of M. bovis genome sequences. If inter- and intra-species transmission events were occurring in the sampled system, there should be evolutionary signatures in the phylogenetic tree. For example, if M. bovis sequences sourced from cattle and badgers have a very close phylogenetic relationship, this suggests that inter-species transmission has occurred. The phylogeny was constructed with the maximum likelihood algorithm in RAxML (v8.2.11; Stamatakis, 2014; RRID:SCR_006086) using a GTR (generalized time reversible) substitution model with 100 bootstraps. The maximum likelihood algorithm was selected as a fast alternative to Bayesian approaches. Although Bayesian approaches will better explore the phylogenetic tree space, this space is expected to be small for phylogenies based on M. bovis data given its highly conserved genome. The GTR model was the most appropriate based on analyses using the modelTest() function in the R package PHANGORN (v2.3.1; Schliep, 2011; RRID:SCR_017302). Based on the range of SNV thresholds (3–12) used to define recent M. tuberculosis transmission (Bryant et al., 2013; Jajou et al., 2018; Roetzer et al., 2013; Yang et al., 2017), clades containing highly related (<10 SNVs apart) cattle-derived and badger-derived sequences (inter-species clades) were identified (Figure 1). The testing histories and recorded movements (for cattle), and capture information (for badgers) of the sampled and in-contact animals associated with each cluster were available. These data were investigated to determine whether they provided any additional evidence to support the phylogenetic relationships indicative of inter-species transmission. ‘In-contact’ animals were defined as those badgers that resided in the same badger social group, or those cattle that lived in the same herd, at the same time as one or more of the sampled badgers or cattle (respectively) associated with a particular inter-species clade.

Estimating inter-species transmission rates

To further investigate patterns of inter- and intra-species transmission, additional evolutionary analyses were completed to estimate directional inter-species transmission rates and quantify their frequency relative to intra-species transmission events. A subset of the sequences available (from 97 badger- and 83 cattle-derived isolates) was selected to estimate the transmission rate of M. bovis between the sampled cattle and badger populations. The selected sequences were within the parent clade containing all the inter-species clades (shown in Figure 1) and were sampled from within 10 km of Woodchester Park between 1999 and 2014. The subset of sequences was split into ‘inner’ and ‘outer’ groups, based on a 3.5 km radius from Woodchester Park (Figure 5). The 3.5 km radius size was selected to contain the sampling locations associated with all the badger-derived sequences and the closest cattle-derived sequences, based on the reported home-ranges of badgers in southern England being <1 km2 (Garnett et al., 2005; Macdonald et al., 2008; Roper et al., 2003).

Figure 5.

Sampling locations of the 97 badgers and 83 cattle associated with the Mycobacterium bovis sequences selected for analysis in BEAST2.

Sampling locations of the 97 badgers and 83 cattle associated with the Mycobacterium bovis sequences selected for analysis in BEAST2.

Location represents the registered address of each sampled farm or the centroid of the estimated sampled badger social group’s territory boundary (indicated by the red polygons). The overlaid circles were used to split the cattle- and badger-derived M. bovis sequences into ‘inner’ and ‘outer’ populations, the distances refer to the radius of each circle. The ‘inner’ circle was defined such that it contained all the locations associated with the available badger-derived and closest (within the badger’s recorded home range of <1 km2 [Gittleman and Harvey, 1982; Garnett et al., 2005; Macdonald et al., 2008; Roper et al., 2003]) surrounding cattle-derived M. bovis sequences. The presence of a temporal signal among the selected M. bovis sequences was examined (Appendix 2: Testing the presence of a temporal signal). A temporal signal was supported by a positive trend, calculated within TEMPEST (v1.5; Rambaut et al., 2016; RRID:SCR_017304), between each sequence’s root-to-tip distance and its sampling time and the results of a tip-date randomisation procedure (Firth et al., 2010). The Bayesian Structured coalescent Approximation (BASTA v2.3.1; De Maio et al., 2015; RRID:SCR_017303) tool, available in BEAST2 (Bayesian Evolutionary Analysis by Sampling Trees – v2.4.4 (Bouckaert et al., 2014), RRID:SCR_017307), uses an approximation of the structured coalescent approach (Vaughan et al., 2014) to estimate migration rates within a structured population. The structured population in the current context is the M. bovis population, whose structure was likely to relate to host species and their spatial relationships. BASTA, in contrast to previously popular methods such as discrete trait analyses (Lemey et al., 2009; Pagel et al., 2004), can estimate the ancestral structure of the population in the presence of biased sampling (De Maio et al., 2015). There were two biases associated with the set of sequences available. First, the prevalence of M. bovis in the sampled cattle and badger populations was likely to be different as a result of the on-going control operations in the cattle, therefore the sampling proportions of these different populations relative to the prevalence of M. bovis were likely to be unequal. Second, although the badger population within Woodchester Park has been intensively monitored and sampled, the surrounding badger population is less well understood and unsampled, whereas cattle both within and outside the Woodchester Park area have been sampled. Based on the ‘inner’ and ‘outer’ populations of the sampled cattle and badgers (shown in Figure 5), a series of BASTA analyses, splitting the sampled M. bovis population into different demes, were designed to estimate the inter-species transition rates while accounting for the two sampling biases discussed (Figure 6). For each of the nine separate population structures, two separate analyses were conducted, one where the deme sizes were constrained to be equal and another where they were allowed to vary. Each of these 18 analyses was repeated three times and estimates were combined across replicates. The inter-species transition rates from each model were compared using the Akaike’s Information Criterion through Markov Chain Monte Carlo (AICM; Baele et al., 2013), for further details see Appendix 2: Structured coalescent analyses using BASTA.

Deme assignment diagrams illustrating the different demes (sub-populations) defined in a range of structured population analyses conducted using BASTA.

Code availability

All the code generated for this manuscript is freely available on GitHub. General scripts are available within the ‘WoodchesterPark’ of the GeneralTools repository (https://github.com/JosephCrispell/GeneralTools; Crispell, 2019a; copy archived at https://github.com/elifesciences-publications/GeneralTools). The Java source code files can be found in a separate respository (https://github.com/JosephCrispell/Java; Crispell, 2019b; copy archived at https://github.com/elifesciences-publications/Java). These scripts are licenced under the General Public Licence v3.0.

Data availability

All WGS data used for these analyses have been uploaded to the National Centre for Biotechnology Information Short Read Archive (NCBI-SRA: PRJNA523164). Because of the sensitivity of the associated metadata, only the sampling date and species will be provided with these sequences. In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses. Thank you for submitting your article "Combining genomics and epidemiology to analyse bi-directional transmission of Mycobacterium bovis in a multi-host system" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Neil Ferguson as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Christian Gortazar (Reviewer #3). The reviewers have discussed the reviews with one another, and the Reviewing Editor has drafted this decision to help you prepare a revised submission. This is a population genomics study of the transmission of M. bovis between two nonhuman animal hosts, cattle and badgers, a matter of considerable ecological and practical interest. Summary: This is an analysis of the multispecies badger-cattle transmission system using genomic and epidemiological data to characterize the transmission of M. bovis between these species, a matter of considerable interest. Strengths include the sizable and detailed data set, and the relatively clear exposition of what analyses were done. Essential revisions: The major required revision is to place the work in the context of clear hypotheses, predictions of these hypotheses for the results of analyses, and interpretation of the analyses as they inform our judgment on those hypotheses. As currently written, like many pathogen genomics papers, this paper presents analyses and results, but leaves the rationale for the analyses and the interpretation of the results in terms of scientific hypotheses unclear. While this is not atypical for papers in the field, it makes it very hard for an interested non-specialist in the subject matter (ecology of M. bovis) to appreciate the paper. For a general interest journal like eLife, it is problematic because the reader is left with an unclear sense of what has and has not been shown. It seems that the major hypotheses being tested revolve around the extent to which badgers are a reservoir for cattle M. bovis infection. Put somewhat more precisely, finding that they are a reservoir would mean that badger-badger transmission sustains the infection, that badger-to-cattle transmission is frequent and is the source (immediate or ultimate) of most cattle infections. There are four main types of analysis in this paper. In a revised version we would expect each of these to be motivated explicitly by "we hypothesized x and tested it using this method and found that results were/were not consistent with x”. This is the main missing item in the paper. 1) Random forest and boosted regressions. The reason for doing both is unclear, and the boosted regressions are not described much at all. The RF method seems perhaps to be testing the plausibility of the idea that genetic distance indicates likely transmission. This seems more or less borne out by the main results where spatial and social proximity have strong explanatory roles (the sign is not stated but I assume that epi distance and genetic distance are positively related). I'm not sure that I'd expect such a relationship to hold over the full scale of distances (beyond some distance I would think there would be no relationship anymore) but this is a detail. The finding that "same host" has no role in the RF is quite weird – usually isolates from the same host would be nearly identical. No explanation is given for why the many different network and spatial measures are used, or which one expects to be positive in such a complex model, especially assuming that this is multivariable, so they are all conditional on the others. Overall, the RF seems maybe to be consistent with the data being of high quality and with transmission being related to short genetic distance, but not to clearly refute or confirm any hypothesis. Please clarify why these analyses were done and what the results mean. 2) Phylogenetic reconstruction. Here most of the clades have high probability of ancestral nodes being in cattle, seemingly inconsistent with the badger-reservoir hypothesis. Please comment on how these results should be interpreted. 3) The epidemiological descriptive data, where badgers seemed to have it long before cattle. Seems consistent with the reservoir hypothesis, though badgers are also much more widely sampled. Please make this explicit (modified if it has been misunderstood). 4) The structured coalescent analyses, in which badger-to-cattle transmissions seem to be much more common than the reverse under nearly all models, including the best-supported ones. One aspect that confuses me is that presumably the different lifespans of the hosts lead to different durations of the infection, so I am not sure if number of transmissions per unit time is the best measure of transmission. But taken at face value this seems consistent with the reservoir hypothesis. Please clarify what you think the interpretation is, and in particular (two reviewers wondered) whether these can be interpreted as the ratio of rates (transmissions per unit time), of basic or effective reproductive numbers (transmissions per infection), or something else that has physical interpretation. Assuming that the Reviewing Editor, a nonexpert in the substantive field, has understood the above correctly, please modify the discussion to give careful consideration of the contrasting observations (3 and 4 support the hypothesis, 2 argues against it) and how they can be reconciled. If the authors can convincingly do that and answer (even with uncertainty) a clear scientific question, this could become publishable. [Editors' note: further revisions were requested prior to acceptance, as described below.] Thank you for resubmitting your work entitled "Combining genomics and epidemiology to analyse bi-directional transmission of Mycobacterium bovis in a multi-host system" for further consideration at eLife. Your revised article has been favorably evaluated by Neil Ferguson (Senior Editor), a Reviewing Editor, and one reviewer. This paper has been extensively revised, and the scientific logic is now far clearer. There are some remaining issues that need to be addressed before acceptance, as outlined below: 1) The sampling selected for genetically similar isolates in the two host species, which will (I believe in all cases) increase the estimated transition rates above that which is typical for all strains. For example, any spoligotypes that are not transmitted between the species will not be counted. This is an important caveat to the conclusions about the frequency of interspecies transfer and needs to be explicit in the Discussion. 2) The phylogeny and BASTA analyses document interspecies transfer in both directions. The regression trees seemingly show the relevance of within-species transmission. Neither alone nor together do they answer the question of reservoir – is transmission in either species sufficient on its own for maintenance of the infection and continuing spillover into the other? The Discussion recommends integrated control, and intuitively this seems sensible, but on their own evidence of transmission within each species and between the two does not prove that essentially R_0{ii} >1 for either species, where ii represents transmission from species i to species i, and this is a condition for i to be a reservoir. I believe that the data are formally consistent with the possibility that eradication in either species would eradicate in the other (seems unlikely, as it requires a big role for interspecies transmission) or, more plausibly, that eradication in one species would eliminate it in the other because R0 within that species <1. If this reasoning is wrong, please refute. If it is right, please note this in the discussion and soften the call for integrated control. 3) Is there any way to quantify the ratio of within to between-species transmissions? This is hinted at frequently, but the numbers are never given. 4) The inclusion of a factor in the RF and BRT analyses does not guarantee that it is included in the expected direction. Can the authors report the direction of the effect for each included factor and explain any discrepancies from expectation, e.g. that overlapping lifespan = lower distance? 5) Can the authors explain, in Figure 3 a)what "mean posterior probability of each rate" means (I think it means posterior probability that it is positive) and b) why the ratio of transition counts and ratio of transition rates is so different? 6) No clear answer was given to essential revision 4, which asked in what if any sense these transition rate ratios can be interpreted as reproductive number ratios or something else epidemiological. Please comment on this in the Discussion. Also please edit carefully to use "transition" rather than "transmission" or explain why both these terms appear (as far as I can tell interchangeably) in the text. Essential revisions: The major required revision is to place the work in the context of clear hypotheses, predictions of these hypotheses for the results of analyses, and interpretation of the analyses as they inform our judgment on those hypotheses. As currently written, like many pathogen genomics papers, this paper presents analyses and results, but leaves the rationale for the analyses and the interpretation of the results in terms of scientific hypotheses unclear. While this is not atypical for papers in the field, it makes it very hard for an interested non-specialist in the subject matter (ecology of M. bovis) to appreciate the paper. For a general interest journal like eLife, it is problematic because the reader is left with an unclear sense of what has and has not been shown. It seems that the major hypotheses being tested revolve around the extent to which badgers are a reservoir for cattle M. bovis infection. Put somewhat more precisely, finding that they are a reservoir would mean that badger-badger transmission sustains the infection, that badger-to-cattle transmission is frequent and is the source (immediate or ultimate) of most cattle infections. There are four main types of analysis in this paper. In a revised version we would expect each of these to be motivated explicitly by "we hypothesized x and tested it using this method and found that results were/were not consistent with x”. This is the main missing item in the paper. Thank you for highlighting this critical issue. The following changes were completed in response: – Added description of our hypothesis and three objectives to the end of the Introduction. – Added statements into the Results section to explicitly link each result to an objective and our hypothesis. – Added similar explicit linking statements into the Materials and methods section. 1) Random forest and boosted regressions. The reason for doing both is unclear, and the boosted regressions are not described much at all. The RF method seems perhaps to be testing the plausibility of the idea that genetic distance indicates likely transmission. This seems more or less borne out by the main results where spatial and social proximity have strong explanatory roles (the sign is not stated but I assume that epi distance and genetic distance are positively related). I'm not sure that I'd expect such a relationship to hold over the full scale of distances (beyond some distance I would think there would be no relationship anymore) but this is a detail. The finding that "same host" has no role in the RF is quite weird – usually isolates from the same host would be nearly identical. No explanation is given for why the many different network and spatial measures are used, or which one expects to be positive in such a complex model, especially assuming that this is multivariable, so they are all conditional on the others. Overall, the RF seems maybe to be consistent with the data being of high quality and with transmission being related to short genetic distance, but not to clearly refute or confirm any hypothesis. Please clarify why these analyses were done and what the results mean. Thank you for highlighting this. We have made the following changes: – Additional analyses using Boosted Regression models were completed to allow the two methods to be more directly compared (Appendix 1—figures 2, 3 and 4 were updated and changes were made in the Results section of the main manuscript and to Appendix 1: Metric importance in Random Forest and Boosted Regression Analysis). – Additional text was included to provide clarity regarding the use of these analyses in the Results section. – The reference to the “same host” epidemiological metric has been removed. This metric wasn’t informative in the machine learning analyses because there were too little data available (only 201 of the 12,483 badger-to-badger comparisons were between genomes sourced from the same animal). – The trends in relationship between each predictor variable and the genetic distances were examined using partial dependence plots. The relationships were found to be non-linear and variable and given that partial dependence plots can be misleading when highly correlated predictor variables are present in the model. Therefore, we only added broad statements about direction in the Results section. – Additional analyses were described in the Results section, which investigated the influence of missing data and highly correlated predictor variables. 2) Phylogenetic reconstruction. Here most of the clades have high probability of ancestral nodes being in cattle, seemingly inconsistent with the badger-reservoir hypothesis. Please comment on how these results should be interpreted. We agree that these results were inconsistent with our hypothesis. Re-examining our analyses, it was clear that the ancestral character estimation method used was highly sensitive to sampling biases and therefore these analyses were removed. The BASTA analyses we describe conducted similar analyses but BASTA is considered more robust because these analyses can account for the known sampling biases in our dataset. The following changes were made: – Removed text referring to ancestral character estimation throughout manuscript. – Figure 1 and its legend were updated to remove reference to the ancestral character estimation. – Provided additional text in the Discussion noting that clades 1,2, 3, and 5, which appear to have a cattle origin, are likely to originate from outside of Woodchester Park where we have no badger isolates and therefore sampling biases may influence any observations of a cattle origin. – Added additional explanatory text in the Results section to highlight the influence of sampling biases. 3) The epidemiological descriptive data, where badgers seemed to have it long before cattle. Seems consistent with the reservoir hypothesis, though badgers are also much more widely sampled. Please make this explicit (modified if it has been misunderstood). The epidemiological description (Figure 2) is restricted to the animals associated with clade 4 in Figure 1. The observation that the badgers in this figure were sampled over a broader temporal window is a reflection of the clade 4 strain rather than our sampling. In fact, in our research the cattle population was sampled over a broader temporal window (1988-2013) as compared to the badger population (2000-2011). To clarify this the following changes were made: – Additional text added to the legend of Figure 3. – Minor changes were made to Figure 2 and its legend. – Included four additional supplementary figures to Figure 1 (Figure 1—figure supplements 1, 2, 3, and 4) documenting the life histories of the animals associated with clades 1, 2, 3 and 5. These are referred to in the Results section. 4) The structured coalescent analyses, in which badger-to-cattle transmissions seem to be much more common than the reverse under nearly all models, including the best-supported ones. One aspect that confuses me is that presumably the different lifespans of the hosts lead to different durations of the infection, so I am not sure if number of transmissions per unit time is the best measure of transmission. But taken at face value this seems consistent with the reservoir hypothesis. Please clarify what you think the interpretation is, and in particular (two reviewers wondered) whether these can be interpreted as the ratio of rates (transmissions per unit time), of basic or effective reproductive numbers (transmissions per infection), or something else that has physical interpretation. You are correct that the different lifespans animals will lead to durations of infection. However, the transmission rates estimated in BASTA (our structured coalescent approach) are done at the population level and for these analyses it is the time from infection to transmission that is important. In addition, the transmission rates are at the deme level rather than the individual level. Lastly, the average lifespan of dairy cattle (6.5 years) and badgers (5-8 years) are fairly similar. In the manuscript, we had incorrectly referred to the transmission rates (Figure 3B) as “cattle-to-badger” or “badger-to-cattle”, which should have used “badgers” – this has been corrected. Similarly, in reference to the estimation of the number of within and between species transition events (Figure 3D) we incorrectly used “badgers” and “cattle” – these have been corrected to “cow” and “badger”. To help with the interpretation of the results of the BASTA analyses, an additional panel was added to Figure 3C. Figure 3C presents the median ratio of the badgers-to-cattle transmission rate divided by the cattle-to-badgers transmission rate estimated by each model analysed in BASTA. Additional references in the text for Figure 3C were added. Assuming that the reviewing editor, a nonexpert in the substantive field, has understood the above correctly, please modify the Discussion to give careful consideration of the contrasting observations (3 and 4 support the hypothesis, 2 argues against it) and how they can be reconciled. If the authors can convincingly do that and answer (even with uncertainty) a clear scientific question, this could become publishable. Thank you for this recommendation. In the revised manuscript the Discussion has been re-written. The aim of this re-write was to produce a shorter and clearer discussion that describes how each analysis and its results should be interpreted in the context of our hypothesis and the broader literature. We note that, while we understand why it seemed like the observations contradict each other, by recognising the role that biases played in the analysis 2, these are now consistent. [Editors' note: further revisions were requested prior to acceptance, as described below.] This paper has been extensively revised, and the scientific logic is now far clearer. There are some remaining issues that need to be addressed before acceptance, as outlined below: 1) The sampling selected for genetically similar isolates in the two host species, which will (I believe in all cases) increase the estimated transition rates above that which is typical for all strains. For example, any spoligotypes that are not transmitted between the species will not be counted. This is an important caveat to the conclusions about the frequency of interspecies transfer and needs to be explicit in the Discussion. Our analyses were limited to highly related genomes. There were very few examples of non-SB0263 isolates in the sampled badgers (>90%), and therefore, for this population, we believe our results are a good representation of the epidemiological characteristics of the system. Further, given the small number of samples of differing spoligotypes and the large genetic distances between them, we would likely be unable to improve the accuracy of our estimates even if we added these samples. However, we do agree that the selection of SB0263 may artificially inflate the importance of badger-to-cattle transmission over cattle-to-cattle transmission. This selection bias is alluded to in the Discussion: “In addition, we have only considered spoligotype SB0263 and there are known phenotypic differences between spoligotypes, though such differences are unlikely to fundamentally change the epidemiology (Garbaccio et al., 2014; Wright et al., 2013).” In addition, more detail is provided in the Materials and methods section: “More than 90% of the badger-derived isolates were spoligotype SB0263. More than 75% (1096/1442) of the isolates available from cattle within 10km of Woodchester Park shared the same spoligotype and it is the second most common type found across England (Smith et al., 2003; Smith, Gordon, de la Rua-Domenech, Clifton-Hadley, and Hewinson, 2006).” The section in the Discussion has been edited and expanded: “In addition, we selected only isolates of spoligotype SB0263, since this was the dominant type in the badger population. […] In addition, many different M. bovis spoligotypes have been observed in sympatric badger and cattle populations (Smith et al., 2003) and SB0263 is not only one of the commonest spoligotypes in the UK (Smith et al., 2003), it is also highly prevalent in cattle around Woodchester Park.” 2) The phylogeny and BASTA analyses document interspecies transfer in both directions. The regression trees seemingly show the relevance of within-species transmission. Neither alone nor together do they answer the question of reservoir – is transmission in either species sufficient on its own for maintenance of the infection and continuing spillover into the other? The Discussion recommends integrated control, and intuitively this seems sensible, but on their own evidence of transmission within each species and between the two does not prove that essentially R_0{ii} >1 for either species, where ii represents transmission from species i to species i, and this is a condition for i to be a reservoir. I believe that the data are formally consistent with the possibility that eradication in either species would eradicate in the other (seems unlikely, as it requires a big role for interspecies transmission) or, more plausibly, that eradication in one species would eliminate it in the other because R0 within that species <1. If this reasoning is wrong, please refute. If it is right, please note this in the Discussion and soften the call for integrated control. Thank you for highlighting this issue and we agree that the language used needs to be tightened. We have replaced the use of the term ‘reservoir’ with respect to the badger population and have removed our statements suggesting the badgers are maintaining infection. Instead, we note that our evidence suggests that infection can persist in the badger population independently for over 10 years. We have softened our statement calling for coordinated control by changing “it will be necessary” to “it may be necessary” in the Discussion. Lastly, we have edited the final sentence of the abstract to state: “If representative, our results suggest that control operations should target both cattle and badgers.” 3) Is there any way to quantify the ratio of within to between-species transmissions? This is hinted at frequently, but the numbers are never given. It would be possible to quantify the ratio of the estimated number of within- and between-individual transmission events (from Figure 3C) but we don’t feel this is appropriate. These counts of the transmission events between individual animals can only be considered conservative estimates of the minimum number of events because they don’t account for multiple host transitions (badger or cow) on a single branch from a parent node to its child. In addition, we assumed that, where possible, the host animal represented the parent and one of the child nodes. In contrast, the more robust inter-species transmission rates are explicitly estimated at the population level and account for missed individuals on the transmission chains by allowing multiple host transitions on a single branch. We included additional sentences describing the transmission event counts in the discussion in the Results section: “The counts of events between individual animals outputted by BASTA represent the lower bound of the number of transmission events that occurred over the evolutionary history of the sampled M. bovis population because they are estimated on the transmission chains between the sampled and ancestral host animals and don’t account for missing individuals in these chains.” Also, additional lines were added in the Discussion section: “These counts provide a conservative estimate of the minimum number of transitions between the sampled animals and their ancestors. While it is not appropriate to directly compare the counts within- and between-species, they do demonstrate that, at a minimum, within-species transmission occurs at least twice as frequently as between-species transmission.” Lastly, we created a supplementary figure to Figure 3 (Figure 3—figure supplement 1) that illustrates how the estimated transmission events were counted on each phylogeny in the posterior distribution of trees estimated by the two deme model in BASTA. 4) The inclusion of a factor in the RF and BRT analyses does not guarantee that it is included in the expected direction. Can the authors report the direction of the effect for each included factor and explain any discrepancies from expectation, e.g. that overlapping lifespan = lower distance? Partial dependence plots were produced from the Random Forest analyses estimating the direction of the relationship between each epidemiological metric and genetic distances. These plots were added into Appendix 1—figures 5, 6, and 7. In addition, two new random metrics were included in the Random Forest and Boosted Regression analyses to provide an indication of the importance that could be attributed to a variable that had no relationship to the genetic distances (see updates to Appendix 1— figures 2, 3, and 4). Sentences describing the trends between the epidemiological trends and genetic distances have been included in Appendix 1: “Partial dependence plots were used to estimate the direction of the effect between each of the epidemiological metrics (predictor variables) and the genetics distances (response variable) (Appendix 1—figure 5, Appendix 1—figure 6, and Appendix 1—figure 7). […] There was a lot of noise around these relationships, but these trends were in-line with our expectations that cattle and wildlife in close proximity in time and space are more likely to transmit infection to one another.” 5) Can the authors explain, in Figure 3 a) what "mean posterior probability of each rate" means (I think it means posterior probability that it is positive) and b) why the ratio of transition counts and ratio of transition rates is so different? a) We apologise for the confusion. Based on the number of inter-species rates estimated, this value is either the posterior probability directly (for the first three models) or the mean calculated across the different inter-species transmission rates that were estimated and summed, as seen in Author response table 1 (derived from Figure 5).

Author response table 1.

	2 demes	3 demes – outer is both	3 demes – outer is cattle	3 demes – outer is badgers	4 demes	6 demes – north and south	6 demes – east and west	8 demes – north and south	8 demes – east and west
CB	1	1	1	2	2	3	3	4	4
BC	1	1	1	2	2	3	3	4	4

To improve clarity, text has been added to the legend of Figure 3, clarifying how the numbers representing the posterior probabilities were calculated: “The values above each vertical line represent the posterior probability of each rate, either as a mean of probabilities associated with multiple estimated rates (for the 3Deme_outerIsBadgers, 4Deme, 6Deme and 8Deme models) or a single probability (for the 2Deme, 3Deme_outerIsBoth, and 3Deme_outerIsCattle models).” b) As stated in our response to point 3, the transmission counts represent the lower bounds on the number of events between individual animals. In contrast, the inter-species transmission rates are estimated at the population and can’t be directly compared to these counts. In addition, because the counts don’t account for missing individuals on the transmission chains from the ancestral individuals to the sampled animals, they are susceptible to sampling biases. To avoid confusion panel C of Figure 3 was removed. In addition, the new panel C (previously D) has been edited to improve its clarity. The counts for within species and between species events have been separated and the number of badgers and cattle sampled at the tips of the phylogenies that the counts are derived from has been noted. 6) No clear answer was given to essential revision 4, which asked in what if any sense these transition rate ratios can be interpreted as reproductive number ratios or something else epidemiological. Please comment on this in the Discussion. Also please edit carefully to use "transition" rather than "transmission" or explain why both these terms appear (as far as I can tell interchangeably) in the text. a) The relative transmission rates are the most appropriate calculations given the evolutionary analysis methods that were used in this paper. It isn’t appropriate to discuss these in terms of the reproductive ratio because we are calculating population-level rather than individual transmission rates. We note that direct estimates of reproductive numbers are possible but would require extensive additional work using different methods. b) In our case the inter-population transition rates estimated here can be considered inter-species transmission rates because the populations we consider are species specific. As noted above in 6a above, these are estimates at the population rather than individual animal level. Any instances of the misuse of these terms in the manuscript were corrected. Additional text added into the discussion to clearly define our use of transmission and transition: “The BASTA analyses estimated transition rates between demes within a structured population. Since the demes within the structured model were species-specific the estimated transition rates can be considered equivalent to transmission rates between populations of badgers and cattle.”

Appendix 1—table 1.

Epidemiological metrics capturing the spatial, temporal, and network relationships between a pair of sampled animals.

Whether or not the metric was used in the badger–badger, cattle–cattle, and badger–cattle comparisons is indicated.

Epidemiological metrics	Badger-Badger	Cattle-Cattle	Badger-Cattle
Same main [herd/social group]?	YES	YES	NO
Same sampled [herd/social group]?	YES	YES	NO
Same infected [herd/social group]?	YES	NO	NO
Spatial distance between main [herd/social group]s	YES	YES	YES
Spatial distance between sampled [herd/social group]s	YES	YES	YES
Spatial distance between infected [herd/social group]s	YES	NO	NO
Distance from closest land parcel to main [herd/social group] using centroids	NO	NO	YES
Distance from closest land parcel to sampled [herd/social group] using centroids	NO	NO	YES
Number of days overlap between the recorded lifespans	YES	YES	YES
Number of days overlap between the infected lifespans	YES	NO	NO
Number of days spent in same [herd/social group]	YES	YES	NO
Number of days between infection detection dates	YES	NO	YES
Number of days between sampling dates	YES	YES	NO
Number of days between breakdown dates	NO	YES	NO
Number of recorded [cattle movements/dispersal events] between main [herd/social group]s	YES	YES	NO
Number of recorded [cattle movements/dispersal events] between sampled [herd/social group]s	YES	YES	NO
Number of recorded [cattle movements/dispersal events] between infected [herd/social group]s	YES	NO	NO
Shortest path length between main [herd/social group]s	YES	YES	NO
Mean number of [cattle/badgers] traversing edges of shortest path between main [herd/social group]s	YES	YES	NO
Shortest path length between sampled [herd/social group]s	YES	YES	NO
Mean number of [cattle/badgers] traversing edges of shortest path between sampled [herd/social group]s	YES	YES	NO
Shortest path length between infected [herd/social group]s	YES	NO	NO
Mean number of [cattle/badgers] traversing edges of shortest path between infected [herd/social group]s	YES	NO	NO
Number of [cattle/badgers] recorded in both main [herd/social group]s	YES	YES	NO
Number of [cattle/badgers] recorded in both sampled [herd/social group]s	YES	YES	NO
Number of [cattle/badgers] recorded in both infected [herd/social group]s	YES	NO	NO
Shortest path length between main [herd/social group]s (some [herd/social group]s excluded)	NO	YES	NO
Mean number of [cattle/badgers] traversing edges of shortest path between main [herd/social group]s (some [herd/social group]s excluded)	NO	YES	NO
Shortest path length between sampled [herd/social group]s (some [herd/social group]s excluded)	NO	YES	NO
Mean number of [cattle/badgers] traversing edges of shortest path between sampled [herd/social group]s (some [herd/social group]s excluded)	NO	YES	NO
Shortest path length between infected [herd/social group]s (some [herd/social group]s excluded)	NO	YES	NO
Mean number of [cattle/badgers] traversing edges of shortest path between main [herd/social group]s (some [herd/social group]s excluded)	NO	YES	NO

Appendix 1—table 2.

The 15 M. bovis isolates whose inter-isolate genetic distances were poorly predicted (median difference between actual and predicted genetic distances outside 95% percentile) by the Random Forest and/or Boosted Regression models.

Those isolates whose spoligotypes did not match the phylogenetic patterns are also listed.

Isolate ID	Outlier - Random Forest	Outlier - Boosted Regression	Phylogenetic-Spoligotype mismatch
WB65	YES	YES	NO
WB15	YES	YES	NO
WB137	NO	YES	NO
WB70	YES	YES	NO
WB98	YES	YES	NO
WB99	YES	YES	NO
WB71	NO	YES	YES
WB105	YES	YES	YES
WB106	YES	YES	NO
WB74	YES	YES	NO
WB75	YES	YES	NO
WB107	NO	NO	YES
WB72	NO	NO	YES
WB96	YES	NO	NO
WB100	YES	NO	YES

74 in total

1. Spatial perturbation caused by a badger (Meles meles) culling operation: implications for the function of territoriality and the control of bovine tuberculosis (Mycobacterium bovis).

Authors: F A M Tuyttens; R J Delahay; D W Macdonald; C L Cheeseman; B Long; C A Donnelly
Journal: J Anim Ecol Date: 2000-09 Impact factor: 5.091

Review 2. Bottlenecks and broomsticks: the molecular evolution of Mycobacterium bovis.

Authors: Noel H Smith; Stephen V Gordon; Ricardo de la Rua-Domenech; Richard S Clifton-Hadley; R Glyn Hewinson
Journal: Nat Rev Microbiol Date: 2006-09 Impact factor: 60.633

3. Social organization and movement influence the incidence of bovine tuberculosis in an undisturbed high-density badger Meles meles population.

Authors: J Vicente; R J Delahay; N J Walker; C L Cheeseman
Journal: J Anim Ecol Date: 2007-03 Impact factor: 5.091

4. The population structure of Mycobacterium bovis in Great Britain: clonal expansion.

Authors: Noel H Smith; James Dale; Jacqueline Inwald; Si Palmer; Stephen V Gordon; R Glyn Hewinson; John Maynard Smith
Journal: Proc Natl Acad Sci U S A Date: 2003-12-01 Impact factor: 11.205

5. Local cattle and badger populations affect the risk of confirmed tuberculosis in British cattle herds.

Authors: Flavie Vial; W Thomas Johnston; Christl A Donnelly
Journal: PLoS One Date: 2011-03-28 Impact factor: 3.240

6. Patterns and processes of Mycobacterium bovis evolution revealed by phylogenomic analyses.

Authors: José S L Patané; Joaquim Martins; Ana Beatriz Castelão; Christiane Nishibe; Luciana Montera; Fabiana Bigi; Martin J Zumárraga; Angel A Cataldi; Antônio Fonseca Junior; Eliana Roxo; Ana Luiza; A R Osório; Kláudia S Jorge Ufms; Tyler C Thacker; Nalvo F Almeida; Flabio R Araújo; João C Setubal
Journal: Genome Biol Evol Date: 2017-02-13 Impact factor: 3.416

7. A cluster of multidrug-resistant Mycobacterium tuberculosis among patients arriving in Europe from the Horn of Africa: a molecular epidemiological study.

Authors: Timothy M Walker; Matthias Merker; Astrid M Knoblauch; Peter Helbling; Otto D Schoch; Marieke J van der Werf; Katharina Kranzer; Lena Fiebig; Stefan Kröger; Walter Haas; Harald Hoffmann; Alexander Indra; Adrian Egli; Daniela M Cirillo; Jérôme Robert; Thomas R Rogers; Ramona Groenheit; Anne T Mengshoel; Vanessa Mathys; Marjo Haanperä; Dick van Soolingen; Stefan Niemann; Erik C Böttger; Peter M Keller
Journal: Lancet Infect Dis Date: 2018-01-08 Impact factor: 25.071

8. Performance of TB immunodiagnostic tests in Eurasian badgers (Meles meles) of different ages and the influence of duration of infection on serological sensitivity.

Authors: Mark A Chambers; Sue Waterhouse; Konstantin Lyashchenko; Richard Delahay; Robin Sayers; R Glyn Hewinson
Journal: BMC Vet Res Date: 2009-11-17 Impact factor: 2.741

9. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

10. Identifying mixed Mycobacterium tuberculosis infections from whole genome sequence data.

Authors: Benjamin Sobkowiak; Judith R Glynn; Rein M G J Houben; Kim Mallard; Jody E Phelan; José Afonso Guerra-Assunção; Louis Banda; Themba Mzembe; Miguel Viveiros; Ruth McNerney; Julian Parkhill; Amelia C Crampin; Taane G Clark
Journal: BMC Genomics Date: 2018-08-14 Impact factor: 3.969

15 in total

1. Mycobacterium bovis Tuberculosis in Two Goat Farms in Multi-Host Ecosystems in Sicily (Italy): Epidemiological, Diagnostic, and Regulatory Considerations.

Authors: Vincenzo Di Marco Lo Presti; Maria Teresa Capucchio; Michele Fiasconaro; Roberto Puleio; Francesco La Mancusa; Giovanna Romeo; Carmelinda Biondo; Dorotea Ippolito; Franco Guarda; Flavia Pruiti Ciarello
Journal: Pathogens Date: 2022-06-04

2. A new phylodynamic model of Mycobacterium bovis transmission in a multi-host system uncovers the role of the unobserved reservoir.

Authors: Anthony O'Hare; Daniel Balaz; David M Wright; Carl McCormick; Stanley McDowell; Hannah Trewby; Robin A Skuce; Rowland R Kao
Journal: PLoS Comput Biol Date: 2021-06-25 Impact factor: 4.475

Review 3. Towards a more healthy conservation paradigm: integrating disease and molecular ecology to aid biological conservation^†.

Authors: Pooja Gupta; V V Robin; Guha Dharmarajan
Journal: J Genet Date: 2020 Impact factor: 1.166

4. Identifying likely transmissions in Mycobacterium bovis infected populations of cattle and badgers using the Kolmogorov Forward Equations.

Authors: Gianluigi Rossi; Joseph Crispell; Daniel Balaz; Samantha J Lycett; Clare H Benton; Richard J Delahay; Rowland R Kao
Journal: Sci Rep Date: 2020-12-15 Impact factor: 4.379

5. Isolation and Histopathological Changes Associated with Non-Tuberculous Mycobacteria in Lymph Nodes Condemned at a Bovine Slaughterhouse.

Authors: Angélica M Hernández-Jarguín; Julio Martínez-Burnes; Gloria M Molina-Salinas; Ned I de la Cruz-Hernández; José L Palomares-Rangel; Alfonso López Mayagoitia; Hugo B Barrios-García
Journal: Vet Sci Date: 2020-11-10

6. Social network analysis and whole-genome sequencing to evaluate disease transmission in a large, dynamic population: A study of avian mycobacteriosis in zoo birds.

Authors: Carmel Witte; James H Fowler; Wayne Pfeiffer; Laura L Hungerford; Josephine Braun; Jennifer Burchell; Rebecca Papendick; Bruce A Rideout
Journal: PLoS One Date: 2021-06-09 Impact factor: 3.240

7. Inferring Mycobacterium bovis transmission between cattle and badgers using isolates from the Randomised Badger Culling Trial.

Authors: Andries J van Tonder; Mark J Thornton; Andrew J K Conlan; Keith A Jolley; Lee Goolding; Andrew P Mitchell; James Dale; Eleftheria Palkopoulou; Philip J Hogarth; R Glyn Hewinson; James L N Wood; Julian Parkhill
Journal: PLoS Pathog Date: 2021-11-29 Impact factor: 6.823

Review 8. Mycobacterium bovis: From Genotyping to Genome Sequencing.

Authors: Ana M S Guimaraes; Cristina K Zimpel
Journal: Microorganisms Date: 2020-05-03

9. Temporal and spatial Mycobacterium bovis prevalence patterns as evidenced in the All Wales Badgers Found Dead (AWBFD) survey of infection 2014-2016.

Authors: Paul Schroeder; Beverley Hopkins; Jeff Jones; Terry Galloway; Ryan Pike; Simon Rolfe; Glyn Hewinson
Journal: Sci Rep Date: 2020-09-16 Impact factor: 4.379

Review 10. Characterization of potential superspreader farms for bovine tuberculosis: A review.

Authors: Helen R Fielding; Trevelyan J McKinley; Richard J Delahay; Matthew J Silk; Robbie A McDonald
Journal: Vet Med Sci Date: 2020-09-16