Ana C Reis1,2, Mónica V Cunha1,2. 1. Centre for Ecology, Evolution and Environmental Changes (cE3c), Faculdade de Ciências da Universidade de Lisboa, Lisboa, Portugal. 2. Biosystems & Integrative Sciences Institute (BioISI), Faculdade de Ciências da Universidade de Lisboa, Lisboa, Portugal.
The newly sequenced data included in this work are deposited under the following Biosample accession numbers: SAMN17004141–SAMN17004143, SAMN17004145– SAMN17004174, SAMN17004176–SAMN17004184 and under the Bioproject accession number PRJNA682618 at a public domain server in the National Centre for Biotechnology Information (NCBI) SRA database.All supporting data, code and protocols have been provided within the article or through supplementary data files which can be found at https://doi.org/10.6084/m9.figshare.15025965., the causative agent of animal tuberculosis (TB), is proposed to evolve as a conservative strictly clonal microorganism. In the current work, a dataset composed of 70
representing the entire clonal diversity was used to assess genome structure, composition and evolution. Plus, in a strain subset (n=42) circulating in an endemic multi-host TB scenario, the polymorphisms in virulence-related genes were scrutinized. The obtained results support an open and dynamic pan-genome architecture, with a small core-genome component when compared to the accessory-genome component. The diversity of the accessory-genome component might be of crucial importance for
adaptive capacity. The complementary ancestral reconstruction evidenced an overall gene loss supplanting gene gain, with the majority of gene gains occurring in terminal branches. Globally, we provide insights into the evolutionary history of
on a population level and clarify clade-specific virulence signatures of field isolates. Contrary to the paradigmatically view on conservative MTC genomes, this work evidences striking genomic diversity among
field strains.
Introduction
The
complex (MTC) encompasses 12 closely related mycobacteria that cause tuberculosis (TB) in a variety of mammalian hosts [1].
is the most well-known member, as it is a prominent human pathogen, accounting for approximately 10 million new cases and 1.4 million deaths in 2018 [2].
is the ecotype that infects the broadest range of host species. It causes animal TB primarily in cattle, but also in other livestock species [3]. Some wildlife species are pathogen reservoirs and infection amplifiers, helping to sustain TB across a variety of ecosystems in different geographic locations [4]. Well-known epidemiological scenarios referred to in the literature include the Iberian Peninsula, United Kingdom, Australia, New Zealand, South Africa and USA [5-7]. Furthermore,
has been associated with zoonotic infection in humans, especially in low-income countries and where the boundaries between humans and animals fade away [8].The MTC members present a high level of similarity at the nucleotide level, with recent comparative genomics works evidencing an average nucleotide identity >99% [9, 10], however they exhibit variable host tropisms, phenotypes and degrees of pathogenicity [11]. Over the years, genomic diversity analyses have been fundamental to the understanding of evolutionary processes involving MTC as a whole, starting with the study of specific genomic regions, such as the Regions of Difference (RD), proceeding to the phylogeny of specific members, leading to the identification of different
lineages and
clonal complexes [12-16]. Current knowledge supports that
is subdivided into five main clonal complexes: European 1 (Eu1), European 2 (Eu2), European 3 (Eu3), African 1 (Af1) and African 2 (Af2) [12-16]. The clonal complexes are identified by the absence of specific spacers in their spoligotyping profile, specific deletions and single nucleotide polymorphisms (SNPs) in precise genes. Their geographic distribution exhibits differences, with Af1 being predominant in West-Central Africa, Afr2 in East Africa, Eu2 in Iberian Peninsula and Brazil, Eu3 in France and Italy, while Eu1 evidence a worldwide distribution [16-18]. Nevertheless, genome composition and functional diversity, as well as recognition of core- and accessory-gene components, are still overlooked topics for the ecotypes infecting animals.The pan-genome consists of the entire gene repertoire of a given species, and its analysis is a very informative approach, enabling stratification into the core-genome, containing genes shared by all strains, and the accessory-genome, covering dispensable genes that are not present in all strains, as well as unique genes, also known as singletons, that are particular to specific strains [19]. The core-genome determines the backbone of all strains of a given species and contains genes necessary to the most basic intraspecific biological processes, being responsible for the major phenotype [19]. The accessory-genome adds genetic diversity to the species and may involve other metabolic pathways that crucially assure adaptation to different ecological niches, colonization of new hosts or other functions that are beneficial over other species [19, 20].The differentiation of genetic variants, grounded by multi-genome studies enabled by whole-genome sequencing, has become central in the field of disease epidemiology, contributing to gaining insights in host–pathogen interactions, virulence, transmission and resistance determinants [21, 22]. In MTC, the combined effect of deletion/insertion events together with the presence of SNPs across members contribute to differences in host ranges, pathogenicity, disease progression and outcome. In fact, previous works performing comparative genome, proteome or secretome analyses of
H37Rv,
and
variant BCG (bacillus Calmette-Guérin) revealed important differences among these ecotypes [20, 23, 24]. For instance, work by Pelayo and collaborators identified about 700 SNPs that could differentiate
BCG and
strains [25].Proteins or genes related to virulence are involved in a diversity of pathways and biological processes, including the interaction with host cells and interference with immune responses; survival inside the aggressive microenvironment of host macrophages; infection recalcitrance and recrudescence; response to environmental stress conditions; and auxiliary antibiotic resistance [26]. The genotypic and phenotypic differences revealed by the comparative works performed so far encourage further studies on
genomes, which may be insightful regarding the biological and virulence traits of a pathogen that is far from being understood or controlled. The identification of virulence-associated genes, their biological function and the relationship between virulence-specific and epidemiological-specific signatures are important to understand TB infection, persistence and improve control.In this study, we aimed to gain insights into the
genome structure, composition and evolution by means of a large comparative study at the genome level. For this purpose, 70
.
representing the entire clonal complex diversity were submitted to a comparative genome analysis through a pan-genomic approach, evaluating the genome composition dynamics and performing an ancestral reconstruction of gene gain and loss. Moreover, in a group of
field isolates circulating in an endemic multi-host TB scenario, the polymorphisms in virulence-related genes were ascertained and mutational rates inferred. Globally, we show that molecular approaches to the pan-genome are helpful to gain insights into the evolutionary history of
on a population level and to clarify clade-specific virulence signatures of
field isolates. We then combine this information with phenotypic advantage inferences that may shed light on the evolutionary forces exerted upon this pathogen at the interface with the host in particular epidemiological scenarios.
Methodology
M. bovis dataset
A globally diverse
dataset was collected and used in this study. Sixteen genomes were downloaded as complete/draft genome assemblies up to a maximum of ten scaffolds from NCBI (National Centre for Biotechnology Information); 12 were selected as representatives of the entire
clonal complex diversity [27] and recovered as Illumina fastq files deposited at SRA (Sequence Read Archive); 42
.
genomes were newly sequenced genomes (Table 1 and Table S1, available in the online version of the article, details below).
BCG (bacillus Calmette-Guérin) was excluded from the NCBI search and
AF2122/97, used as reference genome, was included in this dataset. A detailed account of the dataset, including accession code, sequencing strategy, status of completeness and assembly parameters is presented in Table S1.
Table 1.
Characteristics of
genomes used in this work
M. bovis ID
Clonal complex*
Country
Year
Host species
SNP clade†
Lesion type
Reference
Type of sequence
Mb0220
w/o CC
Portugal
2003
Cattle
E
na
[65]
Newly sequenced
Mb0261
Eu2
Portugal
2006
Red deer
D
Type I
[65]
Newly sequenced
Mb0601
Eu2
Portugal
2007
Cattle
B
na
[65]
Newly sequenced
Mb0769
Eu2
Portugal
2008
Cattle
A
Type II
[65]
Newly sequenced
Mb0783
Eu2
Portugal
2008
Wild boar
A
Type II
[65]
Newly sequenced
Mb0865
Eu2
Portugal
2008
Cattle
C
Type I
[65]
Newly sequenced
Mb0891
Eu2
Portugal
2009
Red deer
A
Type III
[65]
Newly sequenced
Mb0893
Eu2
Portugal
2008
Wild boar
B
Type II
[65]
Newly sequenced
Mb1317
Eu2
Portugal
2010
Cattle
D
Type II
[65]
Newly sequenced
Mb1339
Eu2
Portugal
2010
Cattle
A
Type II
[65]
Newly sequenced
Mb1458
w/o CC
Portugal
2010
Wild boar
E
Type II
[65]
Newly sequenced
Mb1480
w/o CC
Portugal
2010
Cattle
E
Type II
[65]
Newly sequenced
Mb1654
Eu2
Portugal
2011
Cattle
A
Type II
[65]
Newly sequenced
Mb1670
w/o CC
Portugal
2011
Red deer
E
Type III
[65]
Newly sequenced
Mb1711
Eu2
Portugal
2011
Red deer
A
Type III
[65]
Newly sequenced
Mb1712
Eu2
Portugal
2011
Red deer
C
Type III
[65]
Newly sequenced
Mb1714
Eu2
Portugal
2011
Cattle
D
Type II
[65]
Newly sequenced
Mb1744
w/o CC
Portugal
2012
Wild boar
E
Type II
[65]
Newly sequenced
Mb1746
Eu2
Portugal
2012
Red deer
B
Type I
[65]
Newly sequenced
Mb1758
Eu2
Portugal
2012
Cattle
A
Type II
[65]
Newly sequenced
Mb1769
Eu2
Portugal
2012
Wild boar
C
Type II
[65]
Newly sequenced
Mb1785
Eu2
Portugal
2012
Red deer
B
Type II
[65]
Newly sequenced
Mb1789
Eu2
Portugal
2012
Cattle
A
na
[65]
Newly sequenced
Mb1841
Eu2
Portugal
2012
Cattle
A
Type II
[65]
Newly sequenced
Mb1870
Eu2
Portugal
2012
Wild boar
A
Type II
[65]
Newly sequenced
Mb1915
Eu2
Portugal
2013
Red deer
D
Type III
[65]
Newly sequenced
Mb1948
w/o CC
Portugal
2013
Red deer
E
Type III
[65]
Newly sequenced
Mb1960
Eu2
Portugal
2013
Red deer
A
Type I
[65]
Newly sequenced
Mb2026
Eu2
Portugal
2013
Cattle
B
Type II
[65]
Newly sequenced
Mb2043
Eu2
Portugal
2013
Red deer
A
Type III
[65]
Newly sequenced
Mb2067
Eu2
Portugal
2013
Wild boar
B
Type II
[65]
Newly sequenced
Mb2206
Eu2
Portugal
2014
Cattle
B
Type III
[65]
Newly sequenced
Mb2235
w/o CC
Portugal
2014
Red deer
E
Type III
[65]
Newly sequenced
Mb2277
w/o CC
Portugal
2014
Red deer
E
Type II
[65]
Newly sequenced
Mb2300
Eu2
Portugal
2014
Wild boar
B
Type II
[65]
Newly sequenced
Mb2310
Eu2
Portugal
2015
Red deer
A
Type I
[65]
Newly sequenced
Mb2313
Eu2
Portugal
2015
Wild boar
D
Type II
[65]
Newly sequenced
Mb2325
Eu2
Portugal
2015
Red deer
D
Type III
[65]
Newly sequenced
Mb2328
Eu2
Portugal
2015
Red deer
A
Type III
[65]
Newly sequenced
Mb2347
w/o CC
Portugal
2015
Wild boar
E
Type II
[65]
Newly sequenced
Mb2395
Eu2
Portugal
2015
Wild boar
B
Type II
[65]
Newly sequenced
Mb2397
Eu2
Portugal
2015
Wild boar
B
Type III
[65]
Newly sequenced
Mb502499
Af1
Ghana
na
Human
na
na
[27, 66]
SRA deposited
Mb502526
Af1
Ghana
na
Human
na
na
[27, 66]
SRA deposited
Mb1203064
Af1
Ghana
na
Human
na
na
[27, 66]
SRA deposited
Mb4117155
Af2
France
na
Wild boar
na
na
[27, 67]
SRA deposited
Mb1791710
Af2
Tanzania
na
Chimpanzee
na
na
[27, 68]
SRA deposited
Mb1791712
Af2
Tanzania
na
Chimpanzee
na
na
[27, 68]
SRA deposited
Mb1792006
Eu1
USA
2006
Cattle
na
na
[68]
SRA deposited
Mb1792127
Eu1
USA
2008
Cattle
na
na
[68]
SRA deposited
Mb1792361
Eu1
USA
2013
Cattle
na
na
[68]
SRA deposited
Mb7240242
Eu1
USA
2016
Cattle
na
na
[68]
SRA deposited
Mb7240415
Eu1
USA
2014
Cattle
na
na
[68]
SRA deposited
Mb1791984
Eu1
USA
2005
Cattle
na
na
[68]
SRA deposited
MBE1
w/o CC
Egypt
2014
Cattle
na
na
na
assemble/draft genomes NCBI
MBE3
w/o CC
Egypt
2014
Cattle
na
na
na
assemble/draft genomes NCBI
MBE4
w/o CC
Egypt
2014
Cattle
na
na
na
assemble/draft genomes NCBI
MBE10
w/o CC
Egypt
2015
Cattle
na
na
na
assemble/draft genomes NCBI
Mb0077
w/o CC
Canada
2006
Elk
na
na
na
assemble/draft genomes NCBI
Mb0565
w/o CC
Canada
2011
Cattle
na
na
na
assemble/draft genomes NCBI
BMR25
w/o CC
Canada
1985
Bison
na
na
na
assemble/draft genomes NCBI
Mb3601
Eu3
France
2014
Cattle
na
na
[16]
assemble/draft genomes NCBI
Mb0476
Eu2
Canada
2002
Cattle
na
na
na
assemble/draft genomes NCBI
MbSP38
Eu2
Brazil
2010
Cattle
na
na
[69]
assemble/draft genomes NCBI
Mb1595
w/o CC
Korea
2012
Cattle
na
na
[70]
assemble/draft genomes NCBI
Mb0030
w/o CC
China
na
na
na
na
[71]
assemble/draft genomes NCBI
Mb0001
Eu2
Brazil
2015
Tapirus terrestris
na
na
na
assemble/draft genomes NCBI
Mb0003
w/o CC
India
1986
Cattle
na
na
na
assemble/draft genomes NCBI
Mb31150
Af2
Uganda
na
Chimpanzee
na
na
[27, 72]
assemble/draft genomes NCBI
MbAF2122/97
Eu1
UK
1997
Cattle
na
na
na
assemble/draft genomes NCBI
*Eu1: European 1, Eu2: European 2, Eu3: European 3, Af1: African 1, Af2: African 2, and w/o CC: without clonal complex.
The 42 newly sequenced
were entirely recovered from a well-characterized animal TB hotspot in Portugal and included isolates recovered from cattle (n=14), red deer (n=16) and wild boar (n=12) over a 12 year period (circa 70% were recovered over a 5 year period) (Table 1 and Table S1) [28]. The selected host species are described as the most relevant to the maintenance of
in a multi-host system in Portugal [7].Cultures used in this work were derived from frozen stocks, prepared after a single in vitro passage of original archived samples. Re-culture and DNA extraction procedures were performed at a biosecurity level three facility. The frozen culture stocks of selected
were regrown in Middlebrook 7H9 (Difco) medium supplemented with 5% sodium pyruvate and 10% ADS enrichment (50 g albumin, 20 g glucose, 8.5 g sodium chloride in 1 l water), at 37 °C, being monitored regularly for growth. Culture pellet was resuspended in 500 µl PBS, inactivated by heating at 99 °C during 30 min, and then stored at −20 °C until WGS.WGS was performed using the Illumina Genome Analyser, according to the manufacturer’s specifications with the paired-end module attachment. The MiSeq (2×250 pb) technology was implemented for 40 samples at United States Department of Agriculture (USDA, USA), while HiSeq (2×150 pb) technology was employed for the remaining two (Eurofins Genomics, Germany).Information concerning the microscopic characterization of TB-compatible lesions was available for 39 infected animals (Table 1). Lesions were histologically classified into three categories according with disease progression: type I – ‘caseous granulomatous lesions’; type II – ‘caseous granulomatous lesions with calcification’; and type III – ‘thick encapsulated caseous granulomatous lesions with calcification’.
Bioinformatics workflow
The bioinformatics workflow followed in this work is detailed in supplementary material (Fig. S1) and comprises de novo assembly and map to reference strategies.
Genome assembly and annotation
Fifty-four genomes (42 newly sequenced and 12 fastq files recovered from SRA) were de novo assembled with Unicycler pipeline, currently available at https://github.com/rrwick/Unicycler [29]. After the reads’ quality analysis (FastQC version 0.11.7; https://github.com/s-andrews/FastQC), and whenever necessary after clean-up with Trimmomatic version 0.36 (http://www.usadellab.org/cms/?page=trimmomatic), genomes were assembled with SPAdes optimizer [29], and subjected to post-assembly optimization with Pilon version 1.18 [30]. A conservative bridging mode was applied, since it avoids misassembly, and the k-mer size was searched and selected between 20–95% of read length. Considering the reads’ size, contigs of less than 300 bp were removed, and a 20-fold cut-off was stabilized following SPAdes guidelines [31].The quast pipeline (http://quast.sourceforge.net/quast.html), which promotes the remapping of contigs with
AF2122/97 reference genome (NCBI accession number LT708304.1), was used to address the quality of the de novo assemblies (Table S1).All the complete genomes/draft assemblies were annotated with the Rapid Annotation using Subsystem Technology (RAST) toolkit, available at webserver https://rast.nmpdr.org [32] using the taxon 1765 (‘
’) and the genetic code 11.
Genome mapping and SNP variant calling
The fastq files of the dataset from Portugal were further submitted to a map to sequence approach. Reads were aligned to the
AF2122/97 reference genome (LT708304.1), using BWA and Samtools [33, 34] through the vSNP pipeline, currently available at https://githubcom/USDA-VS/vSNP. Base quality-score recalibration, SNP and indel (insertion or deletion) discovery were applied across all isolates using standard filtering parameters or variant quality-score recalibration according to Genome Analysis Toolkit (GATK)’s Best Practices recommendations [35, 36]. Results were filtered using a minimum SAMtools quality score of 150 and AC=2.A variant was filtered out if: (i) it was supported by less than 20 reads, (ii) it was found in a frequency of less than 0.9, (iii) it was registered in at least one strain but also with a gap in at least another strain, in order to avoid mapping errors and false SNPs.Integrated Genomics Viewer (IGV) version 2.4.19 (http://software.broadinstitute.org/software/igv/) was employed to visually validate SNPs and positions with mapping issues or alignment problems. SNPs that fell within Proline-Glutamate (PE) and Proline-Proline Glutamate (PPE) genes were filtered from the analysis, as well as indels.The sequencing statistics details are provided in Table S2.The phylogenetic analysis was conducted with mega (Molecular Evolutionary Genetics Analysis) software version 7.0 (https://www.megasoftware.net/) using a concatenated sequence of 1816 bp validated and polymorphic SNPs by applying the maximum-likelihood method with 1000 bootstrap inferences and the GTR (General Time Reversible) model.
Pan-genome analyses
Get-homologues pipeline (https://github.com/eead-csic-compbio/get_homologues) [37] was used to compute core- and pan-genome from the input of complete genomes/draft de novo assemblies genome sequences.Briefly, the source GenBank-formatted files were passed to get_homologues.pl script and instructed to compute nucleotide and protein clusters, by running the Clusters of Orthologous Groups triangles (COGtriangles) and OrthoMCL (Markov Clustering of orthologs, OMCL) algorithms, as previously detailed [38]. PFAM-domain scanning was enabled (-D flag).The directories holding the results from the different clustering algorithms were then passed to the auxiliary script compare_clusters.pl to compute either the consensus core-genome or pan-genome.Since a large number of draft genomes was used in this work, a 75% of sequence alignment coverage and 95% of sequence identity was imposed and all clusters independent of size were included. Moreover, genes were only considered as taxon specific (clusters of size 1) if sequences had no blast hit with any other protein (E-value 1.0E-05) [39]. This strategy was implemented to minimize the inclusion of truncated genes found at the ends of contigs of open genomes. Truncated genes found at the ends of contigs of incomplete genomes are thus not included in any cluster because they have very weak or no reciprocal blast hit.The auxiliary script parse_pangenome_matrix.pl was employed to perform a pan-genome matrix and to classify genes in four categories: core, soft-core, cloud and shell.The pan- and core-genome curves and the new genes were depicted using the Pan-Genome Profile Analyse Tool (PanGP) software, using distance-guide (DG) algorithm and default parameters. According with Tettelin’s review on pan-genome research [40], Heaps' law model was employed to fit the pan-genome size of strains. The auxiliary script roary_plots.py of Roary pipeline (https://github.com/sanger-pathogens/Roary/blob/master/README.md) [41] was implemented to generate a graphical output, using the pan-genome phylogenetic tree and pan-genome matrix as input files. The scoary pipeline (https://github.com/AdmiralenOla/Scoary) [42] was implemented to infer gene enrichment. The pan-genome matrix, pan-genome phylogenetic tree and a trait file grouping genomes by clonal complex were used as input files. Default criteria were applied to run the pipeline script. The Benjamini–Hochberg adjusted P-value (P-value<0.05) was used to validate the genes most over- and under-represented in the specific groups.
Maximum-likelihood pan-genome phylogenetic tree from the pan-genome matrix
The auxiliary script, estimate_pangenome_phylogenies.sh, included in get-phylomarkers pipeline (https://github.com/vinuesa/get_phylomarkers) [43] was used to perform a customized maximum-likelihood (ML) phylogenetic analysis, based on the pan-genome matrix files, returned by compare_clusters.pl script (options ‘-t 0 –m’) from the get-homologues pipeline. Maximum-likelihood phylogenetic tree was constructed by IQ-tree with ultrafast bootstrap approximation (UFBoot) of 1000 replicates. The GTR (General Time Reversible)+FO (optimized base frequencies by maximum likelihood)+R3 (unconstrained base frequencies) model was pointed as the most suitable evolutionary model.
Genome evolution – gene gain and loss dynamics
To investigate genome evolution and dynamics across the genealogy of M. bovis, a ML birth-and-death model was implemented with the software Count version 10.04 (http://www.iro.umontreal.ca/~csuros/gene_content/count.html) [44], using as input files the ML pan-genome phylogeny and the pan-genome matrix of gene presence/absence.Briefly, first the model was optimized by maximizing the likelihood of the data using a gain–loss-duplication model with a Poisson distribution for gene family size at the root, and assuming a gamma-distributed rate variation across gene families, and a variable gain/loss ratio across branches. The remaining parameters were settled as default. The rate parameters were optimized after 100 rounds with a convergence threshold of 0.01.Then, after the optimization of the branch-specific parameters of the model, the ancestral reconstruction was performed using the dollo parsimony method, and gain and loss events were inferred. For each branch i, the gene turnover Ti was defined as Ti= Gi/Li with Gi denoting the branch rate of gene gains and Li, the rate of gene losses.
Functional annotation analysis
For functional annotation analysis, COG categories were associated to the predicted genome CDS via eggnog-mapper v2 webtool (http://eggnog-mapper.embl.de/) [45]. Each CDS was classified into one or more of the 26 upper COG categories. An auto- adjust of the taxonomic scope was enabled and an e-value cut-off of 1e-3 was settled for orthologue detection. The other parameters were settled as default.
Polymorphic analyses of virulence-related genes
To scrutinize
virulence traits, 421 genes described to be essential for virulence of MTC members, as reviewed by Forrellad and collaborators [26] and/or listed in Mycobrowser database (https://mycobrowser.epfl.ch/), were screened for the presence of SNPs (Table S3). Moreover, due to their biological function in signal transduction, gene regulation and fitness, all genes described to be part of the mce family, Lux-R family or to codify toxin and antitoxin proteins were included (Table S3).Globally, the genes with a demonstrated, or putative role, in mycobacteria virulence were grouped into 13 categories according to their function and molecular features: lipid and fatty acid metabolism (n=38), cell-envelope proteins (n=14), mce family (n=33), lipoproteins (n=6), secretion systems (n=54), defence mechanisms (n=33), protein kinases (n=3), proteases (n=7), metal-transported proteins (n=7), genes and metabolism regulation expression (n=23), Lux-R family (n=6), toxin-antitoxin family (n=120), and other virulent proteins (n=77) (Table S3).The SNPs were analysed according to two criteria: (1) number of identified polymorphisms per gene and (2) monomorphic polymorphisms specific to
phylogenetic clades.Phylogenetic analyses were performed with concatenated alignments of SNPs in the whole set of virulence genes (n=195) and concatenated alignments of non-synonymous (NS) SNPs harboured by virulence genes (n=119). The maximum-likelihood trees were conducted in mega (v 7.0) with 1000 bootstrap inferences and GTR model.
Results and discussion
Pan-genome analysis shows an
open pan-genome
Bacterial pan-genome estimation requires de novo assemblies into complete or partial genomes with large contigs, providing an unprecedented opportunity to reveal differences when comparing with reference-guided genome assemblies. A total of 54
.
isolates were de novo assembled using the same sequencing principles and techniques, while the remaining 16 were downloaded from NCBI as complete genome or draft assemblies (Table S1). The seven draft assemblies derived from the public domain were sequenced using the same sequencing technology and assembled through the implementation of two analytical methodologies (Table S1). Excluding the complete genomes (n=9), an average of 113 contigs per draft genome was obtained (Table S1).The 70 complete genome/draft assemblies ranged between 4.17 and 4.36 Mbp, revealing similar genome sizes, and presenting an average GC content of 65.5% (Table S1). The variation in genome size might be the foundation of morphological and physiological differences between strains, contributing to genomic diversity. The predicted number of protein-coding sequences (CDS) pool ranged from 4199 to 4435 (Table S1). The CDS were further clustered in homologue gene clusters and used to estimate the
core- and pan-genomes.The
genome components were defined as core genome (i.e. genes present in all genomes), soft-core genome [i.e. genes present in at least 95% of the genomes (n=66)], cloud genome (i.e. rare genes present only in one/two genomes) and shell genome (i.e. the remaining genes present in several genomes). Shell and cloud formed the accessory component of the genome that contributes to species diversity and may confer selective advantages, such as niche adaptation, antibiotic resistance or colonization of a new host [46]. In this
dataset, the core-genome is composed of 2736 gene clusters, the soft-core by 3708, the shell genome by 1341 and the cloud genome by 2656 (Figs 1 and S2). The accessory component includes 3997 gene clusters, 2656 of which are present only in one or two genomes. On average, each genome has 42 cloud gene clusters, accounting for an average of less than 1% of the total genome-coding capacity. The bimodal, asymmetrical U-shape distribution of gene clusters that is common to many bacterial species indicates that most genes are either rare or to be found in almost all genomes, leading to a smaller proportion of genes at intermediate frequencies (Fig. 2). Our results are in agreement with previous works performed with
,
and combined datasets that evidenced that the accessory component comprise a higher percentage of pan-genome, from circa 21% in a reported dataset with 13
.
to 75% in a combined
/
dataset [47-49]. Moreover, studies performed with mycobacterial genomes have also shown that mycobacteria have a small core-genome component when compared to the accessory-genome component [50, 51].
Fig. 1.
Soft-core and accessory-genome composition of 70
. A maximum-likelihood phylogeny built based on pan-genome presence/absence matrix is represented on the left, and a heat map showing gene presence (dark blue) or absence (light blue) is on the right.
Soft-core and accessory-genome composition of 70
. A maximum-likelihood phylogeny built based on pan-genome presence/absence matrix is represented on the left, and a heat map showing gene presence (dark blue) or absence (light blue) is on the right.The relative proportions of softcore and accessory components per genome is relatively constant, with each
genome presenting on average (mean and standard deviation) 87.5
0.9% of soft-core genes, 11.5
0.73% of shell genes, and 1.0
1.12% of cloud genes (Fig. S3). Our results are in line with the work of Yang and collaborators, which analysed 13
.
strains in which circa 91% of the genome was occupied by core genes [47]; and with the work of Bolotin and Hershberg that associated lower frequencies (circa 2–6%) of cloud genes to bacterial clonal species, including MTC [52].
Core- and accessory-genome size evolution
Core- and pan-genome evolution size estimates were analysed for exponential decay and with growth models (Fig. 2a). According with the obtained growth model,
seems to have an open pan-genome, since there is no distinctly sharp plateau, suggesting that pan-genome size could continue to increase if the number of genomes would also increase (Fig. 2a). Moreover, on average, 32 genes were added by each genome (Fig. 2b). These results are in agreement with the few published works on this topic regarding
[47, 48] and
[47], that also pointed to an open pan-genome. Open pan-genomes are associated with species that colonize multiple environments and that have multiple ways of exchanging genetic material [46].
is known, among MTC members, to have the widest host range, with livestock and wildlife reservoirs across different ecosystems in different countries. Plus, the capacity of escaping the host immune system is documented, as well as survival in several in vitro and in vivo environmental conditions. This high biological flexibility together with the evidence for an open genome suggests that the accessory-genome of
could play an essential role in the adaptive responses triggered by this pathogen.
Fig. 2.
Prediction of
pan-genome size. (a) Genome size evolution curves of the pan-genome (blue) and core-genome (green). The blue boxes denote the
pan-genome size for each genome for comparison, whereas the green boxes show the same comparison for core-genome size. The curve is the least-squares fit of the power law for the average values. (b) Curve (orange) for the number of new genes with an increase in the number of
genomes.
Prediction of
pan-genome size. (a) Genome size evolution curves of the pan-genome (blue) and core-genome (green). The blue boxes denote the
pan-genome size for each genome for comparison, whereas the green boxes show the same comparison for core-genome size. The curve is the least-squares fit of the power law for the average values. (b) Curve (orange) for the number of new genes with an increase in the number of
genomes.
Dynamics of genome composition
To reconstruct the ancestral dynamics of genome composition across
genealogy, a birth-and-death likelihood model was applied. The analysis revealed a higher loss rate (overall mean 0.011) when comparing with gain rate (overall mean 0.005), with terminal branches evidencing higher values. The gene turnover criteria exhibit a value inferior to one, with the exception of two terminal branches (Mb1595 and Mb0077) and two internal nodes, where the rates of gene gain and loss were estimated to be equal.The ancestral reconstruction evidenced an overall gene loss (overall mean 147) over gene gain (overall mean 25), with the majority of gene gains occurring in terminal branches (Fig. 3). The terminal branches presented higher mean values for both events, when comparing with internal nodes (mean gene gain/loss of 34/169 and 15/123, for terminal and internal nodes, respectively). Globally, longer terminal branches show more events, as is evident in the branches leading to Mb31150 or Mb0030 (Fig. 3). Branches leading to genomes included in the clonal complexes Eu1 and Af2 reveal high mean values of gene gain and gene loss, when comparing with the genomes of clonal complexes Eu2 and Af1. However, an expanded dataset with more representatives would be necessary to confirm this trend.
Fig. 3.
genome composition dynamics. Estimated events of genome content evolution are mapped onto the ML phylogeny (GTR) estimated with the pan-genome matrix of gene presence/absence. Numbers, after
labelling, at terminal nodes represent the estimated numbers of gene gain and loss events (Gains/Losses). The two horizontal bars are histograms that represent the relative proportions of soft core (blue), shell (green) and cloud (orange) genes involved in gain (left) and loss (right) events inferred for the terminal branches. The tree is rooted and drawn to scale with branch lengths measured as the number of substitutions per site.
genome composition dynamics. Estimated events of genome content evolution are mapped onto the ML phylogeny (GTR) estimated with the pan-genome matrix of gene presence/absence. Numbers, after
labelling, at terminal nodes represent the estimated numbers of gene gain and loss events (Gains/Losses). The two horizontal bars are histograms that represent the relative proportions of soft core (blue), shell (green) and cloud (orange) genes involved in gain (left) and loss (right) events inferred for the terminal branches. The tree is rooted and drawn to scale with branch lengths measured as the number of substitutions per site.The shell component is the most variable, being involved in both gene gain, in internal nodes, and loss events in both terminal and internal nodes, while gains of cloud genes and loss of soft core occur mainly at terminal nodes. We highlight that the loss of strain-specific genes at terminal branches cannot be detected, therefore the loss of cloud genes is underestimated with this methodology. The variance within the cloud component might be of crucial importance for
adaptive capacity. Discrete
subpopulations can thus be perceived based on both core- and accessory-genome components.
exhibits a puzzling discrepancy with high genome nucleotide identity, but high divergence in genome composition. The aforementioned results indicate that (i) diversity within genome composition arises mostly through gene loss, and (ii) recent events that are not shared by many strains need to be considered when investigating
diversity and evolution.The observed distributions of gain/loss events can either reflect the genealogical process, implying a minimal number of transition steps, or can result from an intricate series of gene gains and losses that likely reflect local selective pressures, such as a tipping point in the ecological niche.
Pan-genome functional annotation
The functions of the predicted CDS encoded in the genomes were inferred from the protein sequences annotated in the COG database. On average, the majority of CDS with COG (85.4%) were attributed to at least one category, while, in a minority of cases, two to five categories could be assigned (‘Several categories’ group) (Table S1 and Fig. S4). The most represented categories were ‘transcription’ (total average=258), ‘lipid metabolism and transport’ (n=242), ‘energy production and conversion’ (n=214) and ‘unknown function’ (n=876) (Fig. S4). The variability within the composition of each COG category was estimated to be between zero (‘RNA processing and modification’) and 24 (‘Energy production and conversion’), when comparing the composition of all
.A gene-enrichment analysis was implemented with the objective of identifying over- and under-represented genes when grouping genomes by clonal complex. After statistical support correction, a higher number of genes was identified in genomes of clonal complex Eu2 (number of genes=171), followed by genomes of Eu1 (n=97), Af1 (n=18) and finally Af2 (n=10). The genes suggested to be over- and under-represented in these clonal complexes were functionally annotated and the categories registering a higher number of occurrences were ‘Unknown function’ and ‘Cell motility’ (Fig. S5).The difference in genome composition among
strains might be driven by selective pressure exerted upon this pathogen, by the environment or the host immune system, however it would be necessary to have tested a larger sample size to further explore this hypothesis.
Pan-genome-based phylogenetic analysis
The phylogenetic analysis based on the presence/absence of pan-genome provided a complementary perspective on the evolutionary relationships among
strains (Fig. 4). Genomes included in the Af1 (n=3) cluster within the same branch, as well as members of Eu1 (n=6). These
were recovered from the same epidemiological context, Af1 from human hosts in Ghana and Eu1 from cattle in the USA. Furthermore, two of the four members of Af2 recovered from the same epidemiological scenario cluster together. A higher dataset would thus be necessary to disclose deeper associations.
Fig. 4.
Maximum-likelihood tree (GTR) of
based on a pan-genome matrix of gene presence/absence. The coloured squares represent the
clonal complexes: purple for European 1 (Eu1), red for European 2 (Eu2), blue for European 3 (Eu3), orange for African 1 (Af1) and green for African 2 (Af2). The tree is rooted and drawn to scale with branch lengths measured as the number of substitutions per site.
Maximum-likelihood tree (GTR) of
based on a pan-genome matrix of gene presence/absence. The coloured squares represent the
clonal complexes: purple for European 1 (Eu1), red for European 2 (Eu2), blue for European 3 (Eu3), orange for African 1 (Af1) and green for African 2 (Af2). The tree is rooted and drawn to scale with branch lengths measured as the number of substitutions per site.
Impact of the usage of draft genomes on the ability to properly identify CDSs
To further check how the usage of draft genomes might impact the ability to properly identify CDSs, a complementary analysis focused at contig, CDSs and gene gain and loss events was performed. Globally (n=70), the genomes with higher CDS prediction were Mb1714 (165 contigs), Mb2313 (119 contigs), Mb0565 (two contigs) and Mb0030 and Mb31150 (both complete genomes). If considering gene-gain events, the higher number of events (over 100 events) was registered by Mb0030, Mb0003, Mb33150 and MbAF2122/8/97 that are complete genomes; while when considering gene loss, the genomes with higher number of inferred events (over 400) were Mb4117155 (99 contigs) and Mb31150 and MbAF2122/97 (both complete genomes) (data not shown).When considering the group of draft assembly genomes (n=61), these presented on average (mean and standard deviation) 113±44 contigs, 4271±11 CDSs, 25±11 gain events and 159±76 loss events. The correlation between the number of contigs per genome and predicted CDSs was weak (r=0.29). Moreover, the correlation between the number of contigs per genome and gain and loss events was also weak, with values of 0.04 and 0.21, respectively (data not shown). Our analysis did not point to a positive or negative effect exerted by the number of contigs on the prediction of CDS and gain/loss events, thus supporting that the results described are due to intrinsic strain variability and that our methodological approach to infer core- and pan-genome is robust. The inclusion of truncated genes found at the ends of contigs of open genomes was avoided by adopting a 75% of sequence alignment coverage and 95% of sequence identity criteria, as well as the inclusion of all clusters independently of size, contributed to the robustness of the analyses described herein. Moreover, genes were only considered as taxon specific (clusters of size 1) if sequences had no blast hit with any other protein (E-value 1.0E-05) Finally, when considering SNP identification, a map to reference methodology was implemented, so concerns related to truncated genes did not apply.
Virulence-related genes analyses in
from a multi-host TB system in Portugal
The topology of the maximum-likelihood phylogenetic tree based on the SNP alignment containing 1816 polymorphic positions led to the definition of five
SNP clades (Table 1 and Fig. 5). The phylogenetic tree clearly separates SNP clades A–D (n=33), with genomes included in the Eu2 clonal complex, from clade E (n=9) that encompasses genomes not assigned to any of the described clonal complexes (Fig. 5). The majority of SNPs (87.1%) was located in coding regions and a prioritized analysis concerning genes involved, or putatively involved, in virulence pathways was performed, leading to the screening of 421 genes (Table S3).
Fig. 5.
Maximum-likelihood phylogenetic tree of 42
strains, from the dataset from Portugal, using as input an alignment containing all validated SNPs (n=1816). The tree is drawn to scale, with branch lengths measured as the number of substitutions per site and identified with different colours according with the SNP clade. The presence/absence SNP profile in virulence-related genes where polymorphisms were recorded (n=125) is represented. Virulence-associated genes are grouped into categories identified by the colour scheme (figure generated with iTOLv6, https://itol.embl.de/).
Maximum-likelihood phylogenetic tree of 42
strains, from the dataset from Portugal, using as input an alignment containing all validated SNPs (n=1816). The tree is drawn to scale, with branch lengths measured as the number of substitutions per site and identified with different colours according with the SNP clade. The presence/absence SNP profile in virulence-related genes where polymorphisms were recorded (n=125) is represented. Virulence-associated genes are grouped into categories identified by the colour scheme (figure generated with iTOLv6, https://itol.embl.de/).A total of 296 genes were conserved across all analysed strains, while the remaining 125 had at least one polymorphism in one strain, in a total of 194 SNPs. The majority (n=118; 60.8%) was non-synonymous (NS) (Figs. 5 and 6a and Table S4). With the exception of the ‘lipoproteins’ category, all the remaining 12 functional groups comprise polymorphic genes (Fig. 5). Ninety-three genes harbour NS SNPs, 55 had neutral polymorphisms and a group of 23 genes hold both types of SNPs. Considering the NS SNPs, the premature introduction of stop codons in four cases and the loss of a stop codon (Table S5) in one situation could be registered.(a) Circular representation of genome position of genes with SNPs (outer circle) and genes with NS SNPs (inner circle). (b) Circular representation of genome position of genes with monomorphic NS SNPs. From the inner circle to outer circle: clade C, clade D, clade A–D and clade E (figure generated with ShinyCircos, https://venyao.xyz/shinyCircos/).Globally, there were SNPs distributed randomly across
strains and SNPs common to strains clustered within the same clade (Table 2, Figs. 5 and 6b). Moreover, an analysis considering host species and geographic location as clustering criteria was performed, but it did not stratify any specific monomorphic SNP positions.
Table 2.
Identification of SNPs in virulence-associated genes, discriminated by SNP clade
*Polymorphic positions leading to non-synonymous alteration present only in the clade-members and common to all.
NS, non-synonymous.;
Identification of SNPs in virulence-associated genes, discriminated by SNP cladeSNP cladeTotal SNP sites in virulence genesTotal NS SNP sitesClade-monomorphic NS SNP sites*Genes with monomorphic NS SNPsA (n=14)6741––B (n=10)7245––C (n=3)34233Mb2612c (n=1) mesTb (n=1) lpqF (n=1)D (n=6)37212PhoY2 (n=1) treYa (n=1)A to D (n=33)142868esxL (n=1) pknE (n=1) mycp1 (n=1) mbtB (n=1)Mb1542c (n=1) treX (n=1)Mb3685 (n=1) fadA4 (n=1)E (n=9)583526pks2 (n=1) pks5 (n=2) pks12 (n=1) aceA (n=1), Mb3110 (n=1) Mb3114 (n=1) erp (n=1) dppA (n=1) Mb0182 (n=1) mce4C (n=1) mce4B (n=1) mce2B (n=1) lprL (n=1) esxK (n=1) esxM (n=1) esxN (n=2)Mb1974 (n=1)Mb3678c (n=1) mbtB (n=1) clpB (n=1) bpoC (n=1)Mb0393 (n=1) vapc31 (n=1)Mb2515c (n=1) eccC3 (n=1)*Polymorphic positions leading to non-synonymous alteration present only in the clade-members and common to all.NS, non-synonymous.;On average, each
strain harboured 35 SNPs in virulence-related genes, with the strains grouping in clade E presenting higher medium values (average SNP number, 53). Each
strain harboured on average 22 NS and 13 synonymous SNPs. The average value of SNPs per virulence-associated gene was greater than one (Fig. S6). The genes pks12 (Mb2074c), with 15 SNPs, pks5 (Mb1554c), with seven SNP, esxK (Mb1229), with six SNPs, and esxN (Mb1821) with five SNPs, were the genes that registered the highest number of SNPs (Fig. S6). After normalizing the number of polymorphisms per gene length, the virulence-related genes that exhibit the highest rate of mutation are all included in the esx family: esxK, esxN, esxM (Mb1820), esxL (Mb1230) and esxO (Mb2375c).Genomic data integration with disease severity was completed using lesion type information. The isolates from SNP clade E are associated to lesion types II and III only, corresponding to more severe disease, while for the remaining SNP clades the lesions types from I to III were registered. Therefore, the SNP clade with the highest number of SNPs in virulence genes is associated with more severe disease outcome. An increase in the whole dataset could contribute to refining this integrative analysis.Synonymous substitutions are described in the literature as ‘silent’, once they apparently do not lead to changes in protein sequence. However, some works have suggested that synonymous SNPs can have functional effects, such as decreased mRNA stability and translation [53, 54]. Accordingly, NS polymorphisms are expected to contribute to phenotypic variation due to the putative impacts in protein function, so a meticulous inspection of families with genes harbouring higher rates of mutation, as well as the analysis of clade monomorphic NS SNPs were performed (Table 2 and Fig. 6b). Detailed information regarding gene function and the consequences to mycobacteria of gene deletion are provided in the following sections.
Fig. 6.
(a) Circular representation of genome position of genes with SNPs (outer circle) and genes with NS SNPs (inner circle). (b) Circular representation of genome position of genes with monomorphic NS SNPs. From the inner circle to outer circle: clade C, clade D, clade A–D and clade E (figure generated with ShinyCircos, https://venyao.xyz/shinyCircos/).
The esx gene family exhibits the highest mutation rates
The esx genes are members of a specialized secretion system (ESX system) known to enable the transport of antigenic subtracts through the cell wall. This secretion system is associated with pathogenicity and host–pathogen interaction, helping mycobacteria to resist or evade the host immune response [55]. Currently, five ESX systems (ESX-1, ESX-2, ESX-3, ESX-4 and ESX-5) have been described in
and
, with ESX-1, ESX-3 and ESX-5 being required for virulence [55]. The ESX secretion apparatus is constituted by ESX-conserved components (encoded by the ecc and mycP genes), ESX-type-specific secretion-associated proteins (encoded by the esp genes), and secreted/exported proteins (encoded by the esx genes and/or PE and PPE proteins) [55, 56]. This
dataset harboured SNPs across the three systems (ESX-1, ESX-3 and ESX-5), as well as in genes included in the three components of the ESX secretion apparatus. Five esx genes (esxK, esxM, esxL, esxN and esxO) had NS SNPs, as did eight conserved components (eccA1, eccE1, eccA3, eccB3, eccD3, eccE3 and mycP1, mycP3), and five secreted associated components (espG3, espB, espJ, espK, espL) (Table S5). Some of these NS SNPs, in esx and conserved component genes were monomorphic to clade E and clades A to D (Table 2).Polymorphisms in the esxK, esxM and esxN genes, in a total of four positions, are monomorphic for clade E (Table 2 and Table S5). These genes have been proposed to be excreted via the ESX-5 system [55, 56]. Data published so far associate this system to nutrient uptake essential for mycobacteria viability, and to the secretion of PE and PPE proteins of diverse subtypes, with evidences for interference in the pathogenic potential of mycobacteria [55, 57]. Experimental infection works with
deletion mutants for five esx-5-encoded PE and PPE proteins showed attenuated virulence in mouse models of animal infection [58]. Further functional works would be needed to explore if the combined phenotype caused by the polymorphisms in these genes impacts
virulence.The mutations in esxL (n=1) and mycP1 (n=1) are monomorphic to the remaining 33 strains (clades A to D), while other polymorphisms are singletons or restricted to two strains (Table S3 and S5). The protein EsxL forms a complex with EsxK and their excretion via ESX-5 might be dependent on that association [56]. A reduced intracellular growth in human macrophages was reported in M. tuberculosis esxKL deletion mutant, when comparing with the wild-type strain, indicating the involvement in mycobacteria intracellular growth [59]). Since esxL is highly expressed during active infection in human lungs [59], polymorphisms might be prejudicial to the complex formation and consequently interfere with the normal excretion pattern.The gene mycP1 is part of ESX-1 complex, and previous reports suggest a dual role in substrate processing and the regulation of ESX-1 secretion [60]. The mutagenesis of its active site leads to an increase in secretion, while the
deletion mutant reveals the loss of ESX-1 secretion function [60]. The ESX-1 exported substrates [i.e. proteins ESAT-6, culture filtrate protein 10 (CFP-10) and EspA] are described as major virulence determinants of
during infection, playing a key role in the escape from host defence systems [61].
Multiple NS monomorphic SNPs were associated to SNP clades, in a total of three for clade C, two for clade D and 27 for clade E (Table 2 and Table S5). Moreover, if considering clades A to D as a unique group, it was also possible to identify eight monomorphic NS SNPs (Table 2 and STable S5). Detailed information concerning these genes affected by SNPs and non-SNPs is provided in the Supplementary Material (Text S1).
Final remarks and future work
The study of genome composition and structure of pathogen populations might provide important insights into their evolutionary dynamics and pathogenic potential. In recent years, many advances concerning
research have been achieved. However, the narrow WGS works available have been mainly applied for transmission inferences in livestock, livestock-wildlife and livestock-human systems. The present work complements previous studies [47, 48]. A dataset composed of 70 strains was used to get insights into the evolutionary forces exerted upon
, with pan-genome and functional classification approaches and detailed analyses concerning gene gain/loss and mutations in genes associated with virulence traits. Not only the genomic composition and diversity of
was detailed but also clade-specific virulence polymorphism signatures were identified in field strains from Portugal.Contrary to the paradigmatically view on conservative MTC genomes, this work evidences striking genomic diversity among
field strains, providing insights into the core- and accessory-genomes and presenting evidences for an open and dynamic pan-genome. Since
is exposed to dynamic hostile conditions inside the host and also moves across hosts, the accessory component of the genome, especially the cloud moiety that is almost exclusively involved in gene-gain events, could contribute to selective advantages.The fluctuations in genome composition, with gene gain and loss, together with specific polymorphisms, may act as evolutionary driving forces, facilitating diversification and adaptation. However, the hypothetical advantages conferred by the genome dynamics within some hosts or particular ecosystems (i.e. specific region) needs extra investigation.We further examined the genetic diversity of virulence-related genes, especially those expected to contribute to phenotypic variation, in the selected group of
from Portugal. Differential pathogenicity abilities might influence the maintenance of strains with specific traits within the ecosystem, due to their capacity to establish a persistent infection in the host or to increased transmissibility. About 30% of the virulence-associated genes that were screened presented a SNP. Plus, it was possible to identify strain-specific NS SNPs, as well as clade monomorphic NS SNPs for clades C, D, E and A–D. Moreover, an association between clade E, that exhibits the higher average number of SNPs in virulence-related genes and also the highest number of monomorphic NS SNPs, along with worse clinical outcome (using lesions of type II and III as proxy for disease severity), needs to be further comprehended by means of a larger dataset. Moreover, a larger and more complete dataset would also allow robust assessment of the relationship between identified SNP-based clades and various metadata, such as host species or geographic location, which would be quite important for understanding the bigger picture of
evolution and adaptation to different hosts. Mutation frequencies varied significantly along nucleotide sequences related with virulence, such that they often concentrated at certain positions. This is the case of the esx family that included the genes with the highest mutational rate. Special attention should also be drawn to the pks family and mce operon that together with the esx gene family hold the higher number of NS SNPs for clade E members. The relevance of these polymorphisms to
evolution and TB disease are likely to be discovered in the future as more post-genomic analyses develop.Dedicated polymorphism analysis of selected genome regions, such as those related with virulence, may inform not only on the intrinsic properties of the mutation process but might also reflect structural and functional features to be explored under several perspectives. This information combined with experimental functional works may link genotype to phenotype, clarifying the consequences of such SNPs and the underlying adaptive advantages for mycobacteria, potentially opening new avenues for control strategies in the long term. Most functional works focus on
, so it is crucial to investigate the functional consequences of
genotypic diversity associated with virulence factors, particularly some specific genes/gene families, and the subsequent clinical presentation of animal TB. This future research might help to explain the infecting success of certain strains in multi-host systems and might be crucial to redefine control actions potentially opening new possibilities, such as wildlife vaccination. Currently,
BCG remains the single licensed anti-tuberculosis vaccine against human TB. In the animal TB field, the implementation of wildlife vaccination has been explored through experimental trials in badger (Meles meles) and wild boar (Sus scrofa) in Europe, brushtail possum (Trichosurus vulpecula) in New Zealand, or white-tailed deer (Odocoileus virginianus) in the USA using
BCG or heat-inactivated
[62-64]. A deeper knowledge of the genetic diversity and peculiarities of
field strains, including evolutionary dynamics, virulence and pathogenicity, is crucial to customize effective
vaccines or other approaches for animal TB control in each ecosystem.Click here for additional data file.Click here for additional data file.Click here for additional data file.Click here for additional data file.Click here for additional data file.Click here for additional data file.
Authors: Francesca Capon; Michael H Allen; Mahreen Ameen; A David Burden; David Tillman; Jonathan N Barker; Richard C Trembath Journal: Hum Mol Genet Date: 2004-08-27 Impact factor: 6.150
Authors: M Carmen Garcia Pelayo; Swapna Uplekar; Andrew Keniry; Pablo Mendoza Lopez; Thierry Garnier; Javier Nunez Garcia; Laura Boschiroli; Xiangmei Zhou; Julian Parkhill; Noel Smith; R Glyn Hewinson; Stewart T Cole; Stephen V Gordon Journal: Infect Immun Date: 2009-03-16 Impact factor: 3.441
Authors: Andrew J Page; Carla A Cummins; Martin Hunt; Vanessa K Wong; Sandra Reuter; Matthew T G Holden; Maria Fookes; Daniel Falush; Jacqueline A Keane; Julian Parkhill Journal: Bioinformatics Date: 2015-07-20 Impact factor: 6.937