Literature DB >> 34714230

The open pan-genome architecture and virulence landscape of Mycobacterium bovis.

Ana C Reis^1,2, Mónica V Cunha^1,2.

Abstract

Entities: Chemical

Keywords: Mycobacterium bovis; WGS; comparative genomics; core-genome; pan-genome; virulence

Mesh：

Year: 2021 PMID： 34714230 PMCID： PMC8627212 DOI： 10.1099/mgen.0.000664

Source DB: PubMed Journal: Microb Genom ISSN： 2057-5858

× No keyword cloud information.

Data Summary

The newly sequenced data included in this work are deposited under the following Biosample accession numbers: SAMN17004141–SAMN17004143, SAMN17004145– SAMN17004174, SAMN17004176–SAMN17004184 and under the Bioproject accession number PRJNA682618 at a public domain server in the National Centre for Biotechnology Information (NCBI) SRA database. All supporting data, code and protocols have been provided within the article or through supplementary data files which can be found at https://doi.org/10.6084/m9.figshare.15025965. , the causative agent of animal tuberculosis (TB), is proposed to evolve as a conservative strictly clonal microorganism. In the current work, a dataset composed of 70 representing the entire clonal diversity was used to assess genome structure, composition and evolution. Plus, in a strain subset (n=42) circulating in an endemic multi-host TB scenario, the polymorphisms in virulence-related genes were scrutinized. The obtained results support an open and dynamic pan-genome architecture, with a small core-genome component when compared to the accessory-genome component. The diversity of the accessory-genome component might be of crucial importance for adaptive capacity. The complementary ancestral reconstruction evidenced an overall gene loss supplanting gene gain, with the majority of gene gains occurring in terminal branches. Globally, we provide insights into the evolutionary history of on a population level and clarify clade-specific virulence signatures of field isolates. Contrary to the paradigmatically view on conservative MTC genomes, this work evidences striking genomic diversity among field strains.

Introduction

The complex (MTC) encompasses 12 closely related mycobacteria that cause tuberculosis (TB) in a variety of mammalian hosts [1]. is the most well-known member, as it is a prominent human pathogen, accounting for approximately 10 million new cases and 1.4 million deaths in 2018 [2]. is the ecotype that infects the broadest range of host species. It causes animal TB primarily in cattle, but also in other livestock species [3]. Some wildlife species are pathogen reservoirs and infection amplifiers, helping to sustain TB across a variety of ecosystems in different geographic locations [4]. Well-known epidemiological scenarios referred to in the literature include the Iberian Peninsula, United Kingdom, Australia, New Zealand, South Africa and USA [5-7]. Furthermore, has been associated with zoonotic infection in humans, especially in low-income countries and where the boundaries between humans and animals fade away [8]. The MTC members present a high level of similarity at the nucleotide level, with recent comparative genomics works evidencing an average nucleotide identity >99% [9, 10], however they exhibit variable host tropisms, phenotypes and degrees of pathogenicity [11]. Over the years, genomic diversity analyses have been fundamental to the understanding of evolutionary processes involving MTC as a whole, starting with the study of specific genomic regions, such as the Regions of Difference (RD), proceeding to the phylogeny of specific members, leading to the identification of different lineages and clonal complexes [12-16]. Current knowledge supports that is subdivided into five main clonal complexes: European 1 (Eu1), European 2 (Eu2), European 3 (Eu3), African 1 (Af1) and African 2 (Af2) [12-16]. The clonal complexes are identified by the absence of specific spacers in their spoligotyping profile, specific deletions and single nucleotide polymorphisms (SNPs) in precise genes. Their geographic distribution exhibits differences, with Af1 being predominant in West-Central Africa, Afr2 in East Africa, Eu2 in Iberian Peninsula and Brazil, Eu3 in France and Italy, while Eu1 evidence a worldwide distribution [16-18]. Nevertheless, genome composition and functional diversity, as well as recognition of core- and accessory-gene components, are still overlooked topics for the ecotypes infecting animals. The pan-genome consists of the entire gene repertoire of a given species, and its analysis is a very informative approach, enabling stratification into the core-genome, containing genes shared by all strains, and the accessory-genome, covering dispensable genes that are not present in all strains, as well as unique genes, also known as singletons, that are particular to specific strains [19]. The core-genome determines the backbone of all strains of a given species and contains genes necessary to the most basic intraspecific biological processes, being responsible for the major phenotype [19]. The accessory-genome adds genetic diversity to the species and may involve other metabolic pathways that crucially assure adaptation to different ecological niches, colonization of new hosts or other functions that are beneficial over other species [19, 20]. The differentiation of genetic variants, grounded by multi-genome studies enabled by whole-genome sequencing, has become central in the field of disease epidemiology, contributing to gaining insights in host–pathogen interactions, virulence, transmission and resistance determinants [21, 22]. In MTC, the combined effect of deletion/insertion events together with the presence of SNPs across members contribute to differences in host ranges, pathogenicity, disease progression and outcome. In fact, previous works performing comparative genome, proteome or secretome analyses of H37Rv, and variant BCG (bacillus Calmette-Guérin) revealed important differences among these ecotypes [20, 23, 24]. For instance, work by Pelayo and collaborators identified about 700 SNPs that could differentiate BCG and strains [25]. Proteins or genes related to virulence are involved in a diversity of pathways and biological processes, including the interaction with host cells and interference with immune responses; survival inside the aggressive microenvironment of host macrophages; infection recalcitrance and recrudescence; response to environmental stress conditions; and auxiliary antibiotic resistance [26]. The genotypic and phenotypic differences revealed by the comparative works performed so far encourage further studies on genomes, which may be insightful regarding the biological and virulence traits of a pathogen that is far from being understood or controlled. The identification of virulence-associated genes, their biological function and the relationship between virulence-specific and epidemiological-specific signatures are important to understand TB infection, persistence and improve control. In this study, we aimed to gain insights into the genome structure, composition and evolution by means of a large comparative study at the genome level. For this purpose, 70 . representing the entire clonal complex diversity were submitted to a comparative genome analysis through a pan-genomic approach, evaluating the genome composition dynamics and performing an ancestral reconstruction of gene gain and loss. Moreover, in a group of field isolates circulating in an endemic multi-host TB scenario, the polymorphisms in virulence-related genes were ascertained and mutational rates inferred. Globally, we show that molecular approaches to the pan-genome are helpful to gain insights into the evolutionary history of on a population level and to clarify clade-specific virulence signatures of field isolates. We then combine this information with phenotypic advantage inferences that may shed light on the evolutionary forces exerted upon this pathogen at the interface with the host in particular epidemiological scenarios.

Methodology

M. bovis dataset

A globally diverse dataset was collected and used in this study. Sixteen genomes were downloaded as complete/draft genome assemblies up to a maximum of ten scaffolds from NCBI (National Centre for Biotechnology Information); 12 were selected as representatives of the entire clonal complex diversity [27] and recovered as Illumina fastq files deposited at SRA (Sequence Read Archive); 42 . genomes were newly sequenced genomes (Table 1 and Table S1, available in the online version of the article, details below). BCG (bacillus Calmette-Guérin) was excluded from the NCBI search and AF2122/97, used as reference genome, was included in this dataset. A detailed account of the dataset, including accession code, sequencing strategy, status of completeness and assembly parameters is presented in Table S1.

Table 1.

Characteristics of genomes used in this work

M. bovis ID	Clonal complex*	Country	Year	Host species	SNP clade†	Lesion type	Reference	Type of sequence
Mb0220	w/o CC	Portugal	2003	Cattle	E	na	[65]	Newly sequenced
Mb0261	Eu2	Portugal	2006	Red deer	D	Type I	[65]	Newly sequenced
Mb0601	Eu2	Portugal	2007	Cattle	B	na	[65]	Newly sequenced
Mb0769	Eu2	Portugal	2008	Cattle	A	Type II	[65]	Newly sequenced
Mb0783	Eu2	Portugal	2008	Wild boar	A	Type II	[65]	Newly sequenced
Mb0865	Eu2	Portugal	2008	Cattle	C	Type I	[65]	Newly sequenced
Mb0891	Eu2	Portugal	2009	Red deer	A	Type III	[65]	Newly sequenced
Mb0893	Eu2	Portugal	2008	Wild boar	B	Type II	[65]	Newly sequenced
Mb1317	Eu2	Portugal	2010	Cattle	D	Type II	[65]	Newly sequenced
Mb1339	Eu2	Portugal	2010	Cattle	A	Type II	[65]	Newly sequenced
Mb1458	w/o CC	Portugal	2010	Wild boar	E	Type II	[65]	Newly sequenced
Mb1480	w/o CC	Portugal	2010	Cattle	E	Type II	[65]	Newly sequenced
Mb1654	Eu2	Portugal	2011	Cattle	A	Type II	[65]	Newly sequenced
Mb1670	w/o CC	Portugal	2011	Red deer	E	Type III	[65]	Newly sequenced
Mb1711	Eu2	Portugal	2011	Red deer	A	Type III	[65]	Newly sequenced
Mb1712	Eu2	Portugal	2011	Red deer	C	Type III	[65]	Newly sequenced
Mb1714	Eu2	Portugal	2011	Cattle	D	Type II	[65]	Newly sequenced
Mb1744	w/o CC	Portugal	2012	Wild boar	E	Type II	[65]	Newly sequenced
Mb1746	Eu2	Portugal	2012	Red deer	B	Type I	[65]	Newly sequenced
Mb1758	Eu2	Portugal	2012	Cattle	A	Type II	[65]	Newly sequenced
Mb1769	Eu2	Portugal	2012	Wild boar	C	Type II	[65]	Newly sequenced
Mb1785	Eu2	Portugal	2012	Red deer	B	Type II	[65]	Newly sequenced
Mb1789	Eu2	Portugal	2012	Cattle	A	na	[65]	Newly sequenced
Mb1841	Eu2	Portugal	2012	Cattle	A	Type II	[65]	Newly sequenced
Mb1870	Eu2	Portugal	2012	Wild boar	A	Type II	[65]	Newly sequenced
Mb1915	Eu2	Portugal	2013	Red deer	D	Type III	[65]	Newly sequenced
Mb1948	w/o CC	Portugal	2013	Red deer	E	Type III	[65]	Newly sequenced
Mb1960	Eu2	Portugal	2013	Red deer	A	Type I	[65]	Newly sequenced
Mb2026	Eu2	Portugal	2013	Cattle	B	Type II	[65]	Newly sequenced
Mb2043	Eu2	Portugal	2013	Red deer	A	Type III	[65]	Newly sequenced
Mb2067	Eu2	Portugal	2013	Wild boar	B	Type II	[65]	Newly sequenced
Mb2206	Eu2	Portugal	2014	Cattle	B	Type III	[65]	Newly sequenced
Mb2235	w/o CC	Portugal	2014	Red deer	E	Type III	[65]	Newly sequenced
Mb2277	w/o CC	Portugal	2014	Red deer	E	Type II	[65]	Newly sequenced
Mb2300	Eu2	Portugal	2014	Wild boar	B	Type II	[65]	Newly sequenced
Mb2310	Eu2	Portugal	2015	Red deer	A	Type I	[65]	Newly sequenced
Mb2313	Eu2	Portugal	2015	Wild boar	D	Type II	[65]	Newly sequenced
Mb2325	Eu2	Portugal	2015	Red deer	D	Type III	[65]	Newly sequenced
Mb2328	Eu2	Portugal	2015	Red deer	A	Type III	[65]	Newly sequenced
Mb2347	w/o CC	Portugal	2015	Wild boar	E	Type II	[65]	Newly sequenced
Mb2395	Eu2	Portugal	2015	Wild boar	B	Type II	[65]	Newly sequenced
Mb2397	Eu2	Portugal	2015	Wild boar	B	Type III	[65]	Newly sequenced
Mb502499	Af1	Ghana	na	Human	na	na	[27, 66]	SRA deposited
Mb502526	Af1	Ghana	na	Human	na	na	[27, 66]	SRA deposited
Mb1203064	Af1	Ghana	na	Human	na	na	[27, 66]	SRA deposited
Mb4117155	Af2	France	na	Wild boar	na	na	[27, 67]	SRA deposited
Mb1791710	Af2	Tanzania	na	Chimpanzee	na	na	[27, 68]	SRA deposited
Mb1791712	Af2	Tanzania	na	Chimpanzee	na	na	[27, 68]	SRA deposited
Mb1792006	Eu1	USA	2006	Cattle	na	na	[68]	SRA deposited
Mb1792127	Eu1	USA	2008	Cattle	na	na	[68]	SRA deposited
Mb1792361	Eu1	USA	2013	Cattle	na	na	[68]	SRA deposited
Mb7240242	Eu1	USA	2016	Cattle	na	na	[68]	SRA deposited
Mb7240415	Eu1	USA	2014	Cattle	na	na	[68]	SRA deposited
Mb1791984	Eu1	USA	2005	Cattle	na	na	[68]	SRA deposited
MBE1	w/o CC	Egypt	2014	Cattle	na	na	na	assemble/draft genomes NCBI
MBE3	w/o CC	Egypt	2014	Cattle	na	na	na	assemble/draft genomes NCBI
MBE4	w/o CC	Egypt	2014	Cattle	na	na	na	assemble/draft genomes NCBI
MBE10	w/o CC	Egypt	2015	Cattle	na	na	na	assemble/draft genomes NCBI
Mb0077	w/o CC	Canada	2006	Elk	na	na	na	assemble/draft genomes NCBI
Mb0565	w/o CC	Canada	2011	Cattle	na	na	na	assemble/draft genomes NCBI
BMR25	w/o CC	Canada	1985	Bison	na	na	na	assemble/draft genomes NCBI
Mb3601	Eu3	France	2014	Cattle	na	na	[16]	assemble/draft genomes NCBI
Mb0476	Eu2	Canada	2002	Cattle	na	na	na	assemble/draft genomes NCBI
MbSP38	Eu2	Brazil	2010	Cattle	na	na	[69]	assemble/draft genomes NCBI
Mb1595	w/o CC	Korea	2012	Cattle	na	na	[70]	assemble/draft genomes NCBI
Mb0030	w/o CC	China	na	na	na	na	[71]	assemble/draft genomes NCBI
Mb0001	Eu2	Brazil	2015	Tapirus terrestris	na	na	na	assemble/draft genomes NCBI
Mb0003	w/o CC	India	1986	Cattle	na	na	na	assemble/draft genomes NCBI
Mb31150	Af2	Uganda	na	Chimpanzee	na	na	[27, 72]	assemble/draft genomes NCBI
MbAF2122/97	Eu1	UK	1997	Cattle	na	na	na	assemble/draft genomes NCBI

*Eu1: European 1, Eu2: European 2, Eu3: European 3, Af1: African 1, Af2: African 2, and w/o CC: without clonal complex.

†SNP clade according with [66].

na, non-available information;

Characteristics of genomes used in this work ID Clonal complex* Country Year Host species SNP clade† Lesion type Reference Type of sequence Mb0220 w/o CC Portugal 2003 Cattle E na Newly sequenced Mb0261 Eu2 Portugal 2006 Red deer D Type I Newly sequenced Mb0601 Eu2 Portugal 2007 Cattle B na Newly sequenced Mb0769 Eu2 Portugal 2008 Cattle A Type II Newly sequenced Mb0783 Eu2 Portugal 2008 Wild boar A Type II Newly sequenced Mb0865 Eu2 Portugal 2008 Cattle C Type I Newly sequenced Mb0891 Eu2 Portugal 2009 Red deer A Type III Newly sequenced Mb0893 Eu2 Portugal 2008 Wild boar B Type II Newly sequenced Mb1317 Eu2 Portugal 2010 Cattle D Type II Newly sequenced Mb1339 Eu2 Portugal 2010 Cattle A Type II Newly sequenced Mb1458 w/o CC Portugal 2010 Wild boar E Type II Newly sequenced Mb1480 w/o CC Portugal 2010 Cattle E Type II Newly sequenced Mb1654 Eu2 Portugal 2011 Cattle A Type II Newly sequenced Mb1670 w/o CC Portugal 2011 Red deer E Type III Newly sequenced Mb1711 Eu2 Portugal 2011 Red deer A Type III Newly sequenced Mb1712 Eu2 Portugal 2011 Red deer C Type III Newly sequenced Mb1714 Eu2 Portugal 2011 Cattle D Type II Newly sequenced Mb1744 w/o CC Portugal 2012 Wild boar E Type II Newly sequenced Mb1746 Eu2 Portugal 2012 Red deer B Type I Newly sequenced Mb1758 Eu2 Portugal 2012 Cattle A Type II Newly sequenced Mb1769 Eu2 Portugal 2012 Wild boar C Type II Newly sequenced Mb1785 Eu2 Portugal 2012 Red deer B Type II Newly sequenced Mb1789 Eu2 Portugal 2012 Cattle A na Newly sequenced Mb1841 Eu2 Portugal 2012 Cattle A Type II Newly sequenced Mb1870 Eu2 Portugal 2012 Wild boar A Type II Newly sequenced Mb1915 Eu2 Portugal 2013 Red deer D Type III Newly sequenced Mb1948 w/o CC Portugal 2013 Red deer E Type III Newly sequenced Mb1960 Eu2 Portugal 2013 Red deer A Type I Newly sequenced Mb2026 Eu2 Portugal 2013 Cattle B Type II Newly sequenced Mb2043 Eu2 Portugal 2013 Red deer A Type III Newly sequenced Mb2067 Eu2 Portugal 2013 Wild boar B Type II Newly sequenced Mb2206 Eu2 Portugal 2014 Cattle B Type III Newly sequenced Mb2235 w/o CC Portugal 2014 Red deer E Type III Newly sequenced Mb2277 w/o CC Portugal 2014 Red deer E Type II Newly sequenced Mb2300 Eu2 Portugal 2014 Wild boar B Type II Newly sequenced Mb2310 Eu2 Portugal 2015 Red deer A Type I Newly sequenced Mb2313 Eu2 Portugal 2015 Wild boar D Type II Newly sequenced Mb2325 Eu2 Portugal 2015 Red deer D Type III Newly sequenced Mb2328 Eu2 Portugal 2015 Red deer A Type III Newly sequenced Mb2347 w/o CC Portugal 2015 Wild boar E Type II Newly sequenced Mb2395 Eu2 Portugal 2015 Wild boar B Type II Newly sequenced Mb2397 Eu2 Portugal 2015 Wild boar B Type III Newly sequenced Mb502499 Af1 Ghana na Human na na [27, 66] SRA deposited Mb502526 Af1 Ghana na Human na na [27, 66] SRA deposited Mb1203064 Af1 Ghana na Human na na [27, 66] SRA deposited Mb4117155 Af2 France na Wild boar na na [27, 67] SRA deposited Mb1791710 Af2 Tanzania na Chimpanzee na na [27, 68] SRA deposited Mb1791712 Af2 Tanzania na Chimpanzee na na [27, 68] SRA deposited Mb1792006 Eu1 USA 2006 Cattle na na SRA deposited Mb1792127 Eu1 USA 2008 Cattle na na SRA deposited Mb1792361 Eu1 USA 2013 Cattle na na SRA deposited Mb7240242 Eu1 USA 2016 Cattle na na SRA deposited Mb7240415 Eu1 USA 2014 Cattle na na SRA deposited Mb1791984 Eu1 USA 2005 Cattle na na SRA deposited MBE1 w/o CC Egypt 2014 Cattle na na na assemble/draft genomes NCBI MBE3 w/o CC Egypt 2014 Cattle na na na assemble/draft genomes NCBI MBE4 w/o CC Egypt 2014 Cattle na na na assemble/draft genomes NCBI MBE10 w/o CC Egypt 2015 Cattle na na na assemble/draft genomes NCBI Mb0077 w/o CC Canada 2006 Elk na na na assemble/draft genomes NCBI Mb0565 w/o CC Canada 2011 Cattle na na na assemble/draft genomes NCBI BMR25 w/o CC Canada 1985 Bison na na na assemble/draft genomes NCBI Mb3601 Eu3 France 2014 Cattle na na [16] assemble/draft genomes NCBI Mb0476 Eu2 Canada 2002 Cattle na na na assemble/draft genomes NCBI MbSP38 Eu2 Brazil 2010 Cattle na na [69] assemble/draft genomes NCBI Mb1595 w/o CC Korea 2012 Cattle na na [70] assemble/draft genomes NCBI Mb0030 w/o CC China na na na na [71] assemble/draft genomes NCBI Mb0001 Eu2 Brazil 2015 Tapirus terrestris na na na assemble/draft genomes NCBI Mb0003 w/o CC India 1986 Cattle na na na assemble/draft genomes NCBI Mb31150 Af2 Uganda na Chimpanzee na na [27, 72] assemble/draft genomes NCBI MbAF2122/97 Eu1 UK 1997 Cattle na na na assemble/draft genomes NCBI *Eu1: European 1, Eu2: European 2, Eu3: European 3, Af1: African 1, Af2: African 2, and w/o CC: without clonal complex. †SNP clade according with [66]. na, non-available information; Altogether, the entire dataset includes isolates recovered from eight host species distributed by 12 countries, collected between 1985 and 2016 and classified into the clonal complexes previously described [16, 17].

Newly sequenced genomes (dataset from Portugal)

The 42 newly sequenced were entirely recovered from a well-characterized animal TB hotspot in Portugal and included isolates recovered from cattle (n=14), red deer (n=16) and wild boar (n=12) over a 12 year period (circa 70% were recovered over a 5 year period) (Table 1 and Table S1) [28]. The selected host species are described as the most relevant to the maintenance of in a multi-host system in Portugal [7]. Cultures used in this work were derived from frozen stocks, prepared after a single in vitro passage of original archived samples. Re-culture and DNA extraction procedures were performed at a biosecurity level three facility. The frozen culture stocks of selected were regrown in Middlebrook 7H9 (Difco) medium supplemented with 5% sodium pyruvate and 10% ADS enrichment (50 g albumin, 20 g glucose, 8.5 g sodium chloride in 1 l water), at 37 °C, being monitored regularly for growth. Culture pellet was resuspended in 500 µl PBS, inactivated by heating at 99 °C during 30 min, and then stored at −20 °C until WGS. WGS was performed using the Illumina Genome Analyser, according to the manufacturer’s specifications with the paired-end module attachment. The MiSeq (2×250 pb) technology was implemented for 40 samples at United States Department of Agriculture (USDA, USA), while HiSeq (2×150 pb) technology was employed for the remaining two (Eurofins Genomics, Germany). Information concerning the microscopic characterization of TB-compatible lesions was available for 39 infected animals (Table 1). Lesions were histologically classified into three categories according with disease progression: type I – ‘caseous granulomatous lesions’; type II – ‘caseous granulomatous lesions with calcification’; and type III – ‘thick encapsulated caseous granulomatous lesions with calcification’.

Bioinformatics workflow

The bioinformatics workflow followed in this work is detailed in supplementary material (Fig. S1) and comprises de novo assembly and map to reference strategies.

Genome assembly and annotation

Fifty-four genomes (42 newly sequenced and 12 fastq files recovered from SRA) were de novo assembled with Unicycler pipeline, currently available at https://github.com/rrwick/Unicycler [29]. After the reads’ quality analysis (FastQC version 0.11.7; https://github.com/s-andrews/FastQC), and whenever necessary after clean-up with Trimmomatic version 0.36 (http://www.usadellab.org/cms/?page=trimmomatic), genomes were assembled with SPAdes optimizer [29], and subjected to post-assembly optimization with Pilon version 1.18 [30]. A conservative bridging mode was applied, since it avoids misassembly, and the k-mer size was searched and selected between 20–95% of read length. Considering the reads’ size, contigs of less than 300 bp were removed, and a 20-fold cut-off was stabilized following SPAdes guidelines [31]. The quast pipeline (http://quast.sourceforge.net/quast.html), which promotes the remapping of contigs with AF2122/97 reference genome (NCBI accession number LT708304.1), was used to address the quality of the de novo assemblies (Table S1). All the complete genomes/draft assemblies were annotated with the Rapid Annotation using Subsystem Technology (RAST) toolkit, available at webserver https://rast.nmpdr.org [32] using the taxon 1765 (‘ ’) and the genetic code 11.

Genome mapping and SNP variant calling

The fastq files of the dataset from Portugal were further submitted to a map to sequence approach. Reads were aligned to the AF2122/97 reference genome (LT708304.1), using BWA and Samtools [33, 34] through the vSNP pipeline, currently available at https://githubcom/USDA-VS/vSNP. Base quality-score recalibration, SNP and indel (insertion or deletion) discovery were applied across all isolates using standard filtering parameters or variant quality-score recalibration according to Genome Analysis Toolkit (GATK)’s Best Practices recommendations [35, 36]. Results were filtered using a minimum SAMtools quality score of 150 and AC=2. A variant was filtered out if: (i) it was supported by less than 20 reads, (ii) it was found in a frequency of less than 0.9, (iii) it was registered in at least one strain but also with a gap in at least another strain, in order to avoid mapping errors and false SNPs. Integrated Genomics Viewer (IGV) version 2.4.19 (http://software.broadinstitute.org/software/igv/) was employed to visually validate SNPs and positions with mapping issues or alignment problems. SNPs that fell within Proline-Glutamate (PE) and Proline-Proline Glutamate (PPE) genes were filtered from the analysis, as well as indels. The sequencing statistics details are provided in Table S2. The phylogenetic analysis was conducted with mega (Molecular Evolutionary Genetics Analysis) software version 7.0 (https://www.megasoftware.net/) using a concatenated sequence of 1816 bp validated and polymorphic SNPs by applying the maximum-likelihood method with 1000 bootstrap inferences and the GTR (General Time Reversible) model.

Pan-genome analyses

Get-homologues pipeline (https://github.com/eead-csic-compbio/get_homologues) [37] was used to compute core- and pan-genome from the input of complete genomes/draft de novo assemblies genome sequences. Briefly, the source GenBank-formatted files were passed to get_homologues.pl script and instructed to compute nucleotide and protein clusters, by running the Clusters of Orthologous Groups triangles (COGtriangles) and OrthoMCL (Markov Clustering of orthologs, OMCL) algorithms, as previously detailed [38]. PFAM-domain scanning was enabled (-D flag). The directories holding the results from the different clustering algorithms were then passed to the auxiliary script compare_clusters.pl to compute either the consensus core-genome or pan-genome. Since a large number of draft genomes was used in this work, a 75% of sequence alignment coverage and 95% of sequence identity was imposed and all clusters independent of size were included. Moreover, genes were only considered as taxon specific (clusters of size 1) if sequences had no blast hit with any other protein (E-value 1.0E-05) [39]. This strategy was implemented to minimize the inclusion of truncated genes found at the ends of contigs of open genomes. Truncated genes found at the ends of contigs of incomplete genomes are thus not included in any cluster because they have very weak or no reciprocal blast hit. The auxiliary script parse_pangenome_matrix.pl was employed to perform a pan-genome matrix and to classify genes in four categories: core, soft-core, cloud and shell. The pan- and core-genome curves and the new genes were depicted using the Pan-Genome Profile Analyse Tool (PanGP) software, using distance-guide (DG) algorithm and default parameters. According with Tettelin’s review on pan-genome research [40], Heaps' law model was employed to fit the pan-genome size of strains. The auxiliary script roary_plots.py of Roary pipeline (https://github.com/sanger-pathogens/Roary/blob/master/README.md) [41] was implemented to generate a graphical output, using the pan-genome phylogenetic tree and pan-genome matrix as input files. The scoary pipeline (https://github.com/AdmiralenOla/Scoary) [42] was implemented to infer gene enrichment. The pan-genome matrix, pan-genome phylogenetic tree and a trait file grouping genomes by clonal complex were used as input files. Default criteria were applied to run the pipeline script. The Benjamini–Hochberg adjusted P-value (P-value<0.05) was used to validate the genes most over- and under-represented in the specific groups.

Maximum-likelihood pan-genome phylogenetic tree from the pan-genome matrix

The auxiliary script, estimate_pangenome_phylogenies.sh, included in get-phylomarkers pipeline (https://github.com/vinuesa/get_phylomarkers) [43] was used to perform a customized maximum-likelihood (ML) phylogenetic analysis, based on the pan-genome matrix files, returned by compare_clusters.pl script (options ‘-t 0 –m’) from the get-homologues pipeline. Maximum-likelihood phylogenetic tree was constructed by IQ-tree with ultrafast bootstrap approximation (UFBoot) of 1000 replicates. The GTR (General Time Reversible)+FO (optimized base frequencies by maximum likelihood)+R3 (unconstrained base frequencies) model was pointed as the most suitable evolutionary model.

Genome evolution – gene gain and loss dynamics

To investigate genome evolution and dynamics across the genealogy of M. bovis, a ML birth-and-death model was implemented with the software Count version 10.04 (http://www.iro.umontreal.ca/~csuros/gene_content/count.html) [44], using as input files the ML pan-genome phylogeny and the pan-genome matrix of gene presence/absence. Briefly, first the model was optimized by maximizing the likelihood of the data using a gain–loss-duplication model with a Poisson distribution for gene family size at the root, and assuming a gamma-distributed rate variation across gene families, and a variable gain/loss ratio across branches. The remaining parameters were settled as default. The rate parameters were optimized after 100 rounds with a convergence threshold of 0.01. Then, after the optimization of the branch-specific parameters of the model, the ancestral reconstruction was performed using the dollo parsimony method, and gain and loss events were inferred. For each branch i, the gene turnover Ti was defined as Ti= Gi/Li with Gi denoting the branch rate of gene gains and Li, the rate of gene losses.

Functional annotation analysis

For functional annotation analysis, COG categories were associated to the predicted genome CDS via eggnog-mapper v2 webtool (http://eggnog-mapper.embl.de/) [45]. Each CDS was classified into one or more of the 26 upper COG categories. An auto- adjust of the taxonomic scope was enabled and an e-value cut-off of 1e-3 was settled for orthologue detection. The other parameters were settled as default.

Polymorphic analyses of virulence-related genes

To scrutinize virulence traits, 421 genes described to be essential for virulence of MTC members, as reviewed by Forrellad and collaborators [26] and/or listed in Mycobrowser database (https://mycobrowser.epfl.ch/), were screened for the presence of SNPs (Table S3). Moreover, due to their biological function in signal transduction, gene regulation and fitness, all genes described to be part of the mce family, Lux-R family or to codify toxin and antitoxin proteins were included (Table S3). Globally, the genes with a demonstrated, or putative role, in mycobacteria virulence were grouped into 13 categories according to their function and molecular features: lipid and fatty acid metabolism (n=38), cell-envelope proteins (n=14), mce family (n=33), lipoproteins (n=6), secretion systems (n=54), defence mechanisms (n=33), protein kinases (n=3), proteases (n=7), metal-transported proteins (n=7), genes and metabolism regulation expression (n=23), Lux-R family (n=6), toxin-antitoxin family (n=120), and other virulent proteins (n=77) (Table S3). The SNPs were analysed according to two criteria: (1) number of identified polymorphisms per gene and (2) monomorphic polymorphisms specific to phylogenetic clades. Phylogenetic analyses were performed with concatenated alignments of SNPs in the whole set of virulence genes (n=195) and concatenated alignments of non-synonymous (NS) SNPs harboured by virulence genes (n=119). The maximum-likelihood trees were conducted in mega (v 7.0) with 1000 bootstrap inferences and GTR model.

Results and discussion

Pan-genome analysis shows an open pan-genome

Bacterial pan-genome estimation requires de novo assemblies into complete or partial genomes with large contigs, providing an unprecedented opportunity to reveal differences when comparing with reference-guided genome assemblies. A total of 54 . isolates were de novo assembled using the same sequencing principles and techniques, while the remaining 16 were downloaded from NCBI as complete genome or draft assemblies (Table S1). The seven draft assemblies derived from the public domain were sequenced using the same sequencing technology and assembled through the implementation of two analytical methodologies (Table S1). Excluding the complete genomes (n=9), an average of 113 contigs per draft genome was obtained (Table S1). The 70 complete genome/draft assemblies ranged between 4.17 and 4.36 Mbp, revealing similar genome sizes, and presenting an average GC content of 65.5% (Table S1). The variation in genome size might be the foundation of morphological and physiological differences between strains, contributing to genomic diversity. The predicted number of protein-coding sequences (CDS) pool ranged from 4199 to 4435 (Table S1). The CDS were further clustered in homologue gene clusters and used to estimate the core- and pan-genomes. The genome components were defined as core genome (i.e. genes present in all genomes), soft-core genome [i.e. genes present in at least 95% of the genomes (n=66)], cloud genome (i.e. rare genes present only in one/two genomes) and shell genome (i.e. the remaining genes present in several genomes). Shell and cloud formed the accessory component of the genome that contributes to species diversity and may confer selective advantages, such as niche adaptation, antibiotic resistance or colonization of a new host [46]. In this dataset, the core-genome is composed of 2736 gene clusters, the soft-core by 3708, the shell genome by 1341 and the cloud genome by 2656 (Figs 1 and S2). The accessory component includes 3997 gene clusters, 2656 of which are present only in one or two genomes. On average, each genome has 42 cloud gene clusters, accounting for an average of less than 1% of the total genome-coding capacity. The bimodal, asymmetrical U-shape distribution of gene clusters that is common to many bacterial species indicates that most genes are either rare or to be found in almost all genomes, leading to a smaller proportion of genes at intermediate frequencies (Fig. 2). Our results are in agreement with previous works performed with , and combined datasets that evidenced that the accessory component comprise a higher percentage of pan-genome, from circa 21% in a reported dataset with 13 . to 75% in a combined / dataset [47-49]. Moreover, studies performed with mycobacterial genomes have also shown that mycobacteria have a small core-genome component when compared to the accessory-genome component [50, 51].

Fig. 1.

Soft-core and accessory-genome composition of 70 . A maximum-likelihood phylogeny built based on pan-genome presence/absence matrix is represented on the left, and a heat map showing gene presence (dark blue) or absence (light blue) is on the right. The relative proportions of softcore and accessory components per genome is relatively constant, with each genome presenting on average (mean and standard deviation) 87.5 0.9% of soft-core genes, 11.5 0.73% of shell genes, and 1.0 1.12% of cloud genes (Fig. S3). Our results are in line with the work of Yang and collaborators, which analysed 13 . strains in which circa 91% of the genome was occupied by core genes [47]; and with the work of Bolotin and Hershberg that associated lower frequencies (circa 2–6%) of cloud genes to bacterial clonal species, including MTC [52].

Core- and accessory-genome size evolution

Core- and pan-genome evolution size estimates were analysed for exponential decay and with growth models (Fig. 2a). According with the obtained growth model, seems to have an open pan-genome, since there is no distinctly sharp plateau, suggesting that pan-genome size could continue to increase if the number of genomes would also increase (Fig. 2a). Moreover, on average, 32 genes were added by each genome (Fig. 2b). These results are in agreement with the few published works on this topic regarding [47, 48] and [47], that also pointed to an open pan-genome. Open pan-genomes are associated with species that colonize multiple environments and that have multiple ways of exchanging genetic material [46]. is known, among MTC members, to have the widest host range, with livestock and wildlife reservoirs across different ecosystems in different countries. Plus, the capacity of escaping the host immune system is documented, as well as survival in several in vitro and in vivo environmental conditions. This high biological flexibility together with the evidence for an open genome suggests that the accessory-genome of could play an essential role in the adaptive responses triggered by this pathogen.

Fig. 2.

Prediction of pan-genome size. (a) Genome size evolution curves of the pan-genome (blue) and core-genome (green). The blue boxes denote the pan-genome size for each genome for comparison, whereas the green boxes show the same comparison for core-genome size. The curve is the least-squares fit of the power law for the average values. (b) Curve (orange) for the number of new genes with an increase in the number of genomes.

Dynamics of genome composition

To reconstruct the ancestral dynamics of genome composition across genealogy, a birth-and-death likelihood model was applied. The analysis revealed a higher loss rate (overall mean 0.011) when comparing with gain rate (overall mean 0.005), with terminal branches evidencing higher values. The gene turnover criteria exhibit a value inferior to one, with the exception of two terminal branches (Mb1595 and Mb0077) and two internal nodes, where the rates of gene gain and loss were estimated to be equal. The ancestral reconstruction evidenced an overall gene loss (overall mean 147) over gene gain (overall mean 25), with the majority of gene gains occurring in terminal branches (Fig. 3). The terminal branches presented higher mean values for both events, when comparing with internal nodes (mean gene gain/loss of 34/169 and 15/123, for terminal and internal nodes, respectively). Globally, longer terminal branches show more events, as is evident in the branches leading to Mb31150 or Mb0030 (Fig. 3). Branches leading to genomes included in the clonal complexes Eu1 and Af2 reveal high mean values of gene gain and gene loss, when comparing with the genomes of clonal complexes Eu2 and Af1. However, an expanded dataset with more representatives would be necessary to confirm this trend.

Fig. 3.

genome composition dynamics. Estimated events of genome content evolution are mapped onto the ML phylogeny (GTR) estimated with the pan-genome matrix of gene presence/absence. Numbers, after labelling, at terminal nodes represent the estimated numbers of gene gain and loss events (Gains/Losses). The two horizontal bars are histograms that represent the relative proportions of soft core (blue), shell (green) and cloud (orange) genes involved in gain (left) and loss (right) events inferred for the terminal branches. The tree is rooted and drawn to scale with branch lengths measured as the number of substitutions per site. The shell component is the most variable, being involved in both gene gain, in internal nodes, and loss events in both terminal and internal nodes, while gains of cloud genes and loss of soft core occur mainly at terminal nodes. We highlight that the loss of strain-specific genes at terminal branches cannot be detected, therefore the loss of cloud genes is underestimated with this methodology. The variance within the cloud component might be of crucial importance for adaptive capacity. Discrete subpopulations can thus be perceived based on both core- and accessory-genome components. exhibits a puzzling discrepancy with high genome nucleotide identity, but high divergence in genome composition. The aforementioned results indicate that (i) diversity within genome composition arises mostly through gene loss, and (ii) recent events that are not shared by many strains need to be considered when investigating diversity and evolution. The observed distributions of gain/loss events can either reflect the genealogical process, implying a minimal number of transition steps, or can result from an intricate series of gene gains and losses that likely reflect local selective pressures, such as a tipping point in the ecological niche.

Pan-genome functional annotation

The functions of the predicted CDS encoded in the genomes were inferred from the protein sequences annotated in the COG database. On average, the majority of CDS with COG (85.4%) were attributed to at least one category, while, in a minority of cases, two to five categories could be assigned (‘Several categories’ group) (Table S1 and Fig. S4). The most represented categories were ‘transcription’ (total average=258), ‘lipid metabolism and transport’ (n=242), ‘energy production and conversion’ (n=214) and ‘unknown function’ (n=876) (Fig. S4). The variability within the composition of each COG category was estimated to be between zero (‘RNA processing and modification’) and 24 (‘Energy production and conversion’), when comparing the composition of all . A gene-enrichment analysis was implemented with the objective of identifying over- and under-represented genes when grouping genomes by clonal complex. After statistical support correction, a higher number of genes was identified in genomes of clonal complex Eu2 (number of genes=171), followed by genomes of Eu1 (n=97), Af1 (n=18) and finally Af2 (n=10). The genes suggested to be over- and under-represented in these clonal complexes were functionally annotated and the categories registering a higher number of occurrences were ‘Unknown function’ and ‘Cell motility’ (Fig. S5). The difference in genome composition among strains might be driven by selective pressure exerted upon this pathogen, by the environment or the host immune system, however it would be necessary to have tested a larger sample size to further explore this hypothesis.

Pan-genome-based phylogenetic analysis

The phylogenetic analysis based on the presence/absence of pan-genome provided a complementary perspective on the evolutionary relationships among strains (Fig. 4). Genomes included in the Af1 (n=3) cluster within the same branch, as well as members of Eu1 (n=6). These were recovered from the same epidemiological context, Af1 from human hosts in Ghana and Eu1 from cattle in the USA. Furthermore, two of the four members of Af2 recovered from the same epidemiological scenario cluster together. A higher dataset would thus be necessary to disclose deeper associations.

Fig. 4.

Maximum-likelihood tree (GTR) of based on a pan-genome matrix of gene presence/absence. The coloured squares represent the clonal complexes: purple for European 1 (Eu1), red for European 2 (Eu2), blue for European 3 (Eu3), orange for African 1 (Af1) and green for African 2 (Af2). The tree is rooted and drawn to scale with branch lengths measured as the number of substitutions per site.

Impact of the usage of draft genomes on the ability to properly identify CDSs

To further check how the usage of draft genomes might impact the ability to properly identify CDSs, a complementary analysis focused at contig, CDSs and gene gain and loss events was performed. Globally (n=70), the genomes with higher CDS prediction were Mb1714 (165 contigs), Mb2313 (119 contigs), Mb0565 (two contigs) and Mb0030 and Mb31150 (both complete genomes). If considering gene-gain events, the higher number of events (over 100 events) was registered by Mb0030, Mb0003, Mb33150 and MbAF2122/8/97 that are complete genomes; while when considering gene loss, the genomes with higher number of inferred events (over 400) were Mb4117155 (99 contigs) and Mb31150 and MbAF2122/97 (both complete genomes) (data not shown). When considering the group of draft assembly genomes (n=61), these presented on average (mean and standard deviation) 113±44 contigs, 4271±11 CDSs, 25±11 gain events and 159±76 loss events. The correlation between the number of contigs per genome and predicted CDSs was weak (r=0.29). Moreover, the correlation between the number of contigs per genome and gain and loss events was also weak, with values of 0.04 and 0.21, respectively (data not shown). Our analysis did not point to a positive or negative effect exerted by the number of contigs on the prediction of CDS and gain/loss events, thus supporting that the results described are due to intrinsic strain variability and that our methodological approach to infer core- and pan-genome is robust. The inclusion of truncated genes found at the ends of contigs of open genomes was avoided by adopting a 75% of sequence alignment coverage and 95% of sequence identity criteria, as well as the inclusion of all clusters independently of size, contributed to the robustness of the analyses described herein. Moreover, genes were only considered as taxon specific (clusters of size 1) if sequences had no blast hit with any other protein (E-value 1.0E-05) Finally, when considering SNP identification, a map to reference methodology was implemented, so concerns related to truncated genes did not apply.

Virulence-related genes analyses in from a multi-host TB system in Portugal

The topology of the maximum-likelihood phylogenetic tree based on the SNP alignment containing 1816 polymorphic positions led to the definition of five SNP clades (Table 1 and Fig. 5). The phylogenetic tree clearly separates SNP clades A–D (n=33), with genomes included in the Eu2 clonal complex, from clade E (n=9) that encompasses genomes not assigned to any of the described clonal complexes (Fig. 5). The majority of SNPs (87.1%) was located in coding regions and a prioritized analysis concerning genes involved, or putatively involved, in virulence pathways was performed, leading to the screening of 421 genes (Table S3).

Fig. 5.

Maximum-likelihood phylogenetic tree of 42 strains, from the dataset from Portugal, using as input an alignment containing all validated SNPs (n=1816). The tree is drawn to scale, with branch lengths measured as the number of substitutions per site and identified with different colours according with the SNP clade. The presence/absence SNP profile in virulence-related genes where polymorphisms were recorded (n=125) is represented. Virulence-associated genes are grouped into categories identified by the colour scheme (figure generated with iTOLv6, https://itol.embl.de/). A total of 296 genes were conserved across all analysed strains, while the remaining 125 had at least one polymorphism in one strain, in a total of 194 SNPs. The majority (n=118; 60.8%) was non-synonymous (NS) (Figs. 5 and 6a and Table S4). With the exception of the ‘lipoproteins’ category, all the remaining 12 functional groups comprise polymorphic genes (Fig. 5). Ninety-three genes harbour NS SNPs, 55 had neutral polymorphisms and a group of 23 genes hold both types of SNPs. Considering the NS SNPs, the premature introduction of stop codons in four cases and the loss of a stop codon (Table S5) in one situation could be registered. (a) Circular representation of genome position of genes with SNPs (outer circle) and genes with NS SNPs (inner circle). (b) Circular representation of genome position of genes with monomorphic NS SNPs. From the inner circle to outer circle: clade C, clade D, clade A–D and clade E (figure generated with ShinyCircos, https://venyao.xyz/shinyCircos/). Globally, there were SNPs distributed randomly across strains and SNPs common to strains clustered within the same clade (Table 2, Figs. 5 and 6b). Moreover, an analysis considering host species and geographic location as clustering criteria was performed, but it did not stratify any specific monomorphic SNP positions.

Table 2.

Identification of SNPs in virulence-associated genes, discriminated by SNP clade

SNP clade	Total SNP sites in virulence genes	Total NS SNP sites	Clade-monomorphic NS SNP sites*	Genes with monomorphic NS SNPs
A (n=14)	67	41	–	–
B (n=10)	72	45	–	–
C (n=3)	34	23	3	Mb2612c (n=1) mesTb (n=1) lpqF (n=1)
D (n=6)	37	21	2	PhoY2 (n=1) treYa (n=1)
A to D (n=33)	142	86	8	esxL (n=1) pknE (n=1) mycp1 (n=1) mbtB (n=1) Mb1542c (n=1) treX (n=1) Mb3685 (n=1) fadA4 (n=1)
E (n=9)	58	35	26	pks2 (n=1) pks5 (n=2) pks12 (n=1) aceA (n=1), Mb3110 (n=1) Mb3114 (n=1) erp (n=1) dppA (n=1) Mb0182 (n=1) mce4C (n=1) mce4B (n=1) mce2B (n=1) lprL (n=1) esxK (n=1) esxM (n=1) esxN (n=2) Mb1974 (n=1) Mb3678c (n=1) mbtB (n=1) clpB (n=1) bpoC (n=1) Mb0393 (n=1) vapc31 (n=1) Mb2515c (n=1) eccC3 (n=1)

*Polymorphic positions leading to non-synonymous alteration present only in the clade-members and common to all.

NS, non-synonymous.;

Identification of SNPs in virulence-associated genes, discriminated by SNP clade SNP clade Total SNP sites in virulence genes Total NS SNP sites Clade-monomorphic NS SNP sites* Genes with monomorphic NS SNPs A (n=14) 67 41 – – B (n=10) 72 45 – – C (n=3) 34 23 3 Mb2612c (n=1) mesTb (n=1) lpqF (n=1) D (n=6) 37 21 2 PhoY2 (n=1) treYa (n=1) A to D (n=33) 142 86 8 esxL (n=1) pknE (n=1) mycp1 (n=1) mbtB (n=1) Mb1542c (n=1) treX (n=1) Mb3685 (n=1) fadA4 (n=1) E (n=9) 58 35 26 pks2 (n=1) pks5 (n=2) pks12 (n=1) aceA (n=1), Mb3110 (n=1) Mb3114 (n=1) erp (n=1) dppA (n=1) Mb0182 (n=1) mce4C (n=1) mce4B (n=1) mce2B (n=1) lprL (n=1) esxK (n=1) esxM (n=1) esxN (n=2) Mb1974 (n=1) Mb3678c (n=1) mbtB (n=1) clpB (n=1) bpoC (n=1) Mb0393 (n=1) vapc31 (n=1) Mb2515c (n=1) eccC3 (n=1) *Polymorphic positions leading to non-synonymous alteration present only in the clade-members and common to all. NS, non-synonymous.; On average, each strain harboured 35 SNPs in virulence-related genes, with the strains grouping in clade E presenting higher medium values (average SNP number, 53). Each strain harboured on average 22 NS and 13 synonymous SNPs. The average value of SNPs per virulence-associated gene was greater than one (Fig. S6). The genes pks12 (Mb2074c), with 15 SNPs, pks5 (Mb1554c), with seven SNP, esxK (Mb1229), with six SNPs, and esxN (Mb1821) with five SNPs, were the genes that registered the highest number of SNPs (Fig. S6). After normalizing the number of polymorphisms per gene length, the virulence-related genes that exhibit the highest rate of mutation are all included in the esx family: esxK, esxN, esxM (Mb1820), esxL (Mb1230) and esxO (Mb2375c). Genomic data integration with disease severity was completed using lesion type information. The isolates from SNP clade E are associated to lesion types II and III only, corresponding to more severe disease, while for the remaining SNP clades the lesions types from I to III were registered. Therefore, the SNP clade with the highest number of SNPs in virulence genes is associated with more severe disease outcome. An increase in the whole dataset could contribute to refining this integrative analysis. Synonymous substitutions are described in the literature as ‘silent’, once they apparently do not lead to changes in protein sequence. However, some works have suggested that synonymous SNPs can have functional effects, such as decreased mRNA stability and translation [53, 54]. Accordingly, NS polymorphisms are expected to contribute to phenotypic variation due to the putative impacts in protein function, so a meticulous inspection of families with genes harbouring higher rates of mutation, as well as the analysis of clade monomorphic NS SNPs were performed (Table 2 and Fig. 6b). Detailed information regarding gene function and the consequences to mycobacteria of gene deletion are provided in the following sections.

Fig. 6.

(a) Circular representation of genome position of genes with SNPs (outer circle) and genes with NS SNPs (inner circle). (b) Circular representation of genome position of genes with monomorphic NS SNPs. From the inner circle to outer circle: clade C, clade D, clade A–D and clade E (figure generated with ShinyCircos, https://venyao.xyz/shinyCircos/).

The esx gene family exhibits the highest mutation rates

The esx genes are members of a specialized secretion system (ESX system) known to enable the transport of antigenic subtracts through the cell wall. This secretion system is associated with pathogenicity and host–pathogen interaction, helping mycobacteria to resist or evade the host immune response [55]. Currently, five ESX systems (ESX-1, ESX-2, ESX-3, ESX-4 and ESX-5) have been described in and , with ESX-1, ESX-3 and ESX-5 being required for virulence [55]. The ESX secretion apparatus is constituted by ESX-conserved components (encoded by the ecc and mycP genes), ESX-type-specific secretion-associated proteins (encoded by the esp genes), and secreted/exported proteins (encoded by the esx genes and/or PE and PPE proteins) [55, 56]. This dataset harboured SNPs across the three systems (ESX-1, ESX-3 and ESX-5), as well as in genes included in the three components of the ESX secretion apparatus. Five esx genes (esxK, esxM, esxL, esxN and esxO) had NS SNPs, as did eight conserved components (eccA1, eccE1, eccA3, eccB3, eccD3, eccE3 and mycP1, mycP3), and five secreted associated components (espG3, espB, espJ, espK, espL) (Table S5). Some of these NS SNPs, in esx and conserved component genes were monomorphic to clade E and clades A to D (Table 2). Polymorphisms in the esxK, esxM and esxN genes, in a total of four positions, are monomorphic for clade E (Table 2 and Table S5). These genes have been proposed to be excreted via the ESX-5 system [55, 56]. Data published so far associate this system to nutrient uptake essential for mycobacteria viability, and to the secretion of PE and PPE proteins of diverse subtypes, with evidences for interference in the pathogenic potential of mycobacteria [55, 57]. Experimental infection works with deletion mutants for five esx-5-encoded PE and PPE proteins showed attenuated virulence in mouse models of animal infection [58]. Further functional works would be needed to explore if the combined phenotype caused by the polymorphisms in these genes impacts virulence. The mutations in esxL (n=1) and mycP1 (n=1) are monomorphic to the remaining 33 strains (clades A to D), while other polymorphisms are singletons or restricted to two strains (Table S3 and S5). The protein EsxL forms a complex with EsxK and their excretion via ESX-5 might be dependent on that association [56]. A reduced intracellular growth in human macrophages was reported in M. tuberculosis esxKL deletion mutant, when comparing with the wild-type strain, indicating the involvement in mycobacteria intracellular growth [59]). Since esxL is highly expressed during active infection in human lungs [59], polymorphisms might be prejudicial to the complex formation and consequently interfere with the normal excretion pattern. The gene mycP1 is part of ESX-1 complex, and previous reports suggest a dual role in substrate processing and the regulation of ESX-1 secretion [60]. The mutagenesis of its active site leads to an increase in secretion, while the deletion mutant reveals the loss of ESX-1 secretion function [60]. The ESX-1 exported substrates [i.e. proteins ESAT-6, culture filtrate protein 10 (CFP-10) and EspA] are described as major virulence determinants of during infection, playing a key role in the escape from host defence systems [61].

Clade-specific non-synonymous (NS) monomorphic SNPs

Multiple NS monomorphic SNPs were associated to SNP clades, in a total of three for clade C, two for clade D and 27 for clade E (Table 2 and Table S5). Moreover, if considering clades A to D as a unique group, it was also possible to identify eight monomorphic NS SNPs (Table 2 and STable S5). Detailed information concerning these genes affected by SNPs and non-SNPs is provided in the Supplementary Material (Text S1).

Final remarks and future work

The study of genome composition and structure of pathogen populations might provide important insights into their evolutionary dynamics and pathogenic potential. In recent years, many advances concerning research have been achieved. However, the narrow WGS works available have been mainly applied for transmission inferences in livestock, livestock-wildlife and livestock-human systems. The present work complements previous studies [47, 48]. A dataset composed of 70 strains was used to get insights into the evolutionary forces exerted upon , with pan-genome and functional classification approaches and detailed analyses concerning gene gain/loss and mutations in genes associated with virulence traits. Not only the genomic composition and diversity of was detailed but also clade-specific virulence polymorphism signatures were identified in field strains from Portugal. Contrary to the paradigmatically view on conservative MTC genomes, this work evidences striking genomic diversity among field strains, providing insights into the core- and accessory-genomes and presenting evidences for an open and dynamic pan-genome. Since is exposed to dynamic hostile conditions inside the host and also moves across hosts, the accessory component of the genome, especially the cloud moiety that is almost exclusively involved in gene-gain events, could contribute to selective advantages. The fluctuations in genome composition, with gene gain and loss, together with specific polymorphisms, may act as evolutionary driving forces, facilitating diversification and adaptation. However, the hypothetical advantages conferred by the genome dynamics within some hosts or particular ecosystems (i.e. specific region) needs extra investigation. We further examined the genetic diversity of virulence-related genes, especially those expected to contribute to phenotypic variation, in the selected group of from Portugal. Differential pathogenicity abilities might influence the maintenance of strains with specific traits within the ecosystem, due to their capacity to establish a persistent infection in the host or to increased transmissibility. About 30% of the virulence-associated genes that were screened presented a SNP. Plus, it was possible to identify strain-specific NS SNPs, as well as clade monomorphic NS SNPs for clades C, D, E and A–D. Moreover, an association between clade E, that exhibits the higher average number of SNPs in virulence-related genes and also the highest number of monomorphic NS SNPs, along with worse clinical outcome (using lesions of type II and III as proxy for disease severity), needs to be further comprehended by means of a larger dataset. Moreover, a larger and more complete dataset would also allow robust assessment of the relationship between identified SNP-based clades and various metadata, such as host species or geographic location, which would be quite important for understanding the bigger picture of evolution and adaptation to different hosts. Mutation frequencies varied significantly along nucleotide sequences related with virulence, such that they often concentrated at certain positions. This is the case of the esx family that included the genes with the highest mutational rate. Special attention should also be drawn to the pks family and mce operon that together with the esx gene family hold the higher number of NS SNPs for clade E members. The relevance of these polymorphisms to evolution and TB disease are likely to be discovered in the future as more post-genomic analyses develop. Dedicated polymorphism analysis of selected genome regions, such as those related with virulence, may inform not only on the intrinsic properties of the mutation process but might also reflect structural and functional features to be explored under several perspectives. This information combined with experimental functional works may link genotype to phenotype, clarifying the consequences of such SNPs and the underlying adaptive advantages for mycobacteria, potentially opening new avenues for control strategies in the long term. Most functional works focus on , so it is crucial to investigate the functional consequences of genotypic diversity associated with virulence factors, particularly some specific genes/gene families, and the subsequent clinical presentation of animal TB. This future research might help to explain the infecting success of certain strains in multi-host systems and might be crucial to redefine control actions potentially opening new possibilities, such as wildlife vaccination. Currently, BCG remains the single licensed anti-tuberculosis vaccine against human TB. In the animal TB field, the implementation of wildlife vaccination has been explored through experimental trials in badger (Meles meles) and wild boar (Sus scrofa) in Europe, brushtail possum (Trichosurus vulpecula) in New Zealand, or white-tailed deer (Odocoileus virginianus) in the USA using BCG or heat-inactivated [62-64]. A deeper knowledge of the genetic diversity and peculiarities of field strains, including evolutionary dynamics, virulence and pathogenicity, is crucial to customize effective vaccines or other approaches for animal TB control in each ecosystem. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file.

71 in total

1. Disruption of the ESX-5 system of Mycobacterium tuberculosis causes loss of PPE protein secretion, reduction of cell wall integrity and strong attenuation.

Authors: Daria Bottai; Mariagrazia Di Luca; Laleh Majlessi; Wafa Frigui; Roxane Simeone; Fadel Sayes; Wilbert Bitter; Michael J Brennan; Claude Leclerc; Giovanna Batoni; Mario Campa; Roland Brosch; Semih Esin
Journal: Mol Microbiol Date: 2012-02-20 Impact factor: 3.501

2. A synonymous SNP of the corneodesmosin gene leads to increased mRNA stability and demonstrates association with psoriasis across diverse ethnic groups.

Authors: Francesca Capon; Michael H Allen; Mahreen Ameen; A David Burden; David Tillman; Jonathan N Barker; Richard C Trembath
Journal: Hum Mol Genet Date: 2004-08-27 Impact factor: 6.150

3. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

4. A comprehensive survey of single nucleotide polymorphisms (SNPs) across Mycobacterium bovis strains and M. bovis BCG vaccine strains refines the genealogy and defines a minimal set of SNPs that separate virulent M. bovis strains and M. bovis BCG strains.

Authors: M Carmen Garcia Pelayo; Swapna Uplekar; Andrew Keniry; Pablo Mendoza Lopez; Thierry Garnier; Javier Nunez Garcia; Laura Boschiroli; Xiangmei Zhou; Julian Parkhill; Noel Smith; R Glyn Hewinson; Stewart T Cole; Stephen V Gordon
Journal: Infect Immun Date: 2009-03-16 Impact factor: 3.441

5. Pan and Core Genome Analysis of 183 Mycobacterium tuberculosis Strains Revealed a High Inter-Species Diversity among the Human Adapted Strains.

Authors: Fathiah Zakham; Tarja Sironen; Olli Vapalahti; Ravi Kant
Journal: Antibiotics (Basel) Date: 2021-04-28

10. Computational genomics-proteomics and Phylogeny analysis of twenty one mycobacterial genomes (Tuberculosis & non Tuberculosis strains).

Authors: Fathiah Zakham; Othmane Aouane; David Ussery; Abdelaziz Benjouad; Moulay Mustapha Ennaji
Journal: Microb Inform Exp Date: 2012-08-28

1 in total

1. A critical evaluation of Mycobacterium bovis pangenomics, with reference to its utility in outbreak investigation.

Authors: Kristina M Ceres; Michael J Stanhope; Yrjö T Gröhn
Journal: Microb Genom Date: 2022-06