Literature DB >> 26442149

The bacterial pangenome as a new tool for analysing pathogenic bacteria.

L Rouli¹, V Merhej¹, P-E Fournier¹, D Raoult¹.

Abstract

The bacterial pangenome was introduced in 2005 and, in recent years, has been the subject of many studies. Thanks to progress in next-generation sequencing methods, the pangenome can be divided into two parts, the core (common to the studied strains) and the accessory genome, offering a large panel of uses. In this review, we have presented the analysis methods, the pangenome composition and its application as a study of lifestyle. We have also shown that the pangenome may be used as a new tool for redefining the pathogenic species. We applied this to the Escherichia coli and Shigella species, which have been a subject of controversy regarding their taxonomic and pathogenic position.

Entities: Chemical Disease Species

Keywords: Bacteria; bioinformatics tools; comparative genomics; pangenome; pathogenic species

Year: 2015 PMID： 26442149 PMCID： PMC4552756 DOI： 10.1016/j.nmni.2015.06.005

Source DB: PubMed Journal: New Microbes New Infect ISSN： 2052-2975

Definitions

Introduction

The emergence and development of next-generation sequencing technologies (NGS) made the reconstruction of genomes much easier and more accessible than previously [1]. Concerning the study of bacteria, possession and study of more than ten different genomes from the same species is easy, which provides enough data to perform comparisons [1]. Studies of pangenomes arose from these new possibilities and reflect the notion of bacterial species more accurately [2,3]. It is strongly recommended to include a number of genomes in studies to better identify the diversity and composition of the global gene repertoire [1]. The name was quoted in 2005 by Tettelin et al. [4], where a clear definition of the pangenome is given. The pangenome (or supragenome) [5,6] has been defined as the whole gene repertoire of a study group [1,2,7]. In this review, we study the notion of the bacterial pangenome, which is rapidly growing today (Box 1). A pangenome can be defined as open or closed (infinite or finite [9]), according to the species' capacity to acquire exogenous DNA [2,10], to have the machinery to use it [10] and to possess a large amount of rRNA [10]. The open or closed nature of a pangenome is bound to the lifestyle of the studied bacterial species [2,7,10]. Moreover, the allopatric species that live isolated in a narrow niche usually have a small genome and a closed pangenome, because they are specialized [7,10]. Sympatric species, living in a community, tend to have large genomes and an open pangenome, a high horizontal rate of genes transfer and several ribosomal operons [7,10]. These studies pose the question of the nature of bacterial species definition. In contrast to the world of Eukaryotes, where this term has been defined relative to fertility [11], the case of Prokaryotes seems to be much more difficult [12]. Usually, bacterial species are defined on the basis of gene contents, phenotypic characters, the nature of the ecological niche and the 16S ribosomal RNA sequences [11,12]. A bacterial species has been defined as ‘a group of isolates which are characterized by a certain degree of phenotypic resemblance, by a level of 70% DNA-DNA hybridization and by an identity of at least 97% between 16S rRNA sequences’ [13,14] or, more recently, 98.7% [3]. This definition can be applied globally to obligatory pathogens that live in a very narrow ecological niche [11] (allopatry) [13]; there is no real reason for the different adaptation and diversification processes to result in rather coherent groups at the phenotypic and genetic level so they can be designed as a species. Some authors have defined species based on genomic coherence [13], isolate proximity [12] and the ecological niche [11]. We believe that the pangenome represents a new approach to species definition. Indeed, pangenomic studies offer a rather wide panel of possibilities, like predicting the allopatric or sympatric nature of a bacterium, and precisely determining the genomic contents of a group. Based on such results, it is not unrealistic to consider narrowed and closed pangenomes being defined as a species. Moreover, as quoted by Dagan and Martin [15], a tree based on only one gene or on whole ribosomal protein-encoding genes is too simplistic and not representative of reality. In contrast, pangenome study with different tools may help to define species. Quantum physics is a rift from classic physics and is known to be unintuitive. In quantum physics, we observe that there is no progressive state for an electron between two orbits, because it performs quantitative leaps. It is also shown that the atom does not act as a classic system, which can exchange energy continually. These physical phenomena fit our definition of the species description used here. Indeed, when we studied the pangenome and we calculated the core/pangenome ratio on theoretically identical species genomes, we did not always obtain a linear graph as expected, instead we saw a break event. When the break is clear, we may conclude that we are faced with two different species. Here we will present the various methods of analysis, the bioinformatics and experimental tools and the link between pangenome, lifestyle and taxonomy.

Tools

Choice of study subjects

Number of species

We selected 27 bacterial species and compared the core/pangenome ratio depending on the number of tested genomes (Fig. 1) to find the minimum number of genomes necessary for a comprehensive analysis. We noted that in the case of a very closed pangenome (core/pangenome ratio between 100% and 98%), two genomes may be sufficient, and for a closed pangenome six strains seemed sufficient. For an open pangenome, it is more difficult to determine this number of necessary strains. If the pangenome is large, precise analysis can be possible on the basis of ten strains, but in the case of an infinite open pangenome, it is not possible by definition to close it (Fig. 2). This questions the reality of a species such as Escherichia coli, for example.

Fig. 1

Study of the core/pangenome ratio function of the number of genomes added in several bacterial species. A closed pangenome is defined when reaching a plateau.

Fig. 2

(a) Shigella flexneri. (b) All Shigella. (c) Escherichia coli and E. coli + Shigella. In black, the trend curves, in blue the core/pangenome ratio, in red the pangenome and in green the core. Number at left corresponded to percentage and number at right corresponded to number of genes.

Which strains?

Once the number of isolates has been defined, it is necessary to carefully select strains. Several criteria may be considered. First, if the study involves a pathogen, it can be relevant to include the clinical aspects, as different strains of the same species can cause different diseases. This is the case for E. coli[16], where commensal and pathogen isolates can be selected. Among the pathogenic strains, five different clinical groups [16] were selected. Strains also frequently have different geographical origins, like Coxiella burnetii and Yersinia pestis. These ‘geotypes’ are usually related to genotypes. A genotype can be defined by different methods: pulsed field gel electrophoresis [17], multilocus sequence typing [18], multispacer sequence typing [19-21] or single nucleotide polymorphisms (SNPs) of the core genome [22]. For C. burnetii, every multispacer sequence type is defined by 10 ‘cox’ sequences. Finally, it is also possible to use the phenotype including antibiotic resistance or in stress conditions. These four criteria open a wide range of possibilities and it is interesting to select a large panel of strains to describe the pangenome diversity.

Interest of new species analysis

The real-time genomic base been used during epidemics to discover why and how an isolate was able to cause such an event and, at best, to be able to identify specific genetic markers. There are two recent examples of public health use of pangenome analysis: the pandemic in Haiti caused by Vibrio cholerae[23] and the German epidemic caused by E. coli[24]. Respectively, 23 and 40 genomes were used for these analyses of comparative genomics. During these studies, authors determined the gene content and they placed the isolates of interest in a biotype [23] or an existent pathotype [24]. In the case of V. cholerae, all the Haitian clones were clearly related to Nepal [23]. The E. coli isolate from the German epidemic was an emerging clone clustering with an enteroaggregative E. coli pathotype [24].

Microarrays

Chip technology entered during the pangenomic era, and new tools for designing probes were created. In 2007, Prodesign[25] was put into circulation. It is a free online tool (http://www.uhnresearch.ca/labs/tillier/ProDesign/ProDesign.html) that can be used to select probes in order to detect the members of gene families in environmental samples. This allows the detection of several gene families simultaneously and specifically in one or several genomes. Moreover, the length and temperature of the probes does not need to be predefined. This tool was, for example, used in a study in 2011 on Dehalococcoides[26], to detect and characterize these bacteria in the contaminated sites. A second tool was created in 2009: the PanArray[27]. It is a probe selection algorithm that can target several complete genomes with a minimum number of probes. Although microarrays are built on the basis of gene family clusters, PanArray uses an approach based and centreed on probes independently of annotation, gene clustering and multi-alignments. This tool works as well for the known isolates as the unknown; it has been tested on 20 isolates of Listeria monocytogenes and also on C. burnetii[28,29]. Finally, obtaining data from the microarray approach requires particular and specific analyses, new genes cannot be found. For this purpose, PanCGH[30] was developed in particular in 2009 as well as an associated Web application, PanCGHweb[31], in 2010. The use of microarrays is only valid for closed pangenomes.

Bioinformatics tools

Composition and annotation

In the first place, searching for orthologues is a crucial step because it allows an estimation of pangenome composition (number of core and secondary genes). To find them, the most commonly used methods consist of one or several sorts of BLAST [32] or OrthoMCL [33]. There are many possibilities available for the annotation step [9,34-36], although COG (Cluster of Orthologous Groups) [37], InterPro [38] and KEGG (Kyoto Encyclopaedia of Genes and Genomes) [39] are the most frequently used. These tools, in particular COG and KEGG, allow a more detailed study of the functional distribution within the core and within the accessory genome. It is possible to look at the difference of distribution in the COG categories [34,35] and at the metabolic pathways [34,35]. Study of the metabolic pathways is not sufficient, however. It is also important and informative to examine protein expression regulation and transcription factors. Moreover, their absence or their presence in one or several isolates can help to explain some isolate characteristics. The online tool P2RP (Predicted Prokaryotic Regulatory Proteins) [40], which became available in 2013, was specially developed to offer a method for simply, quickly and effectively searching for these kinds of proteins that is accessible to all and not only to bioinformatics specialists. The tool covers complete genomes as well as protein sequences and gives detailed and clear outputs.

Alignment and phylogeny

Turning our interest to genome alignment. We can choose a global alignment with MAUVE [41] (or use it for comparison [35]), or we can try a multiple alignment (using Clustal[36]) to perform phylogeny. Most of the time, for phylogeny, MEGA [42] or MAFFT [43] are recommended for tree reconstruction. We can use different algorithms: neighbour joining [36] or maximum parsimony. The search for SNPs in the core genome can be used to estimate the age of species of interest [44]. However, for this kind of analysis it is necessary to possess genomes of very close species to be able to produce a phylogenetic tree and study in detail the mutational events that led to the separation into two different species. This kind of work has been carried out on Y. pestis, in which a comparative analysis was conducted with Yersinia pseudotuberculosis and Yersinia enterocolitica[45].

Resistome and mobilome

To study the resistome, there are databases such as the ARDB (Antibiotic Resistance Genes DataBase) [46], which can be used to look specifically for genes of resistance present in isolates of interest. This database was used, for instance, for Mycobacterium tuberculosis[47]. Finally, it is also important to study the mobilome [48]. This represents the set of all the mobile elements (and hence selfish genes) contained in the studied genomes. Generally, we look for the clustered regularly interspaced short palindromic repeats (CRISPRs) with CRISPRs finder [49], phages with PHAST [50] or RAST [51] and insertion sequences with IS finder [52].

Dedicated software

The increase in the number of pangenomic studies led to the development of automated tools, which are more or less specialized. The first one, PanOCT [53] (pangenome orthologue clustering tool), is a tool completely dedicated to orthologue searches. There is no online version, but the source code is available at http://panoct.sourceforge.net/. Acinetobacter baumannii isolates were tested and compared with other tools for orthologue detection. For paralogue detection, PanOCT comes first in terms of accuracy and absence of errors [53]. A second tool, less specialized, the PGAP (pangenomes analysis pipeline) [54], offers the user the possibility to obtain five types of data: clusters of gene functions, species evolution, pangenome profile, and the genetic variation of functional genes. This automatic pipeline, tested on Streptococcus pyogenes, is interesting because all the analyses are performed through a single line of command; moreover, it is possible to adapt the parameters to one species. Finally there is the Panseq tool (pangenome sequence analysis program) [55], an online tool (http://lfz.corefacility.ca/panseq/) that allows the user to proceed with three sorts of analyses: search for new regions, allowing the detection of unique zones; analysis of the core and the pangenome, giving information about the SNPs in the core or the distribution of accessory genes; and, finally, a selector of loci allowing us to find those discriminating between selected genomes.

Pangenome Composition

A pangenome is usually divided into three parts [1,2,7]: the core genome, gathering all genes common to all strains of the study; the secondary, called the accessory genome [1,2], which contains genes present between two and n–1 strains; and the unique genes, which are present only in a single strain. Inside the pangenome, we can study different features such as resistome, the mobilome and the global metabolism.

Toxin/antitoxin systems

Toxin/antitoxin (TA) genes are small genetic elements that are divided into five groups [56], based on antitoxin nature (small RNA or small protein) and on the interaction type. The type II TA module is the most studied. TA-toxins target different cellular processes depending on their type: ATP synthesis, translation, replication (type II), cytoskeleton (type IV) and peptidoglycan synthesis (type II) [56]. TA modules have different functions, for instance plasmid stabilization and, in the chromosome, mediation of superintegron stabilization [56]. Superintegrons often encode proteins with an adaptive function like virulence, resistance and often contain TA modules. They are also toxic for the host of the bacteria [57]. Comparison of the ‘bad bugs’ against control species showed that pathogenic capacity is not due to ‘virulence factors’ (which are periodically, very often, more numerous in non-pathogenic bacteria [58]), but due to a virulent gene repertoire caused by a reduced genome repertoire [59]. ‘Virulence factors’ is a misleading definition, except for toxins, which may have a direct effect [59]. In 2011, for the first time [60], TA modules were correlated to the pathogenics of some bacteria. Indeed, most of the bad bugs contained significantly more TA modules than their controls [60].

Non-virulence genes

Non-virulence genes are part of an emerging concept where gene expression decreases virulence in the ancestor, and they are lost in pathogenic strains [61]. Their deletion is associated with increased virulence. Originally identified in Shigella[62] (lysine decarboxylase), a non-virulence gene may help explain pathogen evolution. It has been described later [62] in Salmonella, Y. pestis and Francisella tularensis. Non-virulence genes can have different roles and be involved, for instance, in metabolism and biofilm synthesis [62]. There are 12 well-known non-virulence genes. A detailed definition of what a non-virulence gene is and what it is not has been proposed [62]. Globally, suppressors and non-functional genes in the ancestor are not, whereas deleted, inactivated or differentially regulated genes may be candidate non-virulence genes. To identify putative non-virulence genes, a reference genome is needed. Then, a very detailed genomic analysis is required on all the sequenced strains [62].

Resistome

Resistome is the term used to indicate all the resistance mechanisms that can be found in an organism [47,63,64]. In a recent study [64], the resistome of 412 multi-resistant bacteria found in four cultivable grounds, four urban soils and two pristine environments was performed, testing 23 antibiotics, considering the large amount of resistant pathogenic isolates [63]. This kind of study was carried out for M. tuberculosis in 2013 [47]. The emergence of multidrug-resistant strains prompted the study and 53 genes of resistance have been found, most of these genes (60%) coding for acetyltransferases, having a common ancestral core.

Core and panmetabolism

By analogy with the definitions of the core and the pangenome, the panmetabolism includes all the metabolic reactions that are present in the group of studied organisms, whereas the core contains only the reactions common to all isolates. A complete study was performed on the core and the panmetabolism of E. coli[65], including 29 species. The authors found a panmetabolism comprising 1545 reactions, including 885 that belong to the core. The authors noticed that the proportion of core genes and the nature of the pangenome (open or closed) did not reflect panmetabolism distribution. For E. coli, for example, known to have an ‘infinite’ pangenome, they found a large number of core reactions but, as expected, a low number of core genes. They concluded that diversity was lower at gene level than at metabolic level.

Panregulon

Another developed analogy to the pangenome was the panregulon [66]. Studies were either centred more on the core regulon [67] or on the complete panregulon [66]. The panregulon includes all genes controlled by a particular factor of transcription in the studied genomes [66]. In the first work [67], eight isolates of Listeria monocytogenes were tested, the core regulon consisted of 63 genes, with a panregulon of 425 genes. In a second study [66] on Sinorhizobium meliloti they studied the pangenome and the panregulon at the same time. Based on three isolates, they described a core genome that consisted of 5124 genes and a pangenome of 7824 genes. The panregulon is extremely small compared with the pangenome.

Example of pangenome study: Legionella pneumophila

In 2010, using 454 technologies, five complete genomes of L. pneumophila[35] were sequenced. It is an intracellular bacterium, a human pathogen that lives in sympatry with other microorganisms within amoebae [68]. Legionella pneumophila has an open pangenome. Based on the study of orthologues and helped by BLAST, the core was determined as well as the accessory genome. This was used to describe a core genome that would include 1979 genes, representing 66.9% of the total genome, and a dispensable genome consisting of 978 genes (33.1% of the genome), for which COG categories were assigned. The genome annotation revealed an important number of hypothetical proteins. Most of the genes in the accessory genome belonged to genomic islands, divided into six categories: three different islands connected with drug resistance, one with secretion and transport of heavy metals, three islands with DNA transfer, two CRISPRs systems, seven phage-related systems and 13 islands with no identified function. With regard to these results, authors were able to conclude that the persistence and virulence of L. pneumophila is coded by the core genome.

Pangenome for Taxonomy of Pathogenic Species: the Case of Escherichia and Shigella

Historical taxonomy

For historical reasons related to pathogenicity and particular morphological and biochemical characters, Shigella species were classified in a separate genus from E. coli. Whereas E. coli are usually prototrophic, mobile and ferment many carbohydrates with gas production, Shigella are auxotrophic and can produce gas during glucose fermentation. Hence, Shigella spp. have many ‘negative’ characteristics compared with E. coli. They are not motile, never grow on the synthetic medium Simmons citrate, lack the activities of phenylalanine deaminase or tryptophan, urease, or lysine decarboxylase, and do not produce H2S. The division of Shigella into four species was based on biochemical and antigenic characterization. These species are divided into serotypes based on a characteristic factor O. However, the distinction between Shigella and E. coli, especially the enterohaemorrhagic invasive E. coli EIEC, is somewhat specious. The O antigens of certain serotypes of Shigella are identical or highly related to those of E. coli. Like EIEC, Shigella causes the dysentery syndrome that consists of fever, diarrhoea with blood, pus and mucus in faeces. The mechanism of Shigella pathogenicity is identical to that of EIEC. They enter into epithelial cells to the lamina propria, triggering a major local inflammatory reaction that can lead to abscess formation and ulceration in the colon. Shigella should be included in the E. coli group. Their individualization was maintained only for practical reasons of medical diagnosis.

Ancient criteria

‘Ancient criteria’ are for example pathovars, phenotypically and biochemically based criteria used to distinguish between E. coli and Shigella spp. before genomic criteria. A first genomic criterion was G+C content. Based on GC% comparison between strains, it can be classified as the same species or not [3]. Variation is lower than 2% within E. coli (50.4–51.2) and including Shigella (50.4–51.2). Variation is lower than 1% within Shigella (50.7–51.1). Shigella spp. are indistinguishable from E. coli by DNA/DNA hybridization [69]. In the 16S identity matrix comparing all strains, we noticed that the lowest identity was 98.83%. The minimal 16S identity within E. coli was 99.41% between E. coli IHE3034 and E. coli UMNK88, whereas the minimal identity between E. coli and Shigella was 99.03% between E. coli O26 H11 11368 and Shigella dysenteriae Sd197. The identity between E. coli and Shigella spp. exceeds the cut-offs used to classify bacterial isolates at the genus and species levels on the basis of 16S rRNA gene sequence identity values (95 and 97% or 98.7%, respectively). In general, Shigella and E. coli appear to belong to the same species and some Shigella were closer to some E. coli than to some other Shigella.

New pangenomic criteria

To use pangenome for taxonomy, we clustered our genomes based on COG and KEGG data. In both cases, Shigella was included inside the E. coli cluster and did not constitute a separate group. Then, we looked at the phylogenetic tree based on the concatenation of the core gene SNPs (not shown); Shigella did not constitute a unique cluster, instead the species tended to be distributed among the different E. coli clusters. Then, we calculated the distance between genomes on the basis of nucleic sequence identity, which revealed that some E. coli (26 out of 42 genomes) were closer to Shigella (with more than 90% similarity) than to some other E. coli (with around 80% similarity). The principal coordinate analysis, based on the nucleotide similarity between genomes, showed several different clusters including two clusters containing a mix of Shigella and E. coli species.

Pangenome and taxonomy

Thanks to USEARCH [70] for protein de-replication, followed by a tBLASTn with a 10E-3 E-value, we determined the core/pangenome ratio, the pangenome and the core genome values after each added strain for E. coli, E. coli + Shigella, Shigella and Shigella flexneri (Fig. 2). For each curve (core, pangenome and ratio), we looked for the best R2 (coefficient of determination) in order to determine the most accurate regression type. We also calculated the average rift between the core and the pangenome curves. In all cases, the pangenome curve is described as a linear function whereas those from the core and the ratio are described by power functions. When it is a single species, like S. flexneri (Fig. 2a), core, pangenome and ratio curves matched perfectly with their trend curves corresponding to their function. First, in Fig. 2(b), the ratio and the pangenome curves showed that there are different species, because at some points curves did not follow the trend curve. Then, in Fig. 2(c), the addition of nine Shigella to the 42 E. coli samples creates a break in the pangenome and ratio curves. This is in correlation with the disappearance of 543 functions, 216 from E. coli and 327 from Shigella. Indeed, the standard deviation between the core curve and the pangenome curve has only a 1% variation between the two conditions (with or without Shigella). Finally, with the addition of a second E. coli we can see there is a great decrease (15%) in the ratio, whereas in a homogeneous species, like S. flexneri, this decrease is only 2%. In conclusion, we focused on the fact that E. coli is not a homogeneous species, with these variations between trend curves and ratio (or pangenome curve), compared with S. flexneri, which is a homogeneous species. There is also a breakpoint in the ratio and pangenome curves. Mathematically, this corresponds to the start of a new function. Here, this points to the start of a new species, which may be explored further as a new species criterion to define species.

Relations Between Pangenome and Lifestyle

Ratio

Finally, based on the ‘backbone files’ [44] of MAUVE, we calculated the size of the core and the pangenome of 27 species (Table 3). After determining the core/pangenome ratio (Table 3), we noticed that the species with a closed pangenome possessed a ratio ≥89% and that they were all allopatric. For instance, the species raising the smallest ratio (5%) was a sympatric bacterium that lived in a marine environment. This ratio is based on both coding region and intergenic regions. We also calculated a ratio only with the coding part, based on the core genes.

Table 3

Ratio core/pangenome of several bacterial species according to their life style

Species	Genome used	Lifestyle	Intracellular	Niche	%a
Prochlorococcus marinus	12	Sympatric	no	Marine environment	5
Clostridium botulinum	14	Sympatric	no	Soil	11
Rhodopseudomonas palustris	7	Sympatric	no	Soil, marine environment	46
Sinorhizobium meliloti	6	Sympatric	no	Soil	49
Salmonella enterica	20	Sympatric	facultative	Animals	62
Acinetobacter baumannii	11	Sympatric	no	?	65
Legionella pneumophila	11	Sympatric	facultative	Amoeba	69
Escherichia coli	19	Sympatric	no	Animals	70
Bacillus cereus	12	Sympatric	no	Soil	74
Campylobacter jejuni	14	Sympatric	facultative	Human, chicken	76
Clostridium difficile	18	Sympatric	no	Human gut	77
Helicobacter pylori	10	Sympatric	facultative	Human	78
Haemophilus influenzae	9	Sympatric	facultative	Human	80
Streptococcus pneumoniae	10	Sympatric	no	Human	82
Pseudomonas aeruginosa	7	Sympatric	no	Water	84
Streptococcus agalactiae	5	Sympatric	no	Human	84
Listeria monocytogenes	20	Sympatric	facultative	Amoeba?	84
Francisella tularensis	13	Sympatric	facultative	Ticks	87
Yersinia pestis	12	Allopatric	facultative	Rodents	89
Coxiella burnetii	7	Allopatric	yes	Animals	90
Tropheryma whipplei	19	Allopatric	yes	Human	94
Mycobacterium tuberculosis	20	Allopatric	yes	Human	96
Buchnera aphidicola	8	Allopatric	yes	Aphid	98
Bacillus anthracis	9	Allopatric	no	Animals	99
Rickettsia rickettsii	8	Allopatric	yes	Ticks	99
Chlamydia trachomatis	20	Allopatric	yes	Human	99
Rickettsia prowazekii	8	Allopatric	yes	Human	100

% is the ratio core/pangenome.

Pangenome size and lifestyle

The size of a pangenome is strongly related to the balance existing between gene gain and loss events (Fig. 3). When an ecosystem becomes different (Fig. 3), some functions can then become useless and eventually be lost. In contrast, when the bacteria are in a very diverse environment with many partners, gain events are common (Fig. 3). The genome size is also strongly connected to the selfish genes, which are parasitic and constitute the mobilome (see above). Phages, integrases and transposases contribute to the increase in genome size and are the consequence of life in community. Usually, the more partners there are, the greater the probability of acquiring parasitic DNA. A sympatric bacterium will then have a wide and open pangenome and will possess a quite consequent mobilome as well as more defence mechanisms (CRISPRs) than intracellular and allopatric species, which will have a small and closed pangenome [44].

Fig. 3

Summary of the difference between closed and open pangenome.

Case of ‘bad bugs’

It is known that intracellular bacteria possess fewer genes for transcription [71] and there is a decrease of genes involved in metabolism [72]. In 2011, a study of ‘bad bugs’ (targeting the 12 most dangerous bacteria for human beings) [59] was conducted. Globally, it was noticed that the virulent isolates tend to have a reduced genome compared with their commensal counterparts, but above all that there are functional reductions. Indeed, of the 23 tested COG categories [59], a decrease in gene number was found in ten, specifically for transcription and amino acid metabolism. It was noted that the genes lost from the ‘bad bugs’ mainly encode for the metabolism and transport functions.

Pangenome and Lifestyle Examples, Yersinia pestis and Bacillus anthracis

Yersinia pestis[73], the plague's agent, was studied in 2010. After sequencing 14 genomes, assembly was carried out using Celera assembler [74] and annotation using the MANATEE tool (http://manatee.sourceforge.net/). After global alignment of genomes using MUMmer[75], pangenome composition was predicted using WU-BLASTp and tBLASTn. The core genome consists of 3668 genes and, as for every closed pangenome, the addition of new isolates changed almost nothing. Although Y. pestis lived in the soil, it had a closed pangenome reflecting an allopatric lifestyle. This was the same as B. anthracis, which lived in a dormant form in the soil and which multiplied in its host. Hence, the pangenome makes it possible to determine if a bacteria is just resting and not multiplying in an environment with many other microorganisms (such as soil or water) or if it is active. Take B. anthracis for instance, which lives in the soil in a dormant sporulated form. When it becomes active and multiplies in its host, it has few chances to exchange genes. Therefore the B. anthracis pangenome is closed with a core/pangenome ratio of 99%.

Conclusion: a Quantic Perspective for Taxonomy of Pathogenic Species

Pangenome studies have become almost essential for bacterial genome comparisons. After carefully choosing the strains of interest, we can select an experimental method such as a microarray [26] or bioinformatics-based method (Fig. 4). Bioinformatics offers tools serving general [37] and dedicated [53,54] purposes. Thanks to these analyses, study of pangenomes can provide different kinds of data and increase our knowledge and understanding of a species.

Fig. 4

Strategy of analyses of the pangenome.

First, the size of a genome is directly correlated to its capacity to acquire, or not, exogenous DNA, to gain and loss events and to the presence of selfish genes. The pangenome size depends on all these parameters. Hence, depending on its size and on its type (open or closed), we can determine the species' lifestyle (allopatry or sympatry), and also have an idea of the number of genomes we need to have the best view of real genomic content (Fig. 3). Pangenome study also allowed us to find the resistome [47], the non-virulence genes [62] and the mobilome [48] (to determine selfish genes) of a group of strains(Fig. 4). Sometimes it is possible to extrapolate the age of clones by studying SNP in the core genome. Moreover, by analogy with the pangenome concept, the panmetabolism can be described, giving a large but detailed view of all common metabolisms and/or differences in the strains of interest. By grouping all these genomic data and the lifestyle information, it may be possible to redefine species and classify them depending on their genomic content. Indeed, groups of strains with a core/pangenome ratio of 100 or 99%, with a very reduced mobilome and with an identical gene content may be considered to be one species. However, in the case of an infinite pangenome, such as E. coli, or in the case of a very small ratio (5%) like Prochlorococcus marinus, can we talk about species yet? Instead of a single species, do we have a complex of species? Definitions of species were often reached using old tools. Moreover, some species are, by nature, non-homogeneous (in the case of sympatric species). So redefining species [76] may be an interesting perspective for the future, using a combination of pangenomic data, phylogeny and phylogenomics (unpublished data). Besides redefining species, the second important key to the study of the pangenomes is to see what is not visible at first glance. Take B. anthracis for example, which lives in a niche appearing sympatric (the soil) [77] but remains dormant in spore form and has a very closed pangenome. Conversely, L. pneumophila is intracellular, but it is metabolically active in its niche (amoeba) [68,78] and has an open pangenome. The pangenome therefore also provides an alternative method for analysing lifestyle, which is not simply looking at the apparent predicted niche.

Future Perspectives

In terms of future perspectives, we can consider applying the pangenome to the reclassification of other bacterial pathogenic species or genus, such as Salmonella.

Conflict of interest

None declared.

Definitions

Term	Meaning
Accessory genome	Not unique but not in the core genome.
Allopatric	Here, means living alone in its ecological niche.
Bad bugs	Most dangerous pandemic bacteria for humans.
Closed pangenome	Finished pangenome in which there is no change when new genomes are added.
COG	Cluster of orthologous groups.
Core genome	The pool of genes common to all the studied genomes of a given species.
CRISPRs	Clustered regularly interspaced short palindromic repeats.
KEGG	Kyoto encyclopaedia of genes and genomes
MLST	Multilocus sequence typing, which is used for the typing of multiple loci in molecular biology. It is based on individual phylogenetic analysis or concatenation analysis of multiple housekeeping genes.
Mobilome	All mobile genetic elements of a genome.
MST	Multispacer sequence typing; based on highly polymorphic non-coding sequences.
NGS	Next-generation sequencing.
Non-virulence genes	Genes associated with non-virulence the deletion of which favours virulence.
Open pangenome	A pangenome increasing when a new genome is added to the pangenome.
ORF	Open reading frame.
Pangenome	The repertoire of genes for a group of genomes.
Panmetabolism	The repertoire of metabolic reactions for a group of genomes.
Panregulon	The groups of genes co-regulated observed by transcriptomics analysis.
Resistome	Set of all encoding resistance genes to other bacteria.
SNP	Single nucleotide polymorphism. Variation of only one base.
Species	A homogeneous group of isolates characterized by a phenotypic and genetic resemblance.
Sympatric	Here, means living in a large community in its niche.
TA modules	Toxin/antitoxin modules.

Table 1

Summary of all the pangenomes studies about bacterial species

Species	References	Phylum	Class
Escherichia coli	[8,16,65]	Proteobacteria	Gammaproteobacteria
Streptococcus pneumoniae	[8,79]	Firmicutes	Bacilli
Salmonella enterica	[8,36]	Proteobacteria	Gammaproteobacteria
Staphylococcus aureus	[8,80]	Firmicutes	Bacilli
Helicobacter pylori	[8,81]	Proteobacteria	Epsilonproteobacteria
Vibrio cholerae	[82]	Proteobacteria	Gammaproteobacteria
Mycobacterium tuberculosis	[83]	Actinobacteria	Actinobacteria
Yersinia pestis	[8,73]	Proteobacteria	Gammaproteobacteria
Acinetobacter baumannii	[8,84]	Proteobacteria	Gammaproteobacteria
Chlamydia trachomatis	[34]	Chlamydiae	Chlamydiia
Bacillus cereus	[1,8]	Firmicutes	Bacilli
Streptococcus pyogenes	[8,54]	Firmicutes	Bacilli
Listeria monocytogenes	[85,86]	Firmicutes	Bacilli
Haemophilus influenzae	[5]	Proteobacteria	Gammaproteobacteria
Pseudomonas aeruginosa	[87]	Proteobacteria	Gammaproteobacteria
Enterococcus faecium	[88]	Firmicutes	Bacilli
Clostridium difficile	[89]	Firmicutes	Clostridia
Francisella tularensis	[8]	Proteobacteria	Gammaproteobacteria
Campylobacter jejuni	[8,90]	Proteobacteria	Epsilonproteobacteria
Bacillus anthracis	[4]	Firmicutes	Bacilli
Clostridium botulinum	[8]	Firmicutes	Clostridia
Buchnera aphidicola	[8,91]	Proteobacteria	Gammaproteobacteria
Actinobacillus pleuropneumoniae	[92]	Proteobacteria	Gammaproteobacteria
Legionella pneumophila	[35]	Proteobacteria	Gammaproteobacteria
Streptococcus agalactiae	[4,93]	Firmicutes	Bacilli
Streptococcus suis	[94]	Firmicutes	Bacilli
Sinorhizobium meliloti	[66]	Proteobacteria	Alphaproteobacteria
Aggregatibacter actinomycetemcomitans	[95]	Proteobacteria	Gammaproteobacteria
Bifidobacterium animalis	[96]	Actinobacteria	Actinobacteria
Prochlorococcus marinus	[8]	Cyanobacteria	Prochlorales
Ralstonia solanacearum	[97]	Proteobacteria	Betaproteobacteria
Rhodopseudomonas palustris	[8]	Proteobacteria	Alphaproteobacteria
Coxiella burnetii	[8]	Proteobacteria	Gammaproteobacteria
Erwinia amylovora	[98]	Proteobacteria	Gammaproteobacteria
Corynebacterium pseudotuberculosis	[99]	Actinobacteria	Actinobacteria
Lactobacillus casei	[100]	Firmicutes	Bacilli
Salmonella paratyphi	[101]	Proteobacteria	Gammaproteobacteria
Oenococcus oeni	[102]	Firmicutes	Bacilli
Staphylococcus epidermidis	[103]	Firmicutes	Bacilli
Corynebacterium diphtheriae	[104]	Actinobacteria	Actinobacteria
Tropheryma whipplei		Actinobacteria	Actinobacteria

Table 2

Summary of all the pangenomes studies about bacterial genus

Genus	References	Phylum	Class
Streptococcus	[93]	Firmicutes	Bacilli
Salmonella	[36]	Proteobacteria	Gammaproteobacteria
Vibrio	[82]	Proteobacteria	Gammaproteobacteria
Pseudomonas	[105]	Proteobacteria	Gammaproteobacteria
Burkholderia	[106,107]	Proteobacteria	Betaproteobacteria
Bifidobacterium	[108]	Actinobacteria	Actinobacteria
Chlamydiae	[34]	Chlamydiae	Chlamydiia
Campylobacter	[9]	Proteobacteria	Epsilonproteobacteria
Listeria	[48]	Firmicutes	Bacilli
Dehalococcoides	[109]	Chloroflexi	Dehalococcoidetes
Mycoplasma	[110]	Tenericutes	Mollicutes
Caldicellulosiruptor	[111]	Firmicutes	Clostridia

110 in total

Review 1. The genomics of Acinetobacter baumannii: insights into genome plasticity, antimicrobial resistance and pathogenicity.

Authors: Francesco Imperi; Luísa C S Antunes; Jochen Blom; Laura Villa; Michele Iacono; Paolo Visca; Alessandra Carattoli
Journal: IUBMB Life Date: 2011-10-27 Impact factor: 3.885

2. Pyrosequencing-based comparative genome analysis of the nosocomial pathogen Enterococcus faecium and identification of a large transferable pathogenicity island.

Authors: Willem van Schaik; Janetta Top; David R Riley; Jos Boekhorst; Joyce E P Vrijenhoek; Claudia M E Schapendonk; Antoni P A Hendrickx; Isaäc J Nijman; Marc J M Bonten; Hervé Tettelin; Rob J L Willems
Journal: BMC Genomics Date: 2010-04-14 Impact factor: 3.969

3. Characterizing the metabolism of Dehalococcoides with a constraint-based model.

Authors: M Ahsanul Islam; Elizabeth A Edwards; Radhakrishnan Mahadevan
Journal: PLoS Comput Biol Date: 2010-08-19 Impact factor: 4.475

4. Evolutionary dynamics of complete Campylobacter pan-genomes and the bacterial species concept.

Authors: Tristan Lefébure; Paulina D Pavinski Bitar; Haruo Suzuki; Michael J Stanhope
Journal: Genome Biol Evol Date: 2010-08-04 Impact factor: 3.416

Review 5. The Salmonella enterica pan-genome.

Authors: Annika Jacobsen; Rene S Hendriksen; Frank M Aaresturp; David W Ussery; Carsten Friis
Journal: Microb Ecol Date: 2011-06-04 Impact factor: 4.552

6. PHAST: a fast phage search tool.

Authors: You Zhou; Yongjie Liang; Karlene H Lynch; Jonathan J Dennis; David S Wishart
Journal: Nucleic Acids Res Date: 2011-06-14 Impact factor: 16.971

7. PanCGH: a genotype-calling algorithm for pangenome CGH data.

Authors: Jumamurat R Bayjanov; Michiel Wels; Marjo Starrenburg; Johan E T van Hylckama Vlieg; Roland J Siezen; Douwe Molenaar
Journal: Bioinformatics Date: 2009-01-07 Impact factor: 6.937

8. Comparative analysis of the Oenococcus oeni pan genome reveals genetic diversity in industrially-relevant pathways.

Authors: Anthony R Borneman; Jane M McCarthy; Paul J Chambers; Eveline J Bartowsky
Journal: BMC Genomics Date: 2012-08-03 Impact factor: 3.969

9. Comparative genomics evidence that only protein toxins are tagging bad bugs.

Authors: Kalliopi Georgiades; Didier Raoult
Journal: Front Cell Infect Microbiol Date: 2011-10-25 Impact factor: 5.293

10. Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains.

Authors: Justin S Hogg; Fen Z Hu; Benjamin Janto; Robert Boissy; Jay Hayes; Randy Keefe; J Christopher Post; Garth D Ehrlich
Journal: Genome Biol Date: 2007 Impact factor: 13.583

63 in total

1. Comparative Genomics, from the Annotated Genome to Valuable Biological Information: A Case Study.

Authors: Sabina Zoledowska; Agata Motyka-Pomagruk; Agnieszka Misztak; Ewa Lojkowska
Journal: Methods Mol Biol Date: 2021

Review 2. Omics in gut microbiome analysis.

Authors: Tae Woong Whon; Na-Ri Shin; Joon Yong Kim; Seong Woon Roh
Journal: J Microbiol Date: 2021-02-23 Impact factor: 3.422

3. Tracking the Taxonomy of the Genus Bifidobacterium Based on a Phylogenomic Approach.

Authors: Gabriele Andrea Lugli; Christian Milani; Sabrina Duranti; Leonardo Mancabelli; Marta Mangifesta; Francesca Turroni; Alice Viappiani; Douwe van Sinderen; Marco Ventura
Journal: Appl Environ Microbiol Date: 2018-01-31 Impact factor: 4.792

Review 4. Pan-genomics in the human genome era.

Authors: Rachel M Sherman; Steven L Salzberg
Journal: Nat Rev Genet Date: 2020-02-07 Impact factor: 53.242

5. Comparative Genomics Provides Insights Into Genetic Diversity of Clostridium tyrobutyricum and Potential Implications for Late Blowing Defects in Cheese.

Authors: Lucija Podrzaj; Johanna Burtscher; Konrad J Domig
Journal: Front Microbiol Date: 2022-06-02 Impact factor: 6.064

Review 6. Population Biology and Comparative Genomics of Campylobacter Species.

Authors: Lennard Epping; Esther-Maria Antão; Torsten Semmler
Journal: Curr Top Microbiol Immunol Date: 2021 Impact factor: 4.291

7. A comparative genomics approach identifies contact-dependent growth inhibition as a virulence determinant.

Authors: Jonathan P Allen; Egon A Ozer; George Minasov; Ludmilla Shuvalova; Olga Kiryukhina; Karla J F Satchell; Alan R Hauser
Journal: Proc Natl Acad Sci U S A Date: 2020-03-10 Impact factor: 11.205

8. Bacillus pumilus Group Comparative Genomics: Toward Pangenome Features, Diversity, and Marine Environmental Adaptation.

Authors: Xiaoteng Fu; Linfeng Gong; Yang Liu; Qiliang Lai; Guangyu Li; Zongze Shao
Journal: Front Microbiol Date: 2021-05-07 Impact factor: 5.640

9. Pan and Core Genome Analysis of 183 Mycobacterium tuberculosis Strains Revealed a High Inter-Species Diversity among the Human Adapted Strains.

Authors: Fathiah Zakham; Tarja Sironen; Olli Vapalahti; Ravi Kant
Journal: Antibiotics (Basel) Date: 2021-04-28

10. Zetaproteobacteria Pan-Genome Reveals Candidate Gene Cluster for Twisted Stalk Biosynthesis and Export.

Authors: Elif Koeksoy; Oliver M Bezuidt; Timm Bayer; Clara S Chan; David Emerson
Journal: Front Microbiol Date: 2021-06-18 Impact factor: 5.640