Literature DB >> 35989609

EnteroBase: hierarchical clustering of 100 000s of bacterial genomes into species/subspecies and populations.

Mark Achtman¹, Zhemin Zhou¹, Jane Charlesworth¹, Laura Baxter¹.

Abstract

The definition of bacterial species is traditionally a taxonomic issue while bacterial populations are identified by population genetics. These assignments are species specific, and depend on the practitioner. Legacy multilocus sequence typing is commonly used to identify sequence types (STs) and clusters (ST Complexes). However, these approaches are not adequate for the millions of genomic sequences from bacterial pathogens that have been generated since 2012. EnteroBase (http://enterobase.warwick.ac.uk) automatically clusters core genome MLST allelic profiles into hierarchical clusters (HierCC) after assembling annotated draft genomes from short-read sequences. HierCC clusters span core sequence diversity from the species level down to individual transmission chains. Here we evaluate HierCC's ability to correctly assign 100 000s of genomes to the species/subspecies and population levels for Salmonella, Escherichia, Clostridoides, Yersinia, Vibrio and Streptococcus. HierCC assignments were more consistent with maximum-likelihood super-trees of core SNPs or presence/absence of accessory genes than classical taxonomic assignments or 95% ANI. However, neither HierCC nor ANI were uniformly consistent with classical taxonomy of Streptococcus. HierCC was also consistent with legacy eBGs/ST Complexes in Salmonella or Escherichia and with O serogroups in Salmonella. Thus, EnteroBase HierCC supports the automated identification of and assignment to species/subspecies and populations for multiple genera. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.

Entities: Chemical

Keywords: EnteroBase; accessory genome; big data; cgMLST; genomic databases; hierarchical clustering

Mesh：

Year: 2022 PMID： 35989609 PMCID： PMC9393565 DOI： 10.1098/rstb.2021.0240

Source DB: PubMed Journal: Philos Trans R Soc Lond B Biol Sci ISSN： 0962-8436 Impact factor: 6.671

Introduction

Microbiologists need large databases to identify and communicate about clusters of related bacteria, … Such databases should contain the reconstructed genomes of bacterial isolates, … together with metadata describing their sources and phenotypic properties … and we are currently developing EnteroBase, a genome-based successor to the E. coli and S. enterica MLST databases. The Linnean system of genus and species designations was applied to bacterial nomenclature in the late ninteenth century, soon after their ability to cause infectious diseases had been recognized. These taxonomic labels initially reflected disease specificity and phenotypic similarities. Over the following decades, phenotypic criteria for distinguishing bacterial taxa were extended to include serologically distinct groupings and even differential sensitivity to bacteriophages. A primary goal for taxonomic designations for bacterial pathogens was that they be useful for epidemiology and clinical diagnoses. As a result, serovars of Salmonella enterica, which differ by dominant epitopes on lipopolysaccharide and flagella, were each assigned a distinct species designation, often referring to the disease syndrome and host, e.g. Salmonella typhimurium [2]. Similarly, even though Yersinia pestis is a clone of Yersinia pseudotuberculosis [3,4], it was designated as a distinct species because Y. pestis causes plague whereas Y. pseudotuberculosis causes gastroenteritis. The use of species designations for Salmonella serovars is now disparaged, although it is still in common use [5], but Y. pestis has retained its species designation. Percentage DNA–DNA hybridization levels became the ‘Gold Standard’ metric for new taxonomic designations after 1987 [6]. DNA-DNA hybridization reflects genomic relationships but distinctive phenotypic differences have remained a requirement for the definition of a novel species until the present. Indeed, the international committee which controls taxonomic designations continues to reject the validity of taxonomic designations based solely on DNA similarities [7,8].

Average nucleotide identity and multilocus sequence typing

An alternative genomic approach for taxonomic designations was proposed in 2005 by Konstantinidis [9], namely the definition of species on the basis of average nucleotide identity (ANI). Pairs of genomes with 95% ANI usually belong to the same species whereas pairs with lower ANI values belong to different species. Computational methods based on k-mer searches, such as FastANI [10], can rapidly perform ANI-like calculations on large numbers of genomes, and these calculations are now routinely used by bioinformaticians for taxonomic assignments. An approach based on ANI is enticing because it could provide defined criteria for the definition of a species across all Bacteria, and allow species assignment based exclusively on genome sequences. However, the use of ANI for species assignments has been criticized because it does not completely correlate with DNA-DNA hybridization [11,12]. Furthermore, multiple taxa that are designated as single species each contain multiple 95% ANI groups [13,14]. For example, strains of Streptococcus mitis colonize the human oropharyngeal tract, have similar phenotypes and are considered to represent a single species. However, S. mitis encompasses myriad, genetically distinct strains [15-17], and encompasses at least 44 distinct 95% ANI clusters [17]. Other problematic genera include Pseudomonas [12,18] and Aeromonas [19]. Furthermore, large databases including 100 000s of genomes per genus would struggle to implement methods such as pairwise clustering by ANI because each new entry would require testing against all genomes. There is also no consensus on ANI criteria for recognizing lower taxonomic entities, such as subspecies and populations. Indeed, there is not even a consensus on the definitive properties of what constitutes a bacterial species [20]. Bottom-up population genetic approaches such as multilocus sequence typing (MLST) can provide an alternative to top-down taxonomy, and deal efficiently with large numbers of bacterial strains. Legacy MLST based on the sequence differences of several housekeeping genes was introduced in 1998 [21], and has now been applied to more than 100 bacterial species [22]. MLST defines sequence types (STs), consisting of unique integer designations for each unique sequence (allele) of each of the MLST loci. Some STs mark individual clones with special pathogenic properties and which seem to have arisen fairly recently, such as Escherichia coli ST131, which is a globally prominent cause of urinary tract infection (UTI) and invasive disease [23]. Similarly, Salmonella enterica subsp. enterica serovar Typhimurium ST313 is a common cause of extra-intestinal, invasive salmonellosis in Africa [24]. Higher order clusters of related STs are also well known. Such clusters can be recognized by eBurst analyses [25], and are referred to as ST Complexes in E. coli [26] and eBGs (eBurst groups) in S. enterica [27]. ST Complexes and eBGs seem to reflect natural populations, but their broad properties are still not well understood [28]. The principles of legacy MLST were extended to rMLST, which uses sequences of 53 universal bacterial genes encoding ribosomal proteins and provides a universal MLST scheme for all Bacteria [29]. For both Escherichia and Salmonella, rMLST offers a slight improvement in resolution over legacy MLST, and the identification of Salmonella eBGs is reasonably consistent across both approaches [30]. rMLST has been used to identify tractable sets of representative genomes from large collections for calculating phylogenetic trees [28] and pan-genomes [17]. MLST has also been extended to cgMLST, which encompasses all the genes in a soft core genome [29-31], and cgMLST nomenclatures have been implemented for multiple bacterial genera [28]. The large number of loci encompassed by cgMLST schemes results in enormous numbers of cgMLST STs (cgSTs), but these can be clustered into groups of bacterial genomes at multiple levels of genomic diversity (hierarchical clustering) [32]. cgMLST provides considerably higher resolution than legacy MLST or rMLST, and initial analyses indicated that it is ideal for investigating transmission chains within single source outbreaks or for identifying population structures up to the genus level [28]. Here we focus on the automated assignments of genome assemblies from six genera of important bacterial pathogens to taxonomic and population structures by hierarchical clustering of cgSTs with the EnteroBase HierCC pipeline (table 1) [32]. For five of those genera, HierCC is a full solution for taxonomic designations of species.

Table 1

Legacy data, newly assembled genomes and hierarchical cgST clusters in EnteroBase (09/2021). NOTE: no. genomic loci is numbers of coding sequences whose alleles are automatically called for the wgMLST and cgMLST schemes; HierCC level: maximum numbers of allelic differences within minimal spanning trees that define HC clusters of cgSTs. EnteroBase automatically calculates existing Legacy MLST STs according to legacy schemes for Escherichia/Shigella (Wirth et al. [26]), Salmonella (Achtman et al. [27]), and Clostridioides (Griffiths et al. [33]), but does not assign new STs nor does it maintain a database of legacy data from ABI sequencing. Additional public databases are presented by EnteroBase for Helicobacter (>3500 genomes) and Moraxella (>2350 genomes), but these lack a cgMLST scheme. EnteroBase also has a database for >80 000 Mycobacterium genomes, but this currently (March, 2022) lacks a cgMLST scheme.

genus	legacy MLST (no. strains)	cgMLST (no. genomes)	no. genomic loci		no. HierCC clusters (HierCC level)
genus	legacy MLST (no. strains)	cgMLST (no. genomes)	wgMLST	cgMLST	ST complexes	Lineages	species/subspecies
Salmonella	4930	312 196	21 065	3002	3648 (900)	2185 (2000)	15 (2850)
Escherichia/Shigella	9525	175 256	25 002	2513	1379 (1100)	343 (2000)	16 (2350)
Streptocococcus		76 718	33 887	372	3681 (100)		132 (363)
Clostridioides		23 197	11 490	2556	299 (950)		12 (2500)
Vibrio		12 267	152 249	1128	2000 (800)		155 (1090)
Yersinia	4286	4023	19 591	1553	451 (600)		34 (1490)

EnteroBase and HierCC

EnteroBase (http://enterobase.warwick.ac.uk) includes tools for automatic downloading of short-read sequences and their metadata from the public domain, assembly into draft genomes and population genetic analyses of the core and accessory genomes (table 2) [28]. Draft genomes are annotated according to genus-wide pan-genome schemes created with PEPPAN [17]. In September 2021, EnteroBase contained over 600 000 draft genomes from the six genera in table 1, and likely provides a nearly comprehensive overview of their global diversity. Many of these samples reflect a focus on food-borne disease in the US and United Kingdom, but this bias is increasingly being reduced by the global sources of many genomes (see electronic supplementary material, text, Sample Bias).

Table 2

name	purpose	citation	URL for stand-alone version
GrapeTree	GUI for depicting and analysing minimal spanning and NJ trees of character data	[34]	https://github.com/achtman-lab/GrapeTree
PEPPAN	pan-genome calculation including pseudo-genes from numerous representatives of an entire genus	[17]	https://github.com/zheminzhou/PEPPAN
SPARSE	assignment of metagenomic reads to individual taxa	[35,36]	https://github.com/zheminzhou/SPARSE
EToKi	EnteroBase toolkit of useful functions and pipelines	[28]	https://github.com/zheminzhou/EToKi
BlastFrost	efficient k-mer based search for DNA sequences in large genomic datasets	[37]	https://github.com/nluhmann/BlastFrost
HierCC	automated hierarchical clustering of cgSTs to existing or novel clusters	[32]	https://github.com/zheminzhou/pHierCC

EnteroBase-related software tools that support the analysis of large numbers of bacterial genomes. NOTE: PEPPAN is a stand-alone program that was used to generate the wgMLST schemes used in EnteroBase. SPARSE is a second stand-alone program that has been used to extract taxon-specific reads from metagenomes of ancient DNA that were then used with EToKi to define pseudo-MAGs (metagenomic assembled genomes) that were uploaded to EnteroBase for phylogenetic comparisons. HierCC was used as a stand-alone program in development mode to define an initial set of clusters from a representative set of genomes. Subsequent assignments to existing clusters or to novel clusters were performed automatically within EnteroBase using pHierCC in production mode. BlastFrost is used by EnteroBase to determine the presence/absence of toxins and other pathovar characteristics of gastrointestinal pathogens within E. coli, and to assign gastrointestinal pathovar designations. One of the primary goals for EnteroBase was a hierarchical overview of the population structure of the genera in table 1. We therefore developed HierCC [32] based on cgMLST assignments to support the rapid recognition and detailed investigation of differing levels of population structure. EnteroBase reports cluster assignments and designations at 10–13 levels of allelic differences for all six genera [28] (table 1). HC5–HC10 clusters with maximal internal pair-wise distances within minimal spanning trees of 5 or 10 alleles, respectively, have been used to identify short-term, single source outbreaks of S. enterica and E. coli/Shigella that extended to multiple European countries [38-42]. Pathogen species can also include higher level clusters, which can correspond to somewhat more distantly related bacterial populations that cause endemic or epidemic disease over longer time periods in one or more countries [43-46]. EnteroBase HierCC has even been used to classify all Shigella [47], which correspond to discrete lineages of E. coli [26,48,49]. Classical taxonomic approaches for assigning individual strains and genomes to species depend on human expertise and are not suitable for automated pipelines in real-time databases such as EnteroBase. This problem is acute because many short read sequences in the European Nucleotide Archive (ENA) have incorrect taxonomic assignments, or none at all, and phenotypic distinctions are inappropriate for databases containing 100 000s of genomes. Zhou et al. compared taxonomic designations with peak normalized mutual information and silhouette scores for HierCC clusters and identified HierCC levels that corresponded with well-defined species/subspecies in each genus [32]. For example, for Salmonella with a total of 3002 loci in the cgMLST scheme, HC2850 (94.9% of all cgMLST alleles) was chosen as the optimal HC level for identifying species and subspecies. The optimal HC levels for the other five genera with cgMLST schemes ranged from HC363 to HC2500 (93.6–97.8%) (table 1). Here we compare the consistency of those HierCC assignments with assignments based on classical taxonomic approaches and 95% ANI. The results illustrate problems with classical approaches and with 95% ANI, and we conclude that HierCC is preferable for automated assignment of genomes to species/subspecies for those genera. However, neither ANI nor HierCC is universally satisfactory for species/subspecies assignments within Streptococcus. This manuscript also provides an initial overview of the abilities of HierCC to assign genomes to populations and Lineages [50,51] within Salmonella and Escherichia/Shigella, and compares those assignments with the distributions of O antigens within lipopolysaccharide.

Results

Species and subspecies

In order to test the efficacy of HierCC at identifying species and subspecies, we extracted collections of representative genomes from all six genera in EnteroBase for which HierCC clusters had been implemented (table 3). We wished to calculate maximum-likelihood (ML) phylogenetic topologies of these genomes based on their core genome single nucleotide polymorphisms (SNPs) or the presence/absence of accessory genes from the pan-genome. Such ML trees are very slow to calculate with large datasets, but disjointed tree merging (DTM) within a divide-and-conquer approach allows large phylogenetic trees to be calculated in a reasonable time [52]. We therefore developed cgMLSA (see electronic supplementary material, Methods in Supplementary Text), a novel DTM approach based on ASTRID [53] and ASTRAL [54] that enabled the calculations of ML super-trees from up to 10 000 representative genomes within hours to days, and applied it to each of the datasets. The resulting ML super-trees were annotated with taxonomic designations and cluster designations according to a 95% ANI cutoff calculated with FastANI [10] (95% ANI clusters). We then compared those annotations with cluster assignments based on the species-specific HierCC levels in table 1.

Table 3

Parameters of datasets used for calculating ML super-trees. Each dataset consists of genomes representing the entire diversity of a genus as described in Methods (see electronic supplementary material, Supplemental Text).

genus	no. type strains	total no. genomes	no. strict core genes	no. accessory genes	no. soft core genes (cgMLST)	no. SNPs
Salmonella	7	10 002	1409	19 486	3002	1 410 331
Escherichia (EcoRPlus)	8	9479	134	20 294	2513	1 172 200
Escherichia reps	2	967	192	17 983	2513	971 516
Streptocococcus	84	5937	90	31 630	372	263 080
Clostridioides	1	6725	515	10 975	2556	725 240
Vibrio	119	5032	124	146 041	1128	776 439
Yersinia	23	1847	744	18 799	1553	656 143

Salmonella

The topologies of ML super-trees based on 1 410 331 core SNPs from 10 002 representative genomes were in large part concordant with traditional taxonomic assignments, and with clustering according to 95% ANI (figure 1a) or HC2850 (figure 1b). The three sets of assignments were also largely concordant with the topology of an ML super-tree based on the presence or absence of accessory genes (https://enterobase.warwick.ac.uk/ms_tree/53258). HC2850 and 95% ANI distinguished S. enterica from S. bongori, and from former subspecies IIIa, which was recently designated S. arizonae by Pearce et al. [55]. HC2850 also identified another new Salmonella species, cluster HC2850_215890, and that identification was confirmed by 95% ANI. HC2850_215890 consists of five strains that have been isolated from humans since 2018 in the UK, and a gene for gene comparison indicated that roughly half of their core genes were more similar to S. enterica and the other half to S. bongori.

Figure 1

A comparison of species and subspecies assignments within Salmonella with HierCC and ANI. The figure shows an ML super-tree of 1 410 331 SNPs among 3002 core genes from 10 002 representative genomes of Salmonella (table 3). Former subspecies IIIa is designated S. arizonae in accordance with Pearce et al. 2021 [55]. (a) Partitions differentiated by ANI 95% clusters (legend) correspond to species S. enterica, S. bongori, S. arizonae and a new species, S. HC2850_215890 (five strains from the UK, 2018–2020), as indicated by arrows, and subspecies are not differentiated. (b) Partitions coloured by HC2850 clusters (legend). Arrows indicate HC2850_215890, and a new subspecies, HC2850_222931 (one strain from France, 2018). All other HC2850 clusters correspond to species (S. bongori and S. arizonae) or subspecies, except for HC2850_7171 (starred), which is subsp. enterica (I) according to the ML tree. An interactive version of this GrapeTree rendition can be found at https://enterobase.warwick.ac.uk/ms_tree?tree_id=53257. The corresponding presence/absence tree can be found at https://enterobase.warwick.ac.uk/ms_tree?tree_id=53258. Ninety-five percent ANI did not detect any subspecies structure within S. enterica whereas HC2850 differentiated a total of 12 clusters that corresponded to distinct phylogenetic branches in the ML super-tree (figure 1b). Most of these have been designated as subspecies by traditional taxonomy [30,55], except for a singleton genome in HC2850_222931, which likely represents a novel subspecies. Despite the correct identification of so many subspecies, HC2850 did not support Pearce's definition of subspecies XI and did not distinguish those genomes from subsp. II. Similarly, HC2850 did not differentiate Pearce's subsp. VIII from subsp. IV. The prior definition of these subspecies was largely dependent on branch topologies within phylogenetic trees based on rMLST loci, and was also contradicted by population statistics [55]. HierCC also differentiated HC2850_7171, which consists of a short phylogenetic branch within subsp. I that would not have been assigned to a subspecies according to its ML topology. Genomic comparisons of multiple single genes from genomes within HC2850_7171 suggest that HC2850_7171 may be a hybrid between subsp. I and II because some gene sequences were most similar to the former and others were most similar to the latter. With this sole possible exception, HierCC seems to be suitable for the automated detection of new Salmonella species and subspecies, and for routinely assigning novel genomes to the appropriate taxa.

Escherichia

In addition to Escherichia coli, Escherichia includes the named species albertii [56-58], fergusonii [59,60], marmotae [61,62] and ruysiae [63]. And despite their apparently distinct genus and species names, Shigella boydii, Shigella dysenteriae, Shigella flexneri and Shigella sonnei, all common causes of dysentery, correspond to phylogenetic lineages within E. coli [48] rather than to discrete taxonomic units. Still other, unusual Escherichia strains from lake and ocean water are associated with long phylogenetic branches, and were designated as ‘cryptic clades’ I–VIII by population geneticists [64-69]. The branch leading to clade I is simply a long phylogenetic branch within E. coli [64]. However, clade V encompasses E. marmotae [61,62] and E. ruysiae consists of the union of clades III and IV [63]. Initial analyses with the EcoRPlus collection of 9479 genomes [28] (table 3) yielded results that were compatible with these interpretations. Almost all E. coli or Shigella genomes were within HierCC cluster HC2350_1, and genomes with other taxonomic or clade designations belonged to other HC2350 clusters. However, soon after the definition of the EcoRPlus collection, additional genomes of Escherichia which were related to clade II were described from inter-tidal marine and fresh-water sediments near Hong Kong [66,70]. Furthermore, due to its numerical predominance, E. coli overshadows other Escherichia species and subspecies within EcoRPlus. We therefore, created Escherichia reps in Jan 2021, a novel set of 967 representative genomes which included one genome from each of the 160 most common HC1100 clusters in HC2350_1 and all 807 genomes from other HC2350 clusters which existed in EnteroBase at that time. Figure 2 shows that E. fergusonii and E. albertii form discrete 95% ANI clusters as do E. marmotae and other clade V genomes. As previously reported [63], E. ruysiae consists of the distinct clusters of clades III and IV. Three distinct 95% ANI clusters were found within clades II, VI and VIII while the Hong Kong genomes represented multiple, related phylogenetic branches, one of which has previously been designated clade VII. All E. coli and Shigella genomes are in a common 95% ANI group, as is clade I. Comparable clustering results were also found in an ML tree of presence/absence of accessory genes, albeit with different topology of the deepest branches resulting in E. fergusonii and clade VIII being most closely related to E. coli (https://enterobase.warwick.ac.uk/ms_tree?tree_id=71125). These observations indicate that the taxonomy and population genetic designations for Escherichia are incomplete, and also partially inconsistent.

Figure 2

Maximum-likelihood core SNP tree of 967 Escherichia genomes consisting of one genome from each of 161 HC1100 clusters containing E. coli or Shigella as well as all 806 other Escherichia genomes in EnteroBase as of November 2020. The tree is coloured by (a) pairwise FastANI values clustered at the 95% level and (b) HC2350 cluster designations. The key legends indicate taxonomic designations in the literature which best match the cluster groupings. In (b), HC2350 cluster designations were used to mark novel taxonomic groupings in HC2350 clusters 89353, 89356, 89359 and 137132. An interactive version of this GrapeTree rendition can be found at https://enterobase.warwick.ac.uk/ms_tree?tree_id=52101. The corresponding presence/absence tree can be found at https://enterobase.warwick.ac.uk/ms_tree?tree_id=71125. HierCC clustering of the same data provided a consistent and uniform nomenclature. Each discrete phylogenetic cluster received its own HC2850 designation, including clades III and IV and four phylogenetic clusters among the Hong Kong isolates which were designated according to their HC cluster number: e.g. Escherichia HC2350_89353. E. coli, Shigella and clade I all clustered together in HC2350_1, and the only discordance between topological clustering and HierCC clustering among the 967 genomes in Escherichia reps was a single genome which belonged to E. coli by phylogenetic topology but was assigned to HC2350_11465 by HierCC. These results were so convincing that HierCC groupings were used in late 2021 to curate and update all species designations for Escherichia entries within EnteroBase.

Clostridioides

Clostridium difficile clustered distinctly from C. mangenotii according to 95% ANI, and ANI also distinguished five other clusters among genomes that were designated C. difficile. These clusters seem to be novel species according to their phylogenetic topology in the ML SNP super-tree (figure 3). HC2500 identified the same clusters, and also separated out two additional distinct clusters which resemble novel subspecies (arrows: HC2500_15334, HC2500_15408). A subset of the genomes in these clusters (figure 3) were assigned to cryptic clades C-I, CII and C-III by a recent publication [71], which also concluded that they represent novel species. Thus, HierCC is also suitable for the automated detection of new Clostridioides species and subspecies, and for routinely assigning novel genomes to the appropriate taxa.

Figure 3

Species and subspecies assignments within Clostridioides according to 95% ANI (a) and HierCC (b). ML super-tree of 725 240 SNPs among 2556 core genes from 6724 representative genomes of Clostridioides difficile and one genome of Clostridioides mangenotii (table 3). (a) ANI 95% clusters differentiate C. mangenotii (cluster 1010) and five other clusters (clusters 255, 1370, 373, 2147, 1011) from C. difficile (cluster 0). Four of the 95% ANI clusters correspond to cryptic clades C-I (clusters 255, 373), C-II (cluster 1370) and C-III (cluster 2147) in the designations by Knight et al. [71]. Arrows indicate two additional clusters that were distinguished by HierCC in (b). (b) Partitions coloured by HierCC assign the same genomes to HC2500 clusters as ANI, except that HierCC assigns HC2500_15334 and HC2500_15408 designations to one genome each. An interactive version of this GrapeTree rendition can be found at https://enterobase.warwick.ac.uk/ms_tree?tree_id=53253 and a presence/absence tree can be found at https://enterobase.warwick.ac.uk/ms_tree?tree_id=53254.

Yersinia and Vibrio

Detailed results with Yersinia and Vibrio can be found in the electronic supplementary material, text. The general take-home lessons from those analyses resemble those summarized above: 95% ANI clustering is largely congruent with existing taxonomic designations, and HierCC identifies both species and subspecies with better resolution than 95% ANI. However, multiple, unanticipated taxonomic problems were evident in both genera. Firstly, ENA uses the strain designation as a species designation when a species is not specified during the submission process. The resulting public metadata is bloated with inaccurate species names. Secondly, multiple distinct phylogenetic branches were found without any species identifier. Thirdly, in multiple cases the species identifier did not correspond with ANI and/or HierCC, and comparisons with both the SNP and presence/absence ML topologies confirmed that the species designations in ENA were incorrect. We addressed these problems within EnteroBase by performing radical manual curation to the metadata to ensure that the taxonomic species designations correspond to the phylogenies and population structures. We also encountered multiple, problematical taxonomic designations which are described in the next paragraphs.

Yersinia

Ninety-five percent ANI supports the taxonomic convention that Y. enterocolitica represents a single species, and all genomes designated as Y. enterocolitica were in a single 95% ANI cluster. However, in accord with the conclusions by Reuter et al. [72], HierCC clustering assigned the non-pathogenic biotype 1A genomes to HC1490_73 and HC1490_764, the highly pathogenic biotype 1B to HC1490_2, and pathogenic biotypes 2–5 to HC1490_10. We therefore renamed these four groups by adding the HierCC cluster to the species name, e.g. Y. enterocolitica HC1490_2. In contrast to Y. enterocolitica, the Yersinia pseudotuberculosis Complex corresponds to a single phylogenetic cluster according to both HC1490 and 95% ANI (electronic supplementary material, figure S1). Taxonomists have split these bacteria into Y. pseudotuberculosis, Y. pestis, Y. similis and Y. wautersii [73,74]. DNA–DNA hybridization [75] and gene sequences of several housekeeping genes [4] previously demonstrated that Y. pestis is a clade of Y. pseudotuberculosis, and the new observations demonstrate that a distinct species status is not consistent with the genomic data for Y. similis and Y. wautersii. We have therefore downgraded all three taxa within EnteroBase to the category of subspecies of Y. pseudotuberculosis, and extended the designation of Y. pseudotuberculosis to Y. pseudotuberculosis sensu stricto. These assignments are not reflected by distinct HC1490 clusters, and HC1490 clustering can only be used to automatically assign new genomes to the Y. pseudotuberculosis Complex. We also defined eight other novel species/subspecies designations within Yersinia and assigned unique designations based on HierCC clusters: e.g. Yersinia HC1490_419. These clusters were previously unnamed, or had been incorrectly designated with the names of other species which formed distinct ANI and HC1490 clusters (electronic supplementary material, text, figure S1).

Vibrio

Vibrio encompassed 152 HC1090 clusters (electronic supplementary material, figure S2, text), which corresponds to much greater taxonomic diversity than for the other genera dealt with above. The concordance between HC1090 and 95% ANI clusters was absolute for most clusters, including a large numbers of genomes from V. cholerae, V. parahaemolyticus, V. vulnificans and V. anguillarum (electronic supplementary material, table S1), and did not support the existence of additional subspecies. However, eleven HC1090 clusters each encompassed between two and five ANI clusters (electronic supplementary material, table S2) and three ANI clusters each encompassed 2–3 HC1090 clusters (electronic supplementary material, table S3). In order to support the automated assignment of genomes to named species, we implemented taxonomic assignments according to HC1090 and renamed the species of all genomes that were contradictory to this principle. The resulting dataset contains 109 HC1090 clusters with a unique, classical species designation as well as 43 other species level clusters designated as Vibrio HC1090_xxxx. Seventeen species names were eliminated because they were contradictory to the phylogenetic topologies or were incoherently applied (electronic supplementary material, text, table S4). These taxonomic changes now permit future automated assignment of novel genomes to species designations, and the recognition of novel species, and have provided a clean and consistent basic taxonomy that can be progressively expanded. Prior work has assigned many Vibrio species into so-called higher order clades of species on the basis of MLSA (multilocus sequence analysis) [76]. These clades are also apparent in the ML super-tree of core SNPs, and electronic supplementary material, figure S2 indicates the three largest (Cholerae, Harveyii, Splendidus).

Streptococcus

For all five genera summarized above, clustering according to HierCC was largely concordant with the ML super-trees based on core SNPs or presence/absence of accessory genes. Ninety-five percent ANI and taxonomic designations were also largely concordant with the phylogenetic trees, although to a lesser degree. Similar concordances with the trees based on core SNPs (figure 4) and presence/absence of accessory genes (electronic supplementary material, figure S3) were also found for a majority of the named species within Streptococcus. One hundred and two HC363 clusters were concordant with 95% ANI clusters, and each of those HierCC clusters was specific for a single species after eliminating S. milleri, S. periodonticum and S. ursoris (electronic supplementary material, table S5). These results also demonstrated strong agreement between the two methods with classical taxonomy. However, genomes from a large number of other species within Streptococcus were not clustered satisfactorily by either method.

Figure 4

Comparison of HC363 clusters with taxonomic designations in Streptococcus. ML super-tree of 263 080 SNPs among 372 core genes from 5937 representative Streptococcus genomes (table 3). Species names are indicated next to the phylogenetic clusters according to the locations of genomes from type strains and public metadata. Nodes were coloured by HC363 clusters, and exceptional assignments are indicated by asterisks next to S. pneumoniae and S. pseudopneumoniae, which were both HC363_99; multiple phylogenetic and HierCC clusters within S. suis; S. salivarius and S. vestibularis, which were both HC353_202; S. lutetiensis and S. equinus, which were both HC363_181; and S. dysgalactiae and S. pyogenes, which were both HC363_139. An interactive version of the GrapeTree rendition of the SNP tree can be found at https://enterobase.warwick.ac.uk/ms_tree?tree_id=53261 and the presence/absence tree at https://enterobase.warwick.ac.uk/ms_tree?tree_id=53262. Earlier publications [15,16,77,78], had demonstrated that Streptococcus mitis and oralis each represents multiple related sequence clusters rather than two single species, and similar results were obtained with 95% ANI clustering [17]. These observations were confirmed by the current data. Of the 102 concordant HC363 and 95% ANI clusters, 23 clusters from S. cristatus, S. infantis, S. mitis, S. oralis, S. sanguinis or S. suis each corresponded to only a sub-set of the topological diversity within their species. Furthermore, single HC363 clusters from 18 species or groups of species were subdivided extensively by 95% ANI for a total of 155 discrepancies between the two methods (electronic supplementary material, table S6). These discrepant clusters were within the species S. suis, S. criceti, S. sanguinis, S. oralis, S. hyointestinalis, S. parasanguinis, S. australis, S. infantis and S. cristatus. Ninety-five percent ANI was not able to differentiate S. pneumoniae from S. pseudopneumoniae, S. pyogenes from S. dysgalactiae, S. salivarius from S. vestibularis, or between S. equinus, S. infantarius and S. lutentiensis. Contrariwise, three other 95% ANI clusters each encompassed two HC363 clusters (electronic supplementary material, table S7). HC363 could not distinguish S. pneumoniae or S. pseudopneumoniae from 57 95% ANI clusters within S. mitis. In addition to these problems, eight additional species should not have been given a species name because their type strains belonged to one of these chaotic clusters. We conclude that numerous species have been defined on taxonomic grounds that cannot be correctly identified by either 95% ANI or HierCC, and that multiple discrepancies exist between the two approaches within Streptococcus.

Populations, eBurst groups and Lineages

HierCC supports the identification of populations at multiple hierarchical levels ranging from the species/subspecies down to individual transmission chains. We also expected that intermediate HierCC cluster levels could reliably detect the natural populations defined by legacy MLST. Here we focus on examples of such populations from Salmonella, E. coli/Shigella and S. pneumonia, taxa where they have been examined in greatest detail. eBurstGroups (eBGs) defined by legacy MLST in Salmonella generally correspond to HierCC HC900 clusters [32]. In some cases, genetically related groups of eBGs correspond to HC2000 clusters [50]; we refer to those as Lineages. By contrast, none of the HierCC levels corresponded consistently with other, prior intra-species subdivisions of S. enterica subsp. enterica into lineages [79], clades [80-83] or branches [84]. This failure may reflect a high frequency of extensive homologous recombination at these deeper branches, which can obscure topological relationships [51,84], whereas HC900 (and HC2000 clusters) diverged more recently, and have not yet undergone extensive recombination [51]. Traditional subgrouping nomenclature of Salmonella below the subspecies level predates DNA sequencing and is predominantly based on serovar designations. Serovar designations consist of common names that have been assigned to a total of more than 2500 unique antigenic formulae based on epitopes within lipopolysaccharide (O antigen) and two alternately expressed flagellar subunits (H1, H2). These antigenic formulae are written in the form O epitopes: H1 epitopes: H2 epitopes in the revised Kauffmann-White scheme [85], and O epitopes are summarized as O serogroups with distinct numeric designations [86] , e.g. O:4 for the O group of serovar Typhimurium [87]. Detailed metabolic maps for lipopolysaccharide (LPS) synthesis have been elucidated for 47 O serogroups [87,88]. Serogroups from natural isolates can be determined serologically by agglutination reactions with specific antisera or in silico from genomic sequences with the programs SeqSero [89] or SISTR [90]. In practice, each of these methods has an error rate of a few percent [27,91]. Several hundred legacy eBGs within subsp. enterica were relatively uniform for serovar [27], with multiple exceptions, and many serovars were associated with multiple, apparently unrelated eBGs, likely due to the extensive exchange of genes encoding LPS and flagellar epitopes between eBGs early in their evolutionary history. We proposed that legacy MLST was a better metric for identifying the natural populations than serotyping, and could completely replace this traditional method [27]. Our new data indicate that HC900 clusters are even more reliable than legacy eBGs for recognizing natural populations within Salmonella. Several phylogenetic comparisons indicated that HC900 clusters based on cgMLST are concordant with eBGs in Salmonella [32,50]. Here we test this concordance quantitatively over a total of 319 490 genomes (table 4). The concordance between eBG and HC900 clusters was 0.985 according to the Adjusted Mutual Information Score (AMI) (table 4), a metric that is suitable for samples with a heterogeneous size distribution [94], and almost as high with the Adjusted Rand Index (ARI: 0.982), which is less suitable for heterogeneous data. Comparisons of eBG or HC900 with HC2000 clustering yielded slightly lower AMI scores (table 4). We also tested whether HC900 clusters are uniform for serovar. In late 2019, we corrected false serological data in the metadata Serovar field in EnteroBase by manual curation of 790 HC900 clusters that contained at least five entries. Ninety-seven percent (770/790) of those HC900 clusters were uniform (greater than or equal to 95%) for serovar (electronic supplementary material, table S8), as were most HC2000 clusters (437/473 clusters; 92%). The data also indicated that the predominant O groups were uniform over the multiple HC900 clusters within 95% (382/403) of HC2000 clusters within S. enterica subsp. enterica and 79% (58/70) of HC2000 clusters in other Salmonella species and subspecies (figure 5; electronic supplementary material, table S9).

Table 4

Salmonella
Test statistic	eBG-HC900	eBG-HC2000	HC900-HC2000
Adjusted Mutual Information Score (AMI)	0.985	0.948	0.942
Adjusted Rand Index (ARI)	0.982	0.856	0.866
Escherichia/Shigella
Test statistic	ST Cplx-HC1100	ST Cplx-HC2000	HC1100-HC2000
Adjusted Mutual Information Score (AMI)	0.940	0.688	0.692
Adjusted Rand Index (ARI)	0.871	0.299	0.337
Streptococcus pneumoniae
Test statistic	CC-HC100	GSPC-HC100	GSPC-HC160
Adjusted Mutual Information Score (AMI)	0.959	0.956	0.980
Adjusted Rand Index (ARI)	0.987	0.906	0.949

Figure 5

Hierarchical population structure of O serogroups in Salmonella. Hierarchical bubble plot of 310 901 Salmonella genomes in 790 HC900 clusters for which a consensus O serogroup could be deduced by metadata, or bioinformatic analyses with SeqSero V2 [89] or SISTR 1.1.1 [90]. Taxonomic level (HC level; colours): species/subspecies (HC2850; light grey circles), Lineages (HC2000; dark grey circles) and eBurst groups (HC900; O:group specific colours). Additional information is indicated by yellow text for selected HC2000 and HC900 circles which are specifically mentioned in the text. The diameters of HC900 circles are proportional to the numbers of genomes. An interactive version of this figure can be found at https://observablehq.com/@laurabaxter/salmonella-serovar-piechart from which the representation, raw data and d3 Java code [95] for generating the plot can be downloaded.

Quantitative concordance between eBG and HierCC clustering. NOTE: Salmonella: Calculations were performed on 319 490 entries in EnteroBase which had been assigned to eBGs, HC900 and HC2000 clusters by December, 2021. The dataset only represented 411 eBGs, 690 HC900 clusters and 312 HC2000 clusters, and consists of a subset of all genomes because new eBGs have not been created in recent years. Escherichia/Shigella: HC1100 and HC2000 assignments were from 143 520 genomes which had been assigned to ST Complexes as well as to HC1100 and HC2350 clusters in December, 2021. At that time 186 852 genomes had been assigned to HC1100 and HC2000 clusters, with AMI of 0.59 and ARI of 0.2. Streptococcus pneumoniae consisted of GSPC assignments for 18 147 genomes from electronic supplementary material, table S2, and CC assignments for 13 396 genomes from electronic supplementary material, table S1 by Gladstone et al. [92] which were imported into User defined fields in EnteroBase as described [28]. ARI and AMI were calculated with the functions adjusted_rand_score() and adjusted_mutual_info_score() in the Python 3 library sklearn.metrics v. 1.0.1 [93]. Hierarchical population structure of O serogroups in Salmonella. Hierarchical bubble plot of 310 901 Salmonella genomes in 790 HC900 clusters for which a consensus O serogroup could be deduced by metadata, or bioinformatic analyses with SeqSero V2 [89] or SISTR 1.1.1 [90]. Taxonomic level (HC level; colours): species/subspecies (HC2850; light grey circles), Lineages (HC2000; dark grey circles) and eBurst groups (HC900; O:group specific colours). Additional information is indicated by yellow text for selected HC2000 and HC900 circles which are specifically mentioned in the text. The diameters of HC900 circles are proportional to the numbers of genomes. An interactive version of this figure can be found at https://observablehq.com/@laurabaxter/salmonella-serovar-piechart from which the representation, raw data and d3 Java code [95] for generating the plot can be downloaded. HC2000 Lineages with uniform O group included the Para C Lineage (HC2000_1272; serovars Paratyphi C, Typhisuis, Choleraesuis and Lomita [51]) which emerged about 3500 years ago (electronic supplementary information, Supplemental Text) [51,84]. The Para C Lineage is uniformly serogroup O:7, which is not particularly surprising because all its serovars have identical antigenic formulaes, and are only distinguished by biotype. The Enteritidis Lineage (HC2000_12) consists of serovars Enteritidis (1,9,12:g,m:-), Gallinarum and its variant Pullorum (1,9,12:-:-), and Dublin (1,9,12:g,p:-:-) [50] (figure 5), and is uniform for O:9. The Typhimurium Lineage (HC2000_2) includes serovars Typhimurium, Heidelberg, Reading, Saintpaul, Haifa, Stanleyville and others [50]. This Lineage is also uniform for O group because all these serovars are O:4. The Mbandaka Lineage (HC2000_4) includes serovars Mbandaka (6,7,14: z10:e,n,z15) and Lubbock (6,7:g,m,s:e,n,z15), and is uniformly O:7. However, three Lineages were exceptional, and did contain more than one O serogroup. HC2000_299 is a mixture of O:4 (HC900_299; serovar Agama [28]) and O:7 (HC900_127074; serovar Nigeria). HC900 clusters within HC2000_54 are O:4 (HC900_54: Bredeney, Schwarzengrund; HC24937: Kimuenza) or O:9,46 (HC900_57: Give). Similarly, HC900 clusters within HC2000_106 are O:3,10; O:9; O:9,46; O:4: or O:8. These observations confirm and extend the prior conclusions about concordance between natural populations and serogroup within Salmonella.

Escherichia coli

Patterns of multilocus isoenzyme electrophoresis were used in the 1980s to provide an overview of the genetic diversity of E. coli (see overview by Chaudhuri and Henderson [96]). Those analyses yielded a representative collection of 72 isolates [97], the EcoR collection, whose deep phylogenetic branches were designated haplogroups A, B1, B2, C, D and E [98]. Several haplogroups have since been added [99,100]. The presence or absence of several accessory genes can be used for the assignment of genomes to haplogroups with the Clermont scheme [101,102], and the haplogroup can also be predicted in silico from genomic assemblies with ClermontTyping [103] or EZClermont [104], both of which are implemented within EnteroBase. However, the Clermont scheme ignores Shigella, which consists of E. coli clades despite its differing genus designation [26,47,48], and makes multiple discrepant assignments according to phylogenetic trees [28]. The Clermont scheme also does not properly handle the entire diversity of species and environmental clades/subspecies within the genus Escherichia [28,63,64,66]. EnteroBase HierCC automatically assigns genomes within the Escherichia/Shigella database to the cgMLST equivalents of ST Complexes (HC1100 clusters) and Lineages (HC2000 clusters) (table 1). It also perpetuates the ST Complexes that were initially defined for legacy MLST by Wirth et al. [26]. However, the numbers and composition of legacy ST Complexes have not been updated since 2009 because additional sequencing data defined intermediate, recombinant genotypes that were equidistant to multiple ST Complexes, and threatened to merge existing ST Complexes. By contrast, HC1100 clusters do not merge: intermediate genotypes are rare because cgMLST involves 2512 loci while legacy MLST was based on only seven. Secondly, new genotypes which are similar to and equidistant to multiple clusters do not trigger merging because HierCC arbitrarily assigns them to the oldest of the existing alternatives [32]. Finally, unlike ST Complex designations in legacy MLST, which have not been actively updated in Escherichia, HierCC clusters are automatically created as necessary for new genotypes. Frequent recombination in E. coli results in poor bootstrap support for the deep branches in phylogenetic trees of concatenated genes [26], and we were unable to identify a unique HierCC level which was largely concordant with haplogroups according to the Clermont scheme. However, similar to Salmonella, HC1100 clusters identify clear population groups, which are highly concordant with legacy ST Complexes (AMI = 0.94) (table 4). Unlike Salmonella, HC2000 clusters did not in general mark recognizable additional population structure beyond that which was provided by ST Complexes, and HC2000 clusters are only moderately concordant with either legacy ST Complexes (AMI = 0.69) or their cgMLST equivalent, HC1100 clusters (AMI = 0.69) (table 4). Multiple Shigella species are exceptions to this rule and their legacy ST Complexes [26] equate to cgMLST HC2000 clusters rather than HC1100 clusters according to phylogenetic trees. HC2000_305 replaces ST152 Cplx, and encompasses all S. sonnei (figure 6). HC2000_192 replaces ST245 Cplx, and contains many S. flexneri. However, S. flexneri O group F6 is in an HC1100 cluster within HC2000_1465 (ST243 Cplx), as are S. boydii B2 and B4. HC2000_1465 also includes a second HC1100 cluster with S. boydii O groups B1 and B18, and a third with S. dysenteriae D3, D9 and D13 (figure 6). HC2000_4118 replaces the combination of ST250 Complex and ST149 Complex, which have merged, and contains both S. dysenteriae as well as S. boydii.

Figure 6

Hierarchical population structure of O serogroups in Escherichia coli/Shigella. Hierarchical bubble plot for 167 312 genomes of Escherichia coli or Shigella in HC2350_1 (large light grey circle) that were available in EnteroBase in April, 2021. Seven HC2000 Lineages encompassing 15 HC1100/ST Complexes of Shigella are shown at the right. HC2350_1 also includes 15 other E. coli HC2000 Lineages that each contains at least 50 E. coli genomes and encompass 144 other HC1100 clusters. The remainder of the figure shows those HC1100 clusters and not the corresponding HC2000 clusters. Numbers of genomes assigned to individual O serogroups (legend) are indicated by pie chart wedges within the HC1100 circles. Selected HC1100 clusters are also depicted with indications of phenotype and nomenclature at a larger scale outside the main circle, connected to the original circles by lines. An interactive version of this figure can be found at https://observablehq.com/@laurabaxter/escherichia-serovar-piechart, from which the representation, raw data and d3 Java code [95] for generating the plot can be downloaded. Unlike Salmonella, O serogroups are remarkably variable within HC1100 clusters of E. coli (figure 6). HC1100_63 (ST11 Complex) combines classical EHEC strains of serovar O157:H7 as well as O55:H7 [105], and other, atypical ancestral EPEC isolates of serovar O55:H7 [106]. Multiple other HC1100 clusters that contain EHEC strains also encompass multiple O groups, including HC1100_2 (ST29 Cplx; O26, O103, O111) [107] and HC1100_3 (ST20 Cplx; O45, O103, O111, O128). Similarly, multiple HC1100 clusters of E. coli that cause extra-intestinal diseases also contain a variety of O groups, including HC1100_7 (ST131 Cplx; O16, O25) [23,108,109] and HC1100_44 (K1-encapsulated bacteria of the ST95 Cplx; O1, O2, O18, O25, O45) [26,110,111]. Indeed, the primary impression from figure 6 is that O group variation within an HC1100 cluster is nearly universal throughout E. coli, confirming prior conclusions about the high frequency of homologous recombination in this species [26,99].

Other genera

Correlations between intra-specific HierCC clusters and population groupings at multiple levels, including ST Complexes and ribotypes have been described in Clostridioides difficile [112]. It remains to be tested whether HierCC can identify populations within Yersinia because bacterial populations in Y. pseudotuberculosis are largely obscured by recombination [113]. Similarly, we are not aware of extensive efforts to identify bacterial populations which could be compared with HierCC within Vibrio. However, recent work has described the assignment of large numbers of genomes of Streptococcus pneumoniae to legacy Clonal Complexes as well as GSPC (Global Pneumococcal Sequence Clusters) [92], initially with PopPunk [114] and more recently with Mandrake [115]. Most of those genomes have been assembled within EnteroBase, and assigned to HierCC clusters. We therefore compared the HierCC and GSPC assignments for 18 147 genomes, and also compared them to Clonal Complexes (CCs) based on legacy MLST (13 396 genomes). HC160 clusters yielded the highest AMI and ARI scores with GSPC clusters while CC was most concordant with HC100 clusters (table 4). These conclusions were also supported by visual comparisons of the colour-coding of nodes within a Neighbor-Joining tree based on cgMLST distances according to the various clustering criteria (electronic supplementary material, figure S4). Thus, HC100 clusters seem to be concordant with Clonal Complexes within S. pneumonia and Mandrake/PopPunk clustering equates roughly to HC160 clusters.

Discussion

In 2013, we anticipated that EnteroBase might eventually contain as many as 10 000 genomes of Salmonella and Escherichia. By the end of 2021 it hosted greater than 600 000 assembled genomes of Salmonella, Escherichia/Shigella, Streptococcus, Vibrio, Clostridioides and Yersinia, and has become one of the primary global sources of their genomic data. An analyses of their genomic properties is facilitated by software tools (table 2) that support indexing of genetic diversity, searching for genomes with specific metadata or genetic relations and investigating stable population structures at multiple levels. Here we focused on the automated identification and designation of species/subspecies and populations on the basis of HierCC of core genome MLST genotypes. The data demonstrate that HierCC can completely replace classical taxonomic methods for genomes of Salmonella, Escherichia/Shigella, Vibrio, Clostridioides and Yersinia, with only few minor exceptions. HierCC is also a complete replacement for legacy MLST regarding the assignments to populations within Salmonella, Escherichia/Shigella and Streptococcus pneumoniae.

Taxonomic assignments

The paired ML super-trees based on core SNPs and presence/absence of accessory genes from each genus were generally concordant in their clustering patterns, indicating that those genera contain well-defined and reproducible species/subspecies independent of the genetic criteria. Those phylogenetic clusterings were then used to resolve the most appropriate designations for genomes which were assigned contradictory designations by their taxonomic metadata versus clusters based on 95% ANI or HierCC. The frequency of agreement with current taxonomy was comparable for HierCC and 95% ANI at the species level within Salmonella, Escherichia/Shigella, Vibrio, Clostridioides and Yersinia but 95% ANI was generally unable to resolve subspecies whereas HierCC did not distinguish between subspecies and species clusters. Classical taxonomical designations exhibited multiple, glaring problems in each genus and we replaced those problematical designations in EnteroBase with labels based on the automated, HierCC-based species/subspecies taxonomy. We also defined novel species/subspecies, and labelled them with HierCC-based designations. Why were there so many obvious discrepancies between the different approaches? One obvious reason is that we have discarded the medical tradition of retaining discrete species names for pathogens that cause particular diseases, and recommend designating Y. pestis as a subspecies of Y. pseudotuberculosis. Other discrepancies may reflect technical errors in manipulating data or uploading information to public databases, or simple strain mix-ups [46,50]. Possibly the discrepancies were particularly obvious because our analyses were performed at an unprecedented scale over multiple species. Many insights can be attributed to the facile ability to investigate large databases within a graphic framework that is provided by EnteroBase, and to our optimization of HierCC levels for each genus. However, we failed in a similar attempt to optimize ANI levels for recognizing both species and subspecies. Our taxonomic changes in Yersinia are likely to be highly controversial. According to both ML trees and HierCC, the current designation ‘Y. enterocolitica’ encompasses four, previously unnamed species/subspecies. These taxa are not distinguished by 95% ANI and had not previously been assigned distinct names. We maintain continuity with the traditional designation of Y. enterocolitica, and simply add HierCC affiliations to the species name, e.g. Y. enterocolitica HC1490_2. Alternatively, these clusters of strains could be considered to represent subspecies and their name modified slightly, e.g. Y. enterocolitica subspecies HC1490_2, similar to our downgrading of the named species Y. pestis, Y. similis [74] and Y. wautersii [73] on phylogenetic grounds to subspecies of Y. pseudotuberculosis (electronic supplementary material, figure S1). Finally, we defined eight new species/subspecies within Yersinia without submitting traditional evidence for simple phenotypic differences, and named them after their HC cluster, e.g. Yersinia HC1490_4399. We also followed comparable strategies for the other genera analysed here. In Salmonella, HierCC and 95% ANI identified a new species, Salmonella HC2850_215890, and a new subspecies, S. enterica subsp. HC2850_222931 (figure 1). Five new species were identified within C. difficile by HierCC as well as ANI, and HierCC identified two additional new subspecies. In Escherichia we provide HierCC designations for multiple clusters of bacteria from environmental sources near Hong Kong that were previously unnamed. The ML trees presented here (electronic supplementary material, figure S2) support prior definitions of high order clades containing multiple species in Vibrio on the basis of MLSA [76]. We also observed a general concordance between ANI and HierCC for 109 named Vibrio species. However, 43 HierCC clusters of species rank had not previously been properly classified with species designations, and 17 other species names were superfluous. These changes are fully described in the electronic supplementary material (text), allowing closer scrutiny by the Vibrio taxonomic community for consistency with other properties. HierCC was so effective in clarifying the taxonomies of these five species that we had hoped that it could also facilitate the taxonomic classification of Streptococcus. One hundred and two HC363 clusters were indeed concordant with 95% ANI and taxonomic designations, with only minor exceptions. However, extensive discrepancies existed between 95% ANI and HierCC for numerous other clusters, and both approaches showed multiple discrepancies with classical taxonomic designations. The existence of multiple 95% ANI clusters within S. mitis and S. oralis and other species of Streptococcus has previously been commented on by Kilian and his colleagues [15,16,77,78]. Our observations extend these taxonomic problems to multiple additional species in which HC363 does not distinguish between pairs of named species (figure 4). Neither 95% ANI nor cgMLST HierCC seems to provide a suitable general strategy for elucidating the taxonomy of all of Streptococcus, and we failed to find a general solution to this problem.

HierCC versus classical taxonomy

Classical microbiological species taxonomy involves identifying a type strain whose phenotype can be distinguished from all other type strains, identifying additional isolates with similar phenotypes, demonstrating distinct clustering from other known species in phylogenetic trees based on DNA or amino acid sequences, and publishing a report in one of a very limited number of acceptable journals. Such species definitions are then considered tentative until an international committee has approved them. Other forms of identifying species that include sole reliance on DNA sequence differences are not acceptable [7,8]. The metagenomics community and scientists working with uncultivated organisms from the environment have largely liberated themselves from such regulations, and tend to use operational taxonomic units (OTUs) as taxonomic entities. However, the taxonomic species structure of many microbes from environmental sources remains fuzzy, or does not clearly correspond to classical taxonomy (e.g. Prochlorococcus [116] or Synechococcus [117]). Furthermore, we are not aware of any other method that is able to assign 1000s of genomes per day to existing taxa and also reliably identify new taxa automatically as they appear. HierCC performs this task with bravura for the genera in table 1, with the notable exception of Streptococcus. We have taken the liberty of dropping all attempts to reconcile HierCC species/subspecies clusters with classical prescriptions for how to define a species. Instead we have adopted the practice of using HierCC cluster designations within EnteroBase for the nomenclatures for species/subspecies groupings that had not yet been identified by others and eliminated from EnteroBase multiple species designations of type strains in Vibrio and Streptococcus which did not match the ML tree topologies and HierCC clusters. These actions provide a uniform base for the future additions of additional species and ensured that future genomes can be correctly assigned to uniform clusters of related strains. Scientists wishing to use these schemes to identify the species of their bacterial isolates can upload their sequenced genomes to EnteroBase. The HierCC assignments will be available within hours. We also welcome additional curators of these databases who are willing to test and improve the current taxonomic assignments we have implemented. But we reject the concept of designating our assignments as tentative until they are confirmed in several years by an international committee.

HierCC and populations

The first bacterial taxonomic designations were assigned over 100 years ago. Bacterial population genetics is much younger and has many fewer practitioners. Initial population genetic analyses in the early 1980s subdivided multiple bacterial species into intra-specific lineages based on multilocus enzyme electrophoresis [97]. This methodology was replaced in the late 1990s by legacy MLST based on several housekeeping genes [21], which is currently being replaced by cgMLST based on all genes in the soft core genome [22,29]. EnteroBase calculates genotypes for both types of MLST, as well as for rMLST [28]. Legacy STs differ from cgSTs due to different levels of resolution. However, the boundaries of ST Complexes/eBGs are highly concordant with HierCC clusters, with AMI indices of 0.985 for eBGs versus HC900 in Salmonella and 0.94 for ST Complexes versus HC1100 clusters in Escherichia/Shigella (table 4). We recommended previously that serovars should be replaced by eBGs in S. enterica [27] and the data presented here show that HC900 clusters are an even better replacement. Our data also indicate that Lineages in S. enterica correspond to HC2000 clusters, and that these tend to be highly uniform for O group. The data presented here also indicate that HC1100 groups are a good replacement for detecting populations within Escherichia as are HC100 clusters for CCs in S. pneumoniae [92] (table 4). We interpret these consistencies between legacy MLST and cgMLST as reflecting the existence of natural populations. We previously claimed that legacy ST Complexes in E. coli were unstable due to frequent recombination [26], and were therefore not surprised at their tendency to merge as additional isolates were sequenced. However, legacy MLST is based on only seven genes. The high resolution of cgMLST and the decision to assign new genotypes that are equidistant from multiple HierCC clusters to the oldest cluster have largely negated these problems, and E. coli HC1100 clusters correspond to bacterial groupings that have been independently identified by multiple phenotypic patterns. In the early 1980s, MA began his academic research on bacterial pathogens with E. coli that expressed the K1 polysaccharide capsule [110]. He observed unusually uniform patterns of electrophoretic mobility of major outer membrane proteins across multiple isolates, and interpreted those bacteria as representing ‘clones’. These ‘clones’ correspond to HC1100 clusters, including HC1100_44 (ST95 Complex) which continues to cause invasive disease in humans and animals around the globe [111]. The 1980s analyses showed that these K1 bacteria were variably O1, O2 or O18, and that the serotype variants differed both in their invasiveness and in the hosts which they infected [118]. As indicated in figure 6, HC1100_44 includes O groups O25 and O45 in addition to O1, O2 and O18. Similarly, HC1100_7 (ST131) represents another major cause of extra-intestinal disease in humans [23,108,109]. EHEC bacteria are notorious for their association with hemolytic uremic syndrome (HUS). Many of them belong to distinctive HC1100 clusters, including O157:H7/O55:H7 (HC1100_63, ST11 Cplx [106,119]) and O26, O103 and O111 EHEC bacteria (HC1100_2, ST29 Cplx [107]) (figure 5). By contrast to these correlations with epidemiological groupings, we were unable to identify any other HierCC levels that were consistently concordant with other intra-specific phylogenetic subdivisions based on SNP trees, including haplogroups [98] or Clermont typing [101,102] in E. coli [28] and clades [80-83], Lineages [79] or branches [84] in S. enterica. Instead, we found that a subdivision we refer to as Lineages is marked by some HC2000 clusters in Salmonella and Escherichia/Shigella.

Lineages

In E. coli, HC2000 Lineages were particularly appropriate for defining the clustering level of seven groups of Shigella genomes (figure 6). Lineage designations did not add additional insights into other populations because they were almost entirely encompassed by HC1100 clusters or occasionally even lower level clusters, such as HC400. In S. enterica subspecies enterica, multiple HC2000 lineages corresponded to genetically related combinations of multiple HC900 clusters, in some cases with distinctive serovar designations. However, all but three of the major Lineages were predominantly uniform in O group. The three exceptional HC2000 Lineages (HC2000_54, HC2000_106, HC2000_299) might be worth exploring in greater detail to reconstruct the recombination that has resulted in their differing LPS epitopes. HC160 clusters, the Lineage equivalent in S. pneumonia, were strongly concordant with PopPunk GSPC clustering. Otherwise, little is yet known about the general properties of Lineages in other genera, except that their properties are likely to vary with species or genus. For example, S. enterica HC2000 Lineages were uniform for O group whereas O serogroups were already heterogeneous within HC1100 clusters/ST Complexes within E. coli (figure 6).

Future prospects

EnteroBase was conceived to satisfy a need that MA and ZZ perceived in 2014 [1]. We contend that it is now fit for purpose for investigations of multiple bacterial genera by scientists ranging from beginners through to experts in the areas of microbial epidemiology and population genetics. Its original creators have now all left this project, but EnteroBase is being maintained as a service for the global community by the University of Warwick. Maintenance of its technical functions and databases are thereby assured for the near future. Further functional developments will, however, depend on increased participation and perception of ownership by its users. We perceive a general trend to focus on insular solutions that can satisfy the demands of individual bioinformaticians and regional diagnostic laboratories. Such approaches can yield relatively rapid progress in solving short-term needs. However, a global overview of genomic diversity needs central databases to ensure definitive terminology. Pooling efforts on a central endeavour at the scale represented by EnteroBase would ensure that it continues to function over decades, is representative over all continents, and serves the global community even better. We therefore welcome additional curators and scientific experts as well as bioinformatics collaborations to help improve EnteroBase even more.

106 in total

1. GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens.

Authors: Zhemin Zhou; Nabil-Fareed Alikhan; Martin J Sergeant; Nina Luhmann; Cátia Vaz; Alexandre P Francisco; João André Carriço; Mark Achtman
Journal: Genome Res Date: 2018-07-26 Impact factor: 9.043

2. Genome sequencing of environmental Escherichia coli expands understanding of the ecology and speciation of the model bacterial species.

Authors: Chengwei Luo; Seth T Walk; David M Gordon; Michael Feldgarden; James M Tiedje; Konstantinos T Konstantinidis
Journal: Proc Natl Acad Sci U S A Date: 2011-04-11 Impact factor: 11.205

3. Supplement 2003-2007 (No. 47) to the White-Kauffmann-Le Minor scheme.

Authors: Martine Guibourdenche; Peter Roggentin; Matthew Mikoleit; Patricia I Fields; Jochen Bockemühl; Patrick A D Grimont; François-Xavier Weill
Journal: Res Microbiol Date: 2009-10-17 Impact factor: 3.992

4. The Clermont Escherichia coli phylo-typing method revisited: improvement of specificity and detection of new phylo-groups.

Authors: Olivier Clermont; Julia K Christenson; Erick Denamur; David M Gordon
Journal: Environ Microbiol Rep Date: 2012-12-24 Impact factor: 3.541

5. Global population structure and genotyping framework for genomic surveillance of the major dysentery pathogen, Shigella sonnei.

Authors: Jane Hawkey; Kalani Paranagama; Kate S Baker; Rebecca J Bengtsson; François-Xavier Weill; Nicholas R Thomson; Stephen Baker; Louise Cerdeira; Zamin Iqbal; Martin Hunt; Danielle J Ingle; Timothy J Dallman; Claire Jenkins; Deborah A Williamson; Kathryn E Holt
Journal: Nat Commun Date: 2021-05-11 Impact factor: 14.919

6. Strategies to avoid wrongly labelled genomes using as example the detected wrong taxonomic affiliation for aeromonas genomes in the GenBank database.

Authors: Roxana Beaz-Hidalgo; Mohammad J Hossain; Mark R Liles; Maria-Jose Figueras
Journal: PLoS One Date: 2015-01-21 Impact factor: 3.240

7. Fine-Scale Structure Analysis Shows Epidemic Patterns of Clonal Complex 95, a Cosmopolitan Escherichia coli Lineage Responsible for Extraintestinal Infection.

Authors: David M Gordon; Sarah Geyik; Olivier Clermont; Claire L O'Brien; Shiwei Huang; Charmalie Abayasekara; Ashwin Rajesh; Karina Kennedy; Peter Collignon; Paul Pavli; Christophe Rodriguez; Brian D Johnston; James R Johnson; Jean-Winoc Decousser; Erick Denamur
Journal: mSphere Date: 2017-05-31 Impact factor: 4.389

8. Cluster-specific gene markers enhance Shigella and enteroinvasive Escherichia coli in silico serotyping.

Authors: Xiaomei Zhang; Michael Payne; Thanh Nguyen; Sandeep Kaur; Ruiting Lan
Journal: Microb Genom Date: 2021-12

9. Comprehensive assessment of the quality of Salmonella whole genome sequence data available in public sequence databases using the Salmonella in silico Typing Resource (SISTR).

Authors: James Robertson; Catherine Yoshida; Peter Kruczkiewicz; Celine Nadon; Anil Nichani; Eduardo N Taboada; John Howard Eagles Nash
Journal: Microb Genom Date: 2018-01-17

10. Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications.

Authors: Keith A Jolley; James E Bray; Martin C J Maiden
Journal: Wellcome Open Res Date: 2018-09-24