Literature DB >> 31580794

Genus-wide Yersinia core-genome multilocus sequence typing for species identification and strain characterization.

Cyril Savin^1,2,3, Alexis Criscuolo⁴, Julien Guglielmini⁴, Anne-Sophie Le Guern^2,3,1, Elisabeth Carniel^2,1,3, Javier Pizarro-Cerdá^1,2,3, Sylvain Brisse⁵.

Abstract

The genus Yersinia comprises species that differ widely in their pathogenic potential and public-health significance. Yersinia pestis is responsible for plague, while Yersinia enterocolitica is a prominent enteropathogen. Strains within some species, including Y. enterocolitica, also vary in their pathogenic properties. Phenotypic identification of Yersinia species is time-consuming, labour-intensive and may lead to incorrect identifications. Here, we developed a method to automatically identify and subtype all Yersinia isolates from their genomic sequence. A phylogenetic analysis of Yersinia isolates based on a core subset of 500 shared genes clearly demarcated all existing Yersinia species and uncovered novel, yet undefined Yersinia taxa. An automated taxonomic assignment procedure was developed using species-specific thresholds based on core-genome multilocus sequence typing (cgMLST). The performance of this method was assessed on 1843 isolates prospectively collected by the French National Surveillance System and analysed in parallel using phenotypic reference methods, leading to nearly complete (1814; 98.4 %) agreement at species and infra-specific (biotype and serotype) levels. For 29 isolates, incorrect phenotypic assignments resulted from atypical biochemical characteristics or lack of phenotypic resolution. To provide an identification tool, a database of cgMLST profiles and reference taxonomic information has been made publicly accessible (https://bigsdb.pasteur.fr/yersinia). Genomic sequencing-based identification and subtyping of any Yersinia is a powerful and reliable novel approach to define the pathogenic potential of isolates of this medically important genus.

Entities: Chemical Disease Mutation Species

Keywords: Yersinia; core-genome multilocus sequence typing; genotyping; identification; phylogenetics; species

Year: 2019 PMID： 31580794 PMCID： PMC6861861 DOI： 10.1099/mgen.0.000301

Source DB: PubMed Journal: Microb Genom ISSN： 2057-5858

Data Summary

This whole-genome shotgun project was deposited at GenBank/ENA/DDBJ under the accession numbers CABHPT00000000 to CABIIH00000000, and CABIIP00000000 to CABIJR00000000 (BioProject number PRJEB33414). Core-genome multilocus sequence typing profiles have been made available through the BIGSdb – database at https://bigsdb.pasteur.fr/yersinia. High-quality genomic sequences of bacterial isolates can now be obtained rapidly and affordably, opening up new perspectives for isolate identification and epidemiological investigation. In our study, we developed a genomic sequence analysis method applicable to all members of the genus Yersinia for identification of clinical, veterinary, food or environmental isolates at species and strain levels. We have made this bioinformatics procedure accessible online to enable the reliable identification of Yersinia isolates and to evaluate their pathogenic potential from their assembled genome sequences. The sharing of reference genomic sequence data and using a unified genotyping scheme will advance the discovery of novel Yersinia species, as illustrated herein, and will facilitate the genetic relatedness studies that are required when there is suspicion of an outbreak.

Introduction

The genus , a member of the family , is currently composed of 19 species and includes 3 prominent human pathogens: the agent of plague, Yersinia pestis, and the enteropathogens Yersinia enterocolitica and Yersinia pseudotuberculosis [1]. Whereas the isolation of is rare, represents the third main cause of diarrhoea of bacterial origin in temperate and cold countries [2]. Other species are non-pathogenic for humans; Yersinia ruckeri is a fish pathogen [3] and Yersinia entomophaga is an insect pathogen [4]. Given the heterogeneous pathogenic potential of members, identification at species and sometimes infra-species levels is essential for patient follow-up, and to guide the deployment of public-health measures. In addition, the taxonomy of the genus is evolving dynamically, with eight novel species since 2005 [4-11]. Among these, Yersinia wautersii is the only one with a pathogenic potential for humans [11]. Therefore, a generally applicable identification strategy that would encompass all species, including yet undescribed ones, would be useful. Currently, the complete characterization of a isolate is performed in reference laboratories and relies on phenotypic tests, such as the ability to grow on particular carbon sources, sero-agglutination, enzymatic tests and motility [1, 12–14]. This characterization allows all isolates to be assigned to species and serotypes, and to biotypes within the species and Yersinia intermedia [15, 16]. The complete characterization of a isolate is essential because it distinguishes the highly pathogenic biotype 1B, the non-pathogenic biotype 1A, and the low-pathogenic biotypes 2, 3, 4 and 5. Phenotypic characterization is labour-intensive, requires multiple assays and is time-consuming (7 days). Moreover, misidentifications may occur given that the distinction between some taxa relies on only a single metabolic trait, and that some isolates exhibit atypical metabolic profiles. Although MALDI-TOF MS is increasingly used for species identification, this technique has limitations. For example, it does not distinguish the four species within the complex, including non-pathogenic Yersinia similis [11] and highly pathogenic [17-19]. Moreover, MALDI-TOF MS cannot discriminate the highly pathogenic biotype 1B of from the non-pathogenic biotype 1A [20]. Identification of isolates at the species and infra-specific levels has benefited from the development of sequence-based phylogenetic methods, including the highly reproducible and portable multilocus sequence typing (MLST) approach. A genotyping strategy based on five genes was first developed in 2005, but it did not have sufficient resolution to distinguish the various biotypes of [21]. Later on, a seven-gene MLST scheme was set up to differentiate the three human pathogenic species (, and ), but this scheme was not applicable at the genus level [22]. Finally, a genus-wide seven-gene MLST scheme was developed [23], and allowed both the identification of species and the differentiation of biotypes; thus, representing a relevant alternative to the reference phenotypic method, especially for metabolically atypical isolates. However, classical MLST records variation at approximately 1/1000 of the full genome sequence length, which limits its discriminatory power and restricts phylogenetic information. Whole-genome sequences can now be obtained readily using high-throughput sequencing technology and can be leveraged for core-genome MLST (cgMLST) genotyping [24], which provides much improved resolution and phylogenetic precision, as shown for some other bacterial pathogens [25-28]. In this work, we aimed to develop a genus-wide strain identification and characterization method based on cgMLST. A phylogenetic analysis using genomes from public sequence repositories combined with genomes of isolates received routinely at the French National Reference Laboratory for Plague and other Yersinioses was performed to delineate current species, as well as yet undefined taxa. Next, a genus-wide cgMLST scheme was developed to record nucleotide variation at 500 shared gene loci. Finally, an automated process for strain identification and characterization was set up, validated and implemented in the daily surveillance of isolates in France. To disseminate the method, a database of cgMLST profiles and their corresponding identification was made publicly accessible using the web-based tool BIGSdb (Bacterial Isolate Genome Sequence Database) [29, 30], allowing external users to identify, type and subtype their isolates using their own genomic sequences.

Methods

Isolate collection

All isolates came from the strain collection of the French Yersinia National Reference Laboratory (YNRL), which comprises more than 41 000 strains, including members of all known species (except ). Most strains (~26 000; 63 %) were isolated in France, whereas ~15 000 strains came from other countries throughout the world. Additionally, the Research Unit (Institut Pasteur, Paris) maintains a collection of ~1800 Y. pestis strains isolated worldwide.

Phenotypic characterization

isolates were identified at the YRNL using API20E and API50CH strips, tween-esterase activity [13], pyrazinamidase activity [12] and mannitol-mobility at 28 °C. strains were biotyped according to Wauters biogrouping scheme [15]. and strains were serotyped with a set of 47 and 5 O-antigen-specific rabbit antisera, respectively [31]. Identification of isolates included a specific phage lysis test [32].

Genomic sequencing

A total of 802 isolates from the YNRL were sequenced using an Illumina NextSeq 500 instrument. DNA extraction was performed using a PureLink genomic DNA mini kit (Invitrogen) following the manufacturer’s instructions, except that DNA was eluted in 150 µl H2O (Ambion). Sequencing libraries were prepared using a Nextera XT DNA library preparation kit (Illumina). Paired-end reads of 150 nt were obtained using the Mid Output or High Output kits (Illumina). Trimming and clipping were performed using AlienTrimmer v0.4.0 [33]. Redundant or over-represented reads were reduced using the khmer software package v1.3 [34]. Finally, sequencing errors were corrected using Musket v1.1 [35]. A de novo assembly was performed for each strain using SPAdes v3.12.0 [36] with the pre-processed reads. A minimum sequencing depth of 50× was obtained for each genome. Contigs with significantly low coverage compared to the others were discarded as putative contaminants. On average, genomes were assembled into 208 contigs (min=20; max=2051) with a total size of 4.6 Mb (min=3.6 Mb; max=5.4 Mb) and with an N50 value of 86 065 (min=18 825; max=490 559).

Constitution of a genome dataset

A total of 544 publicly available assembled genomes were downloaded in February 2016 (Tables 1 and S1, available with the online version of this article) from the GenBank repository, representing all species according to the recognized taxonomy at that time, except . In addition, the genome sequences of 802 isolates received at the YNRL (Tables 1 and S1), and previously characterized phenotypically, were determined and used in this work. Together, this constituted a reference dataset of 1346 assembled genomes (Tables 1 and S1). Finally, the cgMLST scheme was validated on 1843 additional isolates received at the YNRL between 2016 and mid-2017 that were sequenced and phenotyped in parallel.

Table 1.

Species and bioserotype composition of the genomic dataset

Species	Bioserotype	Origin		Total
		Public genomes	YNRL
Y. aldovae		6	14	20
Y. aleksiciae		2	12	14
Y. bercovieri		4	17	21
Y. enterocolitica	1A	35	42	77
	1B	13	3	16
	2/O:9	28	8	36
	2–3/O:5,27	15	7	22
	3/O:3	0	19	19
	4	26	115	141
	5	7	4	11
	Unknown	8	0	8
Y. entomophaga		0	2	2
Y. frederiksenii		22	19	41
Y. intermedia		16	7	23
Y. kristensenii		13	11	24
Y. massiliensis		2	5	7
Y. mollaretii		10	11	21
Y. nurmii		1	0	1
Y. pekkanenii		2	0	2
Y. pestis		270	20	290
Y. pseudotuberculosis		43	442	485
Y. rohdei		6	12	18
Y. ruckeri		8	11	19
Y. similis		5	13	18
Y. wautersii		2	5	7
Undefined		0	3	3
Total		544	802	1346

Species and bioserotype composition of the genomic dataset Species Bioserotype Origin Total Public genomes YNRL 6 14 20 2 12 14 4 17 21 1A 35 42 77 1B 13 3 16 2/O:9 28 8 36 2–3/O:5,27 15 7 22 3/O:3 0 19 19 4 26 115 141 5 7 4 11 Unknown 8 0 8 0 2 2 22 19 41 16 7 23 13 11 24 2 5 7 10 11 21 1 0 1 2 0 2 270 20 290 43 442 485 6 12 18 8 11 19 5 13 18 2 5 7 Undefined 0 3 3 Total 544 802 1346

Design of theYersinia cgMLST scheme

From the 1346 reference genomes, 200 genomes representative of genus diversity were selected in the following way. Genome assemblies made up of more than 500 contigs with N50 values below 10 000 nt were discarded. From the remaining genomes, we calculated pairwise genome distances using the program andi [37]. From the resulting matrix, we defined 200 clusters based on an UPGMA (unweighted pair group method with arithmetic mean) hierarchical clustering. For each cluster, one representative with the largest N50 value was selected, leading to a subset of 200 phylogenetically representative genomes (Table S1). A total of 1672 genes present in more than 95 % of these 200 representative genomes was defined as the core genes. For each of these core genes, a file containing all alleles (typically between 50 and 100) was generated. Each file was parsed, and genes were removed if (i) a character other than A, T, G or C was present in any of the sequences, (ii) the gene had a paralogue or (iii) there was a gap of more than 6 nt in the multiple sequence alignment. These filtering criteria led to the selection of 500 core genes deemed suitable for cgMLST analysis.

Phylogenetic analyses

In order to perform a phylogenetic analysis of the genus , a representative subset was built from the 3878 genomes available in our database. First, this set of isolates was partitioned into 30 clusters based on a UPGMA clustering from the pairwise dissimilarities between allelic profiles. Second, as these 30 clusters were of varying size n (n=1 to 1849 isolates), each was partitioned into 7 log10 n UPGMA sub-clusters, and the isolate with the smallest number of missing alleles was selected within each sub-cluster, leading to a final set of 236 isolates representative of the genus . For each of the 500 genes of the cgMLST scheme, the corresponding allele sequences were translated, and a multiple sequence alignment was performed using mafft v7.407 [38]. The 500 multiple sequence alignments were concatenated into a unique supermatrix of 139 050 amino acid characters that was used to infer a maximum-likelihood phylogenetic tree using iq-tree v1.6.3 [39] with the evolutionary model JTT+F+R5 selected by minimizing the BIC criterion [40]. Following a similar approach, datasets within and species led to two supermatrices of characters for 246 (+28 outgroup) and 294 (+20 outgroup) representative isolates, respectively, that were phylogenetically analysed with the evolutionary models JTT+F+R3 and JTT+F+R2, respectively.

Average nucleotide identity (ANI) estimation

Using the 236 representative genomes (see above), the ANI was estimated for each pair of isolates using fastANI v1.1 [41].

BIGSdb

A database was created for the genus in the Institut Pasteur MLST and whole-genome MLST resource (https://bigsdb.pasteur.fr), which uses the BIGSdb software tool, designed to store and analyse sequence data for bacterial isolates [29, 30]. All de novo assembled genomes were uploaded into the isolates database, and the reference alleles of the 500 cgMLST loci were defined in the linked database of reference sequences and profiles (‘seqdef’). Using BIGSdb functionalities, a scan of the genome sequence was performed for each isolate using parameters (min 80 % identity, min 80 % alignment, blastn word size of 20 nt) to check for the presence of each core gene and to determine its allele number. Allelic profiles identifiers (cgST) were defined for genomes with 50 or fewer missing alleles. The BIGSdb – database of cgMLST profiles is accessible at https://bigsdb.pasteur.fr/yersinia/.

Taxonomic assignment

To assign isolates to taxonomic categories based on their cgMLST allelic profiles, a set of allelic difference proportion cut-offs was determined at different levels of the phylogenetic structure. For this, a minimum spanning tree was built from the pairwise dissimilarities between allelic profiles containing fewer than 50 missing alleles. For a given group of isolates (e.g. species), the corresponding subtree was extracted from the minimum spanning tree and the length of its largest edge was defined as the cut-off associated with this group of isolates.

Validation of the cgMLST method

For 1843 isolates received prospectively between 2016 and mid-2017 at the YNRL, taxonomic assignment obtained using the cgMLST method was compared to the phenotypic characterization performed in parallel. The above-defined stringent thresholds used to define species or lineages were re-evaluated based on isolates with no taxonomic assignment (Fig. S1). For isolates presenting discrepancies between the two characterization methods, phenotyping was repeated.

Seven-gene MLST

Genomes were scanned for the seven loci of the genus-wide MLST scheme of McNally and colleagues [23]. From February 15th 2019, for each of the seven MLST loci, the alleles were downloaded from the PubMLST website (https://pubmlst.org/yersinia/), translated and aligned at the amino acid level using mafft v7.407 [38]. Each of the seven multiple amino acid sequence alignments was converted into a position-specific score matrix (PSSM) [42] using blast+ v2.6.0 [43]. For each of the 236 representative genomes (see above), the seven MLST alleles were searched with tblastn by using the corresponding PSSM as query, and extracted. The seven allele files were aligned using mafft, concatenated and phylogenetically analysed using iq-tree v1.6.3 [39] with the evolutionary model TIMe+I+G4 selected by minimizing the BIC criterion [40].

Time-to-results

The time-to-results of the cgMLST and phenotypic approaches were computed based on 1440 and 1113 isolates, respectively. cgMLST timing started with the date of receipt, and included strain isolation and growth, DNA extraction/purification, library preparation and sequencing on the sequencing core facility of Institut Pasteur, Paris (France), as well as demultiplexing, pre-processing, assembly and cgMLST allele calling. Phenotypic time-to-result was calculated from isolate receipt to taxonomic and bioserotype assignment as described above. To compare the time-to-results of both methods, a Student's t-test was performed.

Results

Phylogenetic structure of the genus

A phylogenetic analysis of 236 genomes representing the diversity of the genus was performed based on the concatenation of 500 multiple sequence alignments, revealing a neat structuration of into strongly demarcated clades (Fig. 1). Although most lineages fitted with phenotypic species assignments, a number of taxonomic inconsistencies and yet undescribed groups were uncovered. The potential taxonomic rank of each demarcated clade was, therefore, evaluated based on ANI (Table S2) in the light of the proposed bacterial species delineation cut-off of 95–96 % ANI [44, 45]. As a first step, the use of a stringent delineation cut-off value of 96 % ANI led to 26 groups being defined, labelled as follows: Yersinia aldovae; Yersinia aleksiciae; Yersinia bercovieri; ; ; Yersinia frederiksenii 1, 2 and 3; ; Yersinia kristensenii 1, 2 and 3; Yersinia massiliensis lineages 1 and 2; Yersinia mollaretii lineages 1 and 2; Yersinia nurmii; Yersinia pekkanenii; (also containing and isolates); Yersinia rohdei; ; ; and NEW1 to NEW4 (Fig. 1, Table S2). Second, taxa with current standing in taxonomy, but belonging to the same ANI group as other taxa ( and ), were individualized as distinct groups. Therefore, genomes of our dataset were classified into 28 groups in total. Note that this classification is not a taxonomic proposal, but is used instead as an operational classification. The correspondence between phenotypic and genome-based phylogenetic classifications is summarized in Table 2 and provided for individual isolates in Table S1. We hereafter summarize these observations.

Fig. 1.

Maximum-likelihood phylogenetic tree of the genus based on 500 concatenated multiple sequence alignments. Only bootstrap-based branch support values >70 % are shown. Bar, 0.01 amino acid substitutions per character.

Table 2.

Correspondence between genotypic and phenotypic characterization

Genotypic characterization		Phenotypic characterization
Species	Lineage	Species	Infra-specific category
Y. enterocolitica	1Aa	Y. enterocolitica	Biotype 1A
	1Ab
	1B	Y. enterocolitica	Biotype 1B
	2/3-9a	Y. enterocolitica	Biotype 2/O:9
	2/3-9b
	2/3-5a	Y. enterocolitica	Bioserotype 2–3/O5,27
	2/3-5b
	3-3a	Y. enterocolitica	Bioserotype 3/O:3
	3-3b
	3–3 c
	3-3d
	4	Y. enterocolitica	Bioserotype 4/O:3
	5	Y. enterocolitica	Biotype 5
Y. pseudotuberculosis	1 to 32	Y. pseudotuberculosis	21 O-serotypes
Y. pestis		Y. pestis
Y. mollaretii	1 (sublineage 1a and 1b)	Y. mollaretii
	2
Y. frederiksenii 1		Y. frederiksenii	Diverse serotypes
Y. frederiksenii 2
Y. frederiksenii 3
Y. kristensenii 1		Y. kristensenii	Diverse serotypes
Y. kristensenii 2
Y. kristensenii 3
Y. massiliensis	1	Y. massiliensis
	2	Y. frederiksenii
Y. aldovae		Y. aldovae
Y. aleksiciae		Y. aleksiciae
Y. bercovieri		Y. bercovieri
Y. entomophaga		Y. entomophaga
Y. intermedia		Y. intermedia
Y. nurmii		Y. nurmii
Y. pekkanenii		Y. pekkanenii
Y. rohdei		Y. rohdei
Y. ruckeri		Y. ruckeri
Y. similis		Y. similis
Y. wautersii		Y. wautersii
NEW 1 ( Y. hibernica )		Yersinia sp.
NEW 2		Yersinia sp.
NEW 3		Y. enterocolitica	Biotype 1A
NEW 4		Y. enterocolitica	Biotype 1A

Correspondence between genotypic and phenotypic characterization Genotypic characterization Phenotypic characterization Species Lineage Species Infra-specific category 1Aa Biotype 1A 1Ab 1B Biotype 1B 2/3-9a Biotype 2/O:9 2/3-9b 2/3-5a Bioserotype 2–3/O5,27 2/3-5b 3-3a Bioserotype 3/O:3 3-3b 3–3 c 3-3d 4 Bioserotype 4/O:3 5 Biotype 5 1 to 32 21 O-serotypes 1 (sublineage 1a and 1b) 2 1 Diverse serotypes 2 3 1 Diverse serotypes 2 3 1 2 NEW 1 () sp. NEW 2 sp. NEW 3 Biotype 1A NEW 4 Biotype 1A Maximum-likelihood phylogenetic tree of the genus based on 500 concatenated multiple sequence alignments. Only bootstrap-based branch support values >70 % are shown. Bar, 0.01 amino acid substitutions per character. Non-human-pathogenic species , , , , , , , and each corresponded to a single clade (Fig. 1). In general, they were represented by a few isolates or showed little phylogenetic substructure. ANI values obtained within each group were all >96.6 %. In contrast, the maximum ANI value observed between the members of these species and their closest relatives was 94.7 % (Table S1). These observations are concordant with the species status of these nine members. Isolates identified as based on phenotypic characterization fell into two main lineages, one of them being further subdivided into two sublineages (Fig. 1). We defined lineage 1 as the one containing the type strain ATCC43969T. ANI values between the three lineages or sublineages were 95.6–96.6 % on average, whereas values within each of them were >98 %. The results show that represents a single clade, which could possibly be subdivided into three species and/or subspecies [46]. Isolates identified as based on phenotypic characterization fell into four lineages (Fig. 1). lineage 1 was defined as comprising the type strain ATCC33641T. Whereas lineages 2 and 3 were closely related to lineage 1, lineage 4 was in fact associated with (Fig. 1). ANI values among lineages 1 to 3 were 89.1 % on average, whereas ANI was >98.4 % within each lineage (Table S2). Lineage 4 (=Y. massiliensis lineage 2 in Fig. 1) displayed ANI values of 95.8 % on average with isolates phenotypically characterized as , including the type strain CIP109351T, whereas each of these two groups was homogeneous (ANI >98.6 %). While isolates of lineage 4 are positive for rhamnose fermentation, isolates are negative. This result shows that lineage 4 may be defined as a rhamnose-positive subspecies of Y. massiliensis, which we therefore labelled as lineage 2 (Fig. 1). Altogether, isolates that are currently identified phenotypically as correspond to four separate clades that could be regarded as distinct species. Isolates that were identified phenotypically as were monophyletic and subdivided into three main lineages (Fig. 1). Lineage 1 contained the type strain ATCC33638T; ANI values (Table S2) were 93.5 % among lineages but >98.9 % within lineage, showing that isolates actually correspond to three separate clades derived from a single ancestor and that may be considered as three species. The phylogenetic analysis revealed the existence of additional novel lineages. First, two unique lineages, which we call new species 1 (NEW 1) and new species 2 (NEW 2; Fig. 1), corresponded to isolates that could not be identified based on phenotypic characterization due to atypical metabolic characteristics. Lineages NEW 1 and NEW 2 were phylogenetic sister groups (Fig. 1) separated by 90.1 % ANI (Table S2). They comprised only one and two isolates, respectively (ANI within lineage NEW 2 : 99.9%). Second, two additional lineages, which we call NEW 3 and NEW 4 (Fig. 1), were made up by isolates phenotypically characterized as biotype 1A. These lineages comprised one and ten isolates, respectively, and were closely related to . ANI values between the two lineages were 94.7 % on average (>99.3 % within NEW 4), and were 93.9 and 94.2 %, respectively, with (Table S2). Altogether, these results show that lineages NEW 1 to NEW 4 may represent four additional species. Of note, NEW 1 was recently described as the new species Yersinia hibernica [10]: the ANI between its type strain (accession numbers of its chromosome and plasmid are CP032487 and CP032488) and strain IP37048 (NEW 1) is 99.9 %.

Population structure of

isolates clustered in a single clade (Fig. 1), and ANI values among them (ranging from 95.7 to 100 %) (Table S2) confirmed that they may be regarded as a single species. To provide a detailed view of the population structure of this species, a phylogenetic analysis was performed with 246 isolates (Fig. 2). Non-pathogenic isolates identified as biotype 1A by phenotypic characterization fell into two separate clades, which we call 1Aa and 1Ab. These two lineages represented the earliest diverging branches of , consistent with the hypothesis of evolution of pathogenic members from a non-pathogenic ancestor [47]. isolates of biotypes 1B, 4 and 5 clustered into three different clades that were concordant with biotype assignments (Fig. 2). In contrast, other bioserotypes were in fact subdivided into two or more unrelated sublineages. First, isolates identified as bioserotype 2–3/O:9 fell into two unrelated sublineages for which we propose the names 2/3-9a and 2/3-9b. Likewise, isolates identified as bioserotype 2–3/O:5 fell into two different sublineages for which the names 2/3-5a and 2/3-5b are proposed; sublineage 2/3-5b comprised a single strain, but was clearly distinct from sublineage 2/3-5a isolates (ANI 99.5 to 99.6 %, data not shown). Finally, isolates identified as bioserotype 3/O:3 fell into four different and clearly distinct sublineages for which the names 3-3a, 3-3b, 3–3 c and 3-3d are proposed. Whereas 3–3 c and 3-3d were associated with the biotype 4 clade, sublineages 3-3a and 3-3b were associated with sublineages 2/3-9a and 2/3-9b, respectively.

Fig. 2.

Maximum-likelihood phylogenetic tree of species based on 500 concatenated multiple sequence alignments. The tree was rooted with isolates from the groups NEW 3 and NEW 4 (not shown). Only bootstrap-based branch support values >70 % are shown. Bar, 0.001 amino acid substitutions per character.

Population structure of the complex

The species complex comprises , , and [11]. These four taxa clustered in a single clade in the phylogenetic tree (Fig. 1). Non-pathogenic isolates fell into a lineage that emerged first, with the three other species being tightly associated, consistent with previous works [11, 48]. ANI values (Table S2) among isolates were >99.4 %, and they differed by 95.0 % on average with isolates from the three other species. These results support as a separate species [7]. To provide details on the phylogenetic relationships among , and isolates, an analysis was conducted with 294 strains from the complex (Fig. 3). isolates fell into an early branching lineage clearly separated from the Y. pseudotuberculosis+Y. pestis clade. ANI values (Table S2) between these two lineages were 97.5 % on average (>99.2 % within each lineage), illustrating that was defined at the species rank despite being more related to than the 95–96 % cut-off value. Isolates phenotypically characterized as or fell into 33 different sublineages, one of which comprised all isolates (Fig. 3). Considering its extreme pathogenicity and its peculiar ecological niche, the sublineage remains defined as a separate species, even though it emerged from within as recognized previously [47, 49, 50]. The 32 other sublineages making up the population structure were analysed with respect to the distribution of O-serotypes. Several individual O-serotypes were observed in different sublineages. The most conspicuous case is serotype O:1, which was widely distributed across sublineages; but this was also the case for serotypes O:2 to O:5. Besides, isolates of a single sublineage (e.g. 5, 8, 11 and 14) could differ in their O-serotypes, cgMLST scheme, definition of thresholds and taxonomic assignment.

Fig. 3.

Maximum-likelihood phylogenetic tree of , and based on 500 concatenated multiple sequence alignments. The tree was rooted with (not shown). Only bootstrap-based branch support values >70 % are shown. Bar, 0.001 amino acid substitutions per character. AAG, Auto-agglutinable; NAG, non-agglutinable; NT, non-typable as does not harbour the O-antigen; O, O antigen serotype; Unk, unknown serotype. To capture the phylogenetic structure of using cgMLST, an intuitive and highly reproducible bacterial strain genotyping method was implemented [24, 26], for which species- and sublineage-specific thresholds were defined (see Methods and Fig. S1). This set of cut-off values allowed all reference isolates to be grouped in agreement with their phylogenetic classification. To evaluate this method for strain identification, 1843 isolates received prospectively at the YNRL between 2016 and mid-2017 were analysed by the automatic cgMLST taxonomic assignment procedure. Details of the isolates used for this evaluation are presented in Tables 3 and 4. For 1814 (98.4 %) isolates, genotypic assignment was consistent with phenotypic characterization (Table 3). Taxonomic assignments showed that the validation dataset comprised 669 non-pathogenic isolates belonging to nine different species, 1116 pathogenic isolates and 27 isolates.

Table 3.

Validation of the genotypic characterization

Consistent characterization with both methods was found for 1814 strains (out of 1843).

Phenotypic characterization			Genotypic characterization
Species	Biotype or serotype	No.	Species	Lineage	No.
Y. aleksiciae		2	Y. aleksiciae		2
Y. bercovieri		10	Y. bercovieri		10
Y. frederiksenii		45	Y. frederiksenii 1		3
			Y. frederiksenii 2		34
			Y. frederiksenii 3		8
Y. intermedia		18	Y. intermedia		18
Y. kristensenii		7	Y. kristensenii 1		3
			Y. kristensenii 3		4
Y. massiliensis		1	Y. massiliensis	1	1
Y. mollaretii		4	Y. mollaretii		4
Y. rohdei		9	Y. rohdei		9
Y. enterocolitica	1A	573	Y. enterocolitica	1Aa	565
				1Ab	8
	2/O:5–27	16	Y. enterocolitica	2/3-5a	16
	2/O:9	147	Y. enterocolitica	2/3-9a	11
				2/3-9b	136
	3/O:3	3	Y. enterocolitica	3-3b	2
				3–3 c	1
	4/O:3	950	Y. enterocolitica	4	950
Y. pseudotuberculosis	O:3	1	Y. pseudotuberculosis	14	1
	O:1	26	Y. pseudotuberculosis	2	3
				4	2
				7	4
				10	5
				12	2
				15	6
				16	3
				17	1
Yersinia sp.		2	NEW 2		2
Total		1814			1814

Validation of the genotypic characterization Consistent characterization with both methods was found for 1814 strains (out of 1843). Phenotypic characterization Genotypic characterization Species Biotype or serotype No. Species Lineage No. 2 2 10 10 45 1 3 2 34 3 8 18 18 7 1 3 3 4 1 1 1 4 4 9 9 1A 573 1Aa 565 1Ab 8 2/O:5–27 16 2/3-5a 16 2/O:9 147 2/3-9a 11 2/3-9b 136 3/O:3 3 3-3b 2 3–3 c 1 4/O:3 950 4 950 O:3 1 14 1 O:1 26 2 3 4 2 7 4 10 5 12 2 15 6 16 3 17 1 sp. 2 NEW 2 2 Total 1814 1814 Correspondence between the genotypic and phenotypic characterizations for the 29 non-concordant strains (out of 1843) Phenotypic characterization Genotypic characterization Species Biotype or serotype No. Species Lineage No. 1A 21 NEW4 21 4 2 4 1A/O:5 1 2 1 2/O:27 1 3–3 c 1 2/O:5–27 1 1Ab 1 3/O:3 1 4 1 Total 29 29 Only 29 (1.6 %) isolates received a genotypic assignment that was discordant with phenotypic identification. Based on phenotypic characterization, these isolates belong mainly to two species: and (Table 4). First, 21 isolates were phenotypically characterized as biotype 1A, whereas they actually belong to the new species temporarily called NEW 4. Second, four isolates phenotypically characterized as actually belong to the above-described lineage 2. Third, one non-pathogenic 1A/O:5 was identified as a isolate by cgMLST. New phenotypic and genotypic characterizations were performed and confirmed the initial discrepancy. Fourth, one isolate (IP38164) identified as a pathogenic 2/O5-27 was actually characterized by cgMLST as a lineage 1Ab (non-pathogenic) isolate. This isolate exhibited a positive reaction for pyrazinamidase activity, which is a hallmark of the non-pathogenic strains. Thus, we can consider that genotypic characterization allowed a correct taxonomic assignment of a non-pathogenic isolate. Fifth, isolate IP38493 phenotypically characterized as 3/O:3 was classified by cgMLST as lineage 4 (corresponding to bioserotype 4/O:3 based on phenotypic characterization). After investigation, it turned out that this isolate belongs to the phage type VIII, which is restricted to biotype 4 strains. We can conclude that isolate IP38493 was phenotypically misidentified due to an atypical positive xylose reaction, which is the only difference between biotypes 3 and 4. Finally, the last inconsistency was represented by isolate IP38064 identified as 2/O:27 but belonging in fact to lineage 3–3 c (corresponding to bioserotype 3/O:3). This isolate presented an atypical phenotype for this lineage, as all biotype 2 strains are either of serotype O:5 or O:9. After further analysis, it turned out that this isolate also had a positive O:3 antiserum reaction, and had an atypical positive reaction for indole production, which is the only difference between biotypes 2 and 3. We concluded that IP38064 belongs to bioserotype 3/O:3, consistent with its lineage 3–3 c genotypic assignment.

Table 4.

Correspondence between the genotypic and phenotypic characterizations for the 29 non-concordant strains (out of 1843)

Phenotypic characterization			Genotypic characterization
Species	Biotype or serotype	No.	Species	Lineage	No.
Y. enterocolitica	1A	21	NEW4		21
Y. frederiksenii		4	Y. massiliensis	2	4
Y. enterocolitica	1A/O:5	1	Y. frederiksenii 2		1
Y. enterocolitica	2/O:27	1	Y. enterocolitica	3–3 c	1
	2/O:5–27	1	Y. enterocolitica	1Ab	1
	3/O:3	1	Y. enterocolitica	4	1
Total		29			29

The time-to-results values for the two approaches were compared based on 1113 isolates phenotypically characterized in 2017 and on 1440 isolates characterized by cgMLST in 2018. Based on our current sequencing pipeline organization, cgMLST was slower by 1.3 days compared to phenotypic characterization (Fig. 4): 8.8 versus 7.5 days, respectively (P<0.0001). Whereas the time-to-results of the phenotypic characterization cannot be optimized, it would be possible to shorten the cgMLST process, as currently only two sequencing runs per week are performed on the core facility.

Fig. 4.

Time-to-results comparison between cgMLST and phenotypic methods. Calculations are based on 1113 isolates received in 2017 at the YNRL for the phenotypic characterization (mean=7.5 days) and 1440 isolates received in 2018 for the cgMLST (mean=8.8 days). Statistical analysis was performed with Student's t-test. ****, Difference highly significant, with P value <0.0001.

Phylogenetic tree of the genus based on seven-gene MLST

Classical seven-gene MLST can be used without access to high-throughput sequencing. Therefore, we evaluated the genus-wide seven-gene MLST scheme developed by McNally and colleagues [23] on the 236 genomes used to construct the 500-gene-based phylogenetic tree. The results showed that most clades, including potential novel species (see above), are neatly distinguished. Of note, isolates did not belong to a single clade based on seven-gene MLST data, in contrast to cgMLST results (Fig. 1). Moreover, and NEW 1+NEW 2 clades were not positioned at the same locations in the two trees (Figs 1 and S2), with higher bootstrap support values observed based on the 500 genes. These results show that seven-gene MLST is a reliable approach for identification, although it is expected to be less reliable for phylogenetic classification given that a much lower number of genes are used.

Discussion

By analysing the largest set of whole-genome sequences representing the diversity of to date, we have provided an updated view of the phylogenetic structure of this important bacterial genus. Our phylogenetic analysis confirms the neat demarcation of into clearly distinct clades, most of which correspond to previously defined species. However, this work also uncovered a number of clades that may represent entirely novel species, or novel subdivisions within existing taxa. Overall, we identified eight novel clades that appear to represent new species based on the current ANI-based bacterial species definition: two novel clades provisionally labelled as NEW 1 and NEW 2, which are currently unidentified using classical approaches, as well as: (i) two clades currently identified as and labelled 2 and 3 ( 1 is considered as Y. frederiksenii sensu stricto, because this lineage includes the type strain ATCC33641T; (ii) two clades currently identified as Y. kristensenii, labelled 2 and 3 ( 1 is considered as Y. kristensenii sensu stricto, because this lineage includes the type strain ATCC33638T); and (iii) two clades currently identified as (NEW 3 and NEW 4). Furthermore, we uncovered clear subdivisions into separate lineages within the species , , and . The clinical or epidemiological significance of these subdivisions largely remains to be defined. The results of our large genomic survey underline the need for a taxonomic update of the genus . Although the ANI metric is a useful guide for deciding on the taxonomic status of phylogenetic lineages, the rigid application of a single threshold for a given taxonomic rank (e.g. species) would lead to inconsistencies with current taxonomy (e.g. Y. and ). Besides ANI values, it is highly relevant to consider the biology of organisms, their phenotypic characteristics including pathogenesis, and their phylogenetic breadth and internal structure. Some of the clades or sublineages distinguished by cgMLST concur with new subgroups previously recognized using other methods. For instance, Reuter et al. [47] described that strains were separated into two species clusters, SC8 and 9, which appear to correspond to lineage 3 and lineage 1, respectively (Fig. 1). Moreover, strains previously classified as species cluster 14 [47] and genomospecies 2 [51] correspond to our lineage 4/Y. massiliensis lineage 2. Finally, the recently described novel species [10] corresponds to clade NEW 1. Reference phenotyping approaches failed to recognize most novel clades and lineages. Occasional evolution of biochemical traits within species can result in some strains presenting atypical phenotypes, sometimes leading to misidentification, and isolates with the same phenotype may actually belong to different species on the genome level (Table S1). This clearly illustrates the limitations of current phenotypic characterizations of some of the species. As phenotypic characterization, which is currently the most widely used approach, is labour-intensive and time-consuming (full characterization requires 7 to 8 days), alternative reference identification methods are clearly needed, rather than developing novel discriminant biochemical traits. The phenotypic Vitek GN card or MS (MALDI-TOF) represent alternative identification methods, but are not fully reliable for species [52, 53]. In addition, phenotypic identification at the infra-specific level and, thus, prediction of a strain’s pathogenic potential, is not possible. For instance, neither species of the complex nor biotype 1A nor 1B strains of can be distinguished using MALDI-TOF MS [11, 17, 20]. Moreover, reference database updates would be needed for these methods to allow identification of novel taxa. We implemented a genus-wide cgMLST genotyping system based on 500 highly conserved protein-encoding genes that were assembled with no nucleotide ambiguities, had no paralogue and presented limited size variation. These stringent criteria were used to maximize genotyping reliability and, therefore, led to a subset of genes much smaller than the actual core genome of . Whereas cgMLST schemes with larger number of core genes are typically used to maximize discrimination for epidemiology and surveillance of single species [26, 54], the purpose of our cgMLST scheme is species-level and bioserotype-level identifications. Although most cases are sporadic, some outbreaks have been described and subtyping methods were operational in these cases to identify contamination sources [55-57]. Whether our 500-gene cgMLST scheme could also be a useful tool in epidemiological investigations of outbreaks remains to be evaluated. The cgMLST approach presents the advantages of automation, standardization, reproducibility and simplicity of interpretation, which are important criteria for the use of genome sequence data in clinical microbiology [24, 58]. Using cgMLST genotypes (allelic profiles), we defined species- and lineage-specific thresholds for identification of strains. While the initial definition of thresholds was based on the maximum cgMLST distance observed within groups of reference isolates, these thresholds were relaxed based on observed maximal distances (Table S3) when analysing prospective isolates. This allowed us to increase the identification rate and could be further adapted in the future if more distant genotypes are encountered within some lineages (Table S4). Given that is so strongly structured phylogenetically, it is possible to increase the identification thresholds with no negative impact on reliability. Our evaluation using a prospective set of 1843 genomes produced in the framework of a national surveillance programme demonstrated the power of this approach for identification at species- and infra-species bio-serotype levels. Furthermore, our cgMLST approach allows the clear distinction of all the species of the complex (, , and ). The correspondence established in our system between phenotypic and genotypic characterization (Tables 2 and S4) enables backward compatibility [58-60] of the cgMLST strategy with classical characterization, which is still extensively used worldwide. Furthermore, the cgMLST approach enables unambiguous classification of phenotypically atypical isolates. To make the cgMLST approach broadly available, the BIGSdb was set up and made available at https://bigsdb.pasteur.fr/yersinia/. Genomic sequences can be analysed directly from the BIGSdb interface (Fig. S3). Alternately, cgMLST profiles and their attached identification information can be downloaded locally and used internally for confidential characterization of genomic sequences. A recent survey revealed that national reference laboratories are increasingly using whole-genome sequence typing methods for surveillance of communicable diseases [61]. The method developed here has the potential to become a universally shared reference method for the identification of isolates worldwide.

Data bibliography

1. Savin C, Martin L, Bouchier C, Filali S, Chenau J et al. GenBank BioProjects PRJEB3982, PRJEB3984, PRJEB3985, PRJEB3986, PRJEB3987, PRJEB3988, PRJEB3989, PRJEB3990, PRJEB3991 and PRJEB3992 (2014). 2. Reuter S, Connor TR, Barquist L, Walker D, Feltwell T et al. ENA studies PRJEB2116 and PRJEB2117 (2014). 3. Nguyen SV, Muthappa DM, Hurley D, Donoghue O, McCabe E et al. GenBank accession numbers CP032487 and CP032488 (2019). Click here for additional data file. Click here for additional data file.

57 in total

1. Population structure of the Yersinia pseudotuberculosis complex according to multilocus sequence typing.

Authors: Riikka Laukkanen-Ninios; Xavier Didelot; Keith A Jolley; Giovanna Morelli; Vartul Sangal; Paula Kristo; Carina Brehony; Priscilla F M Imori; Hiroshi Fukushima; Anja Siitonen; Galina Tseneva; Ekaterina Voskressenskaya; Juliana P Falcao; Hannu Korkeala; Martin C J Maiden; Camila Mazzoni; Elisabeth Carniel; Mikael Skurnik; Mark Achtman
Journal: Environ Microbiol Date: 2011-09-27 Impact factor: 5.491

2. Towards a genome-based taxonomy for prokaryotes.

Authors: Konstantinos T Konstantinidis; James M Tiedje
Journal: J Bacteriol Date: 2005-09 Impact factor: 3.490

3. Identification of Yersinia species by the Vitek GNI card.

Authors: H J Linde; H Neubauer; H Meyer; S Aleksic; N Lehn
Journal: J Clin Microbiol Date: 1999-01 Impact factor: 5.948

4. Use of whole-genus genome sequence data to develop a multilocus sequence typing tool that accurately identifies Yersinia isolates to the species and subspecies levels.

Authors: Miquette Hall; Marie A Chattaway; Sandra Reuter; Cyril Savin; Eckhard Strauch; Elisabeth Carniel; Thomas Connor; Inge Van Damme; Lakshani Rajakaruna; Dunstan Rajendram; Claire Jenkins; Nicholas R Thomson; Alan McNally
Journal: J Clin Microbiol Date: 2014-10-22 Impact factor: 5.948

5. Yersinia entomophaga sp. nov., isolated from the New Zealand grass grub Costelytra zealandica.

Authors: Mark R H Hurst; S Anette Becher; Sandra D Young; Tracey L Nelson; Travis R Glare
Journal: Int J Syst Evol Microbiol Date: 2010-05-21 Impact factor: 2.747

6. Pyrazinamidase activity in Yersinia enterocolitica and related organisms.

Authors: K Kandolo; G Wauters
Journal: J Clin Microbiol Date: 1985-06 Impact factor: 5.948

Review 7. MLST revisited: the gene-by-gene approach to bacterial genomics.

Authors: Martin C J Maiden; Melissa J Jansen van Rensburg; James E Bray; Sarah G Earle; Suzanne A Ford; Keith A Jolley; Noel D McCarthy
Journal: Nat Rev Microbiol Date: 2013-09-02 Impact factor: 60.633

8. BIGSdb: Scalable analysis of bacterial genome variation at the population level.

Authors: Keith A Jolley; Martin C J Maiden
Journal: BMC Bioinformatics Date: 2010-12-10 Impact factor: 3.169

9. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries.

Authors: Chirag Jain; Luis M Rodriguez-R; Adam M Phillippy; Konstantinos T Konstantinidis; Srinivas Aluru
Journal: Nat Commun Date: 2018-11-30 Impact factor: 14.919

10. Identification of Yersinia enterocolitica isolates from humans, pigs and wild boars by MALDI TOF MS.

Authors: Katarzyna Morka; Jarosław Bystroń; Jacek Bania; Agnieszka Korzeniowska-Kowal; Kamila Korzekwa; Katarzyna Guz-Regner; Gabriela Bugla-Płoskońska
Journal: BMC Microbiol Date: 2018-08-17 Impact factor: 3.605

11 in total

1. Genetic diversity and spatial distribution of Burkholderia mallei by core genome-based multilocus sequence typing analysis.

Authors: Sandra Appelt; Anna-Maria Rohleder; Daniela Jacob; Heiner von Buttlar; Enrico Georgi; Katharina Mueller; Ulrich Wernery; Joerg Kinne; Marina Joseph; Shantymol V Jose; Holger C Scholz
Journal: PLoS One Date: 2022-07-06 Impact factor: 3.752

2. A Whole-Genome-Based Gene-by-Gene Typing System for Standardized High-Resolution Strain Typing of Bacillus anthracis.

Authors: Mostafa Y Abdel-Glil; Alexandra Chiaverini; Giuliano Garofolo; Antonio Fasanella; Antonio Parisi; Dag Harmsen; Keith A Jolley; Mandy C Elschner; Herbert Tomaso; Jörg Linde; Domenico Galante
Journal: J Clin Microbiol Date: 2021-06-18 Impact factor: 5.948

Review 3. What do we know about osmoadaptation of Yersinia pestis?

Authors: Sébastien Bontemps-Gallo; Jean-Marie Lacroix; Florent Sebbane
Journal: Arch Microbiol Date: 2021-12-08 Impact factor: 2.552

4. A Core Genome Multilocus Sequence Typing Scheme for Streptococcus mutans.

Authors: Shanshan Liu; Xiaoliang Li; Zhenfei Guo; Hongsheng Liu; Yu Sun; Yudong Liu; Qinglong Wang; Shengkai Liao; Kai Zhang
Journal: mSphere Date: 2020-07-08 Impact factor: 4.389

5. Comparative genomic insights into Yersinia hibernica - a commonly misidentified Yersinia enterocolitica-like organism.

Authors: Scott Van Nguyen; Dechamma Mundanda Muthappa; Athmanya K Eshwar; James F Buckley; Brenda P Murphy; Roger Stephan; Angelika Lehner; Séamus Fanning
Journal: Microb Genom Date: 2020-07-23

Review 6. Yersiniosis in New Zealand.

Authors: Lucia Rivas; Hugo Strydom; Shevaun Paine; Jing Wang; Jackie Wright
Journal: Pathogens Date: 2021-02-10

7. In vitro and in silico parameters for precise cgMLST typing of Listeria monocytogenes.

Authors: Federica Palma; Iolanda Mangone; Anna Janowicz; Alexandra Moura; Alexandra Chiaverini; Marina Torresi; Giuliano Garofolo; Alexis Criscuolo; Sylvain Brisse; Adriano Di Pasquale; Cesare Cammà; Nicolas Radomski
Journal: BMC Genomics Date: 2022-03-26 Impact factor: 3.969

8. Obtaining Specific Sequence Tags for Yersinia pestis and Visually Detecting Them Using the CRISPR-Cas12a System.

Authors: Gang Chen; Yufei Lyu; Dongshu Wang; Li Zhu; Shiyang Cao; Chao Pan; Erling Feng; Weicai Zhang; Xiankai Liu; Yujun Cui; Hengliang Wang
Journal: Pathogens Date: 2021-05-06

9. Yersinia artesiana sp. nov., Yersinia proxima sp. nov., Yersinia alsatica sp. nov., Yersina vastinensis sp. nov., Yersinia thracica sp. nov. and Yersinia occitanica sp. nov., isolated from humans and animals.

Authors: Anne-Sophie Le Guern; Cyril Savin; Hilde Angermeier; Sylvie Brémont; Dominique Clermont; Estelle Mühle; Petya Orozova; Hristo Najdenski; Javier Pizarro-Cerdá
Journal: Int J Syst Evol Microbiol Date: 2020-08-27 Impact factor: 2.747

Review 10. Emerging Evasion Mechanisms of Macrophage Defenses by Pathogenic Bacteria.

Authors: Clarisse Leseigneur; Pierre Lê-Bury; Javier Pizarro-Cerdá; Olivier Dussurget
Journal: Front Cell Infect Microbiol Date: 2020-09-25 Impact factor: 5.293