| Literature DB >> 34889728 |
Xiaomei Zhang1, Michael Payne1, Thanh Nguyen1, Sandeep Kaur1, Ruiting Lan1.
Abstract
Shigella and enteroinvasive Escherichia coli (EIEC) cause human bacillary dysentery with similar invasion mechanisms and share similar physiological, biochemical and genetic characteristics. Differentiation of Shigella from EIEC is important for clinical diagnostic and epidemiological investigations. However, phylogenetically, Shigella and EIEC strains are composed of multiple clusters and are different forms of E. coli, making it difficult to find genetic markers to discriminate between Shigella and EIEC. In this study, we identified 10 Shigella clusters, seven EIEC clusters and 53 sporadic types of EIEC by examining over 17000 publicly available Shigella and EIEC genomes. We compared Shigella and EIEC accessory genomes to identify cluster-specific gene markers for the 17 clusters and 53 sporadic types. The cluster-specific gene markers showed 99.64% accuracy and more than 97.02% specificity. In addition, we developed a freely available in silico serotyping pipeline named Shigella EIEC Cluster Enhanced Serotype Finder (ShigEiFinder) by incorporating the cluster-specific gene markers and established Shigella and EIEC serotype-specific O antigen genes and modification genes into typing. ShigEiFinder can process either paired-end Illumina sequencing reads or assembled genomes and almost perfectly differentiated Shigella from EIEC with 99.70 and 99.74% cluster assignment accuracy for the assembled genomes and read mapping respectively. ShigEiFinder was able to serotype over 59 Shigella serotypes and 22 EIEC serotypes and provided a high specificity of 99.40% for assembled genomes and 99.38% for read mapping for serotyping. The cluster-specific gene markers and our new serotyping tool, ShigEiFinder (installable package: https://github.com/LanLab/ShigEiFinder, online tool: https://mgtdb.unsw.edu.au/ShigEiFinder/), will be useful for epidemiological and diagnostic investigations.Entities:
Keywords: Shigella; cluster-specific gene markers; enteroinvasive E. coli; phylogenetic clusters; serotyping
Mesh:
Substances:
Year: 2021 PMID: 34889728 PMCID: PMC8767346 DOI: 10.1099/mgen.0.000704
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.and EIEC cluster identification phylogenetic tree. Representative isolates from the identification dataset were used to construct the phylogenetic tree using Quicktree v1.3 [65] to identify and EIEC (enteroinvasive ) clusters and visualized using Grapetree. The dendrogram shows the phylogenetic relationships of 1879 and EIEC isolates represented in the identification dataset. Branch lengths are on a log scale for clarity. Bar, 0.2 substitutions per site. and EIEC clusters are coloured. Numbers in square brackets after the cluster name are the number of isolates for each identified cluster. CSP indicates sporadic EIEC lineages. ECOR is the reference collection. is which was included to show the location of ‘typical’ serotype 13 strains. CSS, CSB12, CSB13, CSD1, CSD8 and CSD10 are the clusters of , serotype 12, serotype 13, serotype 1, serotype 8 and serotype 10 respectively.
Fig. 2.In silico serotyping pipeline workflow. Schematic of in silico serotyping of and EIEC (enteroinvasive ) by cluster-specific genes combined with the ipaH gene, O antigen and modification genes and H antigen genes, implemented in ShigEiFinder. Both assembled genomes and raw reads are accepted as data input. The dotted arrows show the cutoff value applied for initial gene filtering. WGS, whole-genome sequencing; HK, housekeeping; SS, SF, SB and SD, , , and respectively. The abbreviated ‘species’ name plus the serotype number is the designation of a serotype (e.g. serotype 1 is abbreviated as SD1). For SB11, there were two sequence types (STs) with ST5475 and ST1765 located within clusters C1 and C2 respectively.
Summary of identified and EIEC clusters and outliers in the identification dataset
|
Clusters (no. of serotypes)* |
No. of isolates |
No. of STs |
No. of rSTs |
Serotypes |
|---|---|---|---|---|
|
C1 (25) |
288 |
36 |
166 |
SB1–4, SB6, SB8, SB10, SB14, SB18, SB11†, SB19–20†; SD3–7, SD9, SD11–13, SD14–15†, SD96-26b†; SF6 |
|
C2 (9) |
101 |
19 |
56 |
SB5, SB7, SB9, SB11, SB15, SB16, SB17; SD2, SD-E670-74†; SD2 |
|
C3 (20) |
744 |
81 |
437 |
SF1a, SF1b, SF1c (7 a), SF2a, SF2b, SF3a, SF3b, SF4a, SF4av, SF4b, SF4bv, SF5a, SF5b, SF7b, SFX, SFXv (4 c), SFY, SFYv, SF novel serotype; SB-E1621-54† |
|
C4 (9) |
51 |
6 |
21 |
O28ac:H-, O28ac:H7, O136:H7, O164:H-, O164:H7, O29:H4, O173:H7, O124:H7, O132:H7† |
|
C5 (6) |
62 |
4 |
15 |
O121:H30, O124:H30, O164:H30, O132:H21, O152:H30, O152:H- |
|
C6 (3) |
20 |
2 |
6 |
O143:H26, O167:H26, O112ac:H26† |
|
C7 |
10 |
1 |
3 |
O144:H25 |
|
C8‡ |
12 |
2 |
1 |
O96:H19 |
|
C9‡ |
4 |
1 |
2 |
O8:H19 |
|
C10‡ |
2 |
1 |
1 |
O135:H30 |
|
CSS |
427 |
39 |
294 |
|
|
CSD1 |
70 |
8 |
56 |
SD1 |
|
CSD8 |
7 |
3 |
3 |
SD8 |
|
CSD10 |
2 |
2 |
1 |
SD10 |
|
CSB12‡ |
8 |
2 |
6 |
SB12 |
|
CSB13 |
7 |
3 |
3 |
SB13 |
|
CSB13-atypical‡ |
5 |
3 |
3 |
SB13 |
|
Sporadic EIEC lineages‡ [ |
59 |
49 |
53 |
53 antigen types |
*Numbers in parentheses are the number of serotypes within that cluster.
†Serotypes were inconsistent with previous analyses.
‡Clusters identified as new clusters in this study.
The sensitivity and specificity of cluster-specific genes
|
Cluster |
Cluster-specific genes (single/sets) |
Identification dataset (1969 isolates) | ||
|---|---|---|---|---|
|
No. of isolates |
Sensitivity |
Specificity | ||
|
C1 |
Set of 4 genes |
288 |
100 |
99.94* |
|
C2 |
Set of 3 genes |
101 |
100 |
100 |
|
C3 |
Set of 3 genes |
744 |
100 |
99.59* |
|
C4 |
Set of 2 genes |
51 |
100 |
100 |
|
C5 |
Set of 3 genes |
62 |
100 |
100 |
|
C6 |
Set of 2 genes |
20 |
100 |
100 |
|
C7 |
Single gene |
10 |
100 |
100 |
|
C8 |
Set of 2 genes |
12 |
100 |
100 |
|
C9 |
Set of 2 genes |
4 |
100 |
100 |
|
C10 |
Single gene |
2 |
100 |
100 |
|
CSS |
Set of 5 genes |
427 |
100 |
99.87* |
|
CSD1 |
Set of 2 genes |
70 |
100 |
100 |
|
CSD8 |
Single gene |
7 |
100 |
100 |
|
CSD10 |
Single gene |
2 |
100 |
100 |
|
CSB12 |
Single gene |
8 |
100 |
100 |
|
CSB13 |
Single gene |
7 |
100 |
100 |
|
CSB13-atypical |
Single gene |
5 |
100 |
100 |
|
53 Sporadic EIEC lineages |
Single gene/lineage |
59 |
100 |
100 |
*A cluster-specific gene set specificity of less than 100% was due to at least one FP found in that set.
The accuracy of ShigEiFinder with the identification dataset and validation dataset
|
ShigEiFinder assignments |
Identification dataset ( |
Validation dataset ( | ||
|---|---|---|---|---|
|
Assembled genome |
Read mapping |
Assembled genome |
Read mapping | |
|
|
1871 |
1848 |
15455 |
15471 |
|
Multiple |
9 |
6 |
33 |
7 |
|
|
0 |
8 |
13 |
23 |
|
Not |
89 |
89 |
0 |
0 |
|
Accuracy† |
99.54% |
99.28% |
99.70% |
99.81% |
*Reads were not available for 18 EIEC isolates downloaded from NCBI in the identification dataset. The identification dataset has 90 non-Shigella/EIEC isolates including 72 ECOR isolates and 18 E.albertii isolates. One E. albertii isolate was assigned to SB13 by ShigaTyper which was grouped into cluster SB13 on the phylogenetic tree.
†Accuracy was defined as the number of Shigella and EIEC isolates being correctly assigned to a cluster over the total number tested.
The assignments of 15501 validation isolates by ShigEiFinder and Shigatyper
|
ShigEiFinder assignment |
ShigaTyper assignment |
Total | |||
|---|---|---|---|---|---|
|
Agreement with ShigEiFinder |
Discrepant with ShigEiFinder | ||||
|
|
EIEC |
Non-assignment* | |||
|
SS |
1515 |
0 |
7465 |
19 |
8999 |
|
SF |
4644 |
0 |
117 |
71 |
4832 |
|
C1 and C2 (SB and SD) |
1004 |
0 |
17 |
151 |
1172 |
|
SB12 |
4 |
0 |
0 |
2 |
6 |
|
SB13 |
1 |
0 |
0 |
0 |
1 |
|
SB13-atypical |
2 |
0 |
0 |
0 |
2 |
|
SD1 |
80 |
0 |
244 |
2 |
326 |
|
SD8 |
2 |
0 |
1 |
0 |
3 |
|
SD10 |
0 |
0 |
0 |
1 |
1 |
|
EIEC |
101 |
1 |
0 |
0 |
102 |
|
Sporadic EIEC lineages |
0 |
1 |
15 |
0 |
16 |
|
Multiple clusters |
0 |
0 |
5 |
2 |
7 |
|
|
0 |
23 |
11 |
0 |
34 |
|
Total |
7353 |
25 |
7875 |
248 |
15501 |
*Non-assignment: multiple wzx genes and non-prediction.