| Literature DB >> 34860150 |
Kyrylo Bessonov1, Chad Laing2, James Robertson1, Irene Yong1, Kim Ziebell1, Victor P J Gannon3, Anil Nichani1, Gitanjali Arya1, John H E Nash4, Sara Christianson5.
Abstract
Escherichia coli is a priority foodborne pathogen of public health concern and phenotypic serotyping provides critical information for surveillance and outbreak detection activities. Public health and food safety laboratories are increasingly adopting whole-genome sequencing (WGS) for characterizing pathogens, but it is imperative to maintain serotype designations in order to minimize disruptions to existing public health workflows. Multiple in silico tools have been developed for predicting serotypes from WGS data, including SRST2, SerotypeFinder and EToKi EBEis, but these tools were not designed with the specific requirements of diagnostic laboratories, which include: speciation, input data flexibility (fasta/fastq), quality control information and easily interpretable results. To address these specific requirements, we developed ECTyper (https://github.com/phac-nml/ecoli_serotyping) for performing both speciation within Escherichia and Shigella, and in silico serotype prediction. We compared the serotype prediction performance of each tool on a newly sequenced panel of 185 isolates with confirmed phenotypic serotype information. We found that all tools were highly concordant, with 92-97 % for O-antigens and 98-100 % for H-antigens, and ECTyper having the highest rate of concordance. We extended the benchmarking to a large panel of 6954 publicly available E. coli genomes to assess the performance of the tools on a more diverse dataset. On the public data, there was a considerable drop in concordance, with 75-91 % for O-antigens and 62-90 % for H-antigens, and ECTyper and SerotypeFinder being the most concordant. This study highlights that in silico predictions show high concordance with phenotypic serotyping results, but there are notable differences in tool performance. ECTyper provides highly accurate and sensitive in silico serotype predictions, in addition to speciation, and is designed to be easily incorporated into bioinformatic workflows.Entities:
Keywords: E. coli; enteric pathogens; in silico serotyping; public health; serotyping
Mesh:
Substances:
Year: 2021 PMID: 34860150 PMCID: PMC8767331 DOI: 10.1099/mgen.0.000728
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Flowchart outlining the major stages within ECTyper. Input can be either raw reads or assemblies. Species identification is performed if the ‘--verify’ parameter is specified using MASH to determine the closest representative genome in NCBI RefSeq. Antigen predictions only proceed if the species is E. coli. In the case of raw reads as input, there is a preprocessing stage that aligns the reads against curated databases of genes used to predict O- and H-antigens and produce a consensus sequence. After the preprocessing stage, both reads and assemblies are processed the same. The best matching alleles for each of the genes is identified using blastn based on both %identity and %coverage values. A final report is output in tab-delimited format with the summary QC values (Table 1). See the Methods section for further details.
Quality control values and their assignments based on the nine scenarios
|
Value |
Scenario |
|---|---|
|
PASS (REPORTABLE) |
Both O and H-antigen alleles meet or exceed both minimum %identity or %coverage individual allele thresholds and a single serogroup is predicted both for O and H |
|
FAIL (-:- TYPING) |
A sample is |
|
WARNING (WRONG SPECIES) |
A sample is non- |
|
WARNING MIXED O-TYPE |
A mixed O-antigen call is predicted requiring a further wet-lab confirmation (e.g. O17/O77/O73/O106) |
|
WARNING (-:H TYPING) |
A sample is |
|
WARNING (O:- TYPING) |
A sample is |
|
WARNING (O NON-REPORT) |
O-antigen alleles do not meet minimum %identity or %coverage thresholds |
|
WARNING (H NON-REPORT) |
H-antigen alleles do not meet minimum %identity or %coverage thresholds |
|
WARNING (O and H NON-REPORT) |
Both O and H-antigen alleles do not meet individual minimum %identity or %coverage thresholds |
In silico serotype prediction benchmarking on newly sequenced isolates. A total of 185 isolates with complete serotype information for both O- and H-antigens were used to benchmark the performance of four in silico prediction tools (Table S3)
|
O-antigen |
H-antigen | |||||||
|---|---|---|---|---|---|---|---|---|
|
Perfect match (PM) |
Ambiguous match (AM) |
Incorrect prediction (IP) |
No (NP) |
Perfect match (PM) |
Ambiguous match (AM) |
Incorrect prediction (IP) |
No (NP) | |
|
ECTyper (assembly) |
163 |
16 |
0 |
6 |
185 |
0 |
0 |
0 |
|
ECTyper (reads) |
163 |
14 |
1 |
7 |
185 |
0 |
0 |
0 |
|
SerotypeFinder |
154 |
19 |
0 |
12 |
183 |
1 |
0 |
1 |
|
EToKi EBEis |
145 |
30 |
2 |
8 |
182 |
0 |
0 |
3 |
|
SRST2 |
169 |
0 |
14 |
2 |
181 |
0 |
1 |
3 |
In silico serotype prediction benchmarking on a large public dataset. A total of 6954 samples with complete or partial serotype information were used to benchmark 4 tools for in silico serotype prediction accuracy (Table S4). Since complete antigen information was not available for all samples, there was a total of 6905 samples with O-antigen information and 3722 samples with designated H-antigens
|
O-antigen |
H-antigen | |||||||
|---|---|---|---|---|---|---|---|---|
|
Perfect match (PM) |
Ambiguous match (AM) |
Incorrect prediction (IP) |
No (NP) |
Perfect match (PM) |
Ambiguous match (AM) |
Incorrect prediction (IP) |
No (NP) | |
|
ECTyper (assembly) |
6071 |
176 |
595 |
63 |
3309 |
0 |
396 |
17 |
|
ECTyper (reads) |
5756 |
149 |
588 |
412 |
3303 |
0 |
390 |
29 |
|
SerotypeFinder |
5896 |
262 |
586 |
161 |
3308 |
25 |
386 |
3 |
|
EToKi EBEis |
5288 |
572 |
555 |
490 |
3304 |
7 |
393 |
18 |
|
SRST2 |
5125 |
45 |
559 |
1176 |
2269 |
12 |
265 |
1176 |