| Literature DB >> 29338812 |
James Robertson1, Catherine Yoshida2, Peter Kruczkiewicz2, Celine Nadon2, Anil Nichani1, Eduardo N Taboada3, John Howard Eagles Nash4.
Abstract
Public health and food safety institutions around the world are adopting whole genome sequencing (WGS) to replace conventional methods for characterizing Salmonella for use in surveillance and outbreak response. Falling costs and increased throughput of WGS have resulted in an explosion of data, but questions remain as to the reliability and robustness of the data. Due to the critical importance of serovar information to public health, it is essential to have reliable serovar assignments available for all of the Salmonella records. The current study used a systematic assessment and curation of all Salmonella in the sequence read archive (SRA) to assess the state of the data and their utility. A total of 67 758 genomes were assembled de novo and quality-assessed for their assembly metrics as well as species and serovar assignments. A total of 42 400 genomes passed all of the quality criteria but 30.16 % of genomes were deposited without serotype information. These data were used to compare the concordance of reported and predicted serovars for two in silico prediction tools, multi-locus sequence typing (MLST) and the Salmonella in silico Typing Resource (SISTR), which produced predictions that were fully concordant with 87.51 and 91.91 % of the tested isolates, respectively. Concordance of in silico predictions increased when serovar variants were grouped together, 89.25 % for MLST and 94.98 % for SISTR. This study represents the first large-scale validation of serovar information in public genomes and provides a large validated set of genomes, which can be used to benchmark new bioinformatics tools.Entities:
Keywords: Public Health; Salmonella; phenotype prediction; serotyping; surveillance; whole genome sequencing
Mesh:
Substances:
Year: 2018 PMID: 29338812 PMCID: PMC5857378 DOI: 10.1099/mgen.0.000151
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Draft genome assemblies were put through each quality control filter in parallel and the numbers of genomes failing at each filter are listed above
The unique list of genomes that failed one or more filters was found to be 25 358 of the 67 758 genomes used as input. Genomes with more than 10 % of their assembled bases from non-Salmonella failed the contamination check. Assemblies with a size <4 or >6 Mb failed the size criteria. Assemblies were filtered if they met any of these conditions: with >500 contigs, largest contig <100 kb, more than 1 % ambiguous bases, an N50 <50 kb. A further reason for exclusions was if the assembly did not have ≥300 cgMLST genes. If the GC content of the genome was <50 % or >54 % it was excluded. Finally, if the genome was not associated with a serovar in the SRA it was not analysed further in the comparison of the in silico tools. N50, the value where 50 % of an assembly is made up of contigs equal to or greater than that value.
| Reason for exclusion | Count | Percentage |
|---|---|---|
| Contamination | 1267 | 5.00 |
| Genome size | 1937 | 7.64 |
| Number of contigs | 1094 | 4.31 |
| Largest contig | 674 | 2.66 |
| Ambiguous | 299 | 1.18 |
| Missing O-/H-antigen | 695 | 2.74 |
| cgMLST allele count | 1443 | 5.69 |
| N50 | 2087 | 8.23 |
| GC content | 1017 | 4.01 |
| Missing serovar assignment | 20 446 | 80.63 |
| Cumulative fail | 25 358 |
Fig. 1.Draft genome assemblies were examined through SISTR to determine the number of the 330 cgMLST genes present in each assembly. A histogram of the frequency of cgMLST abundance in genomes was determined for the entire dataset (67 758 genomes) without any quality control. The analysis was then repeated on filtered genomes that passed the sequence quality filters with the exception of the requirement for >320 cgMLST genes.
Fig. 2.Bubble graph of the intra-serovar diversity of the filtered list of Salmonella genomes. The relative size of each bubble indicates the relative number of genomes reported to belong to that serovar. Each bubble is colourized based on the number of unique 330 gene cgMLST profiles, which were found within a given serovar. Darker bubbles indicate a higher number of unique cgMLST profiles.
Fig. 3.Bubble graphs highlighting the bias in the SRA dataset to the USA and UK due to their large surveillance programmes. The size of the bubble indicates the relative number of genomes listing that country as their collection source and the colour of the bubble indicates the number of distinct serovars reported from that country.
In total, 42 400 draft genomes, which passed all of the quality criteria from Table 1, were examined for the concordance of in silico serovar predictions with the reported serovar
Each genome was categorized into one of five different categories according to the criteria established in the Methods.
| Category | SISTR | MLST |
|---|---|---|
| Type 0: Full match | 38 954 | 36 954 |
| Type 1: Incorrect reported serovar | 2115 | 1804 |
| Type 2: Serovar variant detected | 1305 | 891 |
| Type 3: Incorrect predicted serovar | 26 | 462 |
| Type 4: Untypeable | 0 | 2289 |
Fig. 4.Multiple sequence alignment of Oranienburg and Othmarschen FliC protein sequences with conserved ends removed. Amino acid variants are in bold and highlighted in yellow. Sequence labels consist of the reported serovar and the number of times that unique FliC protein was seen within that serovar.
Fig. 5.Core genome SNP tree produced by Parsnp alignment of de novo assemblies of isolates which were reported to be Oranienburg or Othmarschen, and where SISTR confirmed that their antigenic formulas were accurate. Othmarschen isolates are highlighted in red.