Literature DB >> 35221678

Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST.

Emeline Cherchame1,2, Guy Ilango1,3, Sabrina Cadel-Six1.   

Abstract

With the advent of next-generation whole-genome sequencing (WGS), the need for good-quality and well-characterised Salmonella genomes has increased over the past years. Good-quality complete genomes are often required for assembly reference mapping or phylogenetic single nucleotide polymorphism (SNP) analysis. Complete genomes or contigs from specific sources or serovars are also searched for clustering analysis or source attribution studies. Therefore, new bioinformatics tools are needed for the extraction of good-quality and well-characterised genomes from public databases. Here, we developed SalmoDEST, an open-source Python tool capable of extracting Salmonella genomes with a coverage higher than 50x and genome length over 4Mb from the GenBank database in the form of complete genomes or contigs, with verification of the serovar to which they belong and identification of the corresponding multi locus sequence type (MLST) profile. To validate the ability to SalmoDEST to screen for and retrieve genomes of good quality, we compared our results for S. Typhi complete genome with those available in the literature and extracted Salmonella genomes from bovine sources strains isolated worldwide. Finally, we provide in this study a list of 239 complete genomes for 123 serovars of Salmonella of high quality. SalmoDEST is a handy and easy-to-use open-source tool to extract complete genomes or contigs that can be routinely used in public health, food safety and research laboratories. SalmoDEST (SALMOnella Download gEnome Serotype sT) is available at https://github.com/I-Guy/SalmoDEST.
© The Author(s) 2022.

Entities:  

Keywords:  MLST profile determination; SalmoDEST; Salmonella; complete reference genomes; good-quality genomes; serovar prediction

Year:  2022        PMID: 35221678      PMCID: PMC8874161          DOI: 10.1177/11779322221080264

Source DB:  PubMed          Journal:  Bioinform Biol Insights        ISSN: 1177-9322


Introduction

The investigation of genetic markers or genome relationships between different pathogens and microorganisms requires good-quality genomes. A large panel of good-quality genomes makes it possible to study chromosome rearrangements in more detail, identify sequences of interest and improve the identification of genetic clustering. Among the most frequently consulted sequence databases for collecting genomes is the open-access GenBank database, housed by the National Centre for Biotechnology Information (NCBI). GenBank annotates a collection of all publicly available nucleotide sequences generated by laboratories throughout the world from more than 100,000 distinct organisms. Release 242.0, produced in February 2021, contained over 12 trillion nucleotide bases in more than 2 billion sequences. To facilitate the retrieval of genomes of interest from the GenBank database, we designed a workflow (called SalmoDEST) to search and download genomes with a coverage greater than 50x. The options of this tool make it possible to download either complete genomes or contigs. It is possible to choose to download protein fasta files, if desired, and an output directory where all the selected fasta files are kept. The SalmoDEST tool was developed for Salmonella, a well-known and widely distributed foodborne pathogen. Salmonella enterica is regulated in the European Union (EU) and monitored in the United States (US) and many other countries. In the US, the economic burden due to salmonellosis is estimated to be US$3.66 billion per year. In 2016, the incidences of culture-confirmed cases of salmonellosis were 14.51 and 20.4 cases per 100,000 population in the US and the EU, respectively.[2,3] The economic, social and public health importance of diseases caused by Salmonella has brought many developing and developed countries to implement their monitoring systems with whole-genome sequencing (WGS) of the isolated strains, clustering by single nucleotide polymorphism (SNP) core-genome analysis for outbreaks and source attribution investigations. For countries that can carry out WGS, it is necessary to have access to Salmonella genomes from different regions of the world and for which the serovar has been verified and the multi-locus sequence type (MLST) profile identified. For countries in which WGS is still not readily available, carrying out studies based on good-quality and well-identified open-access Salmonella genomes can prove to be an essential asset.

Materials and Methods

Workflow description

SalmoDEST is implemented as an open-source Python tool (https://github.com/I-Guy/SalmoDEST). It is based on a succession of two Python scripts and a Bash process (Figure 1).
Figure 1.

SALMOnella Download gEnome Serotype sT (SalmoDEST) pipeline.

ST, sequence type.

SALMOnella Download gEnome Serotype sT (SalmoDEST) pipeline. ST, sequence type. SalmoDEST is a workflow designed to search and download Salmonella genomes from the NCBI GenBank database using either the ncbi-acc-download tool for complete genomes or ncbi-genome-download for contigs. Using these tools, the first Python script ‘Get_HQ_Genome_1.py’ in SalmoDEST automatically downloads the genome fasta files of the strains for which accession numbers are present in the input text file. Then, the serovar and MLST profile predictions of the downloaded genomes is carried out with a Bash process using SeqSero, MLSTseeman tool and Quast, respectively. The second Python script ‘Get_HQ_Genome2.py’ renames the downloaded fasta files, adding the accession number, the serovar and the MLST profile predictions as follows: antigenic formula or serovar name_ST_ID_Accession number (eg, Montevideo_81_42N_CP037893.1). The Python script ‘Get_HQ_Genome2.py’ also downloads the gff and gbk files and checks the quality of each genome. It retains only those with coverage greater than 50x and a genome length longer than 4 Mb, and removes the others. Finally, this Python script compresses (zips) all files. Optionally, it is possible to choose to download fasta protein files, if desired, and, in addition, choose an output directory in which all the selected fasta files are stored.

Get_HQ_Genome_1.py script

The input file of SalmoDEST and the ‘Get_HQ_Genome1.py’ script is a text file, obtained from an NCBI Nucleotide database query (https://www.ncbi.nlm.nih.gov/nuccore) or compiled by the user, listing the accession numbers of the complete genomes or contigs to download. If an NCBI Nucleotide database query is used, the ‘Complete Record’ must be exported into a destination ‘File’ in the ‘Accession List’ format sorted by ‘Default order’. In the ‘Get_HQ_Genome_1.py’ script, the function named ‘getFastafromNuccore’ downloads fasta files and transcribes the accession number of the downloaded fasta files in a tsv file. The function named ‘Renamer’ renames every fasta file as ”ID_Accession.fasta” and creates a folder with the same name to which it moves the fasta files. The function named “Filter1Genome” works only if the user chooses the “complete genome mode”. The function named “Filter1Contig” works only if the user chooses the “contigs mode”. These two functions copy the accession numbers of the fasta files in a tsv file named “Genome_HQ.tsv”. Then, they count the number of contigs in every fasta file and report it in a second tsv file named “Genome_HQ_Filter1.tsv”. If the “complete genome mode” is selected, it discards all fasta files with more than one contig.

Get_HQ_Genome2.py

The ‘Get_HQ_Genome2.py’ script runs after the Bash process queries the SeqSero, MLSTseeman and Quast tools. The function named ‘ReadSeqSero’ reads the results from the SeqSero2 tool and retrieves the accession numbers of the genomes and the serovar predictions, with the associated probabilities. Similarly, the function named ‘ReadMLST’ reads results from the MLSTseeman tool and stores accession numbers and MLST profiles. The function named ‘ReadQuast’ reads results from the Quast tool and retrieves length, the N50 value and the number of contigs of genomes. The function named ‘MergeResult’ merges all the information from the previous functions (ie, serovar predictions, MLST profiles, number of contigs, length, N50 and genome size) along with information from ‘Genome_HQ_Filter1.tsv’ (ie, produced by the ‘Get_HQ_Genome_1.py’ script) in a third tsv file named ‘TableMerge.tsv’. The function named ‘GetGBK’ downloads the gbk (GenBank) files associated with fasta files. The function named ‘Renamer2’ moves the gbk files to the folder containing fasta files and renames them according to the fasta file names. The function named ‘Filter2’ generates a fourth tsv file called ‘TableMergeFilter2.tsv’ with the keys (ie, accession numbers) of all genomes that have a coverage higher than 50x (> 50x) based on gbk files and a length longer than 4 Mb (> 4 Mb). It also adds information on the sequencing technology used to this tsv file. The function named ‘GetGFF’ downloads gff files. The function named ‘RenamerGFF_FASTAprot’ renames gff files and protein fasta files. It moves them to the folder containing the fasta files. The function named ‘FinalRenamer’ renames every file and directories as described above (ie, antigenic formula/serovar name_ST_Accession). The ‘Renamer’ functions can be easily modified at the user’s convenience. The function named ‘zipfiles’ will compress (zip) all the folders containing the downloaded files.

Workflow application

In this study, we report two application examples for SalmoDEST. In the first example, we evaluate the ability of SalmoDEST tool to download complete Salmonella genomes from the NCBI GenBank database and, in the second, its ability to download Salmonella genome contigs for strains isolated from bovine sources.

Selection of complete genomes from a public database

Complete reference genomes are often required for assembly reference mapping or phylogenetic SNP analysis for the mapping step and the calculation of pairwise distance between genomes. Nevertheless, for a single laboratory it may be difficult to have a complete set of reference genomes, particularly considering that the genus Salmonella is separated into six subspecies and over 2000 serovars. The SalmoDEST tool was tested to search, download and select all complete Salmonella reference genomes available in the GenBank database. SalmoDEST applies a coverage filter set to a minimum of 50x. A second manual filter is based on serovar identification. SalmoDEST was used to compare the listed serovars with the serovars predicted by Seqsero2 in the TableMergeFilter2 tsv file. In this study, SalmoDEST was tested using the list of accession numbers obtained using the NCBI ‘All Databases’ query: ‘Salmonella[title] AND Genome[title] AND Salmonella enterica[title] AND Genome Assembly and Annotation report[title]’ (https://www.ncbi.nlm.nih.gov/genome/browse/#!/prokaryotes/152/) with the filter ‘Complete’ (on 24 June 2021). A list with 1648 accession numbers was retrieved, and after eliminating duplicates, 1048 unique accession numbers were found (Supplementary Table S1). The SalmoDEST option for complete genome mode ‘-m g’ was used. Finally, after serovar prediction and genome length verifications, 1040 genomes were retained and downloaded. Four tsv output files were produced, including the final TableMergeFilter2 tsv file (Supplementary Table S2).

Selection of contig genomes from public database

Microbiologists need to access to Salmonella serovar genomes from specific sources for many types of analyses such as clustering analyses, source attribution studies or when screening for molecular markers.[10 -13] Obtaining genomes from laboratories around the world is therefore a major advantage. Here, we tested the ability of the SalmoDEST tool to obtain Salmonella genomes from strains isolated from bovine sources worldwide. The SalmoDEST tool was tested using the list of assembly accession numbers obtained using the NCBI ‘All Databases’ query: ‘Salmonella[title] AND Genome[title] AND Salmonella enterica[title] AND Genome Assembly and Annotation report[title]’ (https://www.ncbi.nlm.nih.gov/genome/browse/#!/prokaryotes/152/) with the following filters: ‘Contig’ AND ‘Bovine’ AND ‘bovine’ (on 24 June 2021), 89 unique accession numbers were found (Supplementary Table S3). The SalmoDEST option for contig genome mode ‘-m c’ was used and, after the filtering process, 88 genomes were downloaded. Four tsv output files were created, including the final TableMergeFilter2 tsv file (Supplementary Table S4).

Results and Discussion

The NCBI Nucleotide query carried out on 7 June 2021 resulted in 1648 accessions. After deduplication, 1048 unique accessions were included in the input txt file and downloaded by the SalmoDEST tool that we developed here. All these complete genomes were checked for 50x coverage, genome length and predicted serovar matching. Finally, 1040 complete genomes with good quality were downloaded and the MLST profile was determined. From the initial list of 1048 complete genomes in the input txt file, SalmoDEST excluded one genome (CP060132.1) for incorrect serovar prediction and seven others (OU015718.1, OU015719.1, OU015720.1, OU015717.1, LR792437.1, LR792391.1 and LN868943.1) due to low genome length (genome lengths of < 4 Mb, comprised between 277 503 and 3 746 274 bases). We obtained 16 genomes of S. enterica subsp. salamae, 10 S. enterica subsp. arizonae, 13 S. enterica subsp. diarizonae, 10 S. enterica subsp. houtenae and 991 S. enterica subsp. enterica, representing 135 serovars with different antigenic formulas. No S. enterica subsp. indica genomes with a coverage higher than 50x were found. Four serovars were overrepresented (ie, more than 50 complete genomes) in the GenBank database and in our results: S. Typhi (ie, responsible for human typhoid fever with 124 genomes/1040), S. Enteritidis, S. Typhimurium and S. 4,[5],12: i:-, with 114/1040, 141/1040 and 56/1040 genomes, respectively. These latter three serovars are the non-typhoid Salmonella serovars the most frequently isolated worldwide. These serovars were followed by S. Heidelberg (40/1040), S. Newport (38/1040), S. Anatum (32/1040), S. Bareilly (30/1040), S. Indiana (22/1040), S. Montevideo (21/1040) and S. Senftenberg (20/1040) (Figure 2). Our results are consistent with CDC and EFSA reports.[14 -18] Since 2016, these 11 serovars have belonged to the top 30 most frequently isolated serovars in the EU and the US.[14 -18]
Figure 2.

Histogram of serovar diversity among the 1040 complete Salmonella genomes downloaded from the NCBI GenBank database using the SalmoDEST tool developed in this study. Only serovars with more than five complete genomes and complete antigenic formula are shown, with the exception of S. 4,[5],12: i:- and S. 1,3,19:g, s,t:-.

Histogram of serovar diversity among the 1040 complete Salmonella genomes downloaded from the NCBI GenBank database using the SalmoDEST tool developed in this study. Only serovars with more than five complete genomes and complete antigenic formula are shown, with the exception of S. 4,[5],12: i:- and S. 1,3,19:g, s,t:-. To validate the ability to SalmoDEST to screen for and retrieve complete genomes of good quality, we compared our results for S. Typhi with those available in the literature. As expected, in accordance with the study published by Yap and Thong in 2017, SalmoDEST was able to recover 124 S. Typhi. The SalmoDEST tool developed in this study succeeded in screening for and downloading good-quality reference genomes for S. Typhi, confirming its ability to make good-quality genomes available quickly. Finally, due to the need for complete genomes for sequence assembly and for SNP phylogenetic analyses (ie, for mapping analyses and to calculate the pairwise distance between genomes), we constituted a panel of complete reference genomes for Salmonella from the SalmoDEST output obtained in this study. We selected 239 complete genomes from the initial 1040 genomes, with 10 S. enterica subsp. salamae, 8 S. enterica subsp. arizonae, 7 S. enterica subsp. diarizonae, 8 S. enterica subsp. houtenae and 206 S. enterica subsp. enterica, representing 123 serovars and 185 MLST profiles (Table 1 and Supplementary Table S5). When possible, the sequencing technology used for complete genome assembly (ie, both short and long reads) and coverage were taken in account for the selection of the final panel. This panel of complete genomes can be used by microbiologists in food poisoning and typhoid investigations involving Salmonella spp.
Table 1.

List of good-quality complete Salmonella genomes (ID, serovar and MLST profile predictions) downloaded from the NCBI GenBank database on 28 June 2021.

Predicted_serovarMLST ProfileAccession numberPredicted_serovarMLST ProfileAccession numberPredicted_serovarMLST ProfileAccession number
1,3,19:g, s,t:-217CP038604.1II 56: b:z65324CP029995.1Oranienburg3613CP033344.1
Abaetetuba2041CP007532.1II 56: z10:e, n,x,z152403CP029992.1Orion684CP030235.1
Aberdeen426LS483453.1II 58: d:z63379CP070222.1Oslo1370CP030231.1
Abony1483CP007534.1II 58:l, z13,z28:-1141LS483477.1Ouakam1610CP022116.1
Adjame3929CP049881.1IIIa -: z4, z23:-106CP053584.1Panama48CP012346.1
Adjame4023CP054827.1IIIa 40: z4, z23:-6216CP041011.1Paratyphi A85CP000026.1
Agona13CP025452.1IIIa 41: z4, z23:-2131CP000880.1Paratyphi A129CP009049.1
Albany or Duesseldorf292CP019177.1IIIa 48: z36:-3711LR134150.1Paratyphi B28CP020492.1
Albert19CP044188.1IIIa 53: z4, z23,z32:-2127CP022504.1Paratyphi B var. L(+) tartrate +307CP000886.1
Anatum64CP029800.1IIIa 53: z4, z23:-874LR133910.1Paratyphi C or Choleraesuis or Typhisuis66AE017220.1
Anatum2167CP014620.1IIIa 62: z36:-2402CP006693.1Paratyphi C or Choleraesuis or Typhisuis68CP007639.1
Antsalova4407CP019116.1IIIa 63:g, z51:-1425CP029991.1Paratyphi C or Choleraesuis or Typhisuis90CP043773.1
ApapaCP019403.1IIIb 47: k:z351195CP053583.1Paratyphi C or Choleraesuis or Typhisuis114CP000857.1
Bareilly203CP063684.2IIIb 48: i:z574CP029989.1Paratyphi C or Choleraesuis or Typhisuis139CP012344.2
Bareilly909CP006053.1IIIb 50: k:z430CP059886.1Paratyphi C or Choleraesuis or Typhisuis145CP051366.1
Bareilly5146CP034721.1IIIb 60: r:z3457CP011289.1Pomona451CP019186.1
Bergen1356CP019405.1IIIb 60: z52:z532830CP030180.1Poona308CP046279.1
Berta435CP030005.1IIIb 61: i:z57LS483474.1Poona447CP037891.1
Birkenhead424CP045958.1IIIb 65: c:z1260CP022135.1Poona812LS483489.1
Bispebjerg251CP043027.1Indiana17CP028131.1Poona964CP019189.1
Blockley52CP043662.1Infantis32CP047881.1Quebec4409CP022019.1
Blukwa367LR134148.1Inverness1384CP019181.1Reading1628CP051307.1
Bovismorbificans142CP060517.1IrumuLR134144.1Rissen469CP030190.1
Bovismorbificans1499CP069297.1Isangi216CP030225.1Rubislaw94CP019192.1
Braenderup22CP022490.1IV -: z4, z23:-963LS483478.1Saintpaul27CP017723.1
Brancaster2133CP036166.1IV -: z4, z23:-3942CP051368.1Saintpaul49CP053055.1
Brandenburg65CP025280.1IV [1],40:g, z51:-2265CP053582.1Saintpaul50CP045954.1
Bredeney241CP043222.1IV 16: z4, z32:-596CP045761.1Saintpaul95CP023512.1
Bredeney897CP007533.1IV 41: z52:-3924CP054715.1Saintpaul680CP022491.1
Butantan600CP046278.1IV 45:g, z51:-107CP030194.1Saintpaul3602CP023166.1
Carmel2123LS483455.1IV 50:g, z51:-2882LR134159.1Sanjuan785LR134142.1
Cerro367CP008925.1IV 50: z4, z23:-2053CP053579.1SchoenebergLR134153.1
ChesterCP019178.1Javiana24CP004027.1Schwarzengrund96CP045447.1
CoelnLR134190.1Johannesburg471CP019411.1Schwarzengrund322CP001127.1
Concord534CP044177.1Kentucky152CP022500.1Senftenberg14CP038591.1
Concord599CP028196.1Kentucky198CP043667.1Senftenberg185CP016837.1
Corvallis1541CP027677.1Kisarawe906CP030203.1Senftenberg210AP020332.1
Cubana286CP006055.1Kottbus212CP062220.1Senftenberg290CP034233.1
Dakar5734CP046280.1Kottbus808CP030211.1Sloterdijk3179CP012349.1
DaytonaLR133909.1Krefeld1799CP019413.1Stanley29CP036167.1
Derby40CP028900.1Litchfield214CP030202.1Stanley1027LS483434.1
Derby71CP026609.1Litchfield491CP019414.1Stanleyville97CP017727.1
Derby72CP022494.1Livingstone2247CP030233.1Stanleyville1986CP034716.1
DjakartaCP019409.1LlandoffCP060585.1Stanleyville4762CP034700.1
Dublin10CP032393.1London155CP061159.1Sundsvall5323LS483457.1
Dublin4406CP019179.1London504CP064709.1Taksony2204LR134146.1
Enteritidis11CP063700.1Lubbock413CP032814.1Telelkebir450CP030217.1
Enteritidis3175CP008928.1Macclesfield4976CP022117.1Tennessee319CP014994.1
Florida931LS483454.1Manhattan18CP019418.1Thompson26CP012514.1
Fresno649CP032444.1Mbandaka413CP022489.1Typhi1CP003278.1
Gallinarum or Enteritidis78CP019035.1Mbandaka3016CP019183.1Typhi2AL513382.1
Gallinarum or Enteritidis92CP022963.1MenstonLS483490.1Typhi8LT904887.1
Gallinarum or Enteritidis136CP018633.1Miami85CP023470.1Typhi2138LT905088.1
Gallinarum or Enteritidis331AM933173.1Miami129CP009559.1Typhi2209CP029918.1
Gallinarum or Enteritidis1972CP045955.1Miami140CP023468.1Typhimurium19AE006468.2
Gallinarum or Enteritidis3304CP045956.1Mikawasima5372CP034713.1Typhimurium34CP045952.1
Gaminara2439CP024165.1Milwaukee1245CP030175.1Typhimurium36CP036168.1
Gaminara2440CP030288.1Minnesota548CP060508.1Typhimurium99CP020922.1
Gateshead6131CP046291.1Montevideo4CP069518.1Typhimurium128HG326213.1
Give516CP046277.1Montevideo81CP037893.1Typhimurium213CP035547.1
Give654CP019174.1Montevideo138CP040380.1Typhimurium302CP014356.1
Goldcoast or Brikama358CP062223.1Montevideo316CP029336.1Typhimurium313CP060169.1
Goldcoast or Brikama2529LR134158.1Muenchen83CP016014.1Typhimurium328CP025736.1
Grumpensis751CP030223.1Muenchen112CP045056.1Typhimurium568CP064919.1
Hadar33CP022069.2Muenchen112CP045063.1Typhimurium568LR862421.1
Havana1237LR134187.1Muenster321CP019198.1Typhimurium2066CP009102.1
Heidelberg15CP005995.1MuensterCP045038.1Typhimurium2210CP040562.1
Hidalgo or CocodyCP022663.1Napoli2095CP063140.1Typhimurium3631CP039854.1
HillingdonCP019410.1Newport5CP015923.1Typhimurium5036CP029840.1
Hvittingfoss434CP045831.1Newport31CP007559.2Typhimurium5401CP033226.2
Hvittingfoss446CP022503.1Newport45CP012598.1Uganda684CP051398.1
I 4,[5],12: i:-2379CP039610.1Newport118CP015924.1Virchow16CP045945.1
I 9: g, m,q:-2912CP019406.1Newport132CP025232.1Wandsworth1498CP019417.1
I 9: g, p,s:-10CP030207.1Newport166CP012144.1Waycross2460CP034707.1
II -: z:e, n,x,z153706LS483495.1Newport350CP016010.1Weltevreden365CP014996.1
II 40: z4, z24: z394415LS483456.1Newport4157CP039436.1Weltevreden2384LN890524.1
II 42: r:-1208CP034717.1Newport4166CP039437.1Weslaco1088LR134143.1
II 47: b:e, n,x,z153910CP053585.1Ohio329CP030181.1Worthington592CP029041.1
II 50: z:e, n,x1110LS483475.1Onderstepoort3102CP022034.1Yoruba1316CP030209.1
II 55: z39:k1121CP022139.1Oranienburg23CP019197.1
List of good-quality complete Salmonella genomes (ID, serovar and MLST profile predictions) downloaded from the NCBI GenBank database on 28 June 2021.

Salmonella contig genomes from bovine sources

Among the recognised pathogens causing human disease, almost 60% are of animal origin and cattle bred for meat and for milk are common reservoirs of Salmonella spp. Almost 40% of a herd can be infected, and the risk of infection increases with the size of the herd.[22,23] Salmonellosis in cattle puts producers at risk for direct economic losses associated with mortality or body weight loss, and also indirect losses caused by reduced feed conversion or veterinary care costs. Genomes from strains isolated from cattle can be used in source attribution studies, as well as in searches for specific host marker sequences. Our test successfully downloaded Salmonella genomes of strains isolated from bovine animals. The SalmoDEST tool was able to download 88 contig genomes of Salmonella isolated from bovine sources with a coverage of > 50x, lengths of > 4 Mb and correct serovar prediction from the initial input list file of 89 genomes. One genome (GCA_004744895,1) was excluded due to a genome length of < 4 Mb (Supplementary Tables 3 and 4). Fifty-two entries in the TableMergeFilter2.tsv file showed missing information on coverage and sequence type in the gbk files of the corresponding genomes. Interestingly, among the 88 contig genomes downloaded, the most represented serovars were S. Typhimurium (28 contig genomes/88), S. Newport (14/88) and S. Dublin (11/88). These three serovars are well known for contaminating bovine animals in the EU and the US.[18,20,22]

50x coverage

The value of 50x was chosen for Salmonella in the SalmoDEST tool following the recommendations of the European Centre for Disease Control and Prevention (ECDC). The amount of data generated per Salmonella isolate by a DNA sequencer is substantial (ie, megabytes) and a trade-off must be struck between genome coverage (ie, quality) and the size of the files generated. For example, although a coverage of 30x is typically sufficient for routine surveillance of foodborne pathogens, the appropriate coverage threshold is platform-dependent and may also vary by organism. ECDC has fixed a coverage of 50x for Salmonella, considering this value as reasonable for corresponding file size. Coverage is frequently considered as the main quality metric typically used in WGS. Furthermore, the quality of genome sequences also have an impact on successful in silico serovar prediction. Missing or incomplete MLST and cgMLST loci sequences largely contribute to errors in identification.[6,26] Similarly, partial or missing antigenic data in the rfb region (ie, the O-antigen flippase and polymerase genes) and the fliC and fljB genes influence in silico serovar prediction. Good coverage prevents poor MLST, cgMLST and, antigenic data and contributes to the correct listing of the serovar.[6,26,27]

Errors in serotyping

Salmonella genomes from GenBank have already revealed errors in the serovar listed in their metadata. In 2016, Yoshiba et al carried out in silico serovar prediction on over 4,291 genomes extracted from GenBank, and revealed that 3.5% gave incorrect serovar predictions and that 1.8% had missing or ambiguous metadata, making it impossible to ascertain the listed phenotypic serovar. For this reason, we integrated the Bash process in the SalmoDEST tool to query the SeqSero2 and MLSTseeman tools. SeqSero is a Web-based tool developed by the Centres for Disease Control and Prevention (CDC) in Atlanta, GA (US) for determining Salmonella serotypes using the rfb region and the fliC and fljB alleles.[6,28] SeqSero2 was chosen because it is the only tool that relies on characterising genetic determinants of Salmonella serovars without consulting any markers, such as MLST types; it saves time because it predicts serovars directly from raw sequencing reads and not from assemblies, and finally it is able to detect inter-serovar contaminations. The MLSTseeman is a tool developed by Torsten Seemann in 1991 that scans contig files against traditional PubMLST typing schemes conceived as part of the development of the first MLST scheme in 1998, making it possible to include all levels of sequence data, from single gene sequences up to and including complete, finished genomes. Information on serovar and MLST type were integrated in SalmoDEST to enable genome verification and because they are integral to surveillance and outbreak investigations.

Conclusion

SalmoDEST is a handy and easy-to-use tool that can be routinely used in public health, food safety and research laboratories to extract complete Salmonella reference genomes of high quality from GenBank. It can also be used to download contig genomes from a list of assembly IDs. A coverage of 50x, as well as correct Salmonella genome size and serovar and MLST type prediction, are used as quality controls for both genome modes (ie, complete and contig genomes search and download). Moreover, SalmoDEST screens downloaded genomes for contamination by using the SeqSero2 tool for serovar prediction. Click here for additional data file. Supplemental material, sj-jpg-1-bbi-10.1177_11779322221080264 for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST by Emeline Cherchame, Guy Ilango and Sabrina Cadel-Six in Bioinformatics and Biology Insights Click here for additional data file. Supplemental material, sj-txt-1-bbi-10.1177_11779322221080264 for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST by Emeline Cherchame, Guy Ilango and Sabrina Cadel-Six in Bioinformatics and Biology Insights Click here for additional data file. Supplemental material, sj-txt-2-bbi-10.1177_11779322221080264 for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST by Emeline Cherchame, Guy Ilango and Sabrina Cadel-Six in Bioinformatics and Biology Insights Click here for additional data file. Supplemental material, sj-xls-1-bbi-10.1177_11779322221080264 for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST by Emeline Cherchame, Guy Ilango and Sabrina Cadel-Six in Bioinformatics and Biology Insights Click here for additional data file. Supplemental material, sj-xlsx-1-bbi-10.1177_11779322221080264 for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST by Emeline Cherchame, Guy Ilango and Sabrina Cadel-Six in Bioinformatics and Biology Insights Click here for additional data file. Supplemental material, sj-xlsx-2-bbi-10.1177_11779322221080264 for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST by Emeline Cherchame, Guy Ilango and Sabrina Cadel-Six in Bioinformatics and Biology Insights
  21 in total

1.  Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms.

Authors:  M C Maiden; J A Bygraves; E Feil; G Morelli; J E Russell; R Urwin; Q Zhang; J Zhou; K Zurth; D A Caugant; I M Feavers; M Achtman; B G Spratt
Journal:  Proc Natl Acad Sci U S A       Date:  1998-03-17       Impact factor: 11.205

2.  Salmonella serotype determination utilizing high-throughput genome sequencing data.

Authors:  Shaokang Zhang; Yanlong Yin; Marcus B Jones; Zhenzhen Zhang; Brooke L Deatherage Kaiser; Blake A Dinsmore; Collette Fitzgerald; Patricia I Fields; Xiangyu Deng
Journal:  J Clin Microbiol       Date:  2015-03-11       Impact factor: 5.948

3.  The European Union One Health 2019 Zoonoses Report.

Authors: 
Journal:  EFSA J       Date:  2021-02-27

4.  The duration of fecal Salmonella shedding following clinical disease among dairy cattle in the northeastern USA.

Authors:  K J Cummings; L D Warnick; K A Alexander; C J Cripps; Y T Gröhn; K L James; P L McDonough; K E Reed
Journal:  Prev Vet Med       Date:  2009-08-07       Impact factor: 2.670

Review 5.  Persistent Infection and Long-Term Carriage of Typhoidal and Nontyphoidal Salmonellae.

Authors:  Ohad Gal-Mor
Journal:  Clin Microbiol Rev       Date:  2018-11-28       Impact factor: 26.132

6.  MassCode liquid arrays as a tool for multiplexed high-throughput genetic profiling.

Authors:  Gregory S Richmond; Htet Khine; Tina T Zhou; Daniel E Ryan; Tony Brand; Mary T McBride; Kevin Killeen
Journal:  PLoS One       Date:  2011-04-22       Impact factor: 3.240

Review 7.  Animal contact as a source of human non-typhoidal salmonellosis.

Authors:  Karin Hoelzer; Andrea Isabel Moreno Switt; Martin Wiedmann
Journal:  Vet Res       Date:  2011-02-14       Impact factor: 3.683

8.  Genome Target Evaluator (GTEvaluator): A workflow exploiting genome dataset to measure the sensitivity and specificity of genetic markers.

Authors:  Arnaud Felten; Laurent Guillier; Nicolas Radomski; Michel-Yves Mistou; Renaud Lailler; Sabrina Cadel-Six
Journal:  PLoS One       Date:  2017-07-27       Impact factor: 3.240

9.  CRISPR is an optimal target for the design of specific PCR assays for salmonella enterica serotypes Typhi and Paratyphi A.

Authors:  Laetitia Fabre; Simon Le Hello; Chrystelle Roux; Sylvie Issenhuth-Jeanjean; François-Xavier Weill
Journal:  PLoS Negl Trop Dis       Date:  2014-01-30

10.  Performance and Accuracy of Four Open-Source Tools for In Silico Serotyping of Salmonella spp. Based on Whole-Genome Short-Read Sequencing Data.

Authors:  Laura Uelze; Maria Borowiak; Carlus Deneke; István Szabó; Jennie Fischer; Simon H Tausch; Burkhard Malorny
Journal:  Appl Environ Microbiol       Date:  2020-02-18       Impact factor: 4.792

View more
  1 in total

1.  Polyphyly in widespread Salmonella enterica serovars and using genomic proximity to choose the best reference genome for bioinformatics analyses.

Authors:  Emeline Cherchame; Guy Ilango; Véronique Noël; Sabrina Cadel-Six
Journal:  Front Public Health       Date:  2022-09-08
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.