Literature DB >> 31849328

Within-species contamination of bacterial whole-genome sequence data has a greater influence on clustering analyses than between-species contamination.

Arthur W Pightling¹, James B Pettengill², Yu Wang², Hugh Rand², Errol Strain².

Abstract

Although it is assumed that contamination in bacterial whole-genome sequencing causes errors, the influences of contamination on clustering analyses, such as single-nucleotide polymorphism discovery, phylogenetics, and multi-locus sequencing typing, have not been quantified. By developing and analyzing 720 Listeria monocytogenes, Salmonella enterica, and Escherichia coli short-read datasets, we demonstrate that within-species contamination causes errors that confound clustering analyses, while between-species contamination generally does not. Contaminant reads mapping to references or becoming incorporated into chimeric sequences during assembly are the sources of those errors. Contamination sufficient to influence clustering analyses is present in public sequence databases.

Entities: Species

Keywords: Clustering analyses; Comparative genomics; Contamination; Escherichia coli; Listeria monocytogenes; MLST; Multi-locus sequence typing; Phylogenetics; SNP; Salmonella enterica; Single-nucleotide polymorphism; Whole-genome sequencing

Mesh：

Year: 2019 PMID： 31849328 PMCID： PMC6918607 DOI： 10.1186/s13059-019-1914-x

Source DB: PubMed Journal: Genome Biol ISSN： 1474-7596 Impact factor: 13.583

Main text

Whole-genome sequence (WGS) analysis is valuable for studying bacteria in many disciplines, including genetics, evolutionary biology, ecology, clinical microbiology, and microbial forensics [1-5]. Researchers cluster genomes with phylogenetic analyses and by counting nucleotide or allele differences. Contamination of eukaryotic data can cause misleading results [6, 7]. For prokaryotes, it is assumed that contamination causes error [8], and tools are available to detect it [9-13], but evidence supporting this assumption is lacking. To measure the influences of contamination on clustering analyses, we generated 720 sets of simulated Listeria monocytogenes, Salmonella enterica, and Escherichia coli Illumina MiSeq reads. These datasets include from 10 to 50% of within-species (at 0.05, 0.5, and 5% genomic distances) and between-species contamination. We also identified 24 sets of closely related bacteria (clusters) within which the contamination datasets can be analyzed. With these tools, we found that within-species contamination caused substantial errors in single-nucleotide polymorphism (SNP) and multi-locus sequence typing (MLST) pipelines, while between-species contamination resulted in fewer errors. Read mapping and assembly behavior explains this observation—reads from the same species are mapped to references or incorporated into the same contiguous sequences (contigs) as subject reads, while reads from different species usually are not. We measured SNP and allele distances between subjects and closely related isolates (“nearest neighbors”) with the CFSAN SNP Pipeline and core-genome MLST (cgMLST) workflows [14-16] (Additional file 1: Table S1). We also performed phylogenetic analyses to provide bootstrap supports for the monophyly of subjects and their nearest neighbors. Importantly, only the subject data are simulated; all other data are real (Additional file 1: Figure S1). This approach provides as realistic a dataset as possible that produces results that apply to real-world situations. We observed increased SNP counts for all three species at 40 and 50% levels of contamination with 0.5 and 5% distant genomes (median 5–154) relative to controls (median 1–3; Fig. 1a–c, Additional file 1: Tables S2 and S3). For S. enterica and E. coli, there were smaller but significant increases at 50% contamination with 0.05% distant genomes (median 12–14) and for one of the two between-species contaminants (median 7–13). Bootstrap support at 40 and 50% levels of within-species contamination decreased for L. monocytogenes and E. coli (median 0.63–0.88 and 0.00–0.92, respectively) compared to controls (median 0.91–0.92 and 0.97), although not all decreases were significant (Fig. 1d–f). For S. enterica, we saw small decreases with 50% contamination by 0.05 (median 0.86) and 0.5% (median 0.96) distant genomes relative to controls (median 1.00 for each). For L. monocytogenes and S. enterica, between-species contamination caused no decreases in bootstrap support (median 0.92–0.93 and 1.00, respectively), and support only slightly decreased for E. coli (median 0.92–0.99). With the MLST workflows, each type of contamination influenced allele counts. Still, the 0.5 and 5% distant genomes had the greatest influence (median 3–294 and 14–418) when compared to controls (median 2–5; Fig. 2a–c, Additional file 1: Tables S2 and S3). The numbers of missing and partial alleles were also greatest for the 0.5 and 5% contaminants (median 1–463) relative to controls (median 0–6; Fig. 2d–f). Errors at lower levels for the MLST are likely due to the absence of filtering steps commonly found in SNP pipelines.

Fig. 1

Fig. 2

Results of MLST analyses and assembly lengths for contaminated datasets. We contaminated simulated Listeria monocytogenes (Lm), Salmonella enterica (Se), and Escherichia coli (Ec) MiSeq data with reads from themselves as controls (Self); genomes from the same species at 0.05, 0.5, and 5% genetic distances; and genomes from different species (e.g., we contaminated Lm with Se and Ec, and we contaminated Se with Lm and Ec) at 10–50% levels. For each contamination type at each level, results for 8 datasets are shown. Panels a-c show allele counts, d-f numbers of missing and partial alleles, and g-i assembly lengths

Results of SNP and phylogenetic analyses for contaminated datasets. We contaminated simulated Listeria monocytogenes (Lm), Salmonella enterica (Se), and Escherichia coli (Ec) MiSeq data with reads from themselves as controls (Self); genomes from the same species at 0.05, 0.5, and 5% genetic distances; and genomes from different species (e.g., we contaminated Lm with Se and Ec, and we contaminated Se with Lm and Ec) at 10–50% levels. For each contamination type at each level, results for 8 datasets are shown. Panels a-c show SNP distances, d-f bootstrap supports, and g-i percent reads mapped Results of MLST analyses and assembly lengths for contaminated datasets. We contaminated simulated Listeria monocytogenes (Lm), Salmonella enterica (Se), and Escherichia coli (Ec) MiSeq data with reads from themselves as controls (Self); genomes from the same species at 0.05, 0.5, and 5% genetic distances; and genomes from different species (e.g., we contaminated Lm with Se and Ec, and we contaminated Se with Lm and Ec) at 10–50% levels. For each contamination type at each level, results for 8 datasets are shown. Panels a-c show allele counts, d-f numbers of missing and partial alleles, and g-i assembly lengths To gain insight into these results, we examined the percent of reads mapped to references. Median values were highest for 0.05 and 0.5% within-species contamination (median 96–100%) and lowest for between-species (median 50–91%), while 5% within-species contamination yielded intermediate results (median 76–98%; Fig. 1g–i, Additional file 1: Tables S2 and S3). For between-species contamination, there is an inverse relationship between contamination levels and the percent of reads mapped to references. For example, at 10% contamination, approximately 90% of reads mapped. It appears that the more distant mapped contaminant reads are, the higher the SNP counts. Contaminant reads that are similar enough to the reference to be mapped but distant enough from the subject to introduce variation will generate errors. In turn, these errors may reduce bootstrap support. A similar relationship exists between allele distances and assembly lengths. Median assembly lengths for 0.05 and 0.5% within-species data are similar to controls (median 3.0–5.6 and 3.0–5.3 megabases [Mb], respectively), while between-species contaminants yielded larger assemblies (median 4.1–9.9 Mb) and the 5% within-species contamination dataset yielded intermediate assemblies (median 3.1–9.1 Mb; Fig. 2g–i). To measure contamination in public sequence databases, we used ConFindr [13] to analyze 10,000 randomly selected fastq datasets for each of L. monocytogenes, S. enterica, and E. coli (Additional file 2: Table S4). We detected contamination in 8.92, 6.38, and 5.47% of the data, respectively (Additional file 1: Table S5). We detected between-species contamination (1.23, 0.29, and 0.15%) less often than within-species contamination (7.69, 6.09, and 5.33%), consistent with Low et al. [13]. We also analyzed the simulated data with ConFindr and used that information to estimate levels of contamination in the databases that may confound SNP and MLST workflows (Additional file 1: Figure S2 and Table S5). Approximately 1.48 (L. monocytogenes), 2.22 (S. enterica), and 0.87% (E. coli) of the data are contaminated at levels that are likely to influence SNP analyses. Roughly 2.26 (L. monocytogenes), 5.06 (S. enterica), and 1.26% (E. coli) of the data are contaminated at levels that may influence MLST analyses. In summary, we show that within-species contamination (especially by 0.5 and 5% distant genomes) causes more errors in SNP counts, allele counts, and phylogenetic analyses of bacterial genomes [17] than between-species contamination. While other workflows may not yield the exact numbers measured here, the observation that contaminant reads are mapped to references and included in contigs of the same species, resulting in errors, is likely to hold. This study also shows that contamination that may cause errors in clustering analyses is present in public sequence databases. Therefore, it is important that studies include steps to detect within-species contamination.

Methods

We searched the National Center for Biotechnology Information’s (NCBI’s) database for closed Listeria monocytogenes, Salmonella enterica, and Escherichia coli genomes (e.g., “Listeria monocytogenes”[Organism] AND (“complete genome”[filter] AND all[filter] NOT anomalous[filter])) and downloaded all assemblies. We identified those that are 0–9 SNPs distant to other genomes (“nearest neighbors”) using the “min_dist_same” and “min_dist_opp” measurements in the NCBI metadata files [18-20]. We used the NCBI’s Isolates Browser [21] to identify closed genomes with closely related isolates that are part of NCBI SNP trees with at least 5 taxa [22]. We assembled 16,839 L. monocytogenes, 127,357 S. enterica, and 33,821 Escherichia coli Illumina datasets with SPAdes v3.12.0 (spades.py --careful -1 forward.fastq -2 reverse.fastq) [23]. We removed contigs that were less than 500 nucleotides. We aligned closed and draft assemblies with NUCmer v3.1 (nucmer --prefix=ref_qry closed.fna draft.fna) and estimated SNP distances with show-snps (show-snps -Clr ref_qry.delta > ref_qry.snps) [24]. We selected closed genomes for further analyses that are approximately 0.05, 0.5, and 5% from draft genomes of the same species (based upon closed assembly length estimates calculated with QUASTv4.5 [25]). For most subjects, within-species contamination represents (i) closely related genomes of the same serotype and clonal complex, with 0–2 locus differences (average 0.22; as measured with the program mlst; 0.05%) [26-28]; (ii) distantly related genomes of the same serotype but different clonal complex and 2–6 locus variants (average 4.1; 0.5%); and (iii) genomes of a different serotype and clonal complex with 7 locus variants (average 7; 5%; Additional file 1: Table S1). When unavailable, we predicted serotypes for S. enterica with SeqSero [29] and E. coli with SerotypeFinder [30]. We generated simulated reads using closed subject assemblies, within-species draft contaminant assemblies, and between-species draft contaminant assemblies, with ART_Illumina v2.5.8 (art_illumina -ss MSv1 -i assembly.fasta -p -l 230 -f 20 -m 295 -s 10 -o paired_data) [31]—all assemblies were generated from real sequencing data. Contamination fastq files were made by randomly selecting subject and contaminant reads at indicated levels (in this case 10–50% contamination) and combining them into paired read files with 20-fold depth of coverage (github.com/apightling/contamination; e.g., select_reads.pl subject_1.fq subject_2.fq 10 contaminant_1.fq contaminant_2.fq output_prefix). We identified SNP clusters that contain subject genome sequences with the NCBI’s Isolates Browser. If SNP clusters had more than 20 taxa, counting the subjects and their nearest neighbors, we randomly selected subsets for further analyses. We also ensured that the subjects and nearest neighbors formed monophyletic groups in phylogenetic trees. We generated SNP matrices with the CFSAN SNP Pipeline v1.0, using the subject assembly as a reference to minimize errors [32]. Alignments of SNPs that were detected by mapping reads to the reference were phylogenetically analyzed with GARLI v2.01.1067 [33] (100 replicates, K80 and HKY). We reported supports for monophyly of subjects and nearest neighbors; if the they were no longer monophyletic, we recorded a support of 0. We assembled simulated data with SPAdes v3.12.0 and measured assembly statistics with QUAST v4.5. We analyzed Listeria monocytogenes assemblies with the LmCGST core-genome multi-locus sequence typing (cgMLST) tool and Salmonella enterica assemblies with an S. enterica cgMLST tool described in Pettengill et al. [15]. We analyzed E. coli assemblies with a cgMLST developed using the same approach. Partial alleles are those loci whose lengths are less than 60% of the predicted lengths, and missing alleles are those loci that are less than 60% of predicted lengths and less than 80% identical to the reference. Additional file 1: Figure S1. Phylogenetic tree of 9 Listeria monocytogenes genomes with study subject and nearest neighbor labeled. Figure S2. Results of ConFindr analysis of contamination datasets generated for this study. Table S1. Contextual information for genome sequences used for this study. Table S2. Results of SNP pipeline and core-genome multi locus sequence typing analyses. Table S3. P-values for results of clustering analyses. Table S5. Percent of contamination detected in data from NCBI. Table S6. NCBI accession numbers for data generated during this study. Additional file 2: Table S4. ConFindr results from analysis of 10,000 Listeria monocytogenes, Salmonella enterica, and Escherichia coli fastq datasets. (XLS 7913 kb) Additional file 3. Review history.

23 in total

1. Salmonella serotype determination utilizing high-throughput genome sequencing data.

Authors: Shaokang Zhang; Yanlong Yin; Marcus B Jones; Zhenzhen Zhang; Brooke L Deatherage Kaiser; Blake A Dinsmore; Collette Fitzgerald; Patricia I Fields; Xiangyu Deng
Journal: J Clin Microbiol Date: 2015-03-11 Impact factor: 5.948

2. ART: a next-generation sequencing read simulator.

Authors: Weichun Huang; Leping Li; Jason R Myers; Gabor T Marth
Journal: Bioinformatics Date: 2011-12-23 Impact factor: 6.937

3. Versatile and open software for comparing large genomes.

Authors: Stefan Kurtz; Adam Phillippy; Arthur L Delcher; Michael Smoot; Martin Shumway; Corina Antonescu; Steven L Salzberg
Journal: Genome Biol Date: 2004-01-30 Impact factor: 13.583

4. BIGSdb: Scalable analysis of bacterial genome variation at the population level.

Authors: Keith A Jolley; Martin C J Maiden
Journal: BMC Bioinformatics Date: 2010-12-10 Impact factor: 3.169

5. Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.

Authors: Arthur W Pightling; Nicholas Petronella; Franco Pagotto
Journal: PLoS One Date: 2014-08-21 Impact factor: 3.240

6. ProDeGe: a computational protocol for fully automated decontamination of genomes.

Authors: Kristin Tennessen; Evan Andersen; Scott Clingenpeel; Christian Rinke; Derek S Lundberg; James Han; Jeff L Dangl; Natalia Ivanova; Tanja Woyke; Nikos Kyrpides; Amrita Pati
Journal: ISME J Date: 2015-06-09 Impact factor: 10.302

7. Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of k-mer and Site-Based Methods for Inferring the Genetic Distances among Tens of Thousands of Salmonella Samples.

Authors: James B Pettengill; Arthur W Pightling; Joseph D Baugher; Hugh Rand; Errol Strain
Journal: PLoS One Date: 2016-11-10 Impact factor: 3.240

8. ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data.

Authors: Andrew J Low; Catherine D Carrillo; Adam G Koziol; Paul A Manninger; Burton Blais
Journal: PeerJ Date: 2019-05-31 Impact factor: 2.984

Review 9. Transforming clinical microbiology with bacterial genome sequencing.

Authors: Xavier Didelot; Rory Bowden; Daniel J Wilson; Tim E A Peto; Derrick W Crook
Journal: Nat Rev Genet Date: 2012-08-07 Impact factor: 53.242

10. Detection of bacterial contaminants and hybrid sequences in the genome of the kelp Saccharina japonica using Taxoblast.

Authors: Simon M Dittami; Erwan Corre
Journal: PeerJ Date: 2017-11-17 Impact factor: 2.984

7 in total

1. Transmission dynamics of SARS-CoV-2 within-host diversity in two major hospital outbreaks in South Africa.

Authors: James E San; Sinaye Ngcapu; Aquillah M Kanzi; Houriiyah Tegally; Vagner Fonseca; Jennifer Giandhari; Eduan Wilkinson; Chase W Nelson; Werner Smidt; Anmol M Kiran; Benjamin Chimukangara; Sureshnee Pillay; Lavanya Singh; Maryam Fish; Inbal Gazy; Darren P Martin; Khulekani Khanyile; Richard Lessells; Tulio de Oliveira
Journal: Virus Evol Date: 2021-04-21

2. Species-Specific Quality Control, Assembly and Contamination Detection in Microbial Isolate Sequences with AQUAMIS.

Authors: Carlus Deneke; Holger Brendebach; Laura Uelze; Maria Borowiak; Burkhard Malorny; Simon H Tausch
Journal: Genes (Basel) Date: 2021-04-26 Impact factor: 4.096

3. Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel.

Authors: Geneviève Labbé; Peter Kruczkiewicz; James Robertson; Philip Mabon; Justin Schonfeld; Daniel Kein; Marisa A Rankin; Matthew Gopez; Darian Hole; David Son; Natalie Knox; Chad R Laing; Kyrylo Bessonov; Eduardo N Taboada; Catherine Yoshida; Kim Ziebell; Anil Nichani; Roger P Johnson; Gary Van Domselaar; John H E Nash
Journal: Microb Genom Date: 2021-09

4. Multidrug Resistance Dynamics in Salmonella in Food Animals in the United States: An Analysis of Genomes from Public Databases.

Authors: João Pires; Jana S Huisman; Sebastian Bonhoeffer; Thomas P Van Boeckel
Journal: Microbiol Spectr Date: 2021-10-27

Review 5. Contamination detection in genomic data: more is not enough.

Authors: Luc Cornet; Denis Baurain
Journal: Genome Biol Date: 2022-02-21 Impact factor: 13.583

6. A European-wide dataset to uncover adaptive traits of Listeria monocytogenes to diverse ecological niches.

Authors: Benjamin Félix; Yann Sevellec; Federica Palma; Pierre Emmanuel Douarre; Arnaud Felten; Nicolas Radomski; Ludovic Mallet; Yannick Blanchard; Aurélie Leroux; Christophe Soumet; Arnaud Bridier; Pascal Piveteau; Eliette Ascensio; Michel Hébraud; Renáta Karpíšková; Tereza Gelbíčová; Marina Torresi; Francesco Pomilio; Cesare Cammà; Adriano Di Pasquale; Taran Skjerdal; Ariane Pietzka; Werner Ruppitsch; Monica Ricão Canelhas; Bojan Papić; Ana Hurtado; Bart Wullings; Hana Bulawova; Hanna Castro; Miia Lindström; Hannu Korkeala; Žanete Šteingolde; Toomas Kramarenko; Lenka Cabanova; Barbara Szymczak; Manfred Gareis; Verena Oswaldi; Elisabet Marti; Anne-Mette Seyfarth; Jean-Charles Leblanc; Laurent Guillier; Sophie Roussel
Journal: Sci Data Date: 2022-04-28 Impact factor: 8.501

Review 7. Combination of Whole Genome Sequencing and Metagenomics for Microbiological Diagnostics.

Authors: Srinithi Purushothaman; Marco Meola; Adrian Egli
Journal: Int J Mol Sci Date: 2022-08-30 Impact factor: 6.208

7 in total