| Literature DB >> 31849328 |
Arthur W Pightling1, James B Pettengill2, Yu Wang2, Hugh Rand2, Errol Strain2.
Abstract
Although it is assumed that contamination in bacterial whole-genome sequencing causes errors, the influences of contamination on clustering analyses, such as single-nucleotide polymorphism discovery, phylogenetics, and multi-locus sequencing typing, have not been quantified. By developing and analyzing 720 Listeria monocytogenes, Salmonella enterica, and Escherichia coli short-read datasets, we demonstrate that within-species contamination causes errors that confound clustering analyses, while between-species contamination generally does not. Contaminant reads mapping to references or becoming incorporated into chimeric sequences during assembly are the sources of those errors. Contamination sufficient to influence clustering analyses is present in public sequence databases.Entities:
Keywords: Clustering analyses; Comparative genomics; Contamination; Escherichia coli; Listeria monocytogenes; MLST; Multi-locus sequence typing; Phylogenetics; SNP; Salmonella enterica; Single-nucleotide polymorphism; Whole-genome sequencing
Mesh:
Year: 2019 PMID: 31849328 PMCID: PMC6918607 DOI: 10.1186/s13059-019-1914-x
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Results of SNP and phylogenetic analyses for contaminated datasets. We contaminated simulated Listeria monocytogenes (Lm), Salmonella enterica (Se), and Escherichia coli (Ec) MiSeq data with reads from themselves as controls (Self); genomes from the same species at 0.05, 0.5, and 5% genetic distances; and genomes from different species (e.g., we contaminated Lm with Se and Ec, and we contaminated Se with Lm and Ec) at 10–50% levels. For each contamination type at each level, results for 8 datasets are shown. Panels a-c show SNP distances, d-f bootstrap supports, and g-i percent reads mapped
Fig. 2Results of MLST analyses and assembly lengths for contaminated datasets. We contaminated simulated Listeria monocytogenes (Lm), Salmonella enterica (Se), and Escherichia coli (Ec) MiSeq data with reads from themselves as controls (Self); genomes from the same species at 0.05, 0.5, and 5% genetic distances; and genomes from different species (e.g., we contaminated Lm with Se and Ec, and we contaminated Se with Lm and Ec) at 10–50% levels. For each contamination type at each level, results for 8 datasets are shown. Panels a-c show allele counts, d-f numbers of missing and partial alleles, and g-i assembly lengths