Literature DB >> 26203331

Large-scale contamination of microbial isolate genomes by Illumina PhiX control.

Supratim Mukherjee¹, Marcel Huntemann¹, Natalia Ivanova¹, Nikos C Kyrpides², Amrita Pati¹.

Abstract

With the rapid growth and development of sequencing technologies, genomes have become the new go-to for exploring solutions to some of the world's biggest challenges such as searching for alternative energy sources and exploration of genomic dark matter. However, progress in sequencing has been accompanied by its share of errors that can occur during template or library preparation, sequencing, imaging or data analysis. In this study we screened over 18,000 publicly available microbial isolate genome sequences in the Integrated Microbial Genomes database and identified more than 1000 genomes that are contaminated with PhiX, a control frequently used during Illumina sequencing runs. Approximately 10% of these genomes have been published in literature and 129 contaminated genomes were sequenced under the Human Microbiome Project. Raw sequence reads are prone to contamination from various sources and are usually eliminated during downstream quality control steps. Detection of PhiX contaminated genomes indicates a lapse in either the application or effectiveness of proper quality control measures. The presence of PhiX contamination in several publicly available isolate genomes can result in additional errors when such data are used in comparative genomics analyses. Such contamination of public databases have far-reaching consequences in the form of erroneous data interpretation and analyses, and necessitates better measures to proofread raw sequences before releasing them to the broader scientific community.

Entities: Chemical Species

Keywords: Comparative genomics; Contamination; Next-generation sequencing; PhiX

Year: 2015 PMID： 26203331 PMCID： PMC4511556 DOI： 10.1186/1944-3277-10-18

Source DB: PubMed Journal: Stand Genomic Sci ISSN： 1944-3277

Background

The ability to produce large numbers of high-quality, low-cost reads has revolutionized the field of microbiology [1-3]. Starting from a meager 1575 registered projects in September 2005, there has been a steady increase in the number of sequencing projects according to the Genomes OnLine Database [4]. As of November 17th 2014, there were 41,553 bacterial and archaeal isolate genome sequencing projects reported in GOLD [4,5]. This explosion of genome sequencing projects especially during the last 5 years has been largely catalyzed by the development of several next-generation sequencing platforms offering rapid and accurate genome information at a low cost. Among the different NGS technologies available commercially, the sequencing by synthesis technology [6] championed by Illumina [7] is the most widely used. Despite its high accuracy, the Illumina sequencing platform does come with its share of challenges [8] that need to be addressed by the users of this technology. One such challenge is the protocol in which PhiX is used as a quality and calibration control for sequencing runs. PhiX is an icosahedral, nontailed bacteriophage with a single-stranded DNA. It has a tiny genome with 5386 nucleotides and was the first DNA genome to be sequenced by Fred Sanger [9]. Due to its small, well-defined genome sequence, PhiX has been commonly used as a control for Illumina sequencing runs. For the majority of its library preparations Illumina recommends using PhiX at a low concentration of 1%, which can be raised up to 40% for low diversity samples. Depending on the concentration of PhiX used, it can be spiked in the same lane along with the sample or used as a separate lane. Addition of PhiX as a sequencing control necessitates subsequent quality control steps to remove the sequences such that they do not get integrated as part of the target genome. Here, we identify and catalog more than 1000 genomes in public databases (i.e. Genbank) that are contaminated with PhiX sequences and the approximately 10% of the genomes that are published in literature. In an era where sequencing data is growing exponentially along with the need to rapidly churn out novel sequences, our report serves as a reminder that it is equally important to develop effective downstream screening and quality control measures to prevent large-scale contamination of public databases. Since preliminary analyses of initial draft sequences lead to formulation of key scientific questions, contamination can result in misinterpretation of data and drawing of erroneous biological conclusions.

Methods

We screened the current list of isolate microbial genomes in the Integrated Microbial Genomes (IMG v 4.0) [10] against the PhiX genome. The nucleotide sequence of each query genome was compared against PhiX using NCBI-BLASTn [11] and hits above a percent identity of 90% and e-value of 0.01 were retained. A hit was flagged as being contaminated with PhiX sequences if its total length was at least 80% of the length of the contig.

Results

Among the isolate bacterial and archaeal genomes in IMG v4.0, 1230 scaffolds from 1041 genomes were contaminated with PhiX sequences, with 105 contaminated genomes published in literature (Additional files 1 and 2). A summary of the affected genomes, sequencing information and their sequence assembly method is displayed in Additional file 3. Sequences of these genomes were incorporated into IMG from NCBI Reference Sequence Database. Majority of the contaminated scaffolds (1216 out of 1230) have a 100% PhiX contamination, 11 scaffolds have a 99% contamination, 4 scaffolds have a contamination rate between 94–98% while PhiX sequences contaminated 83% of 1 scaffold (Additional file 1). Sixty-two genomes have multiple scaffolds (between 2 and 10 scaffolds each) that are contaminated with PhiX sequences. While the average length of contamination in such a single scaffold varies between 406 bp and 1878 bp, the total contamination per genome adds up to 4055–4777 bp (Table 1). Approximately 94% (979) genomes have a single scaffold each, with an average length of 5587 bp that is contaminated with PhiX (Table 1).

Table 1

Summary of genomes and their corresponding scaffolds contaminated with PhiX sequences

Number of Genomes	Number of contaminated scaffolds/genome	Average contaminated sequence length (bp)/ scaffold	Average contaminated sequence length (bp)/ genome
2	10	406	4055
5	9	476	4282
6	8	502	4017
3	7	627	4389
46	2–6	1878	4777
979	1	5587	5587

Summary of genomes and their corresponding scaffolds contaminated with PhiX sequences The size of the genomes contaminated with PhiX varies from the tiny 1.05 Mb intracellular 10_881_SC42 [12-14] to the 12.2 Mb antifungal natural product synthesizing myxobacterium [15] (Figure 1). While the average length of contaminated sequence per genome is 5530 bp matching perfectly with the 5386 bp size of an entire PhiX genome, there is no direct correlation between the percentage of contamination and the size of the affected genome (Figure 1, inset). The source of contamination appears to be related to the sequencing center and its analysis and quality control pipeline. The PhiX contaminated genomes were sequenced by 54 different universities and sequencing centers; so it seems that the problem is quite widespread among sequencing groups (Additional file 3). Genomes from the Human Microbiome Project account for a little over 12% of the contaminated genomes (Additional file 3).

Figure 1

Genome size and contaminated sequence length (inset) of PhiX contaminated taxa.

Conclusions

The presence of PhiX sequences within individual genomes first attracted our attention while manually curating a small number of isolate genomes. Initially thought of as an exciting biological phenomenon or the result of horizontal gene transfer, after careful analyses, these scaffolds turned out to be nothing but sequencing artifacts. Sequencing centers generate massive amounts of data, which calls for strict quality control measures. The sheer volume of data being generated on a daily basis necessitates well-defined, automatic quality control protocols at source. Contaminated sequences once released to public databases typically trace thousands of analysis routes and can add to error propagation and incorrect hypotheses [16]. Thus, it is extremely important to detect contaminated sequences at the source and prevent them from affecting subsequent downstream analyses. Contamination and sequence artifacts can come from multiple sources including but not limited to sequencing controls such as PhiX, cloning vectors, adapters, PCR primers, nucleic acid impurities present in reagents required for sample isolation and preparation and human error. Salter et al. [17] identified a wide range of contaminants from DNA extraction kits and other laboratory reagents affecting the outcome of culture-independent microbiota research; while Lusk [18] detected widespread contamination in four independent high throughput sequencing experiments. A study [19] scanning DNA sequences from The Thousand Genome Project [20] identified significant contamination by sequences. While DNA contamination has been a long-standing issue in research laboratories, its potential long-term implications were highlighted recently in light of developments in high throughput sequencing and human microbiome research. A recent commentary published in Nature [21] summarizes the problem well. Several tools have been developed over the years for quality control of raw sequence reads such as Phred [22], Sequence Scanner [23] (specifically for first generation sequence data) and NCBI’s VecScreen and UniVec [24,25] to get rid of contaminants of vector origin. More recent programs have been designed for analyzing NGS data such as TileQC [26], FastQC [27], PRINSEQ [28], NGS-QC [29], programs to detect contamination such as DeconSeq [30], as well as multi genome alignment (MGA) [31] and QC-Chain [32] which can provide both rapid QC and contamination filtering of NGS data. Such programs are meant to prevent release of contaminated sequences. However, our results from scanning publicly available microbial isolate genome sequences for contamination shows that large number of errors can be detected in spite of the easy availability of multiple quality control measures. The sheer volume of PhiX contaminated genomes is alarming and calls for implementation of stricter quality control measures especially at large genome centers with high rates of sequence turnover. Detection of PhiX contamination encouraged us to expand our search further; we performed additional analysis looking for other sources of contamination and have identified genomes in public databases that are: (a) either a partial or complete mixtures of two or more strains (b) genomes contaminated with short fragments of two or more species (c) ‘isolate’ genomes where a complete genome is cloned inside another The list of such genomes is available in Additional file 4 and their nucleotide sequences are available on a JGI public ftp site [33]. The IMG database has already implemented a quality control step to identify and remove these artifacts during data submission, and the sequence data in the system is free of PhiX contamination. We are currently in the process of cleaning up additional contaminated genomes. Most have already been removed from IMG completely or are being re-instated after cleaning up of contaminated scaffolds. At the same time, most of the PhiX contaminated genomes continue to exist in other public databases such as NCBI/RefSeq or Genbank and are easily accessible to researchers over the world. While we welcome the technological advances associated with NGS platforms and acknowledge their long-term benefits, we expect principal investigators (PI) of large-scale sequencing projects to be aware of the possible pitfalls and take corrective measures as necessary. For the genomes contaminated with PhiX sequences, we recommend individual PI’s to retract the corresponding sequences, remove contaminating scaffolds, and re-upload the clean sequences to public databases.

Abbreviations

IMG: Integrated Microbial Genomes; HMP: Human Microbiome Project; GOLD: Genomes OnLine Database; NGS: next-generation sequencing; SBS: sequencing by synthesis.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AP and NI initiated the project. SM, AP and MH performed all analysis tasks. NI, NCK and AP performed validation of analysis. SM and AP wrote the paper. All authors read and approved the final manuscript.

Additional file 1

Complete list of PhiX contaminated scaffolds, corresponding IMG Taxon IDs and their percentage of contamination. Click here for file

Additional file 2

List of genomes contaminated with PhiX that has been published in literature. Click here for file

Additional file 3

Detailed sequencing information of PhiX contaminated genomes. Click here for file

Additional file 4

List of non-PhiX contaminations that were detected and removed from the public IMG database. Click here for file

23 in total

1. Whole-genome sequence annotation: 'Going wrong with confidence'.

Authors: N C Kyrpides; C A Ouzounis
Journal: Mol Microbiol Date: 1999-05 Impact factor: 3.501

2. Nucleotide sequence of bacteriophage phi X174 DNA.

Authors: F Sanger; G M Air; B G Barrell; N L Brown; A R Coulson; C A Fiddes; C A Hutchison; P M Slocombe; M Smith
Journal: Nature Date: 1977-02-24 Impact factor: 49.962

3. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea.

Authors: Dongying Wu; Philip Hugenholtz; Konstantinos Mavromatis; Rüdiger Pukall; Eileen Dalin; Natalia N Ivanova; Victor Kunin; Lynne Goodwin; Martin Wu; Brian J Tindall; Sean D Hooper; Amrita Pati; Athanasios Lykidis; Stefan Spring; Iain J Anderson; Patrik D'haeseleer; Adam Zemla; Mitchell Singer; Alla Lapidus; Matt Nolan; Alex Copeland; Cliff Han; Feng Chen; Jan-Fang Cheng; Susan Lucas; Cheryl Kerfeld; Elke Lang; Sabine Gronow; Patrick Chain; David Bruce; Edward M Rubin; Nikos C Kyrpides; Hans-Peter Klenk; Jonathan A Eisen
Journal: Nature Date: 2009-12-24 Impact factor: 49.962

4. Fast identification and removal of sequence contamination from genomic and metagenomic datasets.

Authors: Robert Schmieder; Robert Edwards
Journal: PLoS One Date: 2011-03-09 Impact factor: 3.240

5. Addressing challenges in the production and analysis of illumina sequencing data.

Authors: Martin Kircher; Patricia Heyn; Janet Kelso
Journal: BMC Genomics Date: 2011-07-29 Impact factor: 3.969

6. Multi-genome alignment for quality control and contamination screening of next-generation sequencing data.

Authors: James Hadfield; Matthew D Eldridge
Journal: Front Genet Date: 2014-02-20 Impact factor: 4.599

7. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses.

Authors: Susannah J Salter; Michael J Cox; Elena M Turek; Szymon T Calus; William O Cookson; Miriam F Moffatt; Paul Turner; Julian Parkhill; Nicholas J Loman; Alan W Walker
Journal: BMC Biol Date: 2014-11-12 Impact factor: 7.431

8. QC-Chain: fast and holistic quality control method for next-generation sequencing data.

Authors: Qian Zhou; Xiaoquan Su; Anhui Wang; Jian Xu; Kang Ning
Journal: PLoS One Date: 2013-04-02 Impact factor: 3.240

9. TileQC: a system for tile-based quality control of Solexa data.

Authors: Peter C Dolan; Dee R Denver
Journal: BMC Bioinformatics Date: 2008-05-28 Impact factor: 3.169

10. Mycoplasma contamination in the 1000 Genomes Project.

Authors: William B Langdon
Journal: BioData Min Date: 2014-04-29 Impact factor: 2.522

54 in total

1. Seasonal and diel patterns of abundance and activity of viruses in the Red Sea.

Authors: Gur Hevroni; José Flores-Uribe; Oded Béjà; Alon Philosof
Journal: Proc Natl Acad Sci U S A Date: 2020-11-10 Impact factor: 11.205

Review 2. A Primer on Infectious Disease Bacterial Genomics.

Authors: Tarah Lynch; Aaron Petkau; Natalie Knox; Morag Graham; Gary Van Domselaar
Journal: Clin Microbiol Rev Date: 2016-09-07 Impact factor: 26.132

3. Uncovering Earth's virome.

Authors: David Paez-Espino; Emiley A Eloe-Fadrosh; Georgios A Pavlopoulos; Alex D Thomas; Marcel Huntemann; Natalia Mikhailova; Edward Rubin; Natalia N Ivanova; Nikos C Kyrpides
Journal: Nature Date: 2016-08-17 Impact factor: 49.962

Review 4. Current status and recent advances of next generation sequencing techniques in immunological repertoire.

Authors: X-L Hou; L Wang; Y-L Ding; Q Xie; H-Y Diao
Journal: Genes Immun Date: 2016-03-10 Impact factor: 2.676

5. Detecting and correcting misclassified sequences in the large-scale public databases.

Authors: Hamid Bagheri; Andrew J Severin; Hridesh Rajan
Journal: Bioinformatics Date: 2020-09-15 Impact factor: 6.937

6. A methanotrophic archaeon couples anaerobic oxidation of methane to Fe(III) reduction.

Authors: Chen Cai; Andy O Leu; Guo-Jun Xie; Jianhua Guo; Yuexing Feng; Jian-Xin Zhao; Gene W Tyson; Zhiguo Yuan; Shihu Hu
Journal: ISME J Date: 2018-04-16 Impact factor: 10.302

7. The Pancreatic Cancer Microbiome Promotes Oncogenesis by Induction of Innate and Adaptive Immune Suppression.

Authors: Smruti Pushalkar; Mautin Hundeyin; Donnele Daley; Constantinos P Zambirinis; Emma Kurz; Ankita Mishra; Navyatha Mohan; Berk Aykut; Mykhaylo Usyk; Luisana E Torres; Gregor Werba; Kevin Zhang; Yuqi Guo; Qianhao Li; Neha Akkad; Sarah Lall; Benjamin Wadowski; Johana Gutierrez; Juan Andres Kochen Rossi; Jeremy W Herzog; Brian Diskin; Alejandro Torres-Hernandez; Josh Leinwand; Wei Wang; Pardeep S Taunk; Shivraj Savadkar; Malvin Janal; Anjana Saxena; Xin Li; Deirdre Cohen; R Balfour Sartor; Deepak Saxena; George Miller
Journal: Cancer Discov Date: 2018-03-22 Impact factor: 39.397

8. Bulk Sequencing from mRNA with UMI for Evaluation of B-Cell Isotype and Clonal Evolution: A Method by the AIRR Community.

Authors: Nidhi Gupta; Susanna Marquez; Cinque Soto; Elaine C Chen; Magnolia L Bostick; Ulrik Stervbo; Andrew Farmer
Journal: Methods Mol Biol Date: 2022

9. Host-derived population genomics data provides insights into bacterial and diatom composition of the killer whale skin.

Authors: Rebecca Hooper; Jaelle C Brealey; Tom van der Valk; Antton Alberdi; John W Durban; Holly Fearnbach; Kelly M Robertson; Robin W Baird; M Bradley Hanson; Paul Wade; M Thomas P Gilbert; Phillip A Morin; Jochen B W Wolf; Andrew D Foote; Katerina Guschanski
Journal: Mol Ecol Date: 2018-10-24 Impact factor: 6.185

10. Evolution of a Dominant Natural Isolate of Escherichia coli in the Human Gut over the Course of a Year Suggests a Neutral Evolution with Reduced Effective Population Size.

Authors: Mohamed Ghalayini; Adrien Launay; Antoine Bridier-Nahmias; Olivier Clermont; Erick Denamur; Mathilde Lescat; Olivier Tenaillon
Journal: Appl Environ Microbiol Date: 2018-03-01 Impact factor: 4.792