Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 ProDeGe: a computational protocol for fully automated decontamination of genomes.

Literature DB >> 26057843

ProDeGe: a computational protocol for fully automated decontamination of genomes.

Kristin Tennessen¹, Evan Andersen¹, Scott Clingenpeel¹, Christian Rinke¹, Derek S Lundberg², James Han¹, Jeff L Dangl³, Natalia Ivanova¹, Tanja Woyke¹, Nikos Kyrpides¹, Amrita Pati¹.

Abstract

Single amplified genomes and genomes assembled from metagenomes have enabled the exploration of uncultured microorganisms at an unprecedented scale. However, both these types of products are plagued by contamination. Since these genomes are now being generated in a high-throughput manner and sequences from them are propagating into public databases to drive novel scientific discoveries, rigorous quality controls and decontamination protocols are urgently needed. Here, we present ProDeGe (Protocol for fully automated Decontamination of Genomes), the first computational protocol for fully automated decontamination of draft genomes. ProDeGe classifies sequences into two classes--clean and contaminant--using a combination of homology and feature-based methodologies. On average, 84% of sequence from the non-target organism is removed from the data set (specificity) and 84% of the sequence from the target organism is retained (sensitivity). The procedure operates successfully at a rate of ~0.30 CPU core hours per megabase of sequence and can be applied to any type of genome sequence.

Entities: Chemical Disease Species

Mesh：

Year: 2015 PMID： 26057843 PMCID： PMC4681846 DOI： 10.1038/ismej.2015.100

Source DB: PubMed Journal: ISME J ISSN： 1751-7362 Impact factor: 10.302

Recent technological advancements have enabled the large-scale sampling of genomes from uncultured microbial taxa, through the high-throughput sequencing of single amplified genomes (SAGs; Rinke ; Swan ) and assembly and binning of genomes from metagenomes (GMGs; Cuvelier ; Sharon and Banfield, 2013). The importance of these products in assessing community structure and function has been established beyond doubt (Kalisky and Quake, 2011). Multiple Displacement Amplification (MDA) and sequencing of single cells has been immensely successful in capturing rare and novel phyla, generating valuable references for phylogenetic anchoring. However, efforts to conduct MDA and sequencing in a high-throughput manner have been heavily impaired by contamination from DNA introduced by the environmental sample, as well as introduced during the MDA or sequencing process (Woyke ; Engel ; Field ). Similarly, metagenome binning and assembly often carries various errors and artifacts depending on the methods used (Nielsen ). Even cultured isolate genomes have been shown to lack immunity to contamination with other species (Parks ; Mukherjee ). As sequencing of these genome product types rapidly increases, contaminant sequences are finding their way into public databases as reference sequences. It is therefore extremely important to define standardized and automated protocols for quality control and decontamination, which would go a long way towards establishing quality standards for all microbial genome product types. Current procedures for decontamination and quality control of genome sequences in single cells and metagenome bins are heavily manual and can consume hours/megabase when performed by expert biologists. Supervised decontamination typically involves homology-based inspection of ribosomal RNA sequences and protein coding genes, as well as visual analysis of k-mer frequency plots and guanine–cytosine content (Clingenpeel, 2015). Manual decontamination is also possible through the software SmashCell (Harrington ), which contains a tool for visual identification of contaminants from a self-organizing map and corresponding U-matrix. Another existing software tool, DeconSeq (Schmieder and Edwards, 2011), automatically removes contaminant sequences, however, the contaminant databases are required input. The former lacks automation, whereas the latter requires prior knowledge of contaminants, rendering both applications impractical for high-throughput decontamination. Here, we introduce ProDeGe, the first fully automated computational protocol for decontamination of genomes. ProDeGe uses a combination of homology-based and sequence composition-based approaches to separate contaminant sequences from the target genome draft. It has been pre-calibrated to discard at least 84% of the contaminant sequence, which results in retention of a median 84% of the target sequence. The standalone software is freely available at http://prodege.jgi-psf.org//downloads/src and can be run on any system that has Perl, R (R Core Team, 2014), Prodigal (Hyatt ) and NCBI Blast (Camacho ) installed. A graphical viewer allowing further exploration of data sets and exporting of contigs accompanies the web application for ProDeGe at http://prodege.jgi-psf.org, which is open to the wider scientific community as a decontamination service (Supplementary Figure S1). The assembly and corresponding NCBI taxonomy of the data set to be decontaminated are required inputs to ProDeGe (Figure 1a). Contigs are annotated with genes following which, eukaryotic contamination is removed based on homology of genes at the nucleotide level using the eukaryotic subset of NCBI's Nucleotide database as the reference. For detecting prokaryotic contamination, a curated database of reference contigs from the set of high-quality genomes within the Integrated Microbial Genomes (IMG; Markowitz ) system is used as the reference. This ensures that errors in public reference databases due to poor quality of sequencing, assembly and annotation do not negatively impact the decontamination process. Contigs determined as belonging to the target organism based on nucleotide level homology to sequences in the above database are defined as ‘Clean', whereas those aligned to other organisms are defined as ‘Contaminant'. Contigs whose origin cannot be determined based on alignment are classified as ‘Undecided'. Classified clean and contaminated contigs are used to calibrate the separation in the subsequent 5-mer based binning module, which classifies undecided contigs as ‘Clean' or ‘Contaminant' using principal components analysis (PCA) of 5-mer frequencies. This parameter can also be specified by the user. When data sets do not have taxonomy deeper than phylum level, or a single confident taxonomic bin cannot be detected using sequence alignment, solely 9-mer based binning is used due to more accurate overall classification. In the absence of a user-defined cutoff, a pre-calibrated cutoff for 80% or more specificity separates the clean contigs from contaminated sequences in the resulting PCA of the 9-mer frequency matrix. Details on ProDeGe's custom database, evaluation of the performance of the system and exploration of the parameter space to calibrate ProDeGe for a high accurate classification rate are provided in the Supplementary Material.

Figure 1

(a) Schematic overview of the ProDeGe engine. (b) Features of data sets used to validate ProDeGe: SAGs from the Arabidopsis endophyte sequencing project, MDM project, public data sets found in IMG but not sequenced at the JGI, as well as genomes from metagenomes. All the data and results can be found in Supplementary Table S3.

The performance of ProDeGe was evaluated using 182 manually screened SAGs (Figure 1b,Supplementary Table S1) from two studies whose data sets are publicly available within the IMG system: genomes of 107 SAGs from an Arabidopsis endophyte sequencing project and 75 SAGs from the Microbial Dark Matter (MDM) project* (only 75/201 SAGs from the MDM project had 1:1 mapping between contigs in the unscreened and the manually screened versions, hence these were used; Rinke ). Manual curation of these SAGs demonstrated that the use of ProDeGe prevented 5311 potentially contaminated contigs in these data sets from entering public databases. Figure 2a demonstrates the sensitivity vs specificity plot of ProDeGe results for the above data sets. Most of the data points in Figure 2a cluster in the top right of the box reflecting a median retention of 89% of the clean sequence (sensitivity) and a median rejection of 100% of the sequence of contaminant origin (specificity). In addition, on average, 84% of the bases of a data set are accurately classified. ProDeGe performs best when the target organism has sequenced homologs at the class level or deeper in its high-quality prokaryotic nucleotide reference database. If the target organism's taxonomy is unknown or not deeper than domain level, or there are few contigs with taxonomic assignments, a target bin cannot be assessed and thus ProDeGe removes contaminant contigs using sequence composition only. The few samples in Figure 2a that demonstrate a higher rate of false positives (lower specificity) and/or reduced sensitivity typically occur when the data set contains few contaminant contigs or ProDeGe incorrectly assumes that the largest bin is the target bin. Some data sets contain a higher proportion of contamination than target sequence and ProDeGe's performance can suffer under this condition. However, under all other conditions, ProDeGe demonstrates high speed, specificity and sensitivity (Figure 2). In addition, ProDeGe demonstrates better performance in overall classification when nucleotides are considered than when contigs are considered, illustrating that longer contigs are more accurately classified (Supplementary Table S1).

Figure 2

ProDeGe accuracy and performance scatterplots of 182 manually curated single amplified genomes (SAGs), where each symbol represents one SAG data set. (a) Accuracy shown by sensitivity (proportion of bases confirmed ‘Clean') vs specificity (proportion of bases confirmed ‘Contaminant') from the Endophyte and Microbial Dark Matter (MDM) data sets. Symbol size reflects input data set size in megabases. Most points cluster in the top right of the plot, showing ProDeGe's high accuracy. Median and average overall results are shown in Supplementary Table S1. (b) ProDeGe completion time in central processing unit (CPU) core hours for the 182 SAGs. ProDeGe operates successfully at an average rate of 0.30 CPU core hours per megabase of sequence. Principal components analysis (PCA) of a 9-mer frequency matrix costs more computationally than PCA of a 5-mer frequency matrix used with blast-binning. The lack of known taxonomy for the MDM data sets prevents blast-binning, thus showing longer finishing times than the endophyte data sets, which have known taxonomy for use in blast-binning.

All SAGs used in the evaluation of ProDeGe were assembled using SPAdes (Bankevich ). In-house testing has shown that reads assembled with SPAdes from different strains or even slightly divergent species of the same genera may be combined into the same contig (Personal communications, KT and Robert Bowers). Ideally, the DNA in a well that gets sequenced belongs to a single cell. In the best case, contaminant sequences need to be at least from a different species to be recognized as such by the homology-based screening stage. In the absence of closely related sequenced organisms, contaminant sequences need to be at least from a different genus to be recognized as such by the composition-based screening stage (Supplementary Material). Thus, there is little risk of ProDeGe separating sequences from clonal populations or strains. We have found species- and genus-level contamination in MDA samples to be rare. To evaluate the quality of publicly available uncultured genomes, ProDeGe was used to screen 185 SAGs and 14 GMGs (Figure 1b). Compared with CheckM (Parks ), a tool which calculates an estimate of genome sequence contamination using marker genes, ProDeGe generally marks a higher proportion of sequence as ‘Contaminant' (Supplementary Table S2). This is because ProDeGe has been calibrated to perform at high specificity levels. The command line version of ProDeGe allows users to conduct their own calibration and specify a user-defined distance cutoff. Further, CheckM only outputs the proportion of contamination, but ProDeGe actually labels each contig as ‘Clean' or ‘Contaminant' during the process of automated removal. The web application for ProDeGe allows users to export clean and contaminant contigs, examine contig gene calls with their corresponding taxonomies, and discover contig clusters in the first three components of their k-dimensional space. Non-linear approaches for dimensionality reduction of k-mer vectors are gaining popularity (van der Maaten and Hinton, 2008), but we observed no systematic advantage of using t-Distributed Stochastic Neighbor Embedding over PCA (Supplementary Figure S2). ProDeGe is the first step towards establishing a standard for quality control of genomes from both cultured and uncultured microorganisms. It is valuable for preventing the dissemination of contaminated sequence data into public databases, avoiding resulting misleading analyses. The fully automated nature of the pipeline relieves scientists of hours of manual screening, producing reliably clean data sets and enabling the high-throughput screening of data sets for the first time. ProDeGe, therefore, represents a critical component in our toolkit during an era of next-generation DNA sequencing and cultivation-independent microbial genomics.

17 in total

1. Microbiology. Genomes from metagenomics.

Authors: Itai Sharon; Jillian F Banfield
Journal: Science Date: 2013-11-29 Impact factor: 47.728

2. Prevalent genome streamlining and latitudinal divergence of planktonic bacteria in the surface ocean.

Authors: Brandon K Swan; Ben Tupper; Alexander Sczyrba; Federico M Lauro; Manuel Martinez-Garcia; José M González; Haiwei Luo; Jody J Wright; Zachary C Landry; Niels W Hanson; Brian P Thompson; Nicole J Poulton; Patrick Schwientek; Silvia G Acinas; Stephen J Giovannoni; Mary Ann Moran; Steven J Hallam; Ricardo Cavicchioli; Tanja Woyke; Ramunas Stepanauskas
Journal: Proc Natl Acad Sci U S A Date: 2013-06-25 Impact factor: 11.205

3. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes.

Authors: H Bjørn Nielsen; Mathieu Almeida; Agnieszka Sierakowska Juncker; Simon Rasmussen; Junhua Li; Shinichi Sunagawa; Damian R Plichta; Laurent Gautier; Anders G Pedersen; Emmanuelle Le Chatelier; Eric Pelletier; Ida Bonde; Trine Nielsen; Chaysavanh Manichanh; Manimozhiyan Arumugam; Jean-Michel Batto; Marcelo B Quintanilha Dos Santos; Nikolaj Blom; Natalia Borruel; Kristoffer S Burgdorf; Fouad Boumezbeur; Francesc Casellas; Joël Doré; Piotr Dworzynski; Francisco Guarner; Torben Hansen; Falk Hildebrand; Rolf S Kaas; Sean Kennedy; Karsten Kristiansen; Jens Roat Kultima; Pierre Léonard; Florence Levenez; Ole Lund; Bouziane Moumen; Denis Le Paslier; Nicolas Pons; Oluf Pedersen; Edi Prifti; Junjie Qin; Jeroen Raes; Søren Sørensen; Julien Tap; Sebastian Tims; David W Ussery; Takuji Yamada; Pierre Renault; Thomas Sicheritz-Ponten; Peer Bork; Jun Wang; Søren Brunak; S Dusko Ehrlich
Journal: Nat Biotechnol Date: 2014-07-06 Impact factor: 54.908

4. Genomic insights into the uncultivated marine Zetaproteobacteria at Loihi Seamount.

Authors: Erin K Field; Alexander Sczyrba; Audrey E Lyman; Christopher C Harris; Tanja Woyke; Ramunas Stepanauskas; David Emerson
Journal: ISME J Date: 2015-03-17 Impact factor: 10.302

5. BLAST+: architecture and applications.

Authors: Christiam Camacho; George Coulouris; Vahram Avagyan; Ning Ma; Jason Papadopoulos; Kevin Bealer; Thomas L Madden
Journal: BMC Bioinformatics Date: 2009-12-15 Impact factor: 3.169

6. Prodigal: prokaryotic gene recognition and translation initiation site identification.

Authors: Doug Hyatt; Gwo-Liang Chen; Philip F Locascio; Miriam L Land; Frank W Larimer; Loren J Hauser
Journal: BMC Bioinformatics Date: 2010-03-08 Impact factor: 3.169

7. SmashCell: a software framework for the analysis of single-cell amplified genome sequences.

Authors: Eoghan D Harrington; Manimozhiyan Arumugam; Jeroen Raes; Peer Bork; David A Relman
Journal: Bioinformatics Date: 2010-10-21 Impact factor: 6.937

8. Fast identification and removal of sequence contamination from genomic and metagenomic datasets.

Authors: Robert Schmieder; Robert Edwards
Journal: PLoS One Date: 2011-03-09 Impact factor: 3.240

9. Large-scale contamination of microbial isolate genomes by Illumina PhiX control.

Authors: Supratim Mukherjee; Marcel Huntemann; Natalia Ivanova; Nikos C Kyrpides; Amrita Pati
Journal: Stand Genomic Sci Date: 2015-03-30

10. IMG 4 version of the integrated microbial genomes comparative analysis system.

Authors: Victor M Markowitz; I-Min A Chen; Krishna Palaniappan; Ken Chu; Ernest Szeto; Manoj Pillay; Anna Ratner; Jinghua Huang; Tanja Woyke; Marcel Huntemann; Iain Anderson; Konstantinos Billis; Neha Varghese; Konstantinos Mavromatis; Amrita Pati; Natalia N Ivanova; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2013-10-27 Impact factor: 16.971

39 in total

Review 1. Single-cell genome sequencing: current state of the science.

Authors: Charles Gawad; Winston Koh; Stephen R Quake
Journal: Nat Rev Genet Date: 2016-01-25 Impact factor: 53.242

2. Metagenomes from 25 Low-Abundance Microbes in a Partial Nitritation Anammox Microbiome.

Authors: Natalie K Beach; Kevin S Myers; Timothy J Donohue; Daniel R Noguera
Journal: Microbiol Resour Announc Date: 2022-05-16

3. Niche differentiation is spatially and temporally regulated in the rhizosphere.

Authors: Erin E Nuccio; Evan Starr; Ulas Karaoz; Eoin L Brodie; Jizhong Zhou; Susannah G Tringe; Rex R Malmstrom; Tanja Woyke; Jillian F Banfield; Mary K Firestone; Jennifer Pett-Ridge
Journal: ISME J Date: 2020-01-17 Impact factor: 10.302

4. Capturing One of the Human Gut Microbiome's Most Wanted: Reconstructing the Genome of a Novel Butyrate-Producing, Clostridial Scavenger from Metagenomic Sequence Data.

Authors: Patricio Jeraldo; Alvaro Hernandez; Henrik B Nielsen; Xianfeng Chen; Bryan A White; Nigel Goldenfeld; Heidi Nelson; David Alhquist; Lisa Boardman; Nicholas Chia
Journal: Front Microbiol Date: 2016-05-26 Impact factor: 5.640

5. Improved Environmental Genomes via Integration of Metagenomic and Single-Cell Assemblies.

Authors: Daniel R Mende; Frank O Aylward; John M Eppley; Torben N Nielsen; Edward F DeLong
Journal: Front Microbiol Date: 2016-02-11 Impact factor: 5.640

6. Draft Genome Sequence of a Pseudomonas aeruginosa Strain Able To Decompose N,N-Dimethyl Formamide.

Authors: Lingyue Yan; Ming Yan; Lin Xu; Li Wei; Liting Zhang
Journal: Genome Announc Date: 2016-02-04

7. Draft Genome Sequence of Aeribacillus pallidus Strain 8m3, a Thermophilic Hydrocarbon-Oxidizing Bacterium Isolated from the Dagang Oil Field (China).

Authors: Andrey B Poltaraus; Diyana S Sokolova; Denis S Grouzdev; Timophey M Ivanov; Sophia G Malakho; Alena V Korshunova; Aleksey S Rozanov; Tatiyana P Tourova; Tamara N Nazina
Journal: Genome Announc Date: 2016-06-09

8. Draft Genome Sequence of Chloroflexus sp. Strain isl-2, a Thermophilic Filamentous Anoxygenic Phototrophic Bacterium Isolated from the Strokkur Geyser, Iceland.

Authors: Vasil A Gaisin; Timophey M Ivanov; Boris B Kuznetsov; Vladimir M Gorlenko; Denis S Grouzdev
Journal: Genome Announc Date: 2016-07-21

9. Draft Genome Sequences of Two Magnetotactic Bacteria, Magnetospirillum moscoviense BB-1 and Magnetospirillum marisnigri SP-1.

Authors: Veronika V Koziaeva; Marina V Dziuba; Timophey M Ivanov; Boris B Kuznetsov; Konstantin G Skryabin; Denis S Grouzdev
Journal: Genome Announc Date: 2016-08-11

10. Global metagenomic survey reveals a new bacterial candidate phylum in geothermal springs.

Authors: Emiley A Eloe-Fadrosh; David Paez-Espino; Jessica Jarett; Peter F Dunfield; Brian P Hedlund; Anne E Dekas; Stephen E Grasby; Allyson L Brady; Hailiang Dong; Brandon R Briggs; Wen-Jun Li; Danielle Goudeau; Rex Malmstrom; Amrita Pati; Jennifer Pett-Ridge; Edward M Rubin; Tanja Woyke; Nikos C Kyrpides; Natalia N Ivanova
Journal: Nat Commun Date: 2016-01-27 Impact factor: 14.919