Literature DB >> 30475995

GenCoF: a graphical user interface to rapidly remove human genome contaminants from metagenomic datasets.

Matthew D Czajkowski1, Daniel P Vance1, Steven A Frese1,2, Giorgio Casaburi1.   

Abstract

SUMMARY: The removal of human genomic reads from shotgun metagenomic sequencing is a critical step in protecting subject privacy. Freely available tools addressing this issue require advanced programing knowledge or are limited by analytical time and data load due to their server-based nature. Here, we compared the most cited tools for host-DNA removal using synthetic and real metagenomic datasets. Then, we integrated the most efficient pipeline in a graphical user interface to make these tools available without command line use. This interface, GenCoF, rapidly removes human genome contaminants from metagenomic datasets. Additionally, the tool offers quality-filtering, data reduction and interactive modification of any parameter in order to customize the analysis. GenCoF offers both quality and host-associated filtering in a non-commercial, freely available tool in a local, interactive and easy-to-use interface.
AVAILABILITY AND IMPLEMENTATION: GenCoF is freely available (under a GPL license) for Mac OS and Linux at https://github.com/MattCzajkowski/GenCoF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2018. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2019        PMID: 30475995      PMCID: PMC6596892          DOI: 10.1093/bioinformatics/bty963

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Improvements in sequencing technologies within the past decade have reduced the cost of shotgun metagenomic sequencing tremendously (Kunin ). Consequently, greater access and lower cost has resulted in rapid growth of metagenomics datasets from humans and other systems (Pagani ). After sequencing, the first step of analysis is the removal of host-derived reads (Kunin ). Residual human DNA in publicly deposited data sets is a major privacy concern and a roadblock to proper study privacy protections (Gurwitz ). Although several tools are currently available to perform this task, the majorities require advanced programing knowledge or have limitations in terms of analysis time or computational requirements. In this study, we compared the most cited tools for host-removal filtering in metagenomics studies. Then, selected tools were integrated into the most efficient pipeline in a graphical user interface (GUI) for broad use. We present Genomic Contamination Filter (GenCoF), a GUI to rapidly remove human genome contaminants from metagenomic datasets. The application is available as an executable in Unix and Mac OS environments and does not require command line interaction. In addition to host-derived contaminant removal, GenCoF offers the possibility to quality-filter sequencing reads and split datasets into smaller sizes to increase manageability of data. GenCoF allows for parameter customization of the analysis based on user needs. Further, a step-by-step tutorial of installation and sequence filtering is available. GenCoF is the first tool offering an interactive and easy-to-use interface to filter metagenomic sequencing reads for both quality and host-associated components. The application is freely available under GPL license at https://github.com/MattCzajkowski/GenCoF.

2 Features and methods

The software included in GenCoF has been coded with Python v3.6.1 and implements freely available packages (see Supplementary Methods).

2.1 Tool description

GenCoF has bundled several programs, including Sickle (Joshi and Fass, 2018), Prinseq (Schmieder and Edwards, 2011a), Bowtie2 (Langmead and Salzberg, 2012) and Fastq and Fasta Splitter from the FASTX-Toolkit. Before read decontamination, samples can be quality-filtered. Read trimming in GenCoF uses either Sickle or Prinseq. Users can decide whether to use multi-threading options and build custom reference databases for sequences removal. Bowtie2 is then employed to perform the decontamination step with optional custom parameterization. Lastly, GenCoF offers the option to concatenate output files if they were initially split.

2.2 Program performance methods

The programs Deconseq v0.4.3 (Schmieder and Edwards, 2011b), Bowtie2 v2.3.4, BBMap (specifically BBSplit) v37.80 (Bushnell, 2014) and BMTagger v1.1.0 (Rotmistrovsky and Agarwala, 2011) were compared for their speed, accuracy and size of reference files created (see Supplementary Results). Four synthetic datasets were created containing an average of 96 166 185 reads from viral, fungal, bacterial, archaeal and human genomes. Datasets had ∼30, 50, 70 or 100% of human reads, respectively. All tools were compared against the human genome (vGRCh38.p7, NCBI accession GCF_000001405.33) as reference. From the synthetic dataset test, the highest performing parameters were chosen for the individual tools, which were finally tested against a published metagenomic dataset (Casaburi ). BLASTn (Altschul ). was used as baseline using an E-value cut-off of ≤10−10 to determine whether the reads reported as positive hits by the programs were true positives (Haas ; Turnbaugh ). Only reads with BLASTn-positive hits were considered true human contaminants.

3 Results

Synthetic analysis showed that BMTagger had the best CPU/hour (Fig. 1B). However, BMTagger is limited by a single-CPU usage and needed the largest reference (∼32 GB). Conversely, Bowtie2 presented similar overall error rates (Fig. 1A), could run with multiple CPUs, and only required creation of a ∼3 GB reference file. In comparison to the other programs, BBSplit returned a higher rate of incorrectly mapped human reads (Fig. 1A), but outperformed all the other applications in correctly assigning microbial reads (Fig. 1C). Further, while BBSplit only generated a 3 GB reference file, it required java runtime environment to run and returned a high CPU/hour across all tested parameters (Fig. 1B). Finally, Deconseq returned very high error rate (Fig. 1C) and CPU/hour (Fig. 1B), and was limited to a single CPU.
Fig. 1.

Analysis of program performance. (A) Average error rate of synthetic reads wrongly assigned as non-human. (B) Average CPU/Hour. (C) Average error rate of synthetic reads wrongly assigned as human. (D) Average error rate of real dataset. Error bars represent standard error

Analysis of program performance. (A) Average error rate of synthetic reads wrongly assigned as non-human. (B) Average CPU/Hour. (C) Average error rate of synthetic reads wrongly assigned as human. (D) Average error rate of real dataset. Error bars represent standard error However, Deconseq only used a 4 GB reference database and only required Mac or Linux OS. From the published metagenomic dataset, Bowtie2 reported the lowest error rate, followed by Deconseq, BBSplit and BMTagger, respectively (Fig. 1D). Bowtie2 outperformed the other tools in terms of accuracy, while BBSplit had the worst accuracy in correctly assigning reads. Overall, Bowtie2 was chosen for GenCoF because of its low error rate, limited pre-requisites and low CPU/hour. It also performed comparably on both real and synthetic datasets. Although it mapped the least number of reads from the published reads, it only differed by ∼3000 reads of the 1.5 million tested (0.2%).

Funding

This work was supported by Evolve Biosystems. Conflict of Interest: Authors are employees of Evolve Biosystems. The software presented here is completely free and available under GPL v3 license. Click here for additional data file.
  9 in total

1.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

2.  Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons.

Authors:  Brian J Haas; Dirk Gevers; Ashlee M Earl; Mike Feldgarden; Doyle V Ward; Georgia Giannoukos; Dawn Ciulla; Diana Tabbaa; Sarah K Highlander; Erica Sodergren; Barbara Methé; Todd Z DeSantis; Joseph F Petrosino; Rob Knight; Bruce W Birren
Journal:  Genome Res       Date:  2011-01-06       Impact factor: 9.043

Review 3.  A bioinformatician's guide to metagenomics.

Authors:  Victor Kunin; Alex Copeland; Alla Lapidus; Konstantinos Mavromatis; Philip Hugenholtz
Journal:  Microbiol Mol Biol Rev       Date:  2008-12       Impact factor: 11.056

4.  Research ethics. Children and population biobanks.

Authors:  David Gurwitz; Isabel Fortier; Jeantine E Lunshof; Bartha Maria Knoppers
Journal:  Science       Date:  2009-08-14       Impact factor: 47.728

5.  Fast gapped-read alignment with Bowtie 2.

Authors:  Ben Langmead; Steven L Salzberg
Journal:  Nat Methods       Date:  2012-03-04       Impact factor: 28.547

6.  Quality control and preprocessing of metagenomic datasets.

Authors:  Robert Schmieder; Robert Edwards
Journal:  Bioinformatics       Date:  2011-01-28       Impact factor: 6.937

7.  Fast identification and removal of sequence contamination from genomic and metagenomic datasets.

Authors:  Robert Schmieder; Robert Edwards
Journal:  PLoS One       Date:  2011-03-09       Impact factor: 3.240

8.  The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata.

Authors:  Ioanna Pagani; Konstantinos Liolios; Jakob Jansson; I-Min A Chen; Tatyana Smirnova; Bahador Nosrat; Victor M Markowitz; Nikos C Kyrpides
Journal:  Nucleic Acids Res       Date:  2011-12-01       Impact factor: 16.971

9.  A core gut microbiome in obese and lean twins.

Authors:  Peter J Turnbaugh; Micah Hamady; Tanya Yatsunenko; Brandi L Cantarel; Alexis Duncan; Ruth E Ley; Mitchell L Sogin; William J Jones; Bruce A Roe; Jason P Affourtit; Michael Egholm; Bernard Henrissat; Andrew C Heath; Rob Knight; Jeffrey I Gordon
Journal:  Nature       Date:  2008-11-30       Impact factor: 49.962

  9 in total
  6 in total

1.  Evaluation of methods for detecting human reads in microbial sequencing datasets.

Authors:  Stephen J Bush; Thomas R Connor; Tim E A Peto; Derrick W Crook; A Sarah Walker
Journal:  Microb Genom       Date:  2020-07

2.  Early-life gut microbiome modulation reduces the abundance of antibiotic-resistant bacteria.

Authors:  Jennifer T Smilowitz; Mark A Underwood; Giorgio Casaburi; Rebbeca M Duar; Daniel P Vance; Ryan Mitchell; Lindsey Contreras; Steven A Frese
Journal:  Antimicrob Resist Infect Control       Date:  2019-08-14       Impact factor: 4.887

3.  Impact of Probiotic B. infantis EVC001 Feeding in Premature Infants on the Gut Microbiome, Nosocomially Acquired Antibiotic Resistance, and Enteric Inflammation.

Authors:  Marielle Nguyen; Heaven Holdbrooks; Prasanthi Mishra; Maria A Abrantes; Sherri Eskew; Mariajamiela Garma; Cyr-Geraurd Oca; Carrie McGuckin; Cynthia B Hein; Ryan D Mitchell; Sufyan Kazi; Stephanie Chew; Giorgio Casaburi; Heather K Brown; Steven A Frese; Bethany M Henrick
Journal:  Front Pediatr       Date:  2021-02-16       Impact factor: 3.418

4.  Application of Metagenomic Next-Generation Sequencing in Mycobacterium tuberculosis Infection.

Authors:  Yaoguang Li; Mengfan Jiao; Ying Liu; Zhigang Ren; Ang Li
Journal:  Front Med (Lausanne)       Date:  2022-04-01

5.  Natural diversity of the honey bee (Apis mellifera) gut bacteriome in various climatic and seasonal states.

Authors:  Márton Papp; László Békési; Róbert Farkas; László Makrai; Maura Fiona Judge; Gergely Maróti; Dóra Tőzsér; Norbert Solymosi
Journal:  PLoS One       Date:  2022-09-09       Impact factor: 3.752

6.  MEDUSA: A Pipeline for Sensitive Taxonomic Classification and Flexible Functional Annotation of Metagenomic Shotgun Sequences.

Authors:  Diego A A Morais; João V F Cavalcante; Shênia S Monteiro; Matheus A B Pasquali; Rodrigo J S Dalmolin
Journal:  Front Genet       Date:  2022-03-07       Impact factor: 4.599

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.