Literature DB >> 33148820

Metagenome Proteins and Database Contamination.

Irina R Arkhipova1.   

Abstract

Continued influx of metagenome-derived proteins with misannotated taxonomy into conventional databases, including RefSeq, threatens to eliminate the value of taxonomy identifiers. To prevent this, urgent efforts should be undertaken by submitters of metagenomic data sets as well as by database managers.
Copyright © 2020 Arkhipova.

Entities:  

Keywords:  MAG; RefSeq; binning; classification; metagenomics; taxonomy; transposons

Mesh:

Substances:

Year:  2020        PMID: 33148820      PMCID: PMC7643828          DOI: 10.1128/mSphere.00854-20

Source DB:  PubMed          Journal:  mSphere        ISSN: 2379-5042            Impact factor:   4.389


COMMENTARY

Explosive growth of metagenomic sequences in major databases reflects an impressive progress in development of hardware and software for large-scale shotgun metagenomic sequencing, assembly, and annotation. However, size filtration during microbial sample collection does not provide much protection from eukaryotic cells, such as metazoan sperm or fungal spores, and while bioinformatic tools can correctly assign core eukaryotic or prokaryotic genes to a proper domain of life, they fail to guard against less-conserved sequences systematically misrecognized by computational pipelines. This is especially applicable to poorly studied or rapidly evolving “optional” genes and, most prominently, to transposable elements, often showing deviations in GC-content/k-mer composition and read coverage. With the advent of metagenomics, eukaryotic transposons and understudied gene families are being uncontrollably assigned to prokaryotic genomes and are on track to outpace the well-known problem of eukaryotic genome contamination by associated microorganisms (1). In the era of fully automated analyses with little or no manual curation, huge batches of data, such as soil or ocean metagenomes, undergo taxonomic classification relying on loose criteria, i.e., genome binning based on k-mer and read coverage and taxonomy binning based on similarity to reference data sets (e.g., the curated RefSeq database [2]), neither of which can properly identify eukaryotic mobilome components. Such misassigned sequences, which may comprise over one-tenth of a metagenomic data set (3), could soon outnumber microbial contaminants in eukaryotic genomes, which can amount to a small percentage of a whole-genome-sequencing (WGS) data set but are easier to filter out due to higher density of known genes in prokaryotes. Inclusion of metagenomes in conventional taxonomy-aware databases defies the very purpose of keeping species-specific database records, on which the scientific community has been relying for many years. Most researchers searching GenBank entries tend to trust the description in the nr protein database as definitively excluding environmental (ENV) samples from WGS projects (4) when there is a separate env_nr database and would not begin their analysis by filtering out misassigned entries based on scattered information from inconsistent, often parenthetical notes in the “keyword” or “source” fields, when the “organism” field presents them with the (incorrect) taxonomic assignment; moreover, even these limited indicators are often absent. Table 1 exemplifies the problem, showing sequences from a Tad clade of eukaryotic non-long-terminal-repeat (non-LTR) retrotransposons, which together with adjacent fungal or animal genes were misassigned to various bacterial (BCT) taxa distributed between ENV and BCT subdivisions, many of which were assigned to RefSeq as “true” bacterial genes, further amplifying the errors. Numerous other transposon superfamilies are displaying similar patterns (not shown), and addition of adjacent genes on the same contigs multiplies the number of misassigned proteins entering RefSeq. With an ever-increasing burden on peer reviewers, screening for such false positives before publication is not too high on the journals’ priority list.
TABLE 1

Top 50 misassigned bacterial metagenome entries from the nr protein database (accessed 15 October 2020) queried with a fungal non-LTR retrotransposon (MGR583) from Pyricularia (Magnaporthe) grisea (O13348) harboring the RT_nLTR_like (cd01650), R1_I_EN (cd09077), and Rnase_HI_RT_non_LTR (cd09276) conserved domains

DBSDAccession no.DescriptionTaxonomyPMID or date
gbBCTPON18153.1Hypothetical protein C2W62_09385Candidatus Entotheonella serta”29439203
refBCTWP_190751469.1Endonuclease/exonuclease/phosphataseTolypothrix sp. FACHB-12329112715
gbENVTKW60303.1Hypothetical protein DI628_08940Blastochloris viridis29180750
gbENVPZR61209.1Hypothetical protein DI537_52575Pseudomonas stutzeri29180750
gbENVTMI85391.1Hypothetical protein E6H10_03430Bacteroidetes bacterium31110364
gbENVTMC24228.1Hypothetical protein E6J34_00855Chloroflexi bacterium31110364
gbENVMAT33547.1Hypothetical proteinPonticaulis sp.29337314
gbENVTMI82166.1Hypothetical protein E6H10_10105Bacteroidetes bacterium31110364
gbENVMAD87981.1Hypothetical proteinDeltaproteobacteria bacterium29337314
refBCTWP_167367247.1Endonuclease/exonuclease/phosphataseSolemya elarraichensis gill symbiont29112715
gbENVTMC17713.1Hypothetical protein E6J34_18185Chloroflexi bacterium31110364
gbENVTMI79299.1Hypothetical protein E6H10_15575Bacteroidetes bacterium31110364
gbENVRZL28817.1Hypothetical protein EOP64_03250Sphingomonas sp.30498029
gbENVMAT33548.1Hypothetical proteinPonticaulis sp.29337314
gbENVMBM83139.1Hypothetical proteinPlanctomycetaceae bacterium29337314
refBCTWP_034999417.1Hypothetical proteinBeijerinckia mobilis29112715
gbBCTPON12600.1Hypothetical protein C2W62_38760Candidatus Entotheonella serta”29439203
gbENVRYF04990.1Reverse transcriptase family proteinOxalobacteraceae bacterium30498029
embBCTSHE22257.1Hypothetical protein BBROOKSOX_612Bathymodiolus brooksi thiotrophic gill symbiont31611646
refBCTWP_139476173.1Hypothetical proteinBathymodiolus brooksi thiotrophic gill symbiont29112715
gbENVRUA04271.1Hypothetical protein DSY43_06760Gammaproteobacteria bacterium31073213
gbENVRYE04920.1Hypothetical protein EOP33_08145Rickettsiaceae bacterium30498029
refBCTWP_078456858.1Hypothetical proteinSolemya velum gill symbiont29112715
refBCTWP_143556149.1Hypothetical proteinSolemya velum gill symbiont29112715
gbENVNQZ78185.1Endonuclease/exonuclease/phosphataseEkhidna sp.2 June 2020
gbENVPYK09849.1Hypothetical protein DME65_11230Verrucomicrobia bacterium29899444
gbENVPYY19146.1Hypothetical protein DMG62_24900Acidobacteria bacterium29899444
refBCTWP_135568158.1Hypothetical protein, partialSolemya elarraichensis gill symbiont29112715
gbENVRYE26883.1Hypothetical protein EOP45_02990Sphingobacteriaceae bacterium30498029
refBCTWP_190751467.1Hypothetical proteinTolypothrix sp. FACHB-12329112715
refBCTWP_180337602.1RNA-directed DNA polymeraseBathymodiolus brooksi thiotrophic gill symbiont29112715
refBCTWP_180337587.1RNA-directed DNA polymeraseBathymodiolus brooksi thiotrophic gill symbiont29112715
refBCTWP_101731609.1Endonuclease/exonuclease/phosphataseCarnobacterium maltaromaticum29112715
gbENVRPJ78669.1Hypothetical protein EHM20_03370Alphaproteobacteria bacterium30086797
refBCTWP_139476202.1Reverse transcriptase family proteinBathymodiolus brooksi thiotrophic gill symbiont29112715
gbENVPZO93396.1Hypothetical protein DI617_08800Streptococcus pyogenes29180750
gbENVMBD0342806.1Reverse transcriptase family proteinMicrocoleus sp. Co-bin1214 September 2020
gbENVPYS85964.1Hypothetical protein DMF62_17490Acidobacteria bacterium29899444
gbENVNQY31272.1RNA-directed DNA polymeraseFlavobacteriaceae bacterium2 June 2020
tpgENVHBI40457.1TPA: hypothetical proteinTenacibaculum sp.30148503
gbENVPYS85881.1Hypothetical protein DMF62_17715Acidobacteria bacterium29899444
tpgENVHBI40644.1TPA: hypothetical proteinTenacibaculum sp.30148503
refBCTWP_185167786.1Reverse transcriptase family proteinProteus mirabilis29112715
refBCTWP_148116844.1Endonuclease/exonuclease/phosphataseAnaplasma phagocytophilum29112715
gbENVTMI86169.1Hypothetical protein E6H10_00780Bacteroidetes bacterium31110364
gbENVNEO82313.1Hypothetical proteinMoorea sp. SIO4G316 February 2020
gbENVNQY31470.1Reverse transcriptase family proteinFlavobacteriaceae bacterium2 June 2020
refBCTWP_141663621.1Reverse transcriptase family proteinBacterium 2013Ark19i29112715
refBCTWP_160869998.1Reverse transcriptase-like proteinPantoea sp. Taur29112715
refBCTWP_155403559.1Hypothetical protein, partialPiscirickettsia salmonis29112715

BLASTP search was limited to Bacteria (taxid:2); database, nr. Data represent all nonredundant GenBank coding DNA sequence (CDS) translations plus PDB plus Swiss-Prot plus PIR plus PRF, excluding environmental samples from WGS projects; hits with E values of >1e−30 are shown. DB, database (GenBank, EMBL, RefSeq, or third party); SD, database subdivision (ENV, environmental samples; BCT, bacterial samples); TPA, third-party annotation. All RefSeq entries are linked to reference 2 (PMID 29112715); unpublished entries unlinked to a PMID show the release date.

Top 50 misassigned bacterial metagenome entries from the nr protein database (accessed 15 October 2020) queried with a fungal non-LTR retrotransposon (MGR583) from Pyricularia (Magnaporthe) grisea (O13348) harboring the RT_nLTR_like (cd01650), R1_I_EN (cd09077), and Rnase_HI_RT_non_LTR (cd09276) conserved domains BLASTP search was limited to Bacteria (taxid:2); database, nr. Data represent all nonredundant GenBank coding DNA sequence (CDS) translations plus PDB plus Swiss-Prot plus PIR plus PRF, excluding environmental samples from WGS projects; hits with E values of >1e−30 are shown. DB, database (GenBank, EMBL, RefSeq, or third party); SD, database subdivision (ENV, environmental samples; BCT, bacterial samples); TPA, third-party annotation. All RefSeq entries are linked to reference 2 (PMID 29112715); unpublished entries unlinked to a PMID show the release date. Database contamination from metagenomes is an acute problem which is poised to become overwhelming. Unless urgent measures are taken, the value of the taxonomy field in gene annotations will eventually be erased by false positives from misassigned metagenomic bins. Many symbiont WGS projects in fact represent host-associated metagenomes (Table 1 lists many examples of host contigs misassigned to symbiont “isolates”), and it would be prudent in these cases to employ taxonomy classifiers which add host genomes, if sequenced, to a reference data set (3). However, this option is rarely applicable to environmental metagenomes, where eukaryotic contigs can be binned into bacterial metagenome-assembled genomes (MAGs) through less-than-certain matches to a taxonomy reference and occasional coincidences in coverage and k-mer composition—features especially applicable to mobile elements, which may dominate the genomes of understudied eukaryotes. Software tools recognizing eukaryotic sequences in metagenomes (5–7) should be upgraded to include recognition of mobilome coding sequences from curated databases such as Repbase (8), and the corresponding modules should be incorporated into prokaryotic genome annotation pipelines (2). However, while metagenome scientists are striving to improve their genomic and taxonomic binning strategies (3, 9, 10), immediate measures should be taken (i) to ensure separation of fragmented metagenomes from conventional databases, preventing misassigned fragments from entering RefSeq, and (ii) to keep the small contigs out of the rigid taxonomy frames by assigning them to “unclassified sequences” in the taxonomy field. Contigs consisting of <10 to 15 kb are most likely to get misassigned, especially if they lack any “marker genes.” These practices should be implemented before metagenome-derived proteins with random taxonomic assignments overrun species-specific databases to such an extent that the taxonomy field becomes useless.
  9 in total

1.  Repbase Update, a database of repetitive elements in eukaryotic genomes.

Authors:  Weidong Bao; Kenji K Kojima; Oleksiy Kohany
Journal:  Mob DNA       Date:  2015-06-02

Review 2.  Benchmarking Metagenomics Tools for Taxonomic Classification.

Authors:  Simon H Ye; Katherine J Siddle; Daniel J Park; Pardis C Sabeti
Journal:  Cell       Date:  2019-08-08       Impact factor: 41.582

3.  Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software.

Authors:  Alexander Sczyrba; Peter Hofmann; Peter Belmann; David Koslicki; Stefan Janssen; Johannes Dröge; Ivan Gregor; Stephan Majda; Jessika Fiedler; Eik Dahms; Andreas Bremges; Adrian Fritz; Ruben Garrido-Oter; Tue Sparholt Jørgensen; Nicole Shapiro; Philip D Blood; Alexey Gurevich; Yang Bai; Dmitrij Turaev; Matthew Z DeMaere; Rayan Chikhi; Niranjan Nagarajan; Christopher Quince; Fernando Meyer; Monika Balvočiūtė; Lars Hestbjerg Hansen; Søren J Sørensen; Burton K H Chia; Bertrand Denis; Jeff L Froula; Zhong Wang; Robert Egan; Dongwan Don Kang; Jeffrey J Cook; Charles Deltel; Michael Beckstette; Claire Lemaitre; Pierre Peterlongo; Guillaume Rizk; Dominique Lavenier; Yu-Wei Wu; Steven W Singer; Chirag Jain; Marc Strous; Heiner Klingenberg; Peter Meinicke; Michael D Barton; Thomas Lingner; Hsin-Hung Lin; Yu-Chieh Liao; Genivaldo Gueiros Z Silva; Daniel A Cuevas; Robert A Edwards; Surya Saha; Vitor C Piro; Bernhard Y Renard; Mihai Pop; Hans-Peter Klenk; Markus Göker; Nikos C Kyrpides; Tanja Woyke; Julia A Vorholt; Paul Schulze-Lefert; Edward M Rubin; Aaron E Darling; Thomas Rattei; Alice C McHardy
Journal:  Nat Methods       Date:  2017-10-02       Impact factor: 28.547

4.  Genome-reconstruction for eukaryotes from complex natural microbial communities.

Authors:  Patrick T West; Alexander J Probst; Igor V Grigoriev; Brian C Thomas; Jillian F Banfield
Journal:  Genome Res       Date:  2018-03-01       Impact factor: 9.043

5.  Identification of fungi in shotgun metagenomics datasets.

Authors:  Paul D Donovan; Gabriel Gonzalez; Desmond G Higgins; Geraldine Butler; Kimihito Ito
Journal:  PLoS One       Date:  2018-02-14       Impact factor: 3.240

6.  MiCoP: microbial community profiling method for detecting viral and fungal organisms in metagenomic samples.

Authors:  Nathan LaPierre; Serghei Mangul; Mohammed Alser; Igor Mandric; Nicholas C Wu; David Koslicki; Eleazar Eskin
Journal:  BMC Genomics       Date:  2019-06-06       Impact factor: 3.969

Review 7.  Accurate and complete genomes from metagenomes.

Authors:  Lin-Xing Chen; Karthik Anantharaman; Alon Shaiber; A Murat Eren; Jillian F Banfield
Journal:  Genome Res       Date:  2020-03-18       Impact factor: 9.043

8.  RefSeq: an update on prokaryotic genome annotation and curation.

Authors:  Daniel H Haft; Michael DiCuccio; Azat Badretdin; Vyacheslav Brover; Vyacheslav Chetvernin; Kathleen O'Neill; Wenjun Li; Farideh Chitsaz; Myra K Derbyshire; Noreen R Gonzales; Marc Gwadz; Fu Lu; Gabriele H Marchler; James S Song; Narmada Thanki; Roxanne A Yamashita; Chanjuan Zheng; Françoise Thibaud-Nissen; Lewis Y Geer; Aron Marchler-Bauer; Kim D Pruitt
Journal:  Nucleic Acids Res       Date:  2018-01-04       Impact factor: 16.971

9.  Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank.

Authors:  Martin Steinegger; Steven L Salzberg
Journal:  Genome Biol       Date:  2020-05-12       Impact factor: 13.583

  9 in total
  3 in total

1.  Recovery of High Quality Metagenome-Assembled Genomes From Full-Scale Activated Sludge Microbial Communities in a Tropical Climate Using Longitudinal Metagenome Sampling.

Authors:  Mindia A S Haryono; Ying Yu Law; Krithika Arumugam; Larry C-W Liew; Thi Quynh Ngoc Nguyen; Daniela I Drautz-Moses; Stephan C Schuster; Stefan Wuertz; Rohan B H Williams
Journal:  Front Microbiol       Date:  2022-06-13       Impact factor: 6.064

2.  How clear is our current view on microbial dark matter? (Re-)assessing public MAG & SAG datasets with MDMcleaner.

Authors:  John Vollmers; Sandra Wiegand; Florian Lenk; Anne-Kristin Kaster
Journal:  Nucleic Acids Res       Date:  2022-07-22       Impact factor: 19.160

3.  An Ancient Clade of Penelope-Like Retroelements with Permuted Domains Is Present in the Green Lineage and Protists, and Dominates Many Invertebrate Genomes.

Authors:  Rory J Craig; Irina A Yushenova; Fernando Rodriguez; Irina R Arkhipova
Journal:  Mol Biol Evol       Date:  2021-10-27       Impact factor: 16.240

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.