Literature DB >> 32579213

Detecting and correcting misclassified sequences in the large-scale public databases.

Hamid Bagheri1, Andrew J Severin2, Hridesh Rajan1.   

Abstract

MOTIVATION: As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity.
RESULTS: We found more than two million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases.
AVAILABILITY AND IMPLEMENTATION: Source code, dataset, documentation, Jupyter notebooks and Docker container are available at https://github.com/boalang/nr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2020. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2020        PMID: 32579213      PMCID: PMC7821992          DOI: 10.1093/bioinformatics/btaa586

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  23 in total

1.  Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors:  M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal:  Nat Genet       Date:  2000-05       Impact factor: 38.330

2.  The Protein Information Resource.

Authors:  Cathy H Wu; Lai-Su L Yeh; Hongzhan Huang; Leslie Arminski; Jorge Castro-Alvear; Yongxing Chen; Zhangzhi Hu; Panagiotis Kourtesis; Robert S Ledley; Baris E Suzek; C R Vinayaka; Jian Zhang; Winona C Barker
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

3.  CDD: a Conserved Domain Database for the functional annotation of proteins.

Authors:  Aron Marchler-Bauer; Shennan Lu; John B Anderson; Farideh Chitsaz; Myra K Derbyshire; Carol DeWeese-Scott; Jessica H Fong; Lewis Y Geer; Renata C Geer; Noreen R Gonzales; Marc Gwadz; David I Hurwitz; John D Jackson; Zhaoxi Ke; Christopher J Lanczycki; Fu Lu; Gabriele H Marchler; Mikhail Mullokandov; Marina V Omelchenko; Cynthia L Robertson; James S Song; Narmada Thanki; Roxanne A Yamashita; Dachuan Zhang; Naigong Zhang; Chanjuan Zheng; Stephen H Bryant
Journal:  Nucleic Acids Res       Date:  2010-11-24       Impact factor: 16.971

4.  An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea.

Authors:  Daniel McDonald; Morgan N Price; Julia Goodrich; Eric P Nawrocki; Todd Z DeSantis; Alexander Probst; Gary L Andersen; Rob Knight; Philip Hugenholtz
Journal:  ISME J       Date:  2011-12-01       Impact factor: 10.302

5.  UniProt: a hub for protein information.

Authors: 
Journal:  Nucleic Acids Res       Date:  2014-10-27       Impact factor: 16.971

6.  Large-scale contamination of microbial isolate genomes by Illumina PhiX control.

Authors:  Supratim Mukherjee; Marcel Huntemann; Natalia Ivanova; Nikos C Kyrpides; Amrita Pati
Journal:  Stand Genomic Sci       Date:  2015-03-30

7.  Evaluating Functional Annotations of Enzymes Using the Gene Ontology.

Authors:  Gemma L Holliday; Rebecca Davidson; Eyal Akiva; Patricia C Babbitt
Journal:  Methods Mol Biol       Date:  2017

8.  Clustering huge protein sequence sets in linear time.

Authors:  Martin Steinegger; Johannes Söding
Journal:  Nat Commun       Date:  2018-06-29       Impact factor: 14.919

9.  Shared data science infrastructure for genomics data.

Authors:  Hamid Bagheri; Usha Muppirala; Rick E Masonbrink; Andrew J Severin; Hridesh Rajan
Journal:  BMC Bioinformatics       Date:  2019-08-22       Impact factor: 3.169

10.  MisPred: a resource for identification of erroneous protein sequences in public databases.

Authors:  Alinda Nagy; László Patthy
Journal:  Database (Oxford)       Date:  2013-07-17       Impact factor: 3.451

View more
  3 in total

1.  Metatranscriptomic Assessment of the Microbial Community Associated With the Flavescence dorée Phytoplasma Insect Vector Scaphoideus titanus.

Authors:  Simona Abbà; Marika Rossi; Marta Vallino; Luciana Galetto; Cristina Marzachì; Massimo Turina
Journal:  Front Microbiol       Date:  2022-04-19       Impact factor: 6.064

2.  A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications.

Authors:  Maaly Nassar; Alexander B Rogers; Francesco Talo'; Santiago Sanchez; Zunaira Shafique; Robert D Finn; Johanna McEntyre
Journal:  Gigascience       Date:  2022-08-11       Impact factor: 7.658

3.  Implementation of GA-VirReport, a Web-Based Bioinformatics Toolkit for Post-Entry Quarantine Screening of Virus and Viroids in Plants.

Authors:  Ruvini V Lelwala; Zacharie LeBlanc; Marie-Emilie A Gauthier; Candace E Elliott; Fiona E Constable; Greg Murphy; Callum Tyle; Adrian Dinsdale; Mark Whattam; Julie Pattemore; Roberto A Barrero
Journal:  Viruses       Date:  2022-07-05       Impact factor: 5.818

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.