Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Detecting and correcting misclassified sequences in the large-scale public databases.

Literature DB >> 32579213

Detecting and correcting misclassified sequences in the large-scale public databases.

Hamid Bagheri¹, Andrew J Severin², Hridesh Rajan¹.

Abstract

MOTIVATION: As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity.
RESULTS: We found more than two million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases.
AVAILABILITY AND IMPLEMENTATION: Source code, dataset, documentation, Jupyter notebooks and Docker container are available at https://github.com/boalang/nr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Year: 2020 PMID： 32579213 PMCID： PMC7821992 DOI： 10.1093/bioinformatics/btaa586

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

23 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. The Protein Information Resource.

Authors: Cathy H Wu; Lai-Su L Yeh; Hongzhan Huang; Leslie Arminski; Jorge Castro-Alvear; Yongxing Chen; Zhangzhi Hu; Panagiotis Kourtesis; Robert S Ledley; Baris E Suzek; C R Vinayaka; Jian Zhang; Winona C Barker
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

3. CDD: a Conserved Domain Database for the functional annotation of proteins.

Authors: Aron Marchler-Bauer; Shennan Lu; John B Anderson; Farideh Chitsaz; Myra K Derbyshire; Carol DeWeese-Scott; Jessica H Fong; Lewis Y Geer; Renata C Geer; Noreen R Gonzales; Marc Gwadz; David I Hurwitz; John D Jackson; Zhaoxi Ke; Christopher J Lanczycki; Fu Lu; Gabriele H Marchler; Mikhail Mullokandov; Marina V Omelchenko; Cynthia L Robertson; James S Song; Narmada Thanki; Roxanne A Yamashita; Dachuan Zhang; Naigong Zhang; Chanjuan Zheng; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2010-11-24 Impact factor: 16.971

4. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea.

Authors: Daniel McDonald; Morgan N Price; Julia Goodrich; Eric P Nawrocki; Todd Z DeSantis; Alexander Probst; Gary L Andersen; Rob Knight; Philip Hugenholtz
Journal: ISME J Date: 2011-12-01 Impact factor: 10.302

5. UniProt: a hub for protein information.

Authors:
Journal: Nucleic Acids Res Date: 2014-10-27 Impact factor: 16.971

6. Large-scale contamination of microbial isolate genomes by Illumina PhiX control.

Authors: Supratim Mukherjee; Marcel Huntemann; Natalia Ivanova; Nikos C Kyrpides; Amrita Pati
Journal: Stand Genomic Sci Date: 2015-03-30

7. Evaluating Functional Annotations of Enzymes Using the Gene Ontology.

Authors: Gemma L Holliday; Rebecca Davidson; Eyal Akiva; Patricia C Babbitt
Journal: Methods Mol Biol Date: 2017

8. Clustering huge protein sequence sets in linear time.

Authors: Martin Steinegger; Johannes Söding
Journal: Nat Commun Date: 2018-06-29 Impact factor: 14.919

9. Shared data science infrastructure for genomics data.

Authors: Hamid Bagheri; Usha Muppirala; Rick E Masonbrink; Andrew J Severin; Hridesh Rajan
Journal: BMC Bioinformatics Date: 2019-08-22 Impact factor: 3.169

10. MisPred: a resource for identification of erroneous protein sequences in public databases.

Authors: Alinda Nagy; László Patthy
Journal: Database (Oxford) Date: 2013-07-17 Impact factor: 3.451

3 in total

1. Metatranscriptomic Assessment of the Microbial Community Associated With the Flavescence dorée Phytoplasma Insect Vector Scaphoideus titanus.

Authors: Simona Abbà; Marika Rossi; Marta Vallino; Luciana Galetto; Cristina Marzachì; Massimo Turina
Journal: Front Microbiol Date: 2022-04-19 Impact factor: 6.064

2. A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications.

Authors: Maaly Nassar; Alexander B Rogers; Francesco Talo'; Santiago Sanchez; Zunaira Shafique; Robert D Finn; Johanna McEntyre
Journal: Gigascience Date: 2022-08-11 Impact factor: 7.658

3. Implementation of GA-VirReport, a Web-Based Bioinformatics Toolkit for Post-Entry Quarantine Screening of Virus and Viroids in Plants.

Authors: Ruvini V Lelwala; Zacharie LeBlanc; Marie-Emilie A Gauthier; Candace E Elliott; Fiona E Constable; Greg Murphy; Callum Tyle; Adrian Dinsdale; Mark Whattam; Julie Pattemore; Roberto A Barrero
Journal: Viruses Date: 2022-07-05 Impact factor: 5.818

3 in total