Hamid Bagheri1, Andrew J Severin2, Hridesh Rajan1. 1. Department of Computer Science, Ames, IA 50011, USA. 2. Genome Informatics Facility, Iowa State University, Ames, IA 50011, USA.
Abstract
MOTIVATION: As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity. RESULTS: We found more than two million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases. AVAILABILITY AND IMPLEMENTATION: Source code, dataset, documentation, Jupyter notebooks and Docker container are available at https://github.com/boalang/nr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity. RESULTS: We found more than two million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases. AVAILABILITY AND IMPLEMENTATION: Source code, dataset, documentation, Jupyter notebooks and Docker container are available at https://github.com/boalang/nr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock Journal: Nat Genet Date: 2000-05 Impact factor: 38.330
Authors: Cathy H Wu; Lai-Su L Yeh; Hongzhan Huang; Leslie Arminski; Jorge Castro-Alvear; Yongxing Chen; Zhangzhi Hu; Panagiotis Kourtesis; Robert S Ledley; Baris E Suzek; C R Vinayaka; Jian Zhang; Winona C Barker Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971
Authors: Aron Marchler-Bauer; Shennan Lu; John B Anderson; Farideh Chitsaz; Myra K Derbyshire; Carol DeWeese-Scott; Jessica H Fong; Lewis Y Geer; Renata C Geer; Noreen R Gonzales; Marc Gwadz; David I Hurwitz; John D Jackson; Zhaoxi Ke; Christopher J Lanczycki; Fu Lu; Gabriele H Marchler; Mikhail Mullokandov; Marina V Omelchenko; Cynthia L Robertson; James S Song; Narmada Thanki; Roxanne A Yamashita; Dachuan Zhang; Naigong Zhang; Chanjuan Zheng; Stephen H Bryant Journal: Nucleic Acids Res Date: 2010-11-24 Impact factor: 16.971
Authors: Daniel McDonald; Morgan N Price; Julia Goodrich; Eric P Nawrocki; Todd Z DeSantis; Alexander Probst; Gary L Andersen; Rob Knight; Philip Hugenholtz Journal: ISME J Date: 2011-12-01 Impact factor: 10.302
Authors: Maaly Nassar; Alexander B Rogers; Francesco Talo'; Santiago Sanchez; Zunaira Shafique; Robert D Finn; Johanna McEntyre Journal: Gigascience Date: 2022-08-11 Impact factor: 7.658
Authors: Ruvini V Lelwala; Zacharie LeBlanc; Marie-Emilie A Gauthier; Candace E Elliott; Fiona E Constable; Greg Murphy; Callum Tyle; Adrian Dinsdale; Mark Whattam; Julie Pattemore; Roberto A Barrero Journal: Viruses Date: 2022-07-05 Impact factor: 5.818