Gail L Rosen1, Tze Yee Lim. 1. Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA. gailr@ece.drexel.edu.
Abstract
BACKGROUND: Classifying the fungal and viral content of a sample is an important component of analyzing microbial communities in environmental media. Therefore, a method to classify any fragment from these organisms' DNA should be implemented. RESULTS: We update the näive Bayes classification (NBC) tool to classify reads originating from viral and fungal organisms. NBC classifies a fungal dataset similarly to Basic Local Alignment Search Tool (BLAST) and the Ribosomal Database Project (RDP) classifier. We also show NBC's similarities and differences to RDP on a fungal large subunit (LSU) ribosomal DNA dataset. For viruses in the training database, strain classification accuracy is 98%, while for those reads originating from sequences not in the database, the order-level accuracy is 78%, where order indicates the taxonomic level in the tree of life. CONCLUSIONS: In addition to being competitive to other classifiers available, NBC has the potential to handle reads originating from any location in the genome. We recommend using the Bacteria/Archaea, Fungal, and Virus databases separately due to algorithmic biases towards long genomes. The tool is publicly available at: http://nbc.ece.drexel.edu.
BACKGROUND: Classifying the fungal and viral content of a sample is an important component of analyzing microbial communities in environmental media. Therefore, a method to classify any fragment from these organisms' DNA should be implemented. RESULTS: We update the näive Bayes classification (NBC) tool to classify reads originating from viral and fungal organisms. NBC classifies a fungal dataset similarly to Basic Local Alignment Search Tool (BLAST) and the Ribosomal Database Project (RDP) classifier. We also show NBC's similarities and differences to RDP on a fungal large subunit (LSU) ribosomal DNA dataset. For viruses in the training database, strain classification accuracy is 98%, while for those reads originating from sequences not in the database, the order-level accuracy is 78%, where order indicates the taxonomic level in the tree of life. CONCLUSIONS: In addition to being competitive to other classifiers available, NBC has the potential to handle reads originating from any location in the genome. We recommend using the Bacteria/Archaea, Fungal, and Virus databases separately due to algorithmic biases towards long genomes. The tool is publicly available at: http://nbc.ece.drexel.edu.
Authors: Mahmoud A Ghannoum; Richard J Jurevic; Pranab K Mukherjee; Fan Cui; Masoumeh Sikaroodi; Ammar Naqvi; Patrick M Gillevet Journal: PLoS Pathog Date: 2010-01-08 Impact factor: 6.823
Authors: Ruth E Ley; Micah Hamady; Catherine Lozupone; Peter J Turnbaugh; Rob Roy Ramey; J Stephen Bircher; Michael L Schlegel; Tammy A Tucker; Mark D Schrenzel; Rob Knight; Jeffrey I Gordon Journal: Science Date: 2008-05-22 Impact factor: 47.728
Authors: Alexa B R McIntyre; Rachid Ounit; Ebrahim Afshinnekoo; Robert J Prill; Elizabeth Hénaff; Noah Alexander; Samuel S Minot; David Danko; Jonathan Foox; Sofia Ahsanuddin; Scott Tighe; Nur A Hasan; Poorani Subramanian; Kelly Moffat; Shawn Levy; Stefano Lonardi; Nick Greenfield; Rita R Colwell; Gail L Rosen; Christopher E Mason Journal: Genome Biol Date: 2017-09-21 Impact factor: 13.583