Illyoung Choi1, Alise J Ponsero2, Matthew Bomhoff2, Ken Youens-Clark2, John H Hartman1, Bonnie L Hurwitz2,3. 1. Department of Computer Science, University of Arizona, 1040 E. 4th Street, Tucson, Arizona, 85721, USA. 2. Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA. 3. BIO5 Institute, University of Arizona, 1657 E. Helen Street, Tucson, Arizona, 85719, USA.
Abstract
Background: Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results: We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions: A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.
Background: Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results: We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions: A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.
Authors: Shibu Yooseph; Granger Sutton; Douglas B Rusch; Aaron L Halpern; Shannon J Williamson; Karin Remington; Jonathan A Eisen; Karla B Heidelberg; Gerard Manning; Weizhong Li; Lukasz Jaroszewski; Piotr Cieplak; Christopher S Miller; Huiying Li; Susan T Mashiyama; Marcin P Joachimiak; Christopher van Belle; John-Marc Chandonia; David A Soergel; Yufeng Zhai; Kannan Natarajan; Shaun Lee; Benjamin J Raphael; Vineet Bafna; Robert Friedman; Steven E Brenner; Adam Godzik; David Eisenberg; Jack E Dixon; Susan S Taylor; Robert L Strausberg; Marvin Frazier; J Craig Venter Journal: PLoS Biol Date: 2007-03 Impact factor: 8.029
Authors: Illyoung Choi; Alise J Ponsero; Matthew Bomhoff; Ken Youens-Clark; John H Hartman; Bonnie L Hurwitz Journal: Gigascience Date: 2019-02-01 Impact factor: 6.524
Authors: Dennis P Wall; Parul Kudtarkar; Vincent A Fusaro; Rimma Pivovarov; Prasad Patil; Peter J Tonellato Journal: BMC Bioinformatics Date: 2010-05-18 Impact factor: 3.169
Authors: Illyoung Choi; Alise J Ponsero; Matthew Bomhoff; Ken Youens-Clark; John H Hartman; Bonnie L Hurwitz Journal: Gigascience Date: 2019-02-01 Impact factor: 6.524
Authors: Ken Youens-Clark; Matt Bomhoff; Alise J Ponsero; Elisha M Wood-Charlson; Joshua Lynch; Illyoung Choi; John H Hartman; Bonnie L Hurwitz Journal: Gigascience Date: 2019-07-01 Impact factor: 6.524
Authors: Ryan Connor; Rodney Brister; Jan P Buchmann; Ward Deboutte; Rob Edwards; Joan Martí-Carreras; Mike Tisza; Vadim Zalunin; Juan Andrade-Martínez; Adrian Cantu; Michael D'Amour; Alexandre Efremov; Lydia Fleischmann; Laura Forero-Junco; Sanzhima Garmaeva; Melissa Giluso; Cody Glickman; Margaret Henderson; Benjamin Kellman; David Kristensen; Carl Leubsdorf; Kyle Levi; Shane Levi; Suman Pakala; Vikas Peddu; Alise Ponsero; Eldred Ribeiro; Farrah Roy; Lindsay Rutter; Surya Saha; Migun Shakya; Ryan Shean; Matthew Miller; Benjamin Tully; Christopher Turkington; Ken Youens-Clark; Bert Vanmechelen; Ben Busby Journal: Genes (Basel) Date: 2019-09-16 Impact factor: 4.096
Authors: George S Watts; James E Thornton; Ken Youens-Clark; Alise J Ponsero; Marvin J Slepian; Emmanuel Menashi; Charles Hu; Wuquan Deng; David G Armstrong; Spenser Reed; Lee D Cranmer; Bonnie L Hurwitz Journal: PLoS Comput Biol Date: 2019-11-22 Impact factor: 4.475