Philip Heller1, H James Tripp1, Kendra Turk-Kubo1, Jonathan P Zehr1. 1. Department of Biomolecular Engineering, University of California, Santa Cruz, CA 95064, USA, Department of Energy (DOE) Joint Genome Institute, Walnut Creek, CA 94598, USA and Department of Ocean Sciences, University of California, Santa Cruz, CA 95064, USA.
Abstract
MOTIVATION: Studies of the biochemical functions and activities of uncultivated microorganisms in the environment require analysis of DNA sequences for phylogenetic characterization and for the development of sequence-based assays for the detection of microorganisms. The numbers of sequences for genes that are indicators of environmentally important functions such as nitrogen (N2) fixation have been rapidly growing over the past few decades. Obtaining these sequences from the National Center for Biotechnology Information's GenBank database is problematic because of annotation errors, nomenclature variation and paralogues; moreover, GenBank's structure and tools are not conducive to searching solely by function. For some genes, such as the nifH gene commonly used to assess community potential for N2 fixation, manual collection and curation are becoming intractable because of the large number of sequences in GenBank and the large number of highly similar paralogues. If analysis is to keep pace with sequence discovery, an automated retrieval and curation system is necessary. RESULTS: ARBitrator uses a two-step process composed of a broad collection of potential homologues followed by screening with a best hit strategy to conserved domains. 34 420 nifH sequences were identified in GenBank as of November 20, 2012. The false-positive rate is ∼0.033%. ARBitrator rapidly updates a public nifH sequence database, and we show that it can be adapted for other genes. AVAILABILITY AND IMPLEMENTATION: Java source and executable code are freely available to non-commercial users at http://pmc.ucsc.edu/∼wwwzehr/research/database/. CONTACT: zehrj@ucsc.edu SUPPLEMENTARY INFORMATION: SUPPLEMENTARY INFORMATION is available at Bioinformatics online.
MOTIVATION: Studies of the biochemical functions and activities of uncultivated microorganisms in the environment require analysis of DNA sequences for phylogenetic characterization and for the development of sequence-based assays for the detection of microorganisms. The numbers of sequences for genes that are indicators of environmentally important functions such as nitrogen (N2) fixation have been rapidly growing over the past few decades. Obtaining these sequences from the National Center for Biotechnology Information's GenBank database is problematic because of annotation errors, nomenclature variation and paralogues; moreover, GenBank's structure and tools are not conducive to searching solely by function. For some genes, such as the nifH gene commonly used to assess community potential for N2 fixation, manual collection and curation are becoming intractable because of the large number of sequences in GenBank and the large number of highly similar paralogues. If analysis is to keep pace with sequence discovery, an automated retrieval and curation system is necessary. RESULTS: ARBitrator uses a two-step process composed of a broad collection of potential homologues followed by screening with a best hit strategy to conserved domains. 34 420 nifH sequences were identified in GenBank as of November 20, 2012. The false-positive rate is ∼0.033%. ARBitrator rapidly updates a public nifH sequence database, and we show that it can be adapted for other genes. AVAILABILITY AND IMPLEMENTATION: Java source and executable code are freely available to non-commercial users at http://pmc.ucsc.edu/∼wwwzehr/research/database/. CONTACT: zehrj@ucsc.edu SUPPLEMENTARY INFORMATION: SUPPLEMENTARY INFORMATION is available at Bioinformatics online.
Authors: Lauren F Messer; Claire Mahaffey; Charlotte M Robinson; Thomas C Jeffries; Kirralee G Baker; Jaime Bibiloni Isaksson; Martin Ostrowski; Martina A Doblin; Mark V Brown; Justin R Seymour Journal: ISME J Date: 2015-11-27 Impact factor: 10.302
Authors: Lauren F Messer; Mark V Brown; Miles J Furnas; Richard L Carney; A D McKinnon; Justin R Seymour Journal: Front Microbiol Date: 2017-06-07 Impact factor: 5.640
Authors: Mar Benavides; Pia H Moisander; Hugo Berthelot; Thorsten Dittmar; Olivier Grosso; Sophie Bonnet Journal: PLoS One Date: 2015-12-11 Impact factor: 3.240
Authors: Mar Fernández-Méndez; Kendra A Turk-Kubo; Pier L Buttigieg; Josephine Z Rapp; Thomas Krumpen; Jonathan P Zehr; Antje Boetius Journal: Front Microbiol Date: 2016-11-23 Impact factor: 5.640
Authors: Roey Angel; Maximilian Nepel; Christopher Panhölzl; Hannes Schmidt; Craig W Herbold; Stephanie A Eichorst; Dagmar Woebken Journal: Front Microbiol Date: 2018-04-30 Impact factor: 5.640