Marc G Chevrette1,2, Fabian Aicheler3, Oliver Kohlbacher3,4, Cameron R Currie2, Marnix H Medema5. 1. Department of Genetics. 2. Department of Bacteriology and J. F. Crow Institute for the Study of Evolution, University of Wisconsin-Madison, Madison, WI 53706, USA. 3. Applied Bioinformatics, Department of Computer Science, Quantitative Biology Center and Center for Bioinformatics, University of Tübingen, 72076 Tübingen, Germany. 4. Biomolecular Interactions, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany. 5. Bioinformatics Group, Wageningen University, 6708PB Wageningen, The Netherlands.
Abstract
SUMMARY: Nonribosomally synthesized peptides (NRPs) are natural products with widespread applications in medicine and biotechnology. Many algorithms have been developed to predict the substrate specificities of nonribosomal peptide synthetase adenylation (A) domains from DNA sequences, which enables prioritization and dereplication, and integration with other data types in discovery efforts. However, insufficient training data and a lack of clarity regarding prediction quality have impeded optimal use. Here, we introduce prediCAT, a new phylogenetics-inspired algorithm, which quantitatively estimates the degree of predictability of each A-domain. We then systematically benchmarked all algorithms on a newly gathered, independent test set of 434 A-domain sequences, showing that active-site-motif-based algorithms outperform whole-domain-based methods. Subsequently, we developed SANDPUMA, a powerful ensemble algorithm, based on newly trained versions of all high-performing algorithms, which significantly outperforms individual methods. Finally, we deployed SANDPUMA in a systematic investigation of 7635 Actinobacteria genomes, suggesting that NRP chemical diversity is much higher than previously estimated. SANDPUMA has been integrated into the widely used antiSMASH biosynthetic gene cluster analysis pipeline and is also available as an open-source, standalone tool. AVAILABILITY AND IMPLEMENTATION: SANDPUMA is freely available at https://bitbucket.org/chevrm/sandpuma and as a docker image at https://hub.docker.com/r/chevrm/sandpuma/ under the GNU Public License 3 (GPL3). CONTACT: chevrette@wisc.edu or marnix.medema@wur.nl. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
SUMMARY: Nonribosomally synthesized peptides (NRPs) are natural products with widespread applications in medicine and biotechnology. Many algorithms have been developed to predict the substrate specificities of nonribosomal peptide synthetase adenylation (A) domains from DNA sequences, which enables prioritization and dereplication, and integration with other data types in discovery efforts. However, insufficient training data and a lack of clarity regarding prediction quality have impeded optimal use. Here, we introduce prediCAT, a new phylogenetics-inspired algorithm, which quantitatively estimates the degree of predictability of each A-domain. We then systematically benchmarked all algorithms on a newly gathered, independent test set of 434 A-domain sequences, showing that active-site-motif-based algorithms outperform whole-domain-based methods. Subsequently, we developed SANDPUMA, a powerful ensemble algorithm, based on newly trained versions of all high-performing algorithms, which significantly outperforms individual methods. Finally, we deployed SANDPUMA in a systematic investigation of 7635 Actinobacteria genomes, suggesting that NRP chemical diversity is much higher than previously estimated. SANDPUMA has been integrated into the widely used antiSMASH biosynthetic gene cluster analysis pipeline and is also available as an open-source, standalone tool. AVAILABILITY AND IMPLEMENTATION: SANDPUMA is freely available at https://bitbucket.org/chevrm/sandpuma and as a docker image at https://hub.docker.com/r/chevrm/sandpuma/ under the GNU Public License 3 (GPL3). CONTACT: chevrette@wisc.edu or marnix.medema@wur.nl. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Michael Knudsen; Dan Søndergaard; Claus Tofting-Olesen; Frederik T Hansen; Ditlev Egeskov Brodersen; Christian N S Pedersen Journal: Bioinformatics Date: 2015-10-14 Impact factor: 6.937
Authors: Peter Cimermancic; Marnix H Medema; Jan Claesen; Kenji Kurita; Laura C Wieland Brown; Konstantinos Mavrommatis; Amrita Pati; Paul A Godfrey; Michael Koehrsen; Jon Clardy; Bruce W Birren; Eriko Takano; Andrej Sali; Roger G Linington; Michael A Fischbach Journal: Cell Date: 2014-07-17 Impact factor: 41.582
Authors: Marnix H Medema; Kai Blin; Peter Cimermancic; Victor de Jager; Piotr Zakrzewski; Michael A Fischbach; Tilmann Weber; Eriko Takano; Rainer Breitling Journal: Nucleic Acids Res Date: 2011-06-14 Impact factor: 16.971
Authors: Hosein Mohimani; Wei-Ting Liu; Roland D Kersten; Bradley S Moore; Pieter C Dorrestein; Pavel A Pevzner Journal: J Nat Prod Date: 2014-08-12 Impact factor: 4.050
Authors: Pablo Cruz-Morales; Johannes Florian Kopp; Christian Martínez-Guerrero; Luis Alfonso Yáñez-Guerra; Nelly Selem-Mojica; Hilda Ramos-Aboites; Jörg Feldmann; Francisco Barona-Gómez Journal: Genome Biol Evol Date: 2016-07-02 Impact factor: 3.416
Authors: Ian J Miller; Evan R Rees; Jennifer Ross; Izaak Miller; Jared Baxa; Juan Lopera; Robert L Kerby; Federico E Rey; Jason C Kwan Journal: Nucleic Acids Res Date: 2019-06-04 Impact factor: 16.971
Authors: Serina L Robinson; Barbara R Terlouw; Megan D Smith; Sacha J Pidot; Timothy P Stinear; Marnix H Medema; Lawrence P Wackett Journal: J Biol Chem Date: 2020-08-21 Impact factor: 5.157
Authors: Tai L Ng; Monica E McCallum; Christine R Zheng; Jennifer X Wang; Kelvin J Y Wu; Emily P Balskus Journal: Chembiochem Date: 2019-12-19 Impact factor: 3.164
Authors: Kurt Throckmorton; Vladimir Vinnik; Ratul Chowdhury; Taylor Cook; Marc G Chevrette; Costas Maranas; Brian Pfleger; Michael George Thomas Journal: ACS Chem Biol Date: 2019-09-03 Impact factor: 5.100
Authors: James H Tryon; Jennifer C Rote; Li Chen; Matthew T Robey; Marvin M Vega; Wan Cheng Phua; William W Metcalf; Kou-San Ju; Neil L Kelleher; Regan J Thomson Journal: ACS Chem Biol Date: 2020-11-05 Impact factor: 5.100
Authors: Zachary L Reitz; Clifford D Hardy; Jaewon Suk; Jean Bouvet; Alison Butler Journal: Proc Natl Acad Sci U S A Date: 2019-09-16 Impact factor: 11.205