| Literature DB >> 25188327 |
Marnix H Medema1, Yared Paalvast1, Don D Nguyen2, Alexey Melnik2, Pieter C Dorrestein2, Eriko Takano3, Rainer Breitling4.
Abstract
Nonribosomally and ribosomally synthesized bioactive peptides constitute a source of molecules of great biomedical importance, including antibiotics such as penicillin, immunosuppressants such as cyclosporine, and cytostatics such as bleomycin. Recently, an innovative mass-spectrometry-based strategy, peptidogenomics, has been pioneered to effectively mine microbial strains for novel peptidic metabolites. Even though mass-spectrometric peptide detection can be performed quite fast, true high-throughput natural product discovery approaches have still been limited by the inability to rapidly match the identified tandem mass spectra to the gene clusters responsible for the biosynthesis of the corresponding compounds. With Pep2Path, we introduce a software package to fully automate the peptidogenomics approach through the rapid Bayesian probabilistic matching of mass spectra to their corresponding biosynthetic gene clusters. Detailed benchmarking of the method shows that the approach is powerful enough to correctly identify gene clusters even in data sets that consist of hundreds of genomes, which also makes it possible to match compounds from unsequenced organisms to closely related biosynthetic gene clusters in other genomes. Applying Pep2Path to a data set of compounds without known biosynthesis routes, we were able to identify candidate gene clusters for the biosynthesis of five important compounds. Notably, one of these clusters was detected in a genome from a different subphylum of Proteobacteria than that in which the molecule had first been identified. All in all, our approach paves the way towards high-throughput discovery of novel peptidic natural products. Pep2Path is freely available from http://pep2path.sourceforge.net/, implemented in Python, licensed under the GNU General Public License v3 and supported on MS Windows, Linux and Mac OS X.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25188327 PMCID: PMC4154637 DOI: 10.1371/journal.pcbi.1003822
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Outline of the NRP2Path matching process.
The input for NRP2Path consists of mass shift sequences (or amino acid search tags) on the one hand, and genome sequences on the other hand. The latter are processed into databases by makedb, using antiSMASH and NRPSPredictor2. When a database is queried with a mass shift sequence or amino acid search tag, Pep2Path scores all possible matches between search tags and all possible assembly line configurations of each of the NRPS BGCs in the database.
Figure 2Quality of NRP2Path predictions with varying sequence tag lengths and NRPSPredictor2 prediction qualities.
The heat map shows the average number of correct BGC predictions for Pep2Path searches with the stendomycin sequence tag V-V-T(S)-T(S)-A-I(L)-V-G across the Streptomyces hygroscopicus ATCC 53653 genome (20 NRPS BGCs) or across all Streptomyces nucleotide entries (342 NRPS BGCs). The searches were done for all possible search subtags of 2–8 amino acids long, and for all combinations of 0–8 simulated mispredictions for the corresponding NRPS modules. Mispredictions are simulated with zero scores given by Pep2Path for sequence tags matching to these domains.
Benchmark of Pep2Path on 18 recently discovered NRPS BGCs.
| Tag size (AA) | BGC search space size: 5 | BGC search space size: 10 | BGC search space size: 25 | BGC search space size: 50 | BGC search space size: 100 |
| 2 ( | 75% | 64% | 47% | 36% | 26% |
| 3 ( | 78% | 70% | 54% | 44% | 37% |
| 4 ( | 83% | 78% | 65% | 56% | 45% |
| 5 ( | 90% | 89% | 79% | 72% | 61% |
| 6 ( | 96% | 96% | 87% | 81% | 74% |
| 7 ( | 99% | 99% | 96% | 91% | 88% |
| 8 ( | 100% | 100% | 100% | 100% | 100% |
For each tag size, all possible search tags of that size in the test set of peptides () were used as queries. For each BGC search space size, 50 search spaces were generated from randomly selected BGCs from the same (sub)phylum that the NRP originates from. The resulting percentages represent the average number of cases in which the correct BGC ended up as the (shared) best hit across all possible sequence tags and across all possible search space permutations. Shared best hits were included because of the frequent presence of orthologous BGCs encoding the same molecule in related genomes. The n in the left column signifies the number of test peptides large enough to be included in the analysis for this tag size; from each of these test peptides, all possible subtags were used in cases where the length of the tag is shorter than the length of the peptide.
Novel matches of NORINE-derived NRPs to BGCs detected in genome sequences.
| Compound | Reference | Species (accession nr.) | Locus tags | NRP search tag from NORINE | NRPSPredictor2 prediction | Pep2Path score (rank) |
| trichotoxin | (Irmscher et al. 1978) |
| TRIVIDRAFT_69940 | ala-gly-ala-leu-ala-glu-ala-ala-ala-ala-ala-ala-pro-leu-ala-xxx-gln-vol | nrp-nrp-ala-nrp-nrp-gln-nrp-ala-nrp-ser-leu-nrp-pro-nrp-ala-ala-gln-vol | 6.25 (1) |
| ferintoic acid | (Williams et al. 1996) |
| MICAK_4000004-MICAK_4000007 | trp-co-lys-val-hty-ala-phe | phe-nrp-lys-val-nrp-ala | 5.24 (1) |
| plusbacin | (Shoji et al. 1992) |
| YSA_0461-YSA_0481 | asp-pro-ser-asp-arg-pro-ala-allothr | asp-ser-ser-asp-nrp-nrp-nrp-thr | 4.91 (1) |
| amphibactin B | (Martinez et al. 2003) |
| VT1337_12727-VT1337_12732 | orn-orn-ser-orn | orn-orn-ser-orn | 2.73 (1) |
| tripropeptin A | (Hashizume et al. 2001) |
| CFU_2182-CFU_2185 | thr-pro-pro-arg-asp-ser-pro-asp | thr-pro-pro-orn-asp-ser-pro-asp | 8.94 (1) |
Candidate BGCs for trichotoxin, ferintoic acid, plusbacin and amphibactin B were discovered by searching within the taxonomic range of the species in which the molecules were found. The candidate BGC for tripropeptin A was discovered by searching the entire Pep2Path database.
Matching of mass sequence tags to RiPP gene clusters using RiPP2Path.
| Peptide | Search tag | Genome | Matches in genome |
| SSV-2083 | I(L)GA(C)GTA(C)WI(L)A(C)V |
| 1 |
| SGR-1832 | AVAQ(K)FVI(L)Q(K)GSTI(L) |
| 1 |
| SCO-2138 | VHFVGWI(L) |
| 1 |
| SLI-2138 | GI(L)VHFVGWI(L) |
| 1 |
| SWA-2138 | I(L)AGI(L)VHFI(L)GWI(L) |
| 1 |
| SRO15-2005 | YWSRRI(L)I(L) |
| 1 |
| SRO15-2212 | VVI(L)S(C)T |
| 47 |
| SRO15-3108 | AS(C)ATVTI(L) |
| 1 |
| SAL-2242 | VTI(L)S(C)T |
| 39 |
Seven out of the nine search tags resulted in unique matches in their corresponding Streptomyces genomes.