| Literature DB >> 27820803 |
Hosein Mohimani1, Alexey Gurevich2, Alla Mikheenko2, Neha Garg3, Louis-Felix Nothias3, Akihiro Ninomiya4, Kentaro Takada4, Pieter C Dorrestein3,5, Pavel A Pevzner1,2.
Abstract
Peptidic natural products (PNPs) are widely used compounds that include many antibiotics and a variety of other bioactive peptides. Although recent breakthroughs in PNP discovery raised the challenge of developing new algorithms for their analysis, identification of PNPs via database search of tandem mass spectra remains an open problem. To address this problem, natural product researchers use dereplication strategies that identify known PNPs and lead to the discovery of new ones, even in cases when the reference spectra are not present in existing spectral libraries. DEREPLICATOR is a new dereplication algorithm that enables high-throughput PNP identification and that is compatible with large-scale mass-spectrometry-based screening platforms for natural product discovery. After searching nearly one hundred million tandem mass spectra in the Global Natural Products Social (GNPS) molecular networking infrastructure, DEREPLICATOR identified an order of magnitude more PNPs (and their new variants) than any previous dereplication efforts.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27820803 PMCID: PMC5409158 DOI: 10.1038/nchembio.2219
Source DB: PubMed Journal: Nat Chem Biol ISSN: 1552-4450 Impact factor: 15.040
Figure 1DEREPLICATOR pipeline. DEREPLICATOR pipeline includes the following steps: (i) generating decoy database of PNPs (ii) constructing theoretical spectra for all PNPs in the database, (iii) generating and scoring PSMs, (vi) computing p-values of PSMs and generating the set of statistically significant PSMs, (v) computing false discovery rate, (vi) enlarging the set of found PSMs through variable dereplication via spectral networks. Various steps related to target and decoy databases are shown in green and red boxes, respectively. Six peptides identified in target database and two peptides identified in decoy database are shown in green and red, respectively.
Figure 2Number of PSMs and peptides identified by DEREPLICATOR. For each x (shown as p-value along the x-axis), the plots show the number of identified PSMs or peptides with p-values below x. (Top) Number of PSMs (a) and peptides (b) for the target AntiMarin and decoy databases in the search of Spectra4. 1787 PSMs and 180 unique PNPs with p-value below 10−13 were dereplicated via spectral networks. (Bottom) Number of PSMs (c) and peptides (d) for the target AntiMarin and decoy databases in the search of Spectra. All searches were performed with the precursor mass tolerance 0.05 Da.
The list of 37 PNPs (in the increasing order of p-values) identified by DEREPLICATOR in the search of Spectra4 against AntiMarin database for p-value threshold 10−11. The precursor mass tolerance was set to 0.05Da. The “organism” column refers to the species present in one of four GNPS datasets contributing to Spectra4 (if known). GNPS datasets MSV000078552 (Bacillus and Pseudomonas cultures), MSV000078557 (Chinese marine strains), MSV000078577 (S. roseosporus), and MSV000078607 (Cubist strains) are referred to as datasets 78552, 78557, 78577, and 78607, respectively. The genomes of the producer organisms are known for the first two datasets but are not available for the last two datasets. B., P., and S. stand for Bacillus, Pseudomonas, and Streptomyces, respectively. The remaining columns specify the PNP from AntiMarin, structure (cyclic or branch cyclic), category (peptide or lipopeptide), p-value, SPCscore, the number of peaks in the spectrum, the number of generalized peptide bonds, the number of PNP variants identified through analysis of the spectral network, and information about the GNPS spectral library search that includes the cosine value and the instrument type (if PNP is present in the spectral library). The final column provides a reference to a paper that contains an image of a spectrum from the PNP (if available) and information from this paper about the species producing this PNP (if available). Since for tolaasins and massetolide (rows 2, 3, and 26), spectra in Spectra4 dataset and GNPS spectral library were collected with different instruments (LTQ-FTICR and qTof, respectively), we did not report their cosines. LTQ-FTICR and hybrid FT are abbreviated as LTQ and hFT, respectively. All spectra in Spectra4 were collected on ThermoFinnigan LTQ instrument with ESI ionization, linear ion trap analyzer, CID activation, and electron multiplier detector.
| # | organism | GNPS | PNP | str | category | p-value | SPC score | # peaks | # bonds | # var | library-search (instrument) | producer/reference |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 78552 | Bacitracin A | bcyc | peptide | 2.0E-26 | 25 | 100 | 11 | 1 | n/a | ||
| 2 | 78552 | Tolaasin I | bcyc | lipo | 3.4E-22 | 21 | 76 | 18 | 1 | qTof | ||
| 3 | 78552 | Tolaasin B | bcyc | lipo | 2.5E-21 | 22 | 149 | 18 | 1 | qTof | ||
| 4 | 78577 | Daptomycin | bcyc | lipo | 6.3E-19 | 25 | 125 | 13 | 1 | 0.55 (LTQ) | ||
| 5 | 78552 | Surfactin B | cyc | lipo | 1.8E-18 | 18 | 70 | 7 | 3 | 0.77 (LTQ) | ||
| 6 | 78557 | Surfactin variant | cyc | lipo | 5.6E-18 | 18 | 149 | 7 | 1 | 0.75 (LTQ) | -/[ | |
| 7 | 78552 | Tolaasin C | bcyc | lipo | 1.9E-17 | 15 | 155 | 19 | 1 | n/a | ||
| 8 | 78552 | Mycosubtilin III | cyc | lipo | 1.4E-16 | 14 | 75 | 8 | 1 | n/a | ||
| 9 | 78577 | Stenothricin IV | bcyc | lipo | 1.7E-16 | 24 | 90 | 9 | 4 | 0.53 (LTQ) | ||
| 10 | 78552 | Surfactin variant_ | cyc | lipo | 3.4E-16 | 19 | 70 | 9 | 3 | 0.77 (LTQ) | ||
| 11 | 78552 | Plipastatin variant | bcyc | lipo | 3.9E-16 | 24 | 115 | 10 | 1 | n/a | ||
| 12 | 78557 | Glumamycin | bcyc | lipo | 1.2E-15 | 25 | 90 | 12 | 2 | n/a | -/- | |
| 13 | 78552 | Surfactin A1 | cyc | lipo | 4.5E-15 | 15 | 70 | 7 | 1 | 0.77 (LTQ) | ||
| 14 | 78557 | Valinomycin | cyc | peptide | 6.3E-15 | 6 | 75 | 6 | 12 | 0.71 (hFT) | -/[ | |
| 15 | 78552 | Plipastatin variant | bcyc | lipo | 1.2E-14 | 26 | 115 | 10 | 1 | n/a | ||
| 16 | 78552 | Surfactin D | cyc | lipo | 2.3E-14 | 17 | 75 | 7 | 3 | n/a | ||
| 17 | 78552 | Surfactin variant | cyc | lipo | 2.7E-14 | 16 | 70 | 7 | 3 | 0.60 (LTQ) | ||
| 18 | 78577 | A21978 C2 | bcyc | lipo | 2.8E-14 | 24 | 140 | 13 | 2 | 0.51 (LTQ) | ||
| 19 | 78577 | Stenothricin I | bcyc | lipo | 3.0E-14 | 21 | 90 | 9 | 4 | 0.43 (LTQ) | ||
| 20 | unknown | 78607 | Kurstakin 2 | bcyc | lipo | 4.2E-14 | 7 | 60 | 7 | 7 | n/a | -/[ |
| 21 | 78577 | A21978 C3 | bcyc | lipo | 4.3E-14 | 18 | 120 | 13 | 2 | 0.51 (LTQ) | ||
| 22 | 78552 | Surfactin variant | cyc | lipo | 5.2E-14 | 16 | 70 | 7 | 1 | 0.77 (LTQ) | ||
| 23 | 78577 | Stenothricin III | bcyc | lipo | 5.2E-14 | 23 | 90 | 9 | 1 | 0.64 (LTQ) | ||
| 24 | 78577 | A21978 C1 | cyc | lipo | 5.7E-14 | 30 | 135 | 13 | 2 | 0.54 (LTQ) | ||
| 25 | 78552 | Surfactin variant | bcyc | lipo | 1.3E-13 | 14 | 65 | 7 | 1 | 0.77 (LTQ) | ||
| 26 | 78552 | Massetolide F | bcyc | lipo | 1.8E-13 | 14 | 90 | 9 | 1 | qTof | ||
| 27 | 78552 | Bacitracin B3 | bcyc | peptide | 3.5E-13 | 21 | 115 | 11 | 1 | n/a | ||
| 28 | 78552 | Surfactin variant | cyc | lipo | 3.9E-13 | 14 | 70 | 7 | 3 | n/a | ||
| 29 | unknown | 78607 | Kurstakin 1 | bcyc | lipo | 8.7E-13 | 7 | 60 | 7 | 7 | n/a | -/[ |
| 30 | 78552 | Kurstakin 4 | bcyc | lipo | 1.6E-12 | 7 | 108 | 7 | 5 | n/a | ||
| 31 | 78557 | Lichenysin G5a | cyc | lipo | 1.9E-12 | 16 | 120 | 7 | 3 | n/a | -/[ | |
| 32 | 78552 | Surfactin variant | cyc | lipo | 2.7E-12 | 15 | 75 | 7 | 1 | n/a | ||
| 33 | 78552 | Plipastatin B2 | bcyc | lipo | 3.1E-12 | 25 | 122 | 10 | 1 | 0.80 (LTQ) | ||
| 34 | 78577 | Stenothricin II | bcyc | lipo | 3.4E-12 | 22 | 90 | 9 | 4 | 0.40 (LTQ) | ||
| 35 | 78552 | Plipastatin variant | bcyc | lipo | 3.8E-12 | 26 | 115 | 10 | 1 | n/a | ||
| 36 | 78552 | Plipastatin A2 | bcyc | lipo | 5.8E-12 | 24 | 120 | 10 | 1 | 0.75 (LTQ) | ||
| 37 | 78552 | Plipastatin A1 | bcyc | lipo | 6.8E-12 | 23 | 115 | 10 | 1 | n/a |
Figure 3Number of peptides identified by DEREPLICATOR in Spectra dataset. The number of unique peptides identified from Fungal/Actinomycetales/Pseudomonas/Cyanobacteria spectral datasets, coming from Fungal/Actinomycetales/Pseudomonas/Cyanobacteria sources. Since B. subtilis was added to the extracts from the samples Spectra and Spectra, 42 and 22 peptides from Bacillus sources identified in Spectra and Spectra represent contaminants. Since Bacillus growth media is similar to that of Actinomycetes and Pseudomonas, samples from Actinomycetes and Pseudomonas often have small Bacillus contaminations that originates from pre-autoclaving growth in the media.
Figure 4Spectral networks illustrating the results of SILAC experiment. (a) Spectral network of surugamides from S. albus J1074 when the strain is labeled by 13C6 isoleucines. A path connecting five green nodes reveals surugamide A (911.621 Da, observed at m/z 912.610) and four SILAC incorporations into isoleucine with characteristic 6 Da mass shifts (surugamide A has four isoleucines which are observed as addition of 6 Da, 12 Da, 18 Da and 24 Da to the precursor ion). Blue nodes reveal incorporations in surugamide B with three isoleucines (897.605 Da, observed at m/z 898.611), and purple nodes reveal incorporations in a previously unknown surugamide variant with two isoleucines (m/z 884.589). (b) Spectral network of surugamides from S. albus J1074 when the strain is labeled by 13C6 lysine. Green and blue nodes reveal SILAC incorporations into a single lysine in surugamides A and B. Sizes of the nodes reflect relative abundance based on total intensity of the ion that was fragmented. Width of the edges connecting the nodes reflects the similarity (cosine score) between corresponding spectra. Since we used a stringent cosine threshold 0.7, some related spectra are not connected by edges. (c) structure of surugamide A.
Figure 5Generating theoretical spectra and computing p-values of PSMs formed by PNPs with various architectures. (a) Generating the theoretical spectrum of a branch-cyclic peptide (only 12 out of 90 peaks in the theoretical spectrum are shown). Nodes and edges in the PNP graph are shown as circles and lines. Bridges are shown as red edges. The intensities of all peaks in the theoretical spectrum are the same since prediction of intensities remains an open problem. (b) MS-DPR[50] explores a large set of peptides (enriched for high-scoring peptides) to accurately estimate p-values. Each such set is illustrated as a collection of seven peptides, each with a different shuffled sequence of amino acids. (c) Constructing decoy database of PNPs by randomly rearranging amino acids while preserving the architecture of a PNP.