| Literature DB >> 30737579 |
Emma Ricart1,2, Valérie Leclère3, Areski Flissi4,5, Markus Mueller6, Maude Pupin4,5, Frédérique Lisacek7,8,9.
Abstract
Proteinogenic and non-proteinogenic amino acids, fatty acids or glycans are some of the main building blocks of nonribsosomal peptides (NRPs) and as such may give insight into the origin, biosynthesis and bioactivities of their constitutive peptides. Hence, the structural representation of NRPs using monomers provides a biologically interesting skeleton of these secondary metabolites. Databases dedicated to NRPs such as Norine, already integrate monomer-based annotations in order to facilitate the development of structural analysis tools. In this paper, we present rBAN (retro-biosynthetic analysis of nonribosomal peptides), a new computational tool designed to predict the monomeric graph of NRPs from their atomic structure in SMILES format. This prediction is achieved through the "in silico" fragmentation of a chemical structure and matching the resulting fragments against the monomers of Norine for identification. Structures containing monomers not yet recorded in Norine, are processed in a "discovery mode" that uses the RESTful service from PubChem to search the unidentified substructures and suggest new monomers. rBAN was integrated in a pipeline for the curation of Norine data in which it was used to check the correspondence between the monomeric graphs annotated in Norine and SMILES-predicted graphs. The process concluded with the validation of the 97.26% of the records in Norine, a two-fold extension of its SMILES data and the introduction of 11 new monomers suggested in the discovery mode. The accuracy, robustness and high-performance of rBAN were demonstrated in benchmarking it against other tools with the same functionality: Smiles2Monomers and GRAPE.Entities:
Keywords: Curation; Fragmentation; Monomer; Natural product; Peptide; Retro-biosynthesis; Structure analysis; Substructure search
Year: 2019 PMID: 30737579 PMCID: PMC6689883 DOI: 10.1186/s13321-019-0335-x
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Example of Vancomycin processing. A First, the primary bonds mapping searches the most common bonds between NRP monomers within the molecule. This process results in the mapping of two pairs of adjacent bonds that cannot be targeted simultaneously since it would isolate some atoms. To avoid that all the possible combinations only including one of the neighboring bonds are computed. B Then, rBAN retrieves the substructures resulting from each combination and it matches them against the monomer database. A coverage score is given to each combination based on the number of atoms that could be annotated. C In this case, any of the results has a full coverage, so the algorithm proceeds to the secondary bonds search of the structure with the highest score. D The breakage of a carbon-carbon bond results in the full mapping of the peptide
Fig. 2Software architecture workflow. This flowchart describes the series of steps for processing structures with rBAN
Fig. 3Adjacent bonds breakage. Our fragmentation algorithm avoids atom isolation, which restricts the simultaneous cut of some adjacent bonds, requiring the computation of further combinations
Fig. 4Identification of monomers containing inner bonds. Some monomer bonds are sometimes fragmented by the algorithm. To handle these cases, when a small region cannot be identified, rBAN repeats the matching process after removing the bond linked to the unidentified substructure (example with Theonellapeptolide Ie)
Fig. 5Norine curation. a The curation involves two main steps: (1) Automatic verification and correction of the SMILES in Norine. rBAN validated 249 (97.26%) SMILES and identified seven potential erroneous SMILES. Retrieving the PubChem SMILES from the non-validated entries enabled the correction of the SMILES of Motuporin (NOR00825). The manual inspection of the remaining entries concluded with the confirmation of six wrong SMILES. (2) Automatic addition of SMILES retrieved from PubChem. From the 403 SMILES retrieved from PubChem, 242 were validated using rBAN. The 161 not validated are likely to be false positives due to the ambiguity of the PubChem searches performed. b Enniatin F belongs to the set of non-validated peptides. rBAN failed to validate this peptide due to differences between the molecular and monomeric annotations. The monomeric graph is circular and contains N-Methyl-Isoleucine while the SMILES encodes a linear structure with dehydro-N-Methyl-Isoleucine(1). Additionally, rBAN could not identify what is supposed to be a N-Methyl-Leucine because it misses a hydroxyl group (2)
Monomers correctly suggested by rBAN
| Norine code | PubChemID | IUPAC name | Structure | Compounds | Reason of the missing monomer | Refs. |
|---|---|---|---|---|---|---|
| NFo-Lys | 12679627 | 6-amino-2-formamidohexanoic acid |
| NOR00261, NOR00262, NOR00263, NOR00264 NOR00266, NOR00267, NOR00269, NOR00270 NOR00271, NOR00272, NOR00274, NOR00275 NOR00276, NOR00277, NOR00278, NOR00580 | “CO” monomer in graphs | [ |
| D-3OMe-Ala | 97963 | 2-amino-3-methoxypropanoic acid |
| NOR00422, NOR00423, NOR00424, NOR00425 NOR00588 | Wrong SMILES of D-3OMe-Ala monomer | [ |
| C5:1(4)-OH(2) | 172026 | 2-hydroxypent-4-enoic acid |
| NOR00064, NOR00066, NOR00068, NOR00071 NOR00073 | Wrong monomer in graphs: C4:1(3)-OH(2) -> C5:1(4)-OH(2) | [ |
| N-Suc | 12522 | 4-amino-4-oxobutanoic acid |
| NOR00160,NOR00166, NOR00903 | Missing monomer in graphs | [ |
| C5:0-OH(2)-Ep(4) | 54305979 | 2-hydroxy-3-(oxiran-2-yl)propanoic acid |
| NOR00086, NOR00087 | Wrong monomer in graphs: C4:0-OH(2)-Ep(3) -> C5:0-OH(2)-Ep(4) | [ |
| Gen | 3469 | 2,5-dihydroxybenzoic acid |
| NOR00489, NOR00598 | Wrong monomer in graphs: 2,3-diOH-Bz -> Gen | [ |
| C10:0-OH(2)-NH2(3) | 57484230 | 3-amino-2-hydroxydecanoic acid |
| NOR01134, NOR01135 | Wrong monomer in graphs: Adda -> C10:0-OH(2)-NH2(3) | [ |
| iC6:0-OH(2.4) | 55300467 | 2,4-dihydroxy-4-methylpentanoic acid |
| NOR00078, NOR00077 | Wrong monomer in graphs: iC5:0-OH(2.3) -> iC6:0-OH(2.4) | [ |
| Isovaleric_acid | 10430 | 3-methylbutanoic acid |
| NOR00477 | Wrong monomer in graph: Hiv -> Isovaleric_acid | [ |
| D-Cl-Trp | 65259 | 2-amino-3-(6-chloro-1H-indol-3-yl)propanoic acid |
| NOR00554 | Wrong SMILES of D-Cl-Trp monomer | [ |
Among the suggested monomers, N-Formyl-Lysine is the most abundant. rBAN considers CO as a formylation, therefore suggests a new formylated monomer instead of using the “CO” monomer currently present in Norine. A second new entity present in five compounds is D-3OMe-Ala. In this case the monomer name is correct but not the SMILES associated with it. Most of the other suggestions are due to the monomers wrongly annotated in the graph that should be substituted with a new substructure. There is also one case (N-Suc) where the monomer was directly missing in the graph. All these corrections were manually evaluated to confirm the agreement with the literature
Comparison rBAN versus s2m
| rBAN | Smiles2Monomers | |
|---|---|---|
| a) Monomers mapping | Based on molecule fragmentation through common monomer linking bonds | Based on mapping of monomers and selection of best tiling |
| b) Light matching | Positions of double/triple bonds are ignored | Implicit hydrogens and bond order are ignored |
| c) Heterocycles treatment | Accounts for NRP cyclisation patterns initiating oxazoles and thiasoles formation | Does not include any rule/pattern for heterocycles |
| d) Presence of new monomers | Unmatched regions left unannotated and potentially identified in discovery mode | Matches the most similar monomers in a given database and leaves out uncovered atoms |
| e) Graph serialization | Labelled edges with bond type and directed in accordance to functional groups in each side | Unlabelled edges |
a) To map the monomers rBAN fragments the molecule and matches the results against the monomer database. S2m computes the combinations of monomers that fit in the molecule. b) To enable tautomer identification during the matching process rBAN omits the positions of the double bonds in the monomer, but it keeps considering those, becoming more restrictive than its analog mode in s2m, in which neither the implicit hydrogens nor the bonds order are taken into account. c) Characteristic NRP structural patterns such as heterocycles are specifically targeted in rBAN but not in s2m. d) When a region cannot be matched because of the absence of the monomer in the database, rBAN leaves the whole region unannotated (with the option of recurring to the discovery mode), while s2m tries to match the most similar monomer even if this is a wrong match and it implies leaving unannotated atoms. e) The monomers graph from rBAN has the edges labeled specifying the type of bond and its direction. s2m does not provide bond labels
Fig. 6Benchmarking rBAN versus s2m. a Both software were used to validate the SMILES data by comparing the Norine monomer graphs with the SMILES-based predicted graphs. rBAN could validate more peptides than s2m and four of the entries uniquely validated by s2m turned out to be false positives of the software. The manual examination of the entries uniquely validated by rBAN revealed a better capacity of the tool to annotate large structures and peptides containing heterocycles and tautomers. b The global distribution of the correctness do not show substantial differences between the two software but it proves that rBAN does not only have more correct peptides, but also less peptides with correctness values close to zero. c The monomer database was extended with new chemical entities to evaluate its effects on the peptide mapping. The results of rBAN remained unchanged proving its robustness, while the extension of the monomer database affected mapping in s2m. d The computational performance was evaluated with different amounts of input peptides. In all cases rBAN outperformed s2m, being between four and five times faster
Fig. 7Benchmarking rBAN versus GRAPE. The coverage of the annotations given by each software was compared. The distribution shows that rBAN fully annotated more peptides than GRAPE