| Literature DB >> 35688826 |
Shuangjia Zheng1,2,3,2, Tao Zeng1, Chengtao Li3, Binghong Chen4, Connor W Coley5, Yuedong Yang6, Ruibo Wu7.
Abstract
The complete biosynthetic pathways are unknown for most natural products (NPs), it is thus valuable to make computer-aided bio-retrosynthesis predictions. Here, a navigable and user-friendly toolkit, BioNavi-NP, is developed to predict the biosynthetic pathways for both NPs and NP-like compounds. First, a single-step bio-retrosynthesis prediction model is trained using both general organic and biosynthetic reactions through end-to-end transformer neural networks. Based on this model, plausible biosynthetic pathways can be efficiently sampled through an AND-OR tree-based planning algorithm from iterative multi-step bio-retrosynthetic routes. Extensive evaluations reveal that BioNavi-NP can identify biosynthetic pathways for 90.2% of 368 test compounds and recover the reported building blocks as in the test set for 72.8%, 1.7 times more accurate than existing conventional rule-based approaches. The model is further shown to identify biologically plausible pathways for complex NPs collected from the recent literature. The toolkit as well as the curated datasets and learned models are freely available to facilitate the elucidation and reconstruction of the biosynthetic pathways for NPs.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35688826 PMCID: PMC9187661 DOI: 10.1038/s41467-022-30970-9
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 17.694
Fig. 1The motivation and overview of BioNavi-NP.
a The vast natural products and rare biosynthetic pathways reported to date. Natural products were collected from DNP[1] and visualized by TMAP[73] (left). Biosynthetic reactions were collected from MetaCyc[5], KEGG[6] and MetaNetX[7], and the network was visualized by Cytoscape[74] (right). The structures were represented by the nodes and similar structures converged. The edges and arrows in the biosynthetic network represent the structural transformation. Fatty acids and others from the AA/MA pathway were colored yellow. Terpenoids and steroids from the MVA/MEP pathway were colored blue. Flavonoids and others from the CA/SA were colored red. Alkaloids and others from the AAs pathway were colored green. Others, such as nucleic acids and some hybrid-origin compounds, were colored black. b The protocol of BioNavi-NP to explore biosynthetic pathways of target natural product. We trained the transformer neural networks by combining biosynthetic and organic reactions, and four models trained with different hyperparameters form the ensemble model, which was finally used to make the single-step prediction (see details in Methods, Supplementary Figs. 1 and 2).
Performance of single-step models by different training strategies.
| Training strategy | top-N accuracy (%), | |||
|---|---|---|---|---|
| 1 | 3 | 5 | 10 | |
| USPTO_NPL | 0 | 0 | 0 | 0 |
| BioChem (w/o chirality) | 7.6 | 11.1 | 13.9 | 16.3 |
| BioChem | 10.6 | 20.1 | 24.5 | 27.8 |
| BioChem + USPTO_NPL | 17.2 | 30.2 | 41.9 | 48.2 |
| BioChem + USPTO_NPL (ensemble) | 21.7 | 42.1 | 52.4 | 60.6 |
| BioChem + USPTO_NPL (seq2seq) | 10.9 | 21.3 | 30.8 | 37.1 |
| RetroPathRL | 20.6 | 30.5 | 36.8 | 42.1 |
Comparison of performance among different models for the test set.
| Methods | Success rate | Hit rate of building blocks | Hit rate of pathways | Longest length | Avg. solutiona | Time (h)b |
|---|---|---|---|---|---|---|
| BioNavi-NP (MCTS) | 34.8% | 16.3% | 1.9% | 3 | 1.0 | 92 |
| RetroPathRL | 52.7% | 4.8% | 3.8% | 3 | 2.8 | 2 |
| BioNavi-NP | 90.2% | 56.0% | 24.7% | 6 | 4.9 | 18 |
| RetroPathRL_UDB | 10.8% | 5.1% | 4.1% | 3 | 2.8 | 3 |
| BioNavi-NP_UDB | 74.7% | 72.8% | 26.1% | 6 | 4.9 | 28 |
UDB user-defined building blocks.
aDenotes the average number of pathways found, only the top-1 result is supported by the MCTS algorithm, while for RetroPathRL, it outputs all pathways it can find. The output option for Retro* is set as top-5 (default is top-10).
bIt is an about 4-times computational time for outputting top-10 in comparison to top-5, that is, the time consuming of BioNavi-NP (if only requesting the top-3) is comparable to RetroPathRL (the average number of pathways returned by RetroPathRL is close to 3).
Fig. 2The distribution and performance of internal test set.
a The chemical space of internal cases in each NPs category and building blocks. The clustering and visualization of chemical space were realized by TMAP[73] using structural molecular fingerprints[75]. The nodes represent the structures and similar structures converge and are clustered on the same branch. The comparison of chemical space for the training set, internal cases and external cases is provided in Supplementary Fig. 4. b The BioNavi-NP’s performance within each NP category. Source data are provided as a Source Data file.
Fig. 3The interface and output of BioNavi-NP webserver.
a The input interface and of BioNavi-NP webserver. b The selected pathways of two examples (sterhirsutin J and glutarate) predicted by BioNavi-NP. Herein the outputs are redrawn to be clear (the raw output provided in Supplementary Figs. 27 and 33), and some candidate pathways with the order of ranks attached at the end, and the top 1 pathway is colored in red. The known intermediates and reactions are highlighted in yellow. There is also an option to select which pathway to be highlighted or shown. The cost of each reaction step is reflected by the confidence score (smaller prediction cost means higher reaction probability, see Methods), and the total cost of the pathway is used to rank the network by default.