Literature DB >> 35282622

Natural product drug discovery in the artificial intelligence era.

F I Saldívar-González¹, V D Aldas-Bulos², J L Medina-Franco¹, F Plisson³.

Abstract

Natural products (NPs) are primarily recognized as privileged structures to interact with protein drug targets. Their unique characteristics and structural diversity continue to marvel scientists for developing NP-inspired medicines, even though the pharmaceutical industry has largely given up. High-performance computer hardware, extensive storage, accessible software and affordable online education have democratized the use of artificial intelligence (AI) in many sectors and research areas. The last decades have introduced natural language processing and machine learning algorithms, two subfields of AI, to tackle NP drug discovery challenges and open up opportunities. In this article, we review and discuss the rational applications of AI approaches developed to assist in discovering bioactive NPs and capturing the molecular "patterns" of these privileged structures for combinatorial design or target selectivity. This journal is © The Royal Society of Chemistry.

Entities: Chemical

Year: 2021 PMID： 35282622 PMCID： PMC8827052 DOI： 10.1039/d1sc04471k

Source DB: PubMed Journal: Chem Sci ISSN： 2041-6520 Impact factor: 9.825

Introduction

Artificial intelligence (AI) refers to the abilities demonstrated by computer machines (and human judgement) to ingest, process and recognise large and complex information patterns. AI has moved from theoretical studies to real-world applications, thanks to the revolution in high-performance computer hardware, extensive storage and accessible software. Machine learning (ML) is a subfield of AI, which englobes the ensemble of mathematical formulas and advanced statistics that humans apply through algorithms to treat such problems. ML algorithms can be executed at very large scales in the cloud at affordable costs and with ease. The digitization of data types (imaging, textual information, soundwaves, biometrics) from sensors or wearables into online public and proprietary databases have inundated the Internet, often referred to as “data deluge”.[1] Those databases and scattered online information have been crucial for building practical predictive applications such as recommendation systems. Open-source toolkits, massive online courses, and educational videos on social media platforms have democratized the use of AI applications to many sectors, including finance, law, cybersecurity, transportation, manufacturing, entertainment, robotics, education, health, and services.[2] Machine learning algorithms have steadily gained attraction within the pharmaceutical industry, we are seeing numerous supervised and unsupervised learning approaches being applied to the different stages of the drug discovery pipeline. For example, clustering methods have segmented cell type imaging, predicted protein target druggability, and supported de novo molecular design. Supervised learning techniques, i.e., regressions and classifications, identified possible targets for Huntington's disease. They speculated over the biological activities and absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) properties for drug design and many more applications.[3] Lastly, generative algorithms are now supporting the molecular design of new chemical entities in medicinal chemistry.[4-6] In 2019, the American company Insilico Medicine developed an AI system named GENTRL (for Generative Tensorial Reinforcement Learning) that successfully invented six kinase inhibitors of discoidin domain receptor 1 linked to lung fibrosis, in just 46 days.[7] Natural products (NPs) are primarily recognized as privileged structures to interact with therapeutically relevant protein targets. Their structural diversity and biological activities still inspire the development of small molecules[8] and macrocyclic drugs.[9] NPs have dominated the sources of novel human therapeutics in the pharmaceutical drug pipeline in the mid-1970s. Two-thirds of the drugs originated from unaltered NPs (5%), NP analogues (28%), or contained NP pharmacophores (35%) between the 1980s and the 2010s.[10] Despite being a proven source for modern small-molecule drug discovery, natural product research has declined at most major pharmaceutical companies. The main arguments are the time-consuming dereplication process, complex syntheses and high-throughput screening-unfriendly extracts.[11,12] Moreover, many NPs present ADME and physicochemical properties, e.g., high degrees of stereochemistry, fused ring systems or rotatable bonds, that are beyond the current drug-like chemical space.[13,14] The uniqueness of NPs continues to marvel laboratory and computer scientists alike. Not surprisingly, researchers have developed and adopted several computational methods throughout the drug pipeline; (1) to assist in the discovery and structural elucidation of bioactive NPs[15,16] and (2) to capture the molecular patterns of these privileged structures for combinatorial design or target selectivity (Fig. 1).[17-19] Over the years, chemoinformatic, bioinformatic, and other informatics-related disciplines have largely contributed to NP-based drug discovery. Their successful applications and limitations have recently been reviewed.[20-23] Computational strategies involving artificial intelligence and machine learning algorithms have slowly made their ways into natural product research, a proven source of modern small molecule drug discovery. For example, in the early 2000s, AI applications mostly included the digitization of organic molecules, and dimensionality reduction techniques (i.e. principal component analysis, self-organizing maps) to map the NP chemical space. The following decade led to the development of ML binary classifiers to predict their biological functions. Recently, scientists have started to implement neural network architectures for genome mining, molecular design. Herein, this perspective article discusses the recent contributions of AI and ML algorithms to assist in the discovery of bioactive NPs and the design of NP-inspired drugs, and their future development.

Fig. 1

Overview of ML/AI algorithms that are implemented across the different stages of the natural product drug discovery pipeline. The pipeline presents two sections: (1) computer-assisted discovery of NPs (data-mining into traditional medicines and peer-reviewed articles, genome mining & structural elucidation and dereplication) and (2) machine learning algorithms applied to NPs (encoding into molecular representations, molecular descriptors, likeness scores, chemical space, predicting biological functions, de-orphanizing and generating de novo NP-inspired compounds).

Computer-assisted discovery of natural products

Data-mining into traditional medicines and peer-reviewed articles

Scientific compendium has long been documented into codices, dissertations, publications, patents, reports or laboratory notebooks. With an estimated 10 000 chemistry-related articles published every year, retrieving chemical information exceeds human readability, and many findings remain hidden. Machine-readable contents are critically needed. The recent transition from printed hard copies to digitized documents and restricted geographical locations to the World Wide Web has kick-started data-mining technologies. In the field of chemistry alone, many text-mining approaches monitor the 20 000 new compounds published in medicinal and biological chemistry journals every year.[24] In 2020, Rajan and co-workers developed DECIMER, the ultimate Optical Chemical Entity Recognition software system using deep learning (DL) that recognized chemical structures from journal articles.[25] Deep learning refers to a subfield of machine learning using neural network architectures with 3 or more layers. In the year prior, Tshitoyan and co-workers reported the discovery of “forgotten” thermoelectric materials from peer-reviewed articles published between 1922 and 2018. The authors first curated a corpus of 1.5 million abstracts from which they established semantic relationships using Word2vec, a technique for natural language processing (NLP). They built a ML model with word embeddings (vector representations of words) to predict the thermoelectric property for 1820 known and 7663 candidate materials.[26] NLP is a branch of AI focused on understanding the interactions between computers and human languages. NLP methodologies extract, categorize, analyse words or sentences to get insights (e.g., knowledge graphs[27]) from unstructured documents. Beyond textual information in natural languages (i.e., English), NLP algorithms must process the many molecular representations associated with biomedical research, a domain referred to as BioNLP.[28,29] To date, subfields of drug discovery; i.e., protein docking,[30,31] protein–protein interactions[32,33] or protein-disease associations[34] have too benefitted from applying biomedical text mining. In contrast, NLP has shown limited applications to the discovery of bioactive NPs. So far, BioNLP methodologies have predominantly deciphered ancient texts from disappearing traditional medicines to identify bioactive plants. Traditional medicine has stimulated the search for bioactive NPs from various sources and novel drugs throughout history. Early evidence referred to plants' medicinal use on clay tablets written in cuneiforms.[35] Traditional Chinese medicine (TCM) has gained attention worldwide due to its role in discovering treatments for malaria[36] and rheumatoid arthritis.[37] In 2014, May and co-workers developed an algorithm that could screen, extract, select, classify and score information from ancient texts.[38] Equipped with that technology, the authors monitored changes in the terminology used for specific diseases over time. They identified common treatments, and they eased the discovery of NPs in both Zhong Hua Yi Dian (ZHYD; Encyclopaedia of Traditional Chinese Medicine), the most comprehensive encyclopedia of TCM books, and Zhong Yi Fang Ji Da Ci Dian (Great Compendium of Chinese Medical Formulae), the largest compendium of herbal formulas.[38] In 2015, Shergis and co-workers conducted a text-mining search within TCM codices, searching natural products as potential treatments for chronic cough, a by-product of cancer, obstructive pulmonary disease, tuberculosis, and asthma.[39] Employing the keywords jiǔ hāi, jiǔ sòu, and jiǔ késòu for chronic/prolonged cough and keeping their terminological authenticity, the authors identified 331 compounds in the ZHYD including; 250 from herbal sources, 47 from animals and 34 from minerals. Despite their originality, these semantic approaches carry many flaws and remain rudimentary. One of the main drawbacks in mining ancient texts is the changing paradigm and concept of medicine over the centuries. For example, the diagnosis and treatment systems in TCM originated from ancient philosophies such as the Qi theory (or the yin-yang theory) and the five elements theory. TCM practitioners call a syndrome a set of patient symptoms that may result from different diseases and be caused by different mechanisms, hampering the identification of treatments.[40] Like traditional medicine, ethnobotanical explorations have contributed to discovering countless NPs.[41] In 2014, Sharma and co-workers explored the Palau and Pohnpei Primary Health Care Manuals, traditional botanical accounts from Micronesian islands, aiming to pinpoint medicinal plants and their therapeutic usages.[35] The authors first digitized the manuals with the Biodiversity Heritage Library to establish individual plants and therapeutic annotations. The extracted information was crossed with contemporary biomedical terminology using MetaMap[42] (https://metamap.nlm.nih.gov/). This BioNLP tool employs computational linguistic techniques to identify equivalent terms in the Unified Medical Language System. The team discovered 129 unique plant species and over 700 treatment indications from the Primary Health Care Manuals. Biodiversity Heritage Library commonly reported 72 plant species; ten displayed comparative symptoms (i.e. diarrhea, pain, rash) resulting from venomous stings or pathogen infections.

Predicting chemical structures from microbial genomes

Rapid fractionations, hyphenated chromatography techniques, and bioassay screenings of natural sources such as plants, marine invertebrates or microbes have traditionally guided the discovery of bioactive NPs.[43] The recent advances in genome sequencing have revealed the biosynthetic logic and genetic basis behind NPs of microbial origin and beyond.[12,44,45] Enzymatic complexes such as polyketide synthases (PKSs)[46] and nonribosomal peptide synthetases (NRPSs),[47] or the ribosomally synthesized and post-translationally modified peptides (RiPPs)[48] are behind the production of these secondary/specialized metabolites. Microbial genomes encode these multi-domain pieces of machinery as biosynthetic gene clusters (BGCs). Over the last decades, considerable efforts in bioinformatics, commonly referred to as the umbrella term “Genome mining”, recently reviewed,[49-52] have enabled the discovery of cryptic BGCs within microbial genomes and the experimental characterization of novel NPs. ML algorithms and pattern-recognition approaches have partaken the genome-mining tools into two areas; (1) scrutinizing novel BGCs and (2) predicting chemical structures. Biosynthetic gene clusters are traditionally discovered through a rule-based selection process, except for novel RiPPs. These peptides are typically identified from a limited set of known RiPP tailoring enzymes and precursor peptides (PPs). In 2017, Tietz and co-workers developed a multi-layer predictor enabling the discovery of lasso peptides.[53] The same year, Mohanty's group created RiPPMiner, a multi-label Support Vector Machine (SVM) classifier that distinguishes between a dozen PP classes.[54] Both approaches limited their training to specific RiPP classes. In 2019, de los Santos complemented the ML-guided discovery of novel RiPPs with NeuRiPP; the neural network architectures could identify known PPs and new PP-like sequences.[55] The following year, three additional methods pitched the automated discovery of new ribosomal peptides. First, Merwin and co-workers created DeepRiPP, the tripartite pipeline employed natural language processing to capture a wider diversity of RiPPs independently from genomic context (NLPPrecursor). Their other components included Basic Alignment of Ribosomal Encoded Products Locally (BARLEY) and Computational Library for Analysis of Mass Spectra (CLAMS) that indexed biosynthetic loci to candidate RiPPs within a database of thousands of microbial extracts from genomic and metabolomic information.[56] The authors applied DeepRiPP to analyze 65,421 sequenced bacterial genomes and identify 19,498 unique unknown RiPPs. Later, Kloosterman and co-workers reported two new bioinformatics tools to support the discovery of novel RiPPs; DecRiPPter (Data-driven Exploratory Class-independent RiPP TrackER)[57] and RRE-finder (RRE stands for RiPP recognition element).[58] The former applied an SVM with a custom database of RiPP-specific BGCs and PPs to prioritize genomic regions. The latter focused on finding RREs, BGC elements that participate in the RiPP biosynthesis encoding for discrete proteins or fused protein domains. Unlike DecRiPPter, RRE-finder capitalized on sequence similarity and protein homology; the tool either detected known RiPP classes using 35 custom profile Hidden Markov Models (pHMMs) (precision mode) or predicted novel RiPP classes with a modified version of the HHpred pipeline[59] (exploratory mode). Both tools confidently identified novel RiPPs; thus, DecRiPPter discovered 42 new RiPP families, including the novel subclass V of lanthipeptides from 1295 Streptomyces genomes. Finding novel chemical entities early on is critical to the drug discovery process. It could alleviate the costs and the experimental time associated with the dereplication of natural crude extracts (NCEs). Chemical novelty is also inherent to the intellectual property (i.e., patents) of pharmaceutical and biotechnology companies competing to develop new drugs in similar target/disease landscapes.[60,61] In 2019, Hannigan and co-workers created DeepBGC, the deep learning framework utilized recurrent neural networks (RNN) to identify novel BGC classes followed by random forest (RF) classifiers to predict their biological activities (i.e., antibacterial, cytotoxic, inhibitor or antifungal). DeepBGC identified adequately many NP classes from predicted BGCs but the algorithm predicted poorly their biological activities, due to the lack of training examples.[62] The following year, Skinnider and co-workers[63] presented the fourth version of PRISM (stands for PRediction Informatics for Secondary Metabolomes, available at http://prism.adapsyn.com) that could predict the chemical structures for 16 classes of NPs from bacterial BGCs, including aminoglycosides, nucleosides, β-lactams, alkaloids, and lincosamides. PRISM4 employed 1772 HMMs and 618 tailoring reactions, reaching a high degree of chemical similarity between predicted structures and the authentic products of known BGCs. With cryptic BGCs, the tool predicted structural features of known NPs. The authors further set a library of 1281 BGCs and trained moderate SVM classifiers to predict the probability for a BGC to display one or more biological activities (i.e., antibacterial, antifungal, antiviral, antitumor, or immunomodulatory activity). The same year, Agrawal and Mohanty developed two RF classifiers that predicted macrocyclization patterns for PKs and NRPs.[64] The first model predicted the capacity of a linear precursor to (or not to) adopt a macrocyclic structure, based on a training dataset made of 196 empirically known macrocyclic PK/NRP compounds and 162 linear chemical entities. The second classifier identified the accurate macrocyclic structure of a PK/NRP compound given its linear precursor, using the 196 macrocyclic compounds previously mentioned and all its theoretically possible macrocyclic structures. Finally, in 2021, Walker and Clardy revisited ML classifiers to predict the antibacterial or antifungal activity based on BGC-derived features. Alongside the development of moderate classifiers (i.e., 57–79% accuracy), which outperformed DeepBGC, the authors uncovered activity-associated BGC domains.[65] Besides finding novel chemical entities, microbes have flourished as biofactories for the development of exogenous metabolites, peptides and proteins through recombinant DNA technology.[66] As the connections between microbial BGCs and NPs grow, we can foresee that the future steps in metabolite engineering would involve hacking directly the genetic information of biosynthetic powerhouses like streptomyces[67] to develop novel and complex NPs, hardly synthesizable and in more sustainable manner.

Automating natural product dereplication process

Early-stage discovery of NPs from organisms of all kingdoms is characterized by the repetitive extractions, subsequent chromatography/spectrometry-guided fractionations and purifications leading to single metabolites (or mixtures thereof) – the process is known as dereplication. One or more biological assays often guide the process to screen and prioritize the extracts, fractions and isolates containing the bioactive substances. Dereplication is lengthy, tedious and might face problems that would hinder the expected return on investments (time, equipment, human resources), such as discovering already known NPs, purifying supposedly novel structures in insufficient amounts, or screening natural crude extracts (NCEs) with high-throughput robotics.[11,12,43] Scientists have strategized different approaches to reduce redundant NCEs with the early chemical profiling of (un-)targeted NPs.[68,69] They have notably prioritized NCEs using state-of-the-art analytical chemistry techniques, i.e., gas/liquid chromatography (GC/LC), nuclear magnetic resonance (NMR) spectroscopy, mass spectrometry (MS), and combinations thereof. The increasing data digitization has enabled the implementation of mathematical and statistical methods. The field of chemometrics has leveraged the multivariate statistical analysis of data from the aforementioned studies and from the optical radiation (i.e., infrared, visible and ultraviolet) for the rapid identification of known and unknown bioactive NPs from NCEs. In 2019, Cornejo-Baez and co-workers documented the most common statistical techniques used to study NPs.[70] The list included both unsupervised (i.e., hierarchical cluster analysis, principal component analysis and discriminant analysis) and supervised ML algorithms (i.e., partial least squares, orthogonal projection to latent structures). Beyond NCEs, scientists have capitalized on ML algorithms to extract information from metabolomic data and generate new biological insights. In particular, supervised ML algorithms such as random forest, support vector machine (SVM), artificial neural network, and genetic algorithms have shown great potential in metabolomics research due to the ability to provide quantitative predictions.[71] The implementation of these algorithms has facilitated analytical data processing, integrated omics data, and stimulated biological applications. For example, ML algorithms are used to integrate chromatogram peaks,[72] predict retention time,[73-75] or amputate missing data.[76] With the growing volume of MS data, several metabolomics platforms arose such as the software MetaboAnalyst 5.0 (https://www.metaboanalyst.ca/).[77] In 2016, Wang and co-workers presented the Global Natural Products Social Molecular Networking (GNPS, http://gnps.ucsd.edu).[78] The platform organizes vast tandem MS datasets into visual molecular networks. Molecular networking (MN) uses nodes to display the high-resolution spectra, and edges to characterize the spectrum-to-spectrum alignments. The initiative is gaining attraction in NP dereplication[79] as well as other applications related to the study of NPs.[80] Overlapping or neighbouring nodes are synonymous with NCE replicates or NCEs with shared fragmentation ions. However, in absence of reference spectra in molecular databases, tandem MS datasets cannot be aligned and molecules cannot be identified. Alternatively, scientists have developed tools such as CSI:FingerID,[81] MS2LDA[82] and SIRIUS 4 (ref. [83]) for small molecules, and VarQuest[84] for peptides, that coupled tandem MS spectra to specialised molecular databases in order to identify NP substructures. Of the aforementioned bioinformatic tools, three utilised ML algorithms to match fragmentation ions with molecular substructures. First, CSI:FingerID[81] (www.csi-fingerid.org) computed fragmentation trees from MS spectra and applied ML algorithms (multiple kernel learning, SVM) to predict the presence or absence of 1415 molecular fingerprints in unknown compounds. Each molecular fingerprint is scored and ranked using Platt probabilities. In result, CSI:FingerID matched NPs and NP substructures from a molecular structure database such as PubChem to the submitted spectra or fragmentation trees from either Agilent (N = 2055) or GNPS (N = 3868). The platform SIRIUS 4 derived from CSI:FingerID.[83] The discovery platform MS2LDA[82] (http://ms2lda.org/) implemented latent Dirichlet allocation (“-LDA”), an unsupervised method, originally used for text mining, to decompose tandem MS data (“MS2-”) into sets of co-occurring fragments or neutral losses (called Mass2Motifs). Those motifs were matched to a set of biochemical features (i.e., amino acids, nucleotides, conjugated acids, polyamines, carbohydrates) to deduce molecular ((sub)structures). Tandem MS data alone remain insufficient to fully elucidate chemical structures, the last half-century has seen the elaboration of computer-assisted structural elucidation (CASE) expert systems. Two recent reviews provide a comprehensive and historical overview of these systems.[85,86] CASE expert systems support the identification of an unknown chemical compound by matching its similar spectral properties to a list of potential candidates. Such systems were primarily based on one-dimensional (1D) and two-dimensional (2D) NMR spectra to elucidate the structures of NPs and organic compounds. In 2020, Reher and co-workers[87] reported the first ML-driven tool called ‘Small Molecule Accurate Recognition Technology’ or SMART 2.0, for the rapid characterization of NPs from NMR spectra of NCEs. At the core of SMART 2.0, the team trained a convolutional neural network on a set of 53 076 2D-NMR spectra (i.e., HSQC – Heteronuclear Single Quantum Coherence spectroscopy) from NPs of the JEOL database and ACD Labs Predictor reduced to a 180-dimensional embedding space. The authors validated their application by discovering and fully characterizing a novel cytotoxic swinholide NP named symplocolide A from the NMR-based SMART mixture analysis of the filamentous marine cyanobacterium Symploca sp. Besides NMR experiments, several non-spectroscopic techniques, i.e., atomic force microscopy,[88] “crystalline sponge” X-ray analysis,[89] and micro-electron diffraction[90] are starting to provide structural insights in combination with CASE expert systems.

Machine learning applied to natural products

Encoding natural products into molecular representations

Modelling and predicting the properties and bioactivities of NPs (or any chemical structure per se) primarily pass through their translation into computer-readable format(s), the so-called molecular representations (Fig. 2). Most representations encode chemical information for a specific use. Original generic and IUPAC names retrieve chemical compounds that share nomenclatures. Matching chemical structures based on their bidimensional molecular graph depictions was computationally demanding. Early molecular representations were designed for light-weight storage space of chemical information or efficient structural search. The simplified input line entry system (SMILES),[91] SMILES arbitrary target specification (SMARTS, Daylight CIS and OpenEye Scientific Software) and international chemical identifier (InChI)[92] were created to store and retrieve molecular information as well as identifying shared molecular features or substructures from databases. Novel molecular representations like DeepSMILES[93] and SELFIES[94] recently arose for their practical use in ML algorithms.

Fig. 2

Molecular representations frequently used in NPs.

Chemical and biomolecular databases are central to many AI applications and they are commonly found across informatic-related disciplines.[95] Chemical databases[96] played an important role to improve the dereplication process of NPs using preassembled NP libraries and chemical fingerprinting (spectroscopic/chromatographic data, calculated physical properties). In the mid-1990s, public and private research entities have initiated several commercial databases from curated literature review including the generalist Chemical Abstracts Service by the American Chemical Society or the highly specialised MarinLit by the University of Canterbury, New Zealand, and now under the British Royal Society of Chemistry. However, the elevated costs and limited access to commercial databases have pushed numerous academic researchers to consider free and open-access options like ChemSpider. Few years ago, the number of NP databases in the public domain was very limited. But the renewed interest in NP research plus rapid advances in informatics and data sharing boosted the generation, and publication of NP collections in the public domain. In 2020, Soronika and Steinbeck reviewed 123 open-access and commercial databases, industrial catalogues, books and collections for NP information, that were still cited in the scientific literature after 2000.[97] Many NP databases, commercial and open-access alike, are sporadically maintained by their creators. Less than half of these databases offered substructure searching using at least one of the aforementioned molecular representations, and many lacked stereochemical information. In result, the authors built the largest collection of open-access natural products named COCONUT (https://coconut.naturalproducts.net/) that compiled the structures and related information of over 400 000 non-redundant NPs. The same research group has continued developing and curating COCONUT and has released an improved version called LOTUS.[98] The need for efficient substructure searching in growing chemical databases and reduced storage space have also led to the development of molecular fingerprints. Initial bitstring fingerprints denoted the presence (1) or the absence (0) of substructures as binary vectors.[99,100] Subsequent topological fingerprints based on atom pairs,[101] local circular neighborhoods and Extended Connectivity fingerprints (ECFP),[102] Molecular ACCess System (MACCS) keys,[103] to name a few, were specifically designed for bioactivity prediction and similarity analysis. Bidimensional fingerprints were implemented to compare the molecular similarity of NPs against synthetic chemical libraries. The similarity of any two molecules is measured using one of many distance metrics such as the Tanimoto coefficient.[104-106] In their 1999 seminal paper, Henzel and co-workers[107] converted 78 318 entries from the Database of Natural Products, 29 432 entries from the Bioactive Natural Product Database, 182 822 chemical compounds from the Available Chemicals Directory, 14 596 drugs and some synthetic compounds from Bayer AG into MACCS- and UNITY-format. NPs presented larger molecular weights and distinct heteroatom distributions, and over 40% of NP-derived pharmacophores were not represented in the synthetic libraries. In the two decades that followed, pharmaceutical companies and academic institutions have integrated bioactive NP scaffolds to their combinatorial drug design strategies, and they have created diversity-oriented synthesis (DOS), diverted total synthesis (DTS), biology-oriented synthesis (BIOS) and function-oriented synthesis (FOS) of NP-like libraries, as summarized by the many reviews.[108-115] In parallel, computational scientists have used the aforementioned fingerprints indifferently between NPs and synthetic molecules to conduct successive structural similarity analyses,[107,116-121] novel visual representations of the chemical space (see Mapping NPs in chemical space),[121-132] and they generated new metrics, i.e., NP-likeness score or metabolite-likeness score (see Engineering likeness scores),[133-141] to monitor the success of that endeavour. In 2020, Seo and co-workers developed NC-MFP as the first natural product molecular fingerprint for NP drug discovery.[142] The same year, Capecchi and co-workers created the MinHashed atom-pair fingerprint reaching up to 4 bonds (or MAP4 for short), suitable for both drugs, NPs as well as macromolecules.[143] Molecular representations have depicted small and large organic molecules indifferently, natural products and synthetic compounds, with the latest MAP4. The goal is to embrace a universal fingerprint to describe and search chemical space. In contrast, we also see emerging novel fingerprints like NC-MFP that promote the singularity of chemical entities, to possibly classify structures or predict biological activity. Both fingerprints will likely co-exist for specific domain applications. In the early 2000s, researchers also created 3D fingerprints based on geometric distances[144,145] and methods such as the rapid overlay of chemical structures (ROCS)[146] to leverage spatial information and shape similarity. Tridimensional fingerprints have predominantly been used to match ligands from virtual screening experiments, based on shape similarity.[147,148] Shape-matching has notably been used in scaffold-hopping, a process focused on discovering novel compounds by changing the core of known parent bioactive structures (see Generating de novo NP-inspired compounds).[149-151] In 2013, Riniker and Landrum benchmarked several fingerprints for ligand-based virtual screening, where the authors concluded that simpler topological (bidimensional) fingerprints could retrieve structural information, making scaffold-hopping potentially obsolete.[152] With respect to bioactive NPs, in 2017, Skinnider and co-workers presented an algorithm named LEMONS for the enumeration of hypothetical modular NPs. The authors leveraged their algorithm to compare different molecular similarity methods with the NP chemical space. Their results suggested that circular fingerprints with a retrosynthetic approach (GRAPE/GARLIC) would outperform the more conventional topological and structural fingerprints.[153] In 2020, Chen and co-workers applied ROCS to interrogate the potential macromolecular targets of NPs, a process coined de-orphanizing (see De-orphanizing). They successfully identified the targets for many small molecules, but they struggled to find those of NPs and macrocyclic ligands.[154] The overall advantages of 2D over 3D fingerprints remain to be demonstrated.[155] However, it is generally accepted that these advantages depend on the application (goal of the study) and the specific types of fingerprints to be compared. More recently, 3D fingerprints have been reported to predict and rank the biological activities from chemical structures, the so-called Quantitative Structure–Activity/Property Relationship (QSA/PR) models.[156]

Vectorizing natural products with molecular descriptors

Besides fingerprints (frequently used by chemoinformaticians), computational chemists would use molecular representations to compute thousands of features (variables) known as molecular descriptors[157] through well-defined algorithms. These descriptors grasp specific molecular features (e.g., atomic properties, size, shape, flexibility, polarity, lipophilicity, pharmacophore) that researchers could easily interpret. Molecular descriptors have been central to the development of predictive QSA/PR modelling. They have been essential to describe the distributions of NPs and synthetic compounds in low-dimensional representations of chemical space(s). At the turn of the 21st century, Lipinski and co-workers devised a set of empirical rules or guidelines, known as rule-of-five (Ro5), based on key molecular descriptors to rapidly identify orally available small molecules from screening campaigns and combinatorial libraries.[158] Together with their structural similarity analyses, several computational studies used molecular descriptors to compare and describe the chemical spaces occupied by NPs, combinatorial chemical libraries, synthetic compounds and marketed drugs.[121-123,125-132] Natural products and macrocycles,[9] which were not initially considered to establish the Ro5 rules, have been found to violate one or more of these rules and yet, they exhibited oral bioavailability. In 2008, Quinn and co-workers attempted to establish a set of rules for NPs only.[159] Several research groups puzzled have followed suit establishing a new set of empirical rules based on molecular descriptors, known as ‘beyond the Ro5’ (bRo5) to explain cell permeability and oral bioavailability of macrocycles.[160-168] In a recent chapter, Grisoni and co-workers reviewed the impact of molecular descriptors upon chemoinformatic applications.[169] The authors primarily introduced molecular representations beyond 3D leading to new features such as conformational flexibility, protonation states or orientations. Once the irrelevant features are removed (missing values, low variance threshold, multicollinearity) and the remaining descriptors are scaled, they are employed for similarity search or QSA/PR modelling. In the former application, the authors warned upon choosing the correct molecular descriptors, and the distance metrics to quantify (dis-)similarity between chemical entities. They advised that optimal molecular descriptors and distance measures could be selected following their quantification through the enrichment factor.[170] In the latter application, identifying the best molecular descriptors to predict the biological activity/property of interest relies mainly on the stability, performance and interpretability of the algorithm in use as well as the metrics (e.g., accuracy, root mean squared error) for model evaluation. Over the past decade, deep learning (DL) algorithms have been widely adopted in domains such as computer vision[171] and natural language processing.[172] Recently, DL models emerged for their applications to drug discovery and molecular informatics.[173-176] At their cores, neural networks handle large datasets and capture the complex relationships between input features (e.g., 2D/3D fingerprints, molecular descriptors) and output decisions (e.g., biological activity, ADME/Tox). Despite their remarkable improvements over traditional ML algorithms, DL models are mostly built from a chosen set of features (i.e., molecular representations) rather than learning from “raw” chemical information. Existing convolutional neural networks (CNNs), used for image classification, are ill-suited for reading 2D graph depictions or 3D structures. Chemical entities such as NPs, synthetic small molecules or drugs could be depicted as molecular graphs of irregular sizes and shapes. Moreover, CNNs conventionally scan images in a specific order; the DL architectures must correctly read the atoms (i.e., nodes/vertices) and chemical bonds (i.e., edges) that molecular graphs are made of. Recent efforts have been achieved with the development of graph convolutional networks (GCNs), setting the state-of-the-art techniques to read the irregular and raw information coming from molecular graphs. So far, Sun and co-workers have reviewed their applications to four domains of the drug discovery pipeline; QSA/PR modelling, drug-target/drug–drug interaction, synthesis planning, and de novo molecular design.[177] Several studies were reported to use large and general chemical datasets such as NCI dataset from the National Cancer Institute (https://cactus.nci.nih.gov/download/nci/), three datasets from European Molecular Biology Laboratory (https://www.ebi.ac.uk – SIDER, STITCH, ChEMBL) or the University of California San Diego's BindingDB (https://www.bindingdb.org/bind/index.jsp) to develop GCNs with domain-specific applications. Close to NPs, Sanchez-Lengeling and co-workers reported the odor profiles (138 labels) for some 5030 molecules from GoodScents perfume materials and Leffingwell 2001 PMP databases in 2019.[178] On average, each molecule presented 1–15 odor labels. The authors could predict using GCNs all 138 descriptors for all molecules at once due to the observed strong correlations between structures and odor labels. GCNs remain to be explicitly applied to NP databases like COCONUT or LOTUS (vide supra).

Mapping natural products in chemical space

The chemical space is the geometric space defined by all the possible chemical compounds, their structural and functional properties. An NP chemical space refers to the space occupied by a set of known NPs. Visualizing this high-dimensional space through human-readable graphical representations (of one to three dimensions) has been critical to decision-making and advances in the drug discovery process.[179,180] One of the most common ways to generate visual representations of the chemical space is using coordinate-based representations that often require reducing the number of dimensions. Transforming high-dimensional data into a smaller set of dimensions to better understand and interpret results is known as dimensionality reduction, a technique familiar to many AI projects.[181] Over the last two decades, numerous research groups have implemented different dimensionality-reduction techniques to explore NP chemical space, extensively reviewed elsewhere.[128,182-184] Besides mapping chemical spaces, dimensionality-reduction techniques expose structure–activity/property relationships (SA/PRs) between compounds and compound datasets. The increasing amount of data stored in chemical datasets makes the visualization of SA/PRs a challenging endeavour.[185] Finally, dimensionality-reduction techniques help define the applicability domain of QSA/PR modelling, a specific region in the underlying chemical space where the model predictions are considered reliable. In that application, dimensionality-reduction techniques identify outliers and generate robust models. Overall, three techniques are commonly employed to either map the chemical space, define its limitations or exhibit SARs/SPRs; principal component analysis (PCA), t-distributed stochastic neighbour embedding (t-SNE), and self-organizing map (SOM). Early accounts to support the analysis, navigation and comparison of chemical space(s) include the works by Schneider and co-workers introducing SOMs to NPs and drugs with scaffold architecture and pharmacophores.[120,122] In 2003, Feher and Schmidt[117] compared the property distributions of drugs, NPs, combinatorial libraries using PCA. Larsson and co-workers developed ChemGPS-NP,[124] a PCA-like representation to compare NP libraries like WOMBAT[121] with their biological activities. Waldmann and co-workers have charted the NP chemical space with SCONP,[123] and ScaffoldHunter.[126] Both tools explore the relationships between the more and more complex scaffolds and their biological activities through the intuitive hierarchical organization of scaffold libraries arranged in tree-like maps.[125,127,129,130,132] These tree-like arrangements led the authors of this review to develop the ChemMaps.[186] In 2020, Reymond's group embedded chemical spaces with very large dimensions into two-dimensional trees named TMAPs[187] along with their recently reported fingerprint MAP4 (ref. [143]) to analyze the similarity between 25 523 NPs of bacterial or fungal origin.[188] Sánchez-Cruz and co-workers also implemented TMAPs to display NP databases, synthetic compound collections, as well as NP-based fragment libraries.[189] The same year, Chávez-Hernández and co-workers applied TMAPs to compare the chemical spaces of 382 248 NPs from the database COCONUT (vide supra), molecules from the ‘dark chemical matter’, and other datasets.[190] Similarly, the authors compared 52 630 molecular fragments generated from COCONUTs NPs with 14 001 fragments from dark chemical matter. They concluded that the NPs (complete compounds and fragments) largely delineated the chemical space.[191] Of note, fragment libraries of NPs can be very useful for the rational fragment-based design of the so-called “pseudo-NPs”.[192] With regards to graphically evaluating the predictive reliability of QSA/PR models, Majumdar and Basak used robust PCA to define a reliable predictive space of 508 chemical mutagens in 2016.[193] The same year, Aniceto and co-workers introduced reliability-density neighbourhoods.[194] Last year, Plisson and co-workers combined unsupervised multivariate outlier detection methods with a t-SNE manifold to delineate the limitations of their hemolytic QSA/PR models. The authors applied their methods to discover novel (non-)hemolytic antimicrobial peptides of natural origin.[195]

Engineering likeness scores

Natural products adopt numerous shapes and ring systems. They contain several oxygens but fewer nitrogen, sulfur, and halogen atoms than synthetic counterparts. Their structural complexity includes a high fraction of carbon sp3 atoms, stereogenic centres, and multiple hydrogen-bonding functional groups (donors and acceptors).[107,116,117,122,196-198] Small NPs are intrinsically rigid[198] whereas large NPs (more than 500 Daltons), in particular macrocycles,[9] present a higher degree of flexibility, which confer to both sized molecules optimal affinity and specificity to bind to proteins and protein–protein interactions. This optimization process has been strongly attributed to the coevolution or complementary chemical design between NPs and specific protein targets to benefit the producer's survival fitness.[199,200] Consequently, NPs and their derived physicochemical properties are regarded as privileged features supporting their total syntheses and the design of bioactive compound libraries. Computational studies have contributed to the design of focused compound libraries by creating new scoring measures to quantify how similar a compound is to the chemical space. In machine learning, generating novel relevant feature(s) is referred to as feature engineering. In 2008, Ertl and co-workers developed a Bayesian measure named NP-likeness score that quantified the similarity of a compound according to the characteristic structural fragments in NPs.[133] The authors compared NPs, synthetic molecules (SMs) and drugs from DrugBank based on the unidimensional distributions of their NP-likeness scores. They could also identify common building blocks to both NPs and drugs. Three years later, Jayaseelan, Ertl and co-workers implemented an open-source, open-data version of the scoring system.[201] And in 2019, Soronika and Steinbeck created the NaPLeS web application (http://naples.naturalproducts.net) that computes the NP-likeness score for chemical libraries.[202] Alongside Ertl's original NP-likeness score, Yu reported an alternative approach to quantify NP-likeness, based on Extended Connectivity Fingerprints (ECFP).[203] Recently, Chen and co-workers developed and validated new ML models for the discrimination of NPs and SMs for the quantification of NP-likeness.[204] NP-scout is the web application derived from this work that computes the probability of a molecule to be a NP based on its physicochemical properties, Morgan2 fingerprints, and MACCS keys. In addition, NP-scout allows visualizing atoms in molecules that make decisive contributions to the assignment of compounds to any class by integrating similarity maps. Similarly to the NP-likeness score, ML applications have contributed to rationalization of alternative scoring systems to narrow large chemical compound libraries with metabolite-likeness,[135,137,138,141,205,206] lead-likeness,[207,208] or drug-likeness[209,210] profiles. Such concepts in more sophisticated abstract terms have allowed the elaboration of models that could be applied for the identification of drug-like NPs beyond the application of empirical rules. Recently, Marshall and co-workers introduced the first intrinsic measure of molecular complexity, called the molecular assembly index (MA).[211] The team developed the MA index to easily track complex molecules in abundance using mass spectrometry, and to support the existence of living producers within our cryptic terrestrial ecosystems or beyond alien exoplanets. To illustrate an application of the molecular complexity index, the authors retrieved 2.5 million small molecules (less than 600 Daltons) from Reaxys® database (https://www.elsevier.com/solutions/reaxys), and compared the MA distribution (1–25) between four libraries; NPs, industrial compounds, metabolites, and pharmaceuticals. Their results showed that the MA index is mostly constrained by mass, and all libraries exhibited a wide range of values. Moreover, the index estimated well the fragmentation complexity in MS/MS spectra from different biological samples. Beyond its application to identify life in outer space, one might divert the original purpose of the molecular complexity index as a fitness function to optimize the design of NP-inspired drugs.

Predicting biological functions

Bioactive NPs are present in small amounts in natural crude extracts (NCEs, μg to few mg), sometimes they are insufficient to conduct biological evaluations in multiple and successive phenotypic or target-based assays. Traditional bioassay-guided fractionation looks at a handful of biological responses at times, the fractions that do not respond to the assay(s) are considered inactive, the NPs within are left out, and their biological profile(s) remain unresolved. Atanasov and co-workers recently discussed the roles of genome mining, metabolite engineering and cultivation systems to tackle that recurrent issue.[12] On the ground, research laboratories stock their purified NPs and NCEs in freezers (diluted at different concentrations or lyophilized) to maximize their future screening campaigns. Some countries have created national chemical repositories, such as France's Chimiothèque Nationale (https://chembiofrance.cn.cnrs.fr) and Compound Australia (https://www.griffith.edu.au/griffith-sciences/compounds-australia), to facilitate the interactions between chemical and biological laboratories. In silico, structure-based approaches (i.e., docking and virtual screening) and ligand-based approaches (i.e., QSA/PR modelling) predict the biological or ADME/Tox profiles of untapped chemical structures. Quantitative Structure–Activity/Property Relationship (QSA/PR) models use ML algorithms, mainly regressors and classifiers, to link the biological activity or physicochemical property of interest with changes in chemical moieties. In the past half-century, QSA/PR modelling has risen from a niche area of computational/theoretical chemistry to one of the major strategies to monitor large chemical libraries with applications in drug design, quantum mechanics, materials, and nanomaterials science, regenerative medicine and environmental toxicity.[212] The application of ML algorithms to predict the biological activities of NPs is new; most models were developed in the last 5–10 years.[213,214] Binary classification models predominate the list of ML algorithms for NP biological activity prediction (as active or inactive). Early classifiers include the development of a linear discriminant analysis (LDA) model using topological descriptors for the search of new anti-inflammatory NPs from MicroSource,[215-217] and two RF classifiers using CDK descriptors[218] to discover antimicrobial and anticancer agents among 1194 marine and microbial NPs from AntiMarin database.[219,220] In 2016, Dai and co-workers predicted the anticancer properties for 5278 out of 21 334 plant-derived NPs from TCM.[221] The authors used their in-house web server CDRUG[222] based on a method coined relative-frequency weighted fingerprints (RFW_FP) and a hybrid score to compare molecular similarity. Between 2017 and 2020, Rayan and co-workers introduced the iterative stochastic elimination (ISE) optimization for the discovery of bioactive NPs; they reported NPs for anticancer,[223] antidiabetic,[224] anti-inflammatory,[225] antibacterial,[226] and antifungal[227] activities. In all examples mentioned above, the authors constructed binary classifiers using sets of approved drugs with the biological activity of interest as the active class, and 2892 NPs as the inactive class. The ISE algorithm scores iteratively the variables (i.e., physicochemical descriptors) and combinations thereof with the biological activity. The common NPs were not inactive per se; rather, their biological activity was either ignored or unknown. Rayan and co-workers assumed these false negatives in the training set might have a minor effect.[223] Noteworthy, the authors did not consider the clear physicochemical differences between NPs and drugs as detrimental factors for their good model performances. In 2018, Egieyeh and co-workers trained several binary classifiers from a dataset of NPs with in vitro antimalarial activity and applied their best models (RF and Sequential Minimization Optimization (SMO) with 82.8% and 85.9% accuracy, respectively) against 450 NPs from InterBioScreen chemical library.[228] The same year, Onguéné and co-workers profiled in silico toxicity of 806 African plant-derived NPs from three databases curated for their antimalarial and anti-HIV properties (p-ANAPL, AfroMalariaDb, and Afro-HIV).[229] The team implemented the knowledge-based predictive software Derek[230] and the Cambridge University small-molecule pharmacokinetics prediction (pkCSM) web server.[231] The former detected toxic sub-structures in small molecules, and the latter assessed the ADME/Tox profiles from graph-based structural signatures. The pkCSM predictions used RF and Logistic Regression algorithms for classification tasks, GP and Model Tree for regression tasks.[231] The following year, Dias and co-workers illustrated the emergence of computational multi-target drug design[232] with the development of two QSA/PR models to discover novel antibiotics against methicillin-resistant Staphylococcus aureus (MRSA).[233] The first model was a modest regression model trained from 6645 anti-MRSA compounds with molecular descriptors to predict the negative decimal logarithm of minimum inhibitory concentration (pMIC) value against MRSA. The second model predicted the antibacterial activity using the 1D-NMR data (1H and 13C) from marine samples (crude extracts, fractions and pure compounds) with 77% accuracy. In 2020, Yoo and co-workers developed the first DL-driven multi-classification algorithm to identify the medicinal uses of NPs for 15 diseases.[234] The authors trained their model from a heterogeneous dataset of 4507 NPs and 2882 drugs, and 686 variables derived from PubMed text mining, molecular interactions features and physicochemical descriptors. The algorithm predicted 31 NPs and their possible uses in 15 phenotypes, including neurological disorders (Alzheimer's disease, Parkinson's disease, pain, stroke), heart problems (failure, myocardial infarction), infectious diseases (bacterial, urinary tract, skin), and autoimmune diseases (rheumatoid arthritis). Finally, in 2021, Liu and co-workers reported a DL algorithm called pretrained self-attentive message passing neural network (P-SAMPNN) to discover novel and potent anti-osteoclastogenic NPs.[235] Most QSA/PR models employed (binary) classification algorithms to predict phenotypic responses such as anticancer or anti-inflammatory activity. Dias and co-workers created the first regression model to predict a phenotypic response, the pMIC value of a compound with antibacterial activity against MRSA.[233] Very few research groups have ventured into developing models, regressors and classifiers, to predict the biological activity against specific protein targets, which is primarily due to the heterogeneity of biological information (e.g., MIC, IC50, EC50, Ki, % inhibition) in databases, and the variety of biological assay conditions these results came from. All the following models were developed to prioritize the virtual screening of sizable chemical libraries. The earliest target-based QSA/PR model could be attributed to Rupp and co-workers for discovering NP-derived Peroxisome Proliferator-Activating Receptor γ (PPARγ) activators for type 2 diabetes mellitus.[236] The authors trained Gaussian process (GP) regression models with 144 PPARγ synthetic ligands and their pKd values. The compounds were encoded using molecular descriptors (i.e., 2D properties, topological pharmacophores, fragment counts), structure graphs (i.e., bond types, pharmacophores types) and combinations thereof, leading to 16 different GP models. Besides classical performance metrics, the authors evaluated their models using the so-called fraction of inactive among top-20-ranked compounds or FI20. They selected the 30 top-ranked compounds from the Asinex and Platinum collections (http://www.asinex.com), including ten compounds from the model with the best FI20 score. A total of eight displayed moderate-high and selective agonistic activity towards PPARα or PPARγ activation assays. One of their hits, the moderate yet selective PPARγ agonist compound 8 is a derivative of truxillic acid.[236] In 2016, Sun and co-workers constructed five inductive logic programming (ILP) models to predict NPs that inhibit Sirtuin 1 (SIRT1), a promising target to treat type 2 diabetes and cancer.[237] The first “inhibitor” model used 179 SIRT1/2 inhibitors (IC50 < 50 μM), whereas the “activator” model was made from 51 activators with EC50 < 2.15 μM. A third “differential” model compared SIRT1/2 inhibitors and activators. Two additional models estimated inhibitor binding energy and inhibitor affinity to SIRT1. The inhibitor models prioritised the virtual screening of 1.4M TCM compounds, leading to twelve candidates for AutoDock Vina software. In 2018, Pang and co-workers developed two classifiers (Naive Bayesian – NB, Recursive Partitioning – RP) to identify NPs with agonistic activity against estrogen receptor α (ERα), a protein target for breast cancer.[238] The authors employed 9075 ERα agonists (IC50 < 10 μM) from the BindingDB and DUD-E databases and a suite of 2–3D physicochemical descriptors. The original dataset was divided into a 60:40 train/test split (6556:2519) with an imbalanced representation of active (2075) and inactive (7000) compounds. Both NB and RF classifiers showed good performances with a clear bias towards the inactive class in both training and testing sets. The authors applied their two best models to a library of 13 166 NPs; 393 NB-derived candidates, 193 RP-derived candidates, 162 NPs were commonly found in both sets. All candidates were docked against ERα (PDB ID: 3ERT) using the docking programs LibDock and CDOCKER, leading to the discovery and biological evaluations of eight NPs with antiestrogenic effects. Besides their cost-effectiveness and time economy, many ML models have inconveniently incorporated drugs and synthetic compounds into their training sets, affecting the performances and the applicability domains of NP bioactivity predictors. For example, in 2017, Zhang and co-workers developed blood–brain barrier (BBB) permeability models that were applied to Traditional Chinese Medicine (TCM).[239] Their preliminary models, based on a synthetic chemical library, performed poorly. Their subsequent models (SVM, RF, Naïve Bayes, and probabilistic neural network), integrating an NP dataset, showed an overall 90% accuracy. Additional in vitro evaluation further validated the BBB permeability predictions for 25 out of 32 TCM molecules. Alternatively, in 2019, Plisson and Piggott created good BBB permeability models based on ensemble classifiers built from 448 disclosed small molecules,[240] that they applied to 471 marine NPs exhibiting kinase inhibition.[241] The authors further implemented univariate Mahalanobis distance measures to define the applicability domain of their models, leading to the discovery of 13 marine-derived kinase inhibitors with appropriate physicochemical characteristics for BBB permeability. These strategies are mainly due to the lack of experimental data on the activity of NPs and the difficulty of representing compounds of natural origin in linear notations. Inadequate molecular representations could impact the performances of ML models as well.[169,214] To date, the same representations define any organic molecule, drug and NP alike. Current proposals to address these issues include integrating experimental NP data to training sets, harmonizing NP bioassay results in public databases, applying ML algorithms that deal with small and imbalanced datasets, and creating NP-specific molecular representations.[213] Finally, the democratization of AI/ML algorithms without proper training and expertise has also led to a surge of malpractices in ML modelling and chemoinformatics, with numerous non-reproducible QSA/PR models. Computational experts remind the scientific community about the best practices to adopt in QSA/PR and ML modelling.[242-244]

De-orphanizing

We rarely know the native binding targets of NPs. Moreover, most bioactive NPs are discovered through phenotypic assays, where their (protein) drug targets remain elusive. Recent advances in screening technologies[245] and novel laboratory strategies help to identify their plausible modes of action,[246-248] a process known as “target fishing.” In parallel, growing computational efforts for ligand-based target fishing include developing ML models and web servers to analyze the medicinal potential of the many NPs annotated in public chemical databases.[249] These tools represent an opportunity to “de-orphanize” NPs by predicting their macromolecular (protein) targets. De-orphanizing predictors of NP drug targets typically employ supervised or semi-supervised ML algorithms trained with combinations of labelled and unlabelled features such as structural representations and types of interactions important for NP pharmacological effects.[250] Several examples of web servers recently developed with ML methods for ligand-based target fishing are described in Table 1. The majority of these servers rely on chemical similarity searches. The PASS (Prediction of Activity Spectra for Substances) software is one of the earliest attempts to predict thousands of biological activities from two-dimensional chemical structures using molecular fragment descriptors.[251] Applications of the PASS web server in the study of NPs have been extensively described.[252] Recently, PASS was implemented in the SistematX Web Portal to profile a large NP database of Brazil.[253] In a similar context, the SEA (similarity ensemble approach) server compares the structural similarity of a query molecule to a group of compounds of a potential target and has evaluated the statistical significance of the resulting similarity score.[254] This server has been validated with NPs such as miconidine acetate (main metabolite of the Brazilian plant Eugenia hiemalis),[255] and the physalins A, D, F and G.[256]

All different ML algorithms/tools used to predict molecular targets of NPsa

Tool	Algorithm(s)	Application(s)	Ref
PASS (prediction of biological activity for substances)	NB	It predicts over 3500 pharmacotherapeutic effects, mechanisms of action, interaction with the metabolic system, and specific toxicity for drug-like molecules on the basis of their structural formula	251
SEA (similarity ensemble approach)	Kruskal algorithm of MST	It relates proteins based on the set-wise chemical similarity among their ligands	254
SPiDER (self-organizing map-based prediction of drug equivalence relationships)	SOMs	Useful to identify innovative compounds in chemical biology, and help investigate the potential side effects of drugs and their repurposing options	257
TiGER (target inference GEneratoR)	Multiple SOMs	It performs qualitative predictions of up to 331 targets	258
DEcRyPT (drug–target relationship predictor)	RF	It deconvolves phenotypic hit targets and accurately predicts affinities	258
STarFish	kNN, RF, MLP and LoR	It considers small molecule binding to 1907 targets and its performance on natural products target prediction is explicitly considered	259

kNN: k-nearest neighbors; LoR: logistic regression; MLP: multilayer perceptron; MST: minimum spanning tree; NB: naive Bayes; RF: random forest; SOM: self-organizing map.

kNN: k-nearest neighbors; LoR: logistic regression; MLP: multilayer perceptron; MST: minimum spanning tree; NB: naive Bayes; RF: random forest; SOM: self-organizing map. However, these tools might not predict the biological targets of structurally intricate NPs because of their somewhat different molecular constitution compared with that of synthetic drugs. In 2014, Reker and co-workers presented SPiDER (self-organizing map-based prediction of drug equivalence relationships) to tackle this problem. This approach relies on self-organizing maps (SOMs), a clustering approach that uses pharmacophore correlations and physicochemical properties to map the relationships between chemical compounds.[257] SPiDER was adapted to predict the targets of natural products with more complex and challenging structures, such as the macrocyclic archazolid A (ArcA),[258] resiniferatoxin,[259] (−)-englerin A[260] and doliculide.[261] The authors demonstrated that by deconvoluting the macrocyclic structures into fragments, assuming that the bioactivity fingerprint could be partly stored into those fragments and subsequently used them as surrogate structures for processing SPiDER, natural product-derived fragments (NPDFs) may help for the prediction of macromolecular targets of their corresponding parent NPs. In 2016, Keum and co-workers created good ML classifiers to predict the interactions between orphan herbal compounds and several protein targets (i.e., GPCRs, ion channels, transporters, receptors, enzymes).[262] In 2017, Schneider and Schneider presented TIGER (Target Inference GEneratoR) in subsequent development, a chemocentric computational method for target prediction that leverages a consensus of two SOMs with slightly modified descriptors.[263] TIGER scores each target unlike previous approaches, where higher score values suggest greater confidence in the prediction. TIGER was validated for the target prediction of resveratrol,[263] (±)-marinopyrrole A[264] and (−)-galantamine.[265] In 2018, Rodrigues and co-workers developed an orthogonal ML workflow called DEcRyPT (Drug–Target Relationship Predictor) based on RF regression to deconvolve phenotypic hit targets and accurately predict affinities. DEcRyPT was used successfully to identify β-lapachone as an allosteric modulator of 5-lipoxygenase.[266] The following year, the team reported the alternate method DEcRyPT 2.0, including y-randomization,[267] that predicted with more robustness the biological target(s) of celastrol.[268] In 2019, Cockroft and co-workers developed the online target prediction tool named STarFish, which was trained with a synthetic composite dataset consisting of 107 190 pairs of compound-targets (88 728 unique compounds and 1907 unique targets) and tested on an NP dataset containing 5589 pairs of compound-targets.[269] STarFish uses a stacking approach, where logistic regression is taken as a meta-classifier that combines model predictions and can produce better predictions than individual models. Furthermore, a multilabel classification approach is taken to emphasize the consideration of polypharmacology during training. Beyond fishing the biological targets of NPs, de-orphanizing ML approaches provide new opportunities for drug repurposing/repositioning.[270,271]

Generating de novo natural product-inspired compounds

Natural products contain privileged features to interact with (protein) drug targets that have supported their uses as starting, intermediate or final products for the design of synthetic compound libraries. Despite these advantages, most NPs do not fulfil the drug discovery paradigm in terms of toxicity, selectivity, lipophilicity and bioavailability, and require medicinal chemistry interventions (e.g., 92% of NP-inspired drugs were altered between 1980 and 2014 (ref. [10])). Their complex structures (i.e., stereogenic centres, heteroatom-containing functional groups, fused rings) have often handicapped the synthetic routes to analogues and their structure–activity/property relationship (SA/PR) studies. Moreover, patenting bioactive NPs in their original form might not be authorised where the compounds were discovered.[272] Consequently, multiple synthetic strategies have led to designing lead structures that would preserve the NP privileges.[192,273,274] The first strategy is the biology-oriented synthesis (BIOS), where NPs are taken as templates to generate synthetically accessible derivatives and mimetics.[275,276] The diversity-oriented or diverted total synthesis (DOS/DTS) focuses on populating the underexplored chemical space by creating new chemical structures with NP-like pharmacophores.[277-279] The complexity-to-diversity strategy (CtD) synthetically mimics enzymatic processes by chemically functionalizing and distorting NPs to structurally diverse compound collections.[280,281] Finally, the function-oriented synthesis or FOS refines the BIOS concept to recapitulate or fine-tuning the function of a biologically active lead structure to obtain simpler scaffolds, increase their ease of synthesis, and achieve synthetic innovation.[282,283] Waldmann and co-workers have recently introduced a set of principles to guide the generation of the “pseudo-NPs”, small molecule compound libraries that combine two or more NP-derived fragments (NPDFs), leading to unprecedented scaffolds. Their chemoinformatic analyses suggested that pseudo-NPs shared more characteristics (sizes, shapes, lipophilicity) with drugs than other libraries such as BIOS and ChEMBL NPs.[192,284,285] Computer-aided de novo design tools[286,287] have appeared alongside the synthetic strategies over the last 20 years and have recently started to generate NP-like compounds. One FOS-inspired approach automatically morphs NPs into synthetically accessible and isofunctional compounds. First, several chemical candidates are produced in silico using a generative algorithm. The compound generation is steered by optimizing the topological similarity between the candidates and a NP template. Subsequently, the computational prediction of the biological target is carried out. In 2016, Friedrich and co-workers reported the computational de novo design of the natural anticancer agent (−)-englerin A, the NP and its mimetics were all identified as potent TRPM8 agonists (TRPM8 stands for transient receptor potential calcium channel subfamily M (melastatin) member 8).[288] In 2018, Merk and co-workers successfully applied two generative algorithms to design fatty acid mimetics as new modulators of retinoid X receptor (RXR) and peroxisome proliferator-activated receptor (PPAR). The authors first used their in-house de novo design algorithm named DOGS[289] (Design Of Genuine Structures) to generate and test NP mimetics from dehydroabietic acid, isopimaric acid and valerenic acid, three known RXR agonists.[290] In 2021, Friedrich and co-workers revisited the computational de novo DOGS design by generating (±)-marinopyrrole A mimetics as moderate-high inhibitors of cyclooxygenases COX-1 and COX-2.[291] In the last five years, scientists have started to design de novo organic chemical entities for material science and drug discovery applications using generative AI.[4-6,292,293] Early applications of DL algorithms to produce new molecules include the use of recurrent neural networks (RNN) with long-short term memory (LSTM),[294] autoencoders,[295,296] generative adversarial networks[297] and reinforcement learning.[298] In 2018, Merk and co-workers implemented LSTM-RNN to produce new RXR and PPAR agonists inspired by 25 fatty acid mimetics.[299,300] The same year, Mȕller and co-workers adapted LSTM-RNNs to generate novel peptide sequences inspired by natural antimicrobial peptides, exempted from repetitive cysteine and proline residues.[301] In 2019, Zheng and co-workers developed a de novo molecular generator to make quasi-biogenic compounds named QBMG.[302] The authors used RNN with a gated recurrent unit (GRU) and trained the generator with 153 733 biogenic compounds from the ZINC15 library.[303] In 2021, Bung and co-workers applied transfer learning and reinforcement learning to create novel small molecule inhibitors of the protease 3CL from the severe acute respiratory syndrome virus 2 (SARS-CoV-2). The team pre-trained the deep neural network architecture with 1.6 million drug-like small molecules from ChEMBL and generated 42 484 molecules. They filtered the generated dataset based on physicochemical descriptors, rule-based criteria, and virtual screening scores resulting in 33 new chemical entities, including two aurantiamide-like compounds.[304] Scaffold-hopping comes as an alternative strategy to computer-aided de novo design. The computational process, widely used in medicinal chemistry, aims at identifying chemical compounds with different molecular backbones that share similar activity/property space.[149-151] Scaffold-hopping applied to NPs means finding simpler NP mimetics, but their structural differences with synthetic compounds could hamper the computational process. In 2018, Grisoni and co-workers introduced a molecular similarity approach that hopped from complex NP scaffolds to simpler isofunctional synthetic mimetics while retaining their biological functions. The authors hopped from structures to structures using their in-house descriptors called weighted holistic atom localization and entity shape (WHALES) for the computational search. They exemplified their strategy using four natural cannabinoids as queries leading to seven novel biologically active compounds; three compounds became cannabinoid receptor modulators.[19] The following year, the team employed WHALES descriptors in conjunction with SPiDER and TIGER tools in a multitarget ligand design approach. Grisoni and coworkers identified eight small molecules inspired by the natural product (−)-galantamine exhibiting multiple target activity profiles against enzymes and protein receptors related to Alzheimer's disease.[265] In addition to the generation of simpler NP-inspired compounds, the total syntheses of NPs[305] and NP analogues[306] are the livelihoods of many chemists and mesmerize the field of organic chemistry, driven by synthetic efficiency, elegance and quality. Training algorithms to support the autonomous synthetic planning of complex NPs has also evolved over half a century since LHASA,[307] giving rise to multiple softwares.[308] Artificial intelligence has integrated computer-aided synthetic planning (CASP) such as Chematica/Synthia, a hybrid human-AI system.[309,310] Artificial intelligence has also informed the fully automated platforms for the synthesis of NPs.[311,312] The next paradigm shift aims to combine CASP with ML-driven models for predicting biological activities, biological targets, ADME/Tox properties to automate the discovery of biologically active NPs and the de novo design of NP-inspired drugs.[180]

Conclusions

Natural products have originated multiple drug discovery success stories, yet the many challenges associated with their discovery or their design – minute amounts, unfriendly extracts, unknown biological functions, missing biological targets, difficult chemical syntheses, complex SA/PR studies, undruggable ADME/Tox properties – led to the decline of NP drug discovery programmes. However, laboratory and computer scientists alike continue to marvel at NPs for their unique privileges to bind biological drug targets specifically for their therapeutic potentials. Artificial intelligence and machine learning algorithms have slowly integrated different stages of NP drug discovery (1) to assist discovering and elucidating bioactive structures and (2) to capture the molecular patterns of these privileged structures for molecular design and target selectivity. About the early discovery of bioactive NPs, natural language processing and text-mining tools have barely deciphered the many bioactive compounds hidden or forgotten in codices of traditional medicines and peer-reviewed articles. In contrast, ML-fuelled applications in genome mining and dereplication processes have reduced the screening of redundant producers or natural crude extracts and accelerated the discovery of novel natural chemical entities such as the subclass V lanthipeptides. With reported NPs, the inclusion of new AI technologies started at the turn of the 21st century with encoding their structures into computer-readable formats (i.e., 1–3D molecular representations, molecular descriptors) and generating chemical space visualization methods to manage and interpret the many naturally occurring compounds present in publicly available databases. The successive application of dimensionality reduction techniques such as PCA, t-SNE, SOM and lately TMAP has provided the means to compare NP privileged features (i.e., physicochemical properties, fragments, likeness scores) with those of drugs and synthetic libraries. In the 2010s, the development of ML models, i.e., regressions and classifications, to predict the biological activity/property of NPs has pushed candidates towards more advanced stages of drug development. It is worth noting that many predictions might inadvertently discard several bioactive NPs due to their striking physicochemical and structural differences with the model training sets (i.e., drugs). The limitations of these predictive models, also known as the applicability domains, are not systematically identified. Future improvements of predictive ML models should include an understanding of the scope and limitations of the available data. Besides biological activities, predictive algorithms and derived web servers have de-orphanized NPs to identify therapeutically relevant protein partners, expanding the realm of applications beyond their natural functions. Finally, deep generative models are re-routing NP-inspired de novo design with the autonomous generation of new drug candidates with simplified structures and inherited biological activities from NPs. Likewise, combining de novo design with de-orphanizing models produces novel isofunctional chemotypes (i.e., NP mimetics) that populate NP-exclusive and uncharted regions of the chemical space. These strategies are improving the synthetic accessibility, potency, and drug-likeness similarity of NP-inspired molecules.

Author contributions

FISG: visualization, investigation, writing – original draft, review & editing. VDAB: investigation, Writing – original draft & review. JLMF: writing – review & editing. FP: conceptualization, visualization, investigation, writing – original draft, review & editing. All authors read and approved the final manuscript.

Conflicts of interest

There are no conflicts to declare.

278 in total

1. Property distributions: differences between drugs, natural products, and molecules from combinatorial chemistry.

Authors: Miklos Feher; Jonathan M Schmidt
Journal: J Chem Inf Comput Sci Date: 2003 Jan-Feb

2. Molecular networking as a dereplication strategy.

Authors: Jane Y Yang; Laura M Sanchez; Christopher M Rath; Xueting Liu; Paul D Boudreau; Nicole Bruns; Evgenia Glukhov; Anne Wodtke; Rafael de Felicio; Amanda Fenner; Weng Ruh Wong; Roger G Linington; Lixin Zhang; Hosana M Debonsi; William H Gerwick; Pieter C Dorrestein
Journal: J Nat Prod Date: 2013-09-11 Impact factor: 4.050

3. The systematic assessment of traditional evidence from the premodern Chinese medical literature: a text-mining approach.

Authors: Brian H May; Anthony Zhang; Yubo Lu; Chuanjian Lu; Charlie C L Xue
Journal: J Altern Complement Med Date: 2014-12 Impact factor: 2.579

4. Unsupervised word embeddings capture latent knowledge from materials science literature.

Authors: Vahe Tshitoyan; John Dagdelen; Leigh Weston; Alexander Dunn; Ziqin Rong; Olga Kononova; Kristin A Persson; Gerbrand Ceder; Anubhav Jain
Journal: Nature Date: 2019-07-03 Impact factor: 49.962

5. Graph convolutional networks for computational drug development and discovery.

Authors: Mengying Sun; Sendong Zhao; Coryandar Gilvary; Olivier Elemento; Jiayu Zhou; Fei Wang
Journal: Brief Bioinform Date: 2020-05-21 Impact factor: 11.622

6. Best practices in machine learning for chemistry.

Authors: Nongnuch Artrith; Keith T Butler; François-Xavier Coudert; Seungwu Han; Olexandr Isayev; Anubhav Jain; Aron Walsh
Journal: Nat Chem Date: 2021-06 Impact factor: 24.427

Review 7. Natural products as starting points for future anti-malarial therapies: going back to our roots?

Authors: Timothy N C Wells
Journal: Malar J Date: 2011-03-15 Impact factor: 2.979

8. Modeling natural anti-inflammatory compounds by molecular topology.

Authors: María Galvez-Llompart; Riccardo Zanni; Ramón García-Domenech
Journal: Int J Mol Sci Date: 2011-12-20 Impact factor: 5.923

9. ChemMaps: Towards an approach for visualizing the chemical space based on adaptive satellite compounds.

Authors: J Jesús Naveja; José L Medina-Franco
Journal: F1000Res Date: 2017-07-17

10. Comparative analysis of chemical similarity methods for modular natural products with a hypothetical structure enumeration algorithm.

Authors: Michael A Skinnider; Chris A Dejong; Brian C Franczak; Paul D McNicholas; Nathan A Magarvey
Journal: J Cheminform Date: 2017-08-16 Impact factor: 5.514

4 in total