Literature DB >> 35211724

cropCSM: designing safe and potent herbicides with graph-based signatures.

Douglas E V Pires¹, Keith A Stubbs², Joshua S Mylne³, David B Ascher⁴.

Abstract

Herbicides have revolutionised weed management, increased crop yields and improved profitability allowing for an increase in worldwide food security. Their widespread use, however, has also led to a rise in resistance and concerns about their environmental impact. Despite the need for potent and safe herbicidal molecules, no herbicide with a new mode of action has reached the market in 30 years. Although development of computational approaches has proven invaluable to guide rational drug discovery pipelines, leading to higher hit rates and lower attrition due to poor toxicity, little has been done in contrast for herbicide design. To fill this gap, we have developed cropCSM, a computational platform to help identify new, potent, nontoxic and environmentally safe herbicides. By using a knowledge-based approach, we identified physicochemical properties and substructures enriched in safe herbicides. By representing the small molecules as a graph, we leveraged these insights to guide the development of predictive models trained and tested on the largest collected data set of molecules with experimentally characterised herbicidal profiles to date (over 4500 compounds). In addition, we developed six new environmental and human toxicity predictors, spanning five different species to assist in molecule prioritisation. cropCSM was able to correctly identify 97% of herbicides currently available commercially, while predicting toxicity profiles with accuracies of up to 92%. We believe cropCSM will be an essential tool for the enrichment of screening libraries and to guide the development of potent and safe herbicides. We have made the method freely available through a user-friendly webserver at http://biosig.unimelb.edu.au/crop_csm.

Entities: Chemical

Keywords: graph-based signatures; herbicides; machine learning; toxicity

Mesh：

Substances：
Herbicides

Year: 2022 PMID： 35211724 PMCID： PMC9155605 DOI： 10.1093/bib/bbac042

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 13.994

Introduction

Herbicides are widely used chemical agents capable of killing or inhibiting growth of unwanted plants, including weeds and grass types, that might compromise crop yields. Over the years, their adoption has revolutionised weed management, increased crop yields and improved profitability allowing for an increase in worldwide food security. Their widespread use, however, has also led to a rise in resistance [1]. Without appropriate measures to bring herbicides with new modes of action to the market, combined with a concerted global effort at product stewardship and regulation, herbicide resistance could reduce world food production in the coming years by 20–40%, leading to a global food security crisis [2]. Although agricultural practices have improved, decades of success by glyphosate and spiralling costs have stymied herbicide development. Designing potent herbicides is particularly challenging considering toxicity concerns, and currently available and widely used weed control compounds have shown to be potentially harmful for the environment, livestock and human health. Exposure to agrochemicals has been, for instance, shown to be one of the potential drivers to the decline of pollinators worldwide [3]. Despite the need for novel potent and safe herbicidal molecules, no herbicide with a new mode of action has reached the market in 30 years [4]. Developing herbicides, much like pharmaceuticals, involves a careful balance between efficacy and safety. In the pharmaceutical industry, drug development pipelines have tackled these challenges by modelling and optimising these important parameters early in the development process, which has been assisted by the implementation of computational pipelines. This has led, in general, to increased hit rates and decreased attrition due to poor toxicity profiles and, in the process, reduced development time, costs and animal testing [5-10]. Although many computer-aided approaches have proven invaluable for drug development, in contrast little has been done to aid the development of safe and potent agrochemicals. A relevant recent development has been the work of Oršolić et al. to establish structure–activity relationships linking herbicides with modes of action and weed selectivity [11]. To fill this gap, here we describe the design of the first computational method to support and guide rational development of safe and potent herbicides, cropCSM, using the largest curated small molecule database of experimentally characterised herbicidal activity to date. Our method uses the concept of graph-based signatures, represented as a cutoff scanning matrix (CSM) (Figure 1), to model small molecule physicochemical properties coupled with machine learning to train and validate predictive tools capable of accurately identifying molecules with potent herbicidal activity, as well as characterise their environmental and human toxicity profiles.

Figure 1

Modelling small molecule activity using graph-based signatures. Small molecules are modelled as unweighted, undirected graphs where nodes represent atoms and edges represent chemical bonds, with atoms labelled via pharmacophore modelling (left panel). All-pairs shortest paths distances (d) are calculated between different label types (middle panel) to represent molecules, their geometry and physicochemical properties as cumulative distributions (right panel, considering different cutoff/distance values).

Materials and methods

cropCSM development was divided into four main steps, including (i) data acquisition, which involved obtaining data relating small molecules to their herbicidal activity as well as human and environmental toxicity profiles; (ii) feature engineering, which involved calculating molecular properties and performing graph-based modelling to extracting relevant information used as evidence in the next steps to (iii) identify what makes up potent and safe herbicides, identifying enriched substructures and qualitatively assessing molecular properties and to (iv) train and test predictive model using supervised learning algorithms.

Data for herbicidal activity

A dataset of 4513 experimentally characterized, structurally diverse small molecules and their herbicidal activity profiles was obtained from Sukhoverkov et al. [12]. These were labelled either as active (997 molecules) or inactive (3516 molecules). They had an average molecular weight of 380 Da and logP of 2.4 (Figure S1). A database of 356 commercial herbicides was also used to evaluate cropCSM [12].

Data for environmental and human toxicity

We have developed new predictors based on six environmental and human toxicity data sets with experimentally characterised molecules. Environmental toxicity data sets included (a) honey bee (Apis mellifera) toxicity, which was composed of 247 toxic and 353 atoxic molecules [13]; (b) avian toxicity, composed of 461 small molecules and their effects on mallard duck (66 toxic and 395 atoxic) [14] and (c) flathead minnow toxicity, with lethal concentration values (LC50) for a diverse set of 554 molecules [15]. Human toxicity data sets included (1) AMES toxicity, with compounds labelled based on their carcinogenic potential (4632 carcinogenic and 3470 not carcinogenic) [16]; (2) oral acute toxicity in rats, denoted as lethal dose (LD50) values for 10 145 compounds [17] and (3) oral chronic toxicity in rats values for 567 compounds [18]. cropCSM: predicting safe and potent herbicides. Using chemical substructure mining, we identified common enriched substructures in compounds with herbicidal activity (A-left). Active compounds presented similar molecular properties of approved drugs (A-right). Performance of herbicide and environmental-toxicity predictors is shown in (B). Our herbicide predictor was able to accurately identify active compounds with AUC > 0.85 on cross-validation and blind test. Three environmental toxicity models have been developed and were capable of successfully measuring minnow toxicity (as a regression task, centre graph) as well as identifying potentially harmful compounds for Bees and Mallard (right-hand side graph). Sensitivity refers to the true positive rate, whereas specificity to the true negative rate.

Graph-based signatures and feature engineering

Graph modelling is an invaluable tool to model biological entities, including small molecules [19-22]. Over the years we have proposed and developed the concept of graph-based signatures (based on CSM concept [23]) to represent physicochemical and geometrical properties of a range of macromolecules [24-29] and their interactions [30-37]. These have also been successfully adapted to represent small molecules pharmacokinetics, toxicity and bioactivity [38-41]. Figure 1 depicts the main steps involved in feature engineering with graph-based signatures. Small molecules are modelled as unweighted, undirected graphs where nodes represent atoms and edges represent covalent bonds. Via pharmacophore modelling [38], atoms/nodes are labelled based on their physicochemical properties. This atomic graph representation of small molecules accounts for both their shape and composition, from which information is extracted as distance patterns. To do so, all-pairs shortest paths distances are calculated, and molecules are then represented simply as cumulative distribution functions of atom distances. These distances are further categorised based on their respective physicochemical properties (pharmacophores) and converted as a feature vector used as evidence to train and test predictive methods. Complementary physicochemical properties are calculated and included using the RDKit cheminformatics library [42] and included in the feature vector.

Model selection and validation

Different supervised learning algorithms available on the scikit-learn Python library [43] were assessed with best performing models (Random Forest, n_estimators = 300 for herbicide activity). Information on learning algorithms and hyperparameters for remaining models is available as Table S1 and selected based on accuracy, Matthew’s correlation coefficient (MCC) and the area under the Receiver operating characteristic curve (ROC curve) (AUC) for classification tasks and Pearson’s correlation [44] and root mean squared error (RMSE) for regression tasks. Hyperparameter tuning was performed with a grid search method implemented in scikit-learn; however, no significant improvement in performance was observed. Performance was assessed under stratified 10-fold cross-validation (obtained using scikit-learn Python library) [29, 30, 45] as well as using nonredundant blind tests (blind tests were composed of held out sets accounting for 10% of molecules selected at random and used as external validation). A feature selection step was used to reduce dimensionality and improve performance via a stepwise forward greedy selection approach [24, 28–30, 46–48]. Performance of cropCSM on identifying molecules with herbicidal activity. Performance of cropCSM on commercially available herbicides. Our method was able to correctly classify 97% of commercial herbicides (344 out of 356, top-left graph). The figure also shows the proportion of compounds predicted to be environmental or human toxic. Molecules were more frequently predicted as AMES toxic (17%, 60 out of 356) and minnow toxic (20%, 70 out of 356).

Substructure mining

To identify what makes up a herbicide, molecular substructure mining was employed to identify substructures that were enriched in the herbicidal class and depleted in the nonherbicidal class. For this purpose, the molecular substructure miner (MoSS) [49] tool was used, and different minimum support/frequency cutoffs tested. We considered 5- and 10-fold differences between support of positive and negative classes (1%, 5%, 1–10% and 2–10%).

Data visualization

t-distributed stochastic neighbor embedding (t-SNE) plots were generated using the R package ‘tsne’ with default parameters.

Results

What makes an herbicide? Correlating molecular properties with herbicidal activity

Using experimental information on the herbicidal activity for a collected data set of over 4500 small molecule compounds (22% with herbicidal activity), we investigated what physicochemical properties of the compounds translate to herbicidal activity. Herbicidal molecules were enriched in saturated carbon chains and benzene substructures, compared with the inactive molecules (Figure 2). The majority (90%) of the active compounds tended to be < 517 Da, have up to 9 acceptors and 4 donors, fewer than 9 rotatable bonds and a logP between −1.7 and 6.1 (Figure S1) (95% <700 Da, 11 rotatable bonds, 11 acceptors, 6 donors and logP between −3.0 to 6.1). This is similar, although slightly more lenient, than the widely used Lipinski’s Rule of Five for orally bioavailable drugs. Interestingly, but consistently, there was no significant distinction in physicochemical properties between herbicides and approved drugs, as illustrated in the t-SNE plot (Figure S2). Compared with all U.S. food and drug administration (FDA)-approved drugs, however, herbicides were enriched in substructures involving chlorine.

Figure 2

cropCSM: predicting safe and potent herbicides. Using chemical substructure mining, we identified common enriched substructures in compounds with herbicidal activity (A-left). Active compounds presented similar molecular properties of approved drugs (A-right). Performance of herbicide and environmental-toxicity predictors is shown in (B). Our herbicide predictor was able to accurately identify active compounds with AUC > 0.85 on cross-validation and blind test. Three environmental toxicity models have been developed and were capable of successfully measuring minnow toxicity (as a regression task, centre graph) as well as identifying potentially harmful compounds for Bees and Mallard (right-hand side graph). Sensitivity refers to the true positive rate, whereas specificity to the true negative rate.

Herbicides have been previously compared with pharmaceuticals [10] with our analysis being consistent with previous results across smaller datasets [50-54], which have shown that physicochemical properties of herbicides are similar to orally delivered drugs, although the former tend to be smaller, with fewer proton donors, lower partition coefficient [53, 54]. Performance on cross-validation of human-toxicity predictors. Three models have been developed and were capable of accurately measuring acute and chronic rat toxicity (as regression tasks, left and centre graphs) as well as identifying potentially carcinogenic compounds (Ames toxicity as a classification task, right-hand side graph). For regression models, two Pearson’s correlation coefficients are shown: one obtained for the whole data set (in black) and one on 90% of the data, after 10% of outliers or poorly predicted points are removed (in red) for analysis purposes to assess the potential effects of outliers on performance (depicted as red dots). Performance of cropCSM on identifying environmentally toxic compounds as a regression task *Denotes a statistically significant performance difference obtained via a Fisher r − to−z transformation, by calculating the z value, using a threshold of P ≤ 0.05 for significance. The table presents performance assessed on cross-validation. Performance of cropCSM on identifying environmentally toxic compounds as a classification task The table presents performance assessed on cross-validation. NA: Not available. These insights were used as a platform to build a supervised machine learning predictive model to identify herbicidal compounds, where the small molecule structure was represented as a graph-based signature, termed CSM, in which the atoms are represented as nodes, and covalent interactions between them as edges [38, 45]. Under stratified 10-fold cross-validation, we were able to correctly identify 82% of the active molecules with an overall accuracy of 87%, area under the ROC curve (AUC) of 0.85 and MCC of 0.60 (Figure 2 and Table 1). For compounds over 500 Da, cropCSM achieved an AUC of 0.81, illustrating the robust performance of the approach on larger compounds. A precision-recall curve (PR curve) was also calculated (Figure S3) to investigate the compromises between type I and type II errors. cropCSM achieved a PR AUC of 0.93, demonstrating a well-tuned model, with good performance and balance between type error I and II. When the model was evaluated against a blind test set of 106 active and 345 inactive molecules, we achieved comparable performance (87% accuracy, AUC of 0.87 and MCC of 0.59). This provided confidence that the approach can be generalized and used with unknown sets of putative herbicidal molecules active against a target of interest.

Table 1

Performance of cropCSM on identifying molecules with herbicidal activity.

Data set	Performance metrics
Data set	Accuracy	AUC	MCC
Cross-validation	87%	0.85	0.60
Blind test	87%	0.87	0.59

To further validate the cropCSM models and demonstrate their real-world applicability, they were applied to a set of 356 commercial herbicides [12]. Over 97% were correctly identified as herbicidal (Figure 3). Of those that were not, they included the natural fatty acid oleic acid and fragment-like molecules such as dazomet and pentachlorophenol.

Figure 3

Performance of cropCSM on commercially available herbicides. Our method was able to correctly classify 97% of commercial herbicides (344 out of 356, top-left graph). The figure also shows the proportion of compounds predicted to be environmental or human toxic. Molecules were more frequently predicted as AMES toxic (17%, 60 out of 356) and minnow toxic (20%, 70 out of 356).

cropCSM web server. (A) Depicts the landing page for the resource. By clicking on ‘Prediction’ at the top menu, users are directed to the job submission page (B). There users have the options to either provide a set of molecules as a SMILES file (1) or individual molecules as a SMILES string (2). After clicking on ‘Run prediction’, and once calculations are complete, users are redirected to a results page (C) where predictions for herbicidal activity as well as toxicity profiles are presented in tabular format (3). Users have the options to download the results (4).

Predicting environmental toxicity

Agrochemicals have been linked to a range of unwanted negative effects on both health and the environment, including glyphosate-free herbicides such as benzalkonium chloride [55]. To help identify safe herbicides, complementary models were developed to capture the impact of a small molecule on honey bee (A. mellifera), mallard (Anas platyrhynchos) and flathead minnow (Pimephales promelas) toxicity, in addition to measures of human health, including AMES toxicity, rat LD50 and oral chronic toxicity. Although assessing molecular substructures enriched in toxic compounds (based on these datasets), (Figure S4), we identified a prevalence of complex ring structures. Of note, structures rich in chlorine, while enriched in herbicides, were also enriched in compounds that were toxic for mallard and minnow, highlighting a potential inherent difficulty in optimising potency and safety when designing herbicides. Tables 2 and 3 depict the performance of cropCSM models for identifying toxic molecules as regression and classification tasks, respectively, assessed during cross-validation. Overall, we were also able to identify toxic molecules as classification and regression tasks with accuracies of up to 92% and Pearson’s correlations of up to 0.86, outperforming previous predictive approaches [13, 14, 38, 56], including pkCSM, admetSAR and the works from Wang et al. and Zhang et al. Regarding environmental toxicity models, the minnow toxicity (LC50) model achieved a Pearson’s correlation of 0.86 (RMSE of 0.74), a performance that increases to 0.93 on 90% of the data (after removing 10% outliers – these are only removed for analysis purposes to assess the potential effects of outliers on performance), whereas honeybee and avian toxicity classifiers achieved AUCs of 0.81 and 0.83 (Figure 2). For human health toxicity predictors, cropCSM achieve Pearson’s correlation coefficients of 0.79 and 0.75 for oral rat acute and chronic toxicity, respectively, and an AUC of 0.94 while predicting AMES compounds (Figure 4).

Table 2

Performance of cropCSM on identifying environmentally toxic compounds as a regression task

Environmental Toxicity	Cross-validation
Environmental Toxicity	Pearson’s (r)	RMSE
Minnow toxicity (LC50)
cropCSM	0.86	0.74
pkCSM	0.74^*	0.84
admetSAR	0.57^*	0.67
Oral rat chronic toxicity
cropCSM	0.75	0.56
pkCSM	0.68^*	0.74
admetSAR	0.50^*	0.73
Oral rat acute toxicity (LD50)
cropCSM	0.79	0.62
pkCSM	0.66^*	0.68
admetSAR	0.61^*	0.32

*Denotes a statistically significant performance difference obtained via a Fisher r − to−z transformation, by calculating the z value, using a threshold of P ≤ 0.05 for significance. The table presents performance assessed on cross-validation.

Table 3

Performance of cropCSM on identifying environmentally toxic compounds as a classification task

Environmental toxicity	Cross-validation
Environmental toxicity	Accuracy	AUC	MCC
Honey bee toxicity
cropCSM	87%	0.81	0.65
Wang et al.	84%	0.84	0.60
Avian toxicity
cropCSM	92%	0.83	0.65
Zhang et al.	91%	0.91	NA
AMES toxicity
cropCSM	87%	0.94	0.74
pkCSM	84%	0.91	NA
admetSAR	85%	0.91	NA

The table presents performance assessed on cross-validation. NA: Not available.

Figure 4

Performance on cross-validation of human-toxicity predictors. Three models have been developed and were capable of accurately measuring acute and chronic rat toxicity (as regression tasks, left and centre graphs) as well as identifying potentially carcinogenic compounds (Ames toxicity as a classification task, right-hand side graph). For regression models, two Pearson’s correlation coefficients are shown: one obtained for the whole data set (in black) and one on 90% of the data, after 10% of outliers or poorly predicted points are removed (in red) for analysis purposes to assess the potential effects of outliers on performance (depicted as red dots).

Using these validated methods, we set out to assess the toxicity profiles of currently commercially available herbicides. The list of 356 commercial herbicides was submitted to the predictors as an independent test. Figure 3 depicts a large proportion of compounds predicted to be environmental or human toxic, with molecules more frequently predicted as either AMES toxic (17%, 60 out of 356) or minnow toxic (20%, 70 out of 356). Interestingly, widely used herbicides glyphosate and glufosinate were predicted as nontoxic to humans. This is interesting, as while glyphosates have been proposed to be carcinogenic, there is growing support that it is the inert products that are responsible for damage to human beings and the environment [57], which would be consistent with the predictions from cropCSM. When grouping commercial herbicides based on their mode of action, a significant proportion of minnow toxic compounds belonged to the class of inhibitors of acetyl CoA carboxylase (14/24, 58% versus 20% overall), AMES toxic compounds were enriched in the class of Inhibitors of cell division (10/28, 36% versus 17% overall) and compounds involved in Rat Chronic toxicity were more frequent in the class of inhibitors of lipid synthesis (7/19, 37% versus 14% overall). These results add credence to the tool to rapidly identify potentially hazardous molecules early in the development process, which has the potential to significantly reduce costs and failure rates.

The cropCSM web server

The backend of the cropCSM web server was developed using the Python Flask framework version 0.12.3 and the front end using Bootstrap framework version 3.3.7. The system is hosted by a Linux server running Apache. Figure 5 depicts the cropCSM web interface. Users can submit molecules to the server either individually as SMILES strings or in batch as SMILES files (Figure 5). Prediction results are displayed in tabular format (Figure 5), which are also made available to download as a comma-separated file.

Figure 5

Conclusions

Here we described cropCSM, the first free and easy-to-use in silico platform dedicated to assist the development of herbicides that are potent, but also nontoxic and environmentally safe. cropCSM is optimally designed to evaluate the herbicide potential and toxicity of small molecular compounds up to 2 kDa, consistent with the training and validation data sets. We employed graph-based signatures to model small molecule physicochemistry of the largest collected data set of molecules with experimentally characterised herbicidal profiles to date and demonstrated their efficacy. We anticipate future iterations of cropCSM that will draw upon larger datasets and as a result will have a higher predictor capability, allowing for a greater increase in accuracy and correlation. We believe cropCSM will be an invaluable tool to assist rational design of new potent herbicidal molecules that are also safe for humans and the environment. The tool has been made available as a user-friendly web interface at https://biosig.unimelb.edu.au/crop_csm. Although very little has been done to assist the designing of potent and safe herbicides computationally, cropCSM fills this gap. cropCSM can accurately identify small molecules with potential herbicidal activity as well as characterise their toxicity and safety profiles. A web server conveniently provides cropCSM’s models in a free and easy-to-use interface. Click here for additional data file.

46 in total

1. Diagnosis of Chronic Kidney Disease Based on Support Vector Machine by Feature Selection Methods.

Authors: Huseyin Polat; Homay Danaei Mehr; Aydin Cetin
Journal: J Med Syst Date: 2017-02-27 Impact factor: 4.460

Review 2. Are herbicides a once in a century method of weed control?

Authors: Adam S Davis; George B Frisvold
Journal: Pest Manag Sci Date: 2017-08-17 Impact factor: 4.845

3. A Comparative Study on the Constitutive Properties of Marketed Pesticides.

Authors: Gefei Hao; Qingjian Dong; Guangfu Yang
Journal: Mol Inform Date: 2011-07-05 Impact factor: 3.353

4. mCSM-AB2: guiding rational antibody design using graph-based signatures.

Authors: Yoochan Myung; Carlos H M Rodrigues; David B Ascher; Douglas E V Pires
Journal: Bioinformatics Date: 2020-03-01 Impact factor: 6.937

5. mmCSM-AB: guiding rational antibody engineering through multiple point mutations.

Authors: Yoochan Myung; Douglas E V Pires; David B Ascher
Journal: Nucleic Acids Res Date: 2020-07-02 Impact factor: 16.971

6. Quantitative structure-activity relationship modeling of rat acute toxicity by oral exposure.

Authors: Hao Zhu; Todd M Martin; Lin Ye; Alexander Sedykh; Douglas M Young; Alexander Tropsha
Journal: Chem Res Toxicol Date: 2009-12 Impact factor: 3.739

7. Cutoff Scanning Matrix (CSM): structural classification and function prediction by protein inter-residue distance patterns.

Authors: Douglas E V Pires; Raquel C de Melo-Minardi; Marcos A dos Santos; Carlos H da Silveira; Marcelo M Santoro; Wagner Meira
Journal: BMC Genomics Date: 2011-12-22 Impact factor: 3.969