Literature DB >> 33748586

Predictive Global Models of Cruzain Inhibitors with Large Chemical Coverage.

Jose Guadalupe Rosas-Jimenez^1,2, Marco A Garcia-Revilla¹, Abraham Madariaga-Mazon², Karina Martinez-Mayorga².

Abstract

Chagas disease affects 8-11 million people worldwide, most of them living in Latin America. Moreover, migratory phenomena have spread the infection beyond endemic areas. Efforts for the development of new pharmacological therapies are paramount as the pharmacological profile of the two marketed drugs currently available, nifurtimox and benznidazole, needs to be improved. Cruzain, a parasitic cysteine protease, is one of the most attractive biological targets due to its roles in parasite survival and immune evasion. In this work, we compiled and curated a database of diverse cruzain inhibitors previously reported in the literature. From this data set, quantitative structure-activity relationship (QSAR) models for the prediction of their pIC50 values were generated using k-nearest neighbors and random forest algorithms. Local and global models were calculated and compared. The statistical parameters for internal and external validation indicate a significant predictability, with q loo 2 values around 0.66 and 0.61 and external R 2 coefficients of 0.725 and 0.766. The applicability domain is quantitatively defined, according to QSAR good practices, using the leverage and similarity methods. The models described in this work are readily available in a Python script for the discovery of novel cruzain inhibitors.

Entities: Chemical Disease Species

Year: 2021 PMID： 33748586 PMCID： PMC7970485 DOI： 10.1021/acsomega.0c05645

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Chagas disease affects 8–11 million people in 21 Latin American countries; there is an estimation of 70–150 million people at risk of infection.[1,2] Migration phenomena have contributed to the spread of the parasite into nonendemic areas such as the United States, Europe, New Zealand, and Australia.[1] Chagas disease is a vector-borne parasitic infection caused by Trypanosoma cruzi and it is transmitted by the three main genera of triatomine bug, Triatoma, Rhodnius, and Panstrongylus. World Health Organization has recognized this infection as a neglected tropical disease (NTD) because of its persistence in developing countries, being a major economic and social problem in these regions, and one of the main causes of premature death for heart failure.[2−4] It was previously reported that this disease causes an estimated loss of 752,000 working days in southern American countries,[4] which implies an economic burden of about US$1.2 billion in productivity.[4] Globally, this parasitic infection has an estimated annual cost of $627.46 million, and 10% of this affects nonendemic countries.[4] Currently, there are only two approved drugs for the treatment of Chagas disease: nifurtimox (NFX) and benznidazole (BZ). Both NFX and BZ have similar efficacy during the acute phase of infection, with 88–100% of negative parasite detection after treatment with NFX and up to 80% for BZ.[5] However, in the chronic phase, the rate of negative tests for the disease after treatment falls to 7–8%,[5] and there are significant side effects, including anorexia, weight loss, paresthesia, nausea, and vomiting, among others.[3,5] Recent therapeutic research is focused on specific biological targets, including cysteine proteases, enzymes in trypanothione metabolism, enzymes in ergosterol biosynthesis, and the kinetoplastid proteasome.[5] Cruzain is a cathepsin L-like cysteine protease present in all stages of the parasite life cycle. It plays significant roles in the trypanosomal growth, survival, and evasion from the host immune response. Plasma membrane-anchored cruzain degrades the Fc fraction of antibodies, overcoming the classic path of complement activation.[3,6] In the amastigotic intracellular stage, this cysteine protease degrades transcription factors, such as NFkB, and thus prevents the activation of macrophages.[3] Cruzain generates the bloodstream pro-inflammatory peptide Lys-bradykinin, which activates host immune cells, promoting the parasite uptake and spread by phagocytosis.[6] The use of cruzain inhibitors in animal models has shown to be effective in clearing the parasite burden, even in the chronic phase. The vinyl-sulfonic compound known as K777 was one the first proof-of-concept studies about the antitripanosomal activity of cruzain inhibitors in animal models.[7−9] Parasite death induced by cruzain inhibitors is attributed to the accumulation of a peptide precursor in the Golgi complex. Therefore, this in vitro and in vivo evidence has validated cruzain as a potential biological target for Chagas disease.[3,6] A variety of chemotypes for cruzain inhibition have been explored through structure–activity relationship (SAR) analysis, high-throughput screening, and docking methods. The most potent molecules belong to the vinyl-sulfone derivatives, oxadiazoles, nitrile-containing peptidomimetics, and thiosemicarbazones, with a broad range of biological activities among chemical families.[2,10,11] These molecules should be further optimized by increasing their selectivity toward parasite vs human cathepsins, and they should be neutral at physiological pH to avoid concentration in lysosomes and off-target effects.[2] Quantitative structure–activity relationship (QSAR) models mathematically correlate structural properties of molecules with their biological activity. There are two distinctive goals in the practice of QSAR modeling: the use of mathematical tools to describe the trends in the data, providing interpretations that could be useful in the understanding of an inherent mechanism, and the use of these methods to achieve predictions with high accuracy, irrespective of the interpretability of the generated models.[12,13] These mutually complementary approaches are often called “descriptive QSAR” and “predictive QSAR.”[13] The advances in QSAR modeling led to its acceptance as a prediction tool of toxicity endpoints for the risk assessment of new chemical entities[14] or as a preliminary step in drug development to identify compounds with potential toxic or mutagenic profiles.[15] The Organization for Economic Cooperation and Development (OECD) established guidelines for the use of QSAR in regulatory settings, and these principles became gold standards in the general QSAR practice.[14] The OECD principles for the validation of QSAR models for regulatory purposes are (1) a defined endpoint, (2) an unambiguous algorithm, (3) a defined domain of applicability, (4) appropriate measures of goodness of fit, robustness, and predictability, and (5) a mechanistic interpretation, if possible. For regulatory purposes, the prediction of a toxicity endpoint must have a high degree of reliability. Importantly, the external validation criteria must be strict regarding the numerical precision of the model since this is a crucial step in decision-making about risk assessment of new compounds.[12,16] In drug discovery, QSAR modeling is a valuable tool for the prioritization of possible hits for experimental validation and to explain the relationships between structural modification and biological activity from a mechanistic basis.[17] Interestingly, QSAR models have often been used to guide the synthesis of new molecules;[18] thus, the descriptive approach is predominant. More recent studies focus on the use of QSAR for virtual screening (VS), and this application has been successful in finding novel chemotypes against important drug targets in diseases such as malaria,[18,19] schistosomiasis,[18,20] tuberculosis,[18,21] cancer,[22,23] and inflammation, among others.[22] Notably, despite the exponential growth in the development of deep learning (DL) algorithms and their applications in many areas such as image and voice recognition, most of the successful QSAR case studies still use classical machine learning algorithms like multiple linear regression, partial least squares, k-nearest neighbors, support vector machines, random forest, and even shallow neural networks. It is a matter of debate if, in the field of QSAR modeling, the advanced DL algorithms offer a better performance over the classical approaches. Several published reviews find that DL models do not have significant improvements over simpler models.[17,24,25] One of the main reasons for this behavior is that DL algorithms require high amounts of data,[24,25] which is feasible in many areas where these methods have been applied, but in drug research, the data is often “limited, expensive, and resource-intensive”.[17] However, advanced DL algorithms still offer advantages for a variety of purposes like modeling multiple endpoints, for the generation of novel chemical features or in inverse QSAR, where structures can be generated from the model.[17] QSAR modeling has also been used in the study and design of cruzain inhibitors. A summary of some recently published models is presented in Table . Most of these studies are built upon data sets with a single chemical family, producing local models and following the descriptive approach. CoMFA and CoMSIA analyses are widely used in these studies, mainly because the results are fully interpretable, since the interaction maps can easily show which fragments of the molecules are correlated with the biological activity. However, this method requires the 3D coordinates of the molecules and, thus, it is dependent on their conformation and alignment. The procedures required to generate conformations and alignments may not be suitable if the model is intended to be used in virtual screening. In turn, 2D QSAR only requires the 2D structure of the molecules. The generation of 2D descriptors is usually fast and easy; nonetheless, their interpretation is often difficult. The main purpose of this work is to generate QSAR predictive models of structurally diverse cruzain inhibitors. Models calculated by means of machine learning algorithms describe the behavior of biological activity in terms of the molecular descriptor space. From the trends identified, the effects of structural modifications on cruzain inhibition become predictable, making QSAR models a useful tool in the search and rational design of cruzain inhibitors.

Table 1

Summary of QSAR Models of Cruzain Inhibitorsa

data set	algorithm	validation summary	reference
27 benzimidazoles	HQSAR (PLS)	q² = 0.77, R² = 0.65	(26)
	CoMFA	q² = 0.71, R² = 0.94
	CoMSIA	q² = 0.75, R² = 0.82
41 peptides	HQSAR (PLS)	q² = 0.77, R² = 0.88	(27)
55 thiosemicarbazones and semicarbazones	CoMFA	q² = 0.78, R² = 0.81	(28)
55 thiosemicarbazones and semicarbazones	CoMSIA	q² = 0.73, R² = 0.79	(28)
57 dipeptidyl nitriles	HQSAR (PLS)	q² = 0.70, R² = 0.62	(29)
61 semicarbazones	MLR	q² = 0.801, R² = 0.906	(30)
32 triazine nitriles	CoMFA	q² = 0.736, R² = 0.762	(31)
32 triazine nitriles	CoMSIA	q² = 0.627, R² = 0.806	(31)
41 ketones	HQSAR (PLS)	q² = 0.794, R² = 0.954	(32)
55 thiosemicarbazones and semicarbazones	HQSAR (PLS)	q² = 0.75, R² = 0.95	(33)
55 thiosemicarbazones and semicarbazones	2D QSAR (PLS)	q² = 0.72, R² = 0.83	(33)
46 ketones	BRANN	q² = 0.749	(34)

q2, coefficient of determination for leave-one-out cross-validation; R2, coefficient of determination of external set; PLS, partial least squares; MLR, multiple linear regression; BRANN, Bayesian regularized artificial neural networks.

Computational Methods

Data Compilation and Curation

Cruzain inhibitors were collected from the ChEMBL database, searching by the molecular target using the keyword cruzain. Molecules annotated with IC50 values were retrieved as the initial data set. The uncurated database has a total of 840 inhibitors. From this set, 118 molecules with activity missing values and 211 that could not be determined (reported with a relation of “>”) were excluded. Since the modeling approach was 2D QSAR, the presence of molecules with chiral atoms was verified in the original sources. In most of the cases, the activity data reported for those compounds does not specify the activity for each enantiomer. The structures and biological activities were kept as stated in the original publications, and a single case where both enantiomers were reported with a significant difference in activity was excluded from the database. For the removal of duplicates, the original papers were consulted and the selection was based on the experimental protocol. A strong criterion for selection was the inclusion of detergent in the assay because colloidal aggregation is one of the main causes of false positives in exploratory and high-throughput screens.[2,35] Moreover, after reviewing the original publications, molecules reported in refs (36) and (37) were excluded because they are classified as aggregators. The main focus of these studies is the search of aggregation, autofluorescence, and reactivity artifacts in screening assays using cruzain as the biological target. After this filter, the final data set for modeling consisted of 344 inhibitors. The IC50 values were converted to pIC50, and this was used as a dependent variable. Lastly, the structures and activities were compared to those in the original sources[36−72] and discrepancies were fixed. The molecular structures, originally retrieved as SMILES strings, were converted to their 2D representation, and tautomers and protonation states at pH 7.0 were assigned using ChemAxon. The curated database is available as cruzain_dataset.xlsx in the Supporting Information. The molecules in the final database were classified based on chemical families. The set was divided by the assigned molecular types, and those with at least 20 compounds were used to build local models. The structural fragments of the molecules in the local sets are presented in Figure . The total data set was also used to build global models, and their performances were compared.

Figure 1

Functional groups used for the classification of the molecules into chemical families for the development of local models. The amide group includes peptidic and nonpeptidic inhibitors with a central amide group.

QSAR Modeling

Descriptor Calculation and Feature Selection

Molecules in the database were loaded and standardized using the RDKit package in Python.[73] 2D molecular descriptors were calculated in the Mordred package for Python.[74] This library contains 1613 1D and 2D descriptors, including atomic counters, topological indices, adjacency matrix-derived values, autocorrelation weighted by atomic properties, subdivided van der Waals surface areas, and physicochemical properties including logP and polarizabilities.[74] For the local models, the data was used separately by family type. Each of these sets was randomly split into training, validation, and test sets. Training and validation sets were used together in feature selection and hyperparameter optimization of the machine learning algorithms, whereas test sets were reserved for the final evaluation. Table shows the number of molecules in the partition for each group.

Table 2

Number of Molecules in Each Family Type for the Generation of Local Models

group	training set	validation set	test set
amides	68	17	22
N-acylhydrazones	16	5	6
thiazolylhydrazones	20	6	7
thiosemicarbazones	49	13	16
triazine nitriles	22	6	7

The initial set with all the molecules was used for the generation of global models. The training sets for the local models were merged to build the global training set, and the same was done for the test sets. Molecules not used in the local models (those with less than 20 molecules per family) were included in the generation of global models. The final partition contains a training set with 223 molecules, a validation set of 53 molecules, and an external test set of 70 inhibitors. Descriptors were scaled to the [0,1] range using eq , where X′ is the scaled descriptor, X is the minimum value, and X is the maximum value. Molecular descriptors in the test and validation groups were scaled according to the training set. Feature selection was first performed by discarding those descriptors with zero variance and with less than three different values (semiconstant variables). After excluding highly correlated descriptors, i.e., those with a correlation coefficient higher than 0.9, the final set contained 558 variables. The selection of the best subset of features was made using a combination of genetic algorithm (GA) and three different machine learning algorithms: multiple linear regression (MLR), k-nearest neighbors (KNN), and random forest (RF). A simple genetic algorithm was built using DEAP in Python,[75] with a crossover probability of 0.6, an individual probability of mutation of 0.3, and a gene mutation probability of 0.02,[76,77] where an individual refers to a candidate model and each gene refers the presence or absence of a certain descriptor in the model. A population of 150 individuals was evolved during 10,000 generations and the best 20 observed individuals were saved for further selection. The multi-objective function was the 10-fold cross-validated q2 together with the R2 value for the validation set, with a penalization for models with more than 10 variables. The GA was performed five times for each regression algorithm, generating a total of 300 models. The machine learning suite used for modeling was the scikit-learn library for Python.[78] In the GA runs, the hyperparameters for the KNN algorithm were the default values,[79] with a weighting scheme based on the inverse of the distance: closer neighbors have a greater influence than distant neighbors.[79] The number of estimators of RF was reduced to 10 for performance reasons, and the values for the remaining hyperparameters were left as defaults.[80] The number of nearest neighbors for KNN and the number of estimators in the RF were further optimized using cross-validation with the GridSearchCV tool of scikit-learn.[81]

Model Validation

After feature selection, the internal performance and stability of the model were assessed with leave-one-out cross-validation, and the coefficient of determination q2 was reported. The y-scrambling method was used to verify the absence of chance correlation. The order of the pIC50 values was randomized 100 times and the models were recalculated for each new independent variable. The q2 coefficients of the randomized models were evaluated. Since a high value of qloo2 is not related to a good predictability,[82] the external evaluation was performed using the selected models to predict the pIC50 of the molecules in the test set, which were never used in model generation. The coefficient of determination for the test set, Rext2, is often used as a measure of external predictability. However, this coefficient is a measure of the fit for the experimental and predicted values to a straight line and this may not be the ideal identity relation.[82] One of the best practices is to compute several statistics for the external validation, compare the diagnostics between them, and in this way, achieve more confident conclusions.[83,84] Thus, we evaluated the root-mean-squared error (RMSE) for the external prediction, the concordance correlation coefficient (CCC), and the QF2 family of parameters.[83,84]Equations –7 were used to calculate the external validation criteria. In the above equations, y is the experimental pIC50 of molecule i, ŷ is the predicted activity for molecule i, y̅TR and y̅EXT are the means of pIC50 values for the training and test sets, respectively, and nTR and nEXT are the numbers of molecules in the training and external test sets, respectively. In the CCC formula, x refers to the experimental pIC50 value for the i molecule, y is the predicted activity for molecule i, and x and y̅ are the mean values for the respective experimental and predicted activities.

Applicability Domain

The predictability of a QSAR model is framed by the nature of the molecules in the training set. The applicability domain is the quantitative delimitation of the descriptor and activity space where predictions are reliable. In this work, the applicability domain was defined using the leverage method.[85] Leverage values, h, are computed using eq , where X is the descriptor matrix of the training set and x is the descriptor vector for a query molecule. Basically, leverage values are proportional to the distance of the molecule from the centroid of the training set. Thus, compounds above a threshold are far from the explored descriptor–activity space and, therefore, their predicted biological activity will be unreliable. Typically, the threshold, hmax, is computed with eq , where p is the number of features and n is the number of molecules in the training set. Leverage and limit values were computed in Python, and the results are presented in a Williams plot. In this representation, molecules with high leverages or large residuals can be easily detected for further examination. Along with the leverage method, molecular similarity was used as a criterion for belonging to applicability domain. The molecular fingerprints using the public MACCSKeys implementation in RDKit were calculated for every molecule in the training and test sets. Then, a similarity matrix of the query test molecules against the training set was computed using the Tanimoto index. The highest similarity value of each molecule is presented together with the corresponding leverage score. The methodology for QSAR modeling is summarized in Figure .

Figure 2

Flowchart with the summary of the methods for QSAR model generation. MLR, multiple linear regression; KNN, k-nearest neighbor regression; RF, random forest regression; GA, genetic algorithm.

Results and Discussion

This work consists of two main parts, the development of global and local models. For the global models, we prepared and analyzed a diverse set of cruzain inhibitors annotated with pIC50 values. The distributions of the biological activity values of the training and test sets are shown in Figure . The pIC50 values range from 3.48 to 10.0 units, from nanomolar to micromolar scale. It is noteworthy that the biological activity values of the test set lie within those of the training set as shown in the histogram of Figure , there are no gaps within bins, and activity outliers are not present, following the general recommendations in QSAR modeling.[12,22] The functional groups of the subsets divided by the chemical family resemble those in the previously reported local models shown in Table . Moreover, current efforts in the search for cruzain inhibitors focus on the design and optimization of particular chemotypes, including imidazoles and benzimidazoles,[9,86,87]N-acyldhydrazones,[88] imides,[89] vinyl peptidomimetics,[2,11,90] oxadiazoles,[2,11,91] and triazoles,[10] among others. Molecules that belong to these chemical groups are part of our curated modeling database. Figure shows selected compounds from our data set. Therefore, the chemical space defined by these molecules resembles the current knowledge about cruzain inhibitors.

Figure 3

Figure 4

Selected molecules from the modeling data set.

Distribution of pIC50 values for cruzain inhibitors. Molecules in the training set are shown in dark gray, and molecules in the test set are shown in light gray. The inhibitory potency of the test set falls within the interval of pIC50 values of the training set. Selected molecules from the modeling data set. The performance of the global models is summarized in Table . This table shows the top 5 models obtained after the evolution of the GA for each of the machine learning approaches. The number of variables selected for most of the models is 9; thus, the ratio of molecules per descriptor is around 24. This ratio is critical for the correct generalization of trends between the chemical structure and biological activity and to avoid overfitting. Except for the MLR algorithm, the selected feature subsets have determination coefficients between experimental and predicted values for the molecules in the test set above 0.7, which is a generally accepted value for external predictability. The best models were obtained using the KNN and RF algorithms, and they will be discussed further in this paper.

Table 3

Results of the Top 5 Models after GA Feature Selection for Every Machine Learning Algorithma

algorithm	n_d	q_loo²	R_ext²	RMSE	CCC	Q_F1²	Q_F2²	Q_F3²
MLR	9	0.604	0.506	0.906	0.701	0.506	0.506	0.534
MLR	9	0.604	0.496	0.924	0.688	0.496	0.496	0.525
MLR	9	0.608	0.495	0.925	0.689	0.496	0.495	0.524
MLR	9	0.601	0.491	0.934	0.685	0.491	0.491	0.520
MLR	9	0.601	0.490	0.935	0.683	0.490	0.490	0.519
KNN	9	0.664	0.725	0.504	0.842	0.725	0.725	0.741
KNN	9	0.649	0.705	0.542	0.826	0.705	0.705	0.722
KNN	9	0.665	0.704	0.542	0.821	0.704	0.704	0.721
KNN	9	0.674	0.703	0.544	0.830	0.703	0.703	0.720
KNN	9	0.661	0.701	0.549	0.822	0.701	0.701	0.718
RF	9	0.608	0.766	0.445	0.841	0.757	0.757	0.771
RF	9	0.687	0.741	0.474	0.842	0.741	0.741	0.756
RF	9	0.604	0.730	0.494	0.821	0.730	0.730	0.745
RF	9	0.656	0.698	0.554	0.821	0.698	0.698	0.715
RF	8	0.597	0.716	0.520	0.813	0.716	0.716	0.733

MLR, multiple linear regression; KNN, k-nearest neighbor regression; RF, random forest regression; nd, number of descriptors in the model; qloo2, coefficient of determination for the leave-one-out cross-validation; Rext2, coefficient of determination for the external data set; RMSE, root-mean-squared error for the external set; CCC, concordance correlation coefficient; QF2, Q2 family of external validation parameters (see Section ). The results of the best models using the local data sets for MLR and KNN algorithms are presented in Tables and 5, respectively. The external coefficients of determination are higher for some of the groups in comparison with the global models, and the best values are around the same magnitude of the previously reported models. The amide and N-acylhydrazone groups are satisfactorily modeled even with the MLR approach. According to the similarity-property principle, on the basis of the classical QSAR analysis, gradual changes in structure lead to gradual changes in activity.[17] In a database of congeneric compounds, the variation of chemical structures tends to be moderate, generating a continuous structure–activity space, which is often capable of being modeled with linear methods. Increasing the molecular diversity in a set of molecules also increases the complexity of the modeled property. Changes in structure are more abrupt and multiple mechanisms involved in activity tend to coexist, making the linearity between the structural modifications and activity not hold.[13,17] This is clearly shown in the comparison between Tables , 4, and 5, where the MLR algorithm was able to predict the activity values in the external set for some of the local models, but in the global set, this was not the case.

Table 4

Results for the Best MLR Models for Each Chemical Groupa

group	n_d	q_loo²	R_ext²	RMSE	CCC	Q_F1²	Q_F2²	Q_F3²
amides	4	0.784	0.805	0.371	0.885	0.805	0.805	0.855
N-acylhydrazones	4	0.617	0.780	0.307	0.850	0.831	0.780	0.682
thiazolylhydrazones	4	0.588	0.555	0.293	0.755	0.622	0.555	0.729
thiosemicarbazones	4	0.532	0.760	0.231	0.854	0.767	0.760	0.764
triazine nitriles	4	0.593	0.571	0.776	0.685	0.583	0.571	0.363

Table 5

Results for the Best KNN Models for Each Chemical Groupa

group	n_d	q_loo²	R_ext²	RMSE	CCC	Q_F1²	Q_F2²	Q_F3²
amides	4	0.822	0.809	0.363	0.892	0.809	0.809	0.858
N-acylhydrazines	4	0.378	0.834	0.231	0.891	0.873	0.834	0.761
thiazolylhydrazines	4	0.547	0.918	0.054	0.961	0.931	0.918	0.950
thiosemicarbazones	4	0.653	0.725	0.265	0.814	0.732	0.725	0.730
triazine nitriles	2	0.724	0.879	0.220	0.924	0.882	0.879	0.820

nd, number of descriptors in the model; qloo2, coefficient of determination for the leave-one-out cross-validation; Rext2, coefficient of determination for the external data set; RMSE, root-mean-squared error for the external set; CCC, concordance correlation coefficient; QF2, Q2 family of external validation parameters (see Section ). nd, number of descriptors in the model; qloo2, coefficient of determination for the leave-one-out cross-validation; Rext2, coefficient of determination for the external data set; RMSE, root-mean-squared error for the external set; CCC, concordance correlation coefficient; QF2, Q2 family of external validation parameters (see Section ). Consensus diversity plots (CDPs),[92] cyclic system retrieval curves (CSRs),[92] and pairwise similarity matrices were calculated to compare chemical diversity between databases. Results of similarity analysis are depicted in Figure . As expected, the complete database has the lowest mean similarity value, and quartile distributions show that 75% of these values are not higher than 0.44. Among local groups, amides have the largest diversity with a mean similarity value of 0.48. The blocks in the matrix suggest that this group could be further divided into subsets, indicating that side chains have a strong influence on the group diversity. However, this set has also the best performances using MLR and KNN algorithms. These results show that sudden changes in structure are accompanied by proportional changes in activity, maintaining linearity within the group. N-Acylhydrazones, thiazolylhydrazones, and thiosemicarbazones have comparable similarity distributions and their model performances are also similar. Lastly, triazine nitriles share the highest similarity between other molecules in their group, with a mean value of 0.64. It is notable that this model, using the KNN algorithm, has a good performance, comparable to the amide model, with only two descriptors.

Figure 5

Pairwise Tanimoto similarity matrices for compounds in each global and local data set based on MACCSKeys fingerprints.

Pairwise Tanimoto similarity matrices for compounds in each global and local data set based on MACCSKeys fingerprints. Figure shows the CDP and CSR results for the global and local sets. It is interesting that the area under the CSR curve for the complete set is higher than the area for amides and N-acylhydrazones. The CSR curve is calculated from a scaffold decomposition of the molecules in the database. Then, the fraction of scaffolds is plotted on the X-axis and the fraction of compounds that contains those scaffolds is plotted on the Y-axis.[92] This result implies that there are more molecules in the complete database that share the same scaffold than those in the amide and N-acylhydrazone sets. In other words, there is an overlap in the scaffolds between local databases. However, this analysis focuses on the core structure of the molecules, ignoring side chains or substituents. In the CDP depicted in Figure , the complete database lies in an area with the highest fingerprint-based diversity but with middle scaffold-based diversity. Therefore, the main differences between the molecules in the complete database come from the side chains.

Figure 6

(A) Cyclic system retrieval curve for the global and local models. (B) Consensus diversity plot comparing the global (complete) and local data sets. Marker size is proportional to the number of molecules in each database. Applicability domain is a concept as important as the external validation in the QSAR modeling practice. The reliability of a prediction will also depend on the extent a new molecule is near the set used for the generation of the model. In this context, local models tend to have more restrictive applicability domain and their predictability will be framed by the same mechanism of action of those molecules in the training set.[12] The complete data includes cruzain inhibitors comprising several of the chemical groups reported in the literature to afford a more general model. Global models are more appropriate than local ones in virtual screening because their wider applicability domain may exert a larger coverage of the chemical space in a diverse database.[12] However, compared with larger databases, the chemical space of the model can still be very narrow. Since there is no QSAR model that can be applied universally, any QSAR-based virtual screening round must have a quantitative measure to identify which molecules are within the applicability domain. The trends in structure–activity relationships are only valid in the region of the chemical space covered by the molecules in the training set. The KNN and RF global models with the best statistical parameters will be discussed in more detail, including the limitations imposed by their applicability domain. Molecular descriptors codify the structural information relevant to the activity. Although some descriptor definitions are rather complex and their interpretation is difficult, a closer look into these features may provide insights into the qualitative structure–activity relationships. Molecular descriptor definitions for KNN and RF models are presented in Tables and 7, respectively. Scatter plots showing the intercorrelation between descriptors and with the biological activity are presented in Figures S1 and S2 of the Supporting Information. Most of the molecular descriptors used by the KNN models, depicted in Figure S1, are related to atomic partial charges and their topological distribution. These properties are relevant in the formation of intermolecular interactions that could describe the binding of inhibitors in the cruzain active site. In turn, descriptors in the random forest regression include counters of atom types along with partial charges and logP contributions. The matrix of scatter plots in Figure S2 shows that counters are able to divide the molecules into groups with different ranges of activity values. For example, high counts of acidic groups correlate with low values of pIC50 (but the opposite is not true), and a high number of double-bonded nitrogen atoms gives a narrow range of activity in the middle potency region. It is also interesting to note that the descriptor BCUTd-1l separates the set in two groups of high and low potency inhibitors. A closer look into the results of this descriptor reveals that molecules with the lowest values belong to the class of peptidyl nitriles, which are also one of the most potent chemical families. The eigenvalues of the Burden matrix are recognized as a molecular descriptor with high discrimination power,[93] as shown in this observation.

Table 6

Definition of Molecular Descriptors Used by the KNN Modela

descriptor name	definition
MATS5z	Moran coefficient of lag 5 weighted by atomic number
GATS3c	Geary coefficient of lag 3 weighted by Gasteiger charge
GATS8s	Geary coefficient of lag 8 weighted by intrinsic state
BCUTc-1h	first highest eigenvalue of Burden matrix weighted by Gasteiger charge
NsssCH	number of sssCH atoms
CIC0	0-ordered complementary information content
PEOE_VSA4	sum of VSA for atoms with Gasteiger charge in [−0.20,–0.15)
JGI4	4-ordered mean topological charge
JGI8	8-ordered mean topological charge

sssCH, carbon atoms single-bonded to three heavy atoms; VSA, van der Waals surface area.

Table 7

Definition of Molecular Descriptors Used by the RF Modela

descriptor name	definition
nAcid	acidic group count
nF	number of F atoms
AATSC1c	averaged and centered Moreau–Broto autocorrelation of lag 1 weighted by Gasteiger charge
AATSC1dv	averaged and centered Moreau–Broto autocorrelation of lag 1 weighted by valence electrons
BCUTd-1l	first lowest eigenvalue of Burden matrix weighted by sigma electrons
NdsN	number of dsN atoms
NdO	number of dO atoms
PEOE_VSA5	sum of VSA for atoms with Gasteiger charge in [−0.15,–0.10)
SlogP_VSA2	sum of VSA for atoms with SlogP atomic contribution in [−0.20,–0.15)

dsN, nitrogen atoms with a single bond and a double bond to other heavy atoms; dO, oxygen atoms with a double bond; SlogP refers to the atomic contribution to calculate the logP using the Wildman and Crippen algorithm.[94]

sssCH, carbon atoms single-bonded to three heavy atoms; VSA, van der Waals surface area. dsN, nitrogen atoms with a single bond and a double bond to other heavy atoms; dO, oxygen atoms with a double bond; SlogP refers to the atomic contribution to calculate the logP using the Wildman and Crippen algorithm.[94] The regression plots of experimental versus predicted values are depicted in Figure for both models. From this figure, it is clear that the models are correctly describing the trends in activity of cruzain inhibitors, even in the test set. The external validation parameters for these models, presented in Table , are above the generally accepted thresholds, except for the concordance correlation coefficient. The selected models have CCC values of 0.842 and 0.845, slightly below the proposed limit of 0.85.[16] Notably, these models are able to capture the hierarchical relationships of the inhibitors regarding their biological activity. In descriptive QSAR modeling, identification of trends in the data is useful for the prioritization or identification of molecules with desirable properties, even if the predictive value is not very accurate.[84] Although previously reported models have similar or better performances, our data set comprises a wider chemical diversity.

Figure 7

Regression plots of the best generated models. The first plot shows the predicted values by the KNN model for the training and test sets, and the plot below shows the results for the RF forest model. Qloo2 and Rext2 are presented inside the graphs. The analysis of residuals is presented along with the discussion of applicability domain. A new molecule can be reliably predicted only if its structural features resemble those of the molecules used to calculate the model. This is because, outside the explored chemical space, the structure–activity landscape may be unpredictable. This must be taken into consideration if the model is going to be used for the search of new molecules from databases. The residual histograms for the discussed models are presented in Figure . The external test residuals are distributed in a range similar to those of the training set. For both models, residuals follow a nearly normal distribution. Although, for nonlinear and nonparametric methods, the normal distribution of residuals is not mandatory, it is useful to make inferences about the prediction error and to identify biases that could indicate the presence of systematic errors. In this case, all the residuals are approximately centered around 0, which is the expected value for the prediction error, and distribute almost symmetrically above and below this value.

Figure 8

Histograms of the distribution of model residuals. The upper plots show the residuals for the KNN model, and the lower histograms plot these results for the RF model.

Histograms of the distribution of model residuals. The upper plots show the residuals for the KNN model, and the lower histograms plot these results for the RF model. The leverage values are proportional to the Mahalanobis distance of the molecules to the center of the group.[85] This distance metric is used in multidimensional spaces of random variables to detect the presence of outliers, and it is particularly useful in spaces where the feature space is not orthogonal. High leverage values are related to molecules with structural features far from the general trends in the data. The Williams plots for the considered models are shown in Figure . This graph represents the leverage values versus the standardized residual; thus, it gives a visual representation of both the structural and activity domains. For both models, most of the molecules in the test set are distributed along the space defined by the training molecules. Vertical lines in the plots represent the calculated limit according to eq . The KNN group has four molecules in the training set slightly after the limit, and the RF group has one molecule with a very high leverage in comparison with the others. This molecule is identified as CHEMBL409024 and its structure is depicted in Figure . It has relatively extreme values of the AATSC1c, AATSC1dv, and BCUTd-1l descriptors in comparison with the general trends of the rest of the compounds in the set. This inhibitor is a nonstandard amino acid with two aromatic rings and a phosphate group. The presence of the phosphate group is a unique feature of this compound and may be the cause of the difference with the rest of the molecules. However, deletion of this molecule did not modify significantly the performance of the model.

Figure 9

Williams plots for the calculated models. In the upper graph, leverages and residuals are shown for the KNN model, whereas the lower graph shows the results for the RF model.

Figure 10

Structure of CHEMBL409024. This molecule has a high leverage in the RF group.

Williams plots for the calculated models. In the upper graph, leverages and residuals are shown for the KNN model, whereas the lower graph shows the results for the RF model. Structure of CHEMBL409024. This molecule has a high leverage in the RF group. The leverage method for applicability domain definition is based on the values of descriptors. However, these features may not capture some effects related to out-of-target phenomena, like the potential of a molecule to cause aggregation. For this reason, an analysis based on molecular similarity was also assessed. Fingerprints, using MACCSKeys implementation of RDKit, were used to compute Tanimoto similarity values between the molecules of the test set and those in the training set. The maximum similarity values for each compound in the test set are plotted against leverage values in Figure . The graphs show that most molecules have at least one congener whose similarity value is greater than 0.7. In general, molecules with low leverage values tend to have higher similarities with the training set. Thus, predictions made by the models are considered reliable if the predicted molecules have leverages and maximum similarity values inside the box depicted in Figure .

Figure 11

Maximum Tanimoto similarity values for each molecule in the test set. MACCSKeys were used to generate molecular fingerprints. Similarities are plotted against leverage values.

Maximum Tanimoto similarity values for each molecule in the test set. MACCSKeys were used to generate molecular fingerprints. Similarities are plotted against leverage values. The analysis of the possibility of chance correlation was tested using the Y-randomization method. The results of the q2 values for 100 randomized models are presented in Figure . The scatter plot shows the q2 values for each iteration followed by the value of the nonrandomized model. Histograms represent the distribution of the calculated q2 coefficients together with the values of the true model. This figure reveals that for the randomized models, the determination coefficient falls to unacceptable levels. Moreover, the distribution of the randomized coefficients does not overlap with the values for the true models. This behavior indicates that the probability of chance correlation is very low. Although the predictability of the models is not outstanding, the relationships of molecular features captured by the descriptors with the biological activity are significant since the randomization of pIC50 values alters considerably the stability and performance of the models.

Figure 12

Results of the q2 values for randomized models in comparison with the q2 of the true model.

Results of the q2 values for randomized models in comparison with the q2 of the true model. The statistical parameters of internal and external validation for the selected models indicate a satisfactory performance. The models can be used to predict the pIC50 on cruzain in the search or prioritization of molecules to be tested as antitripanosomal agents. For this purpose, a Python script is provided in the file cruzain_qsar_models.zip of the Supporting Information, together with the trained models. A detailed explanation on the use of the script is provided in section S3 of the Supporting Information and in the ACS LiveSlides. Briefly, the program takes as input an sdf file with the 2D structures of the query molecules, internally calculates the molecular descriptors, and predicts the pIC50 values using the RF and KNN models. It also computes the leverage and Tanimoto similarity values with respect to the training set to test if the molecules are within the applicability domain of the models. Results are saved in a csv file, and details on its contents and interpretation can be found in Table S1 of the Supporting Information. The Python script provided in this work is intended for scientists working on research and development of anti-Chagas agents. The availability of the training set and descriptors used, the use of open-source software, and the easiness for nonexperts make this tool readily used for a broad scientific audience. As a result, the models will be useful in virtual screening campaigns that, in combination with molecular modeling studies, such as docking, will help with the design and prioritization of experimental studies.

Conclusions

Quantitative structure–activity relationship models were developed for the calculation of pIC50 values of cruzain inhibitors using multiple machine learning algorithms. The statistical parameters describing the performance of the best selected models agree with the general recommendations for QSAR modeling. The data set used in the model includes several of the chemotypes already explored in the literature as well as those not previously modeled, comprising a wider chemical space than local models. External validation results show that the models are able to reproduce the trends and hierarchical relations of the experimental pIC50 values. The reliability of predictions is framed by the applicability domain of the model, which is quantitatively defined by descriptor and molecular similarity methods. Therefore, the OECD principles for QSAR practices and applications are fulfilled. The generated models reproduce the biological activity values, indicating that structural trends are well captured by nonlinear correlations in terms of molecular descriptors. Using these relationships, a molecule can be predicted as a cruzain inhibitor based solely on its chemical structure. This is useful for the prioritization of molecules to be tested experimentally from a database. The models can also be used to ascertain structural modifications likely to improve or decrease cruzain inhibitory activity, if the change in pIC50 is higher than the expected error of prediction. The calculated models are made publicly available and its use could guide the search, development, and rational design of cruzain inhibitors as possible pharmacological treatment of Chagas disease.

2 in total

1. Chemoinformatics Studies on a Series of Imidazoles as Cruzain Inhibitors.

Authors: Alex R Medeiros; Leonardo L G Ferreira; Mariana L de Souza; Celso de Oliveira Rezende Junior; Rocío Marisol Espinoza-Chávez; Luiz Carlos Dias; Adriano D Andricopulo
Journal: Biomolecules Date: 2021-04-15

Review 2. Computational approaches towards the discovery and optimisation of cruzain inhibitors.

Authors: Viviane Corrêa Santos; Rafaela Salgado Ferreira
Journal: Mem Inst Oswaldo Cruz Date: 2022-03-16 Impact factor: 2.743

2 in total