Literature DB >> 18841240

Distinguishing compounds with anticancer activity by ANN using inductive QSAR descriptors.

Abstract

This article describes a method developed for predicting anticancer/non-anticancer drugs using artificial neural network (ANN). The ANN used in this study is a feed-forward neural network with a standard back-propagation training algorithm. Using 30 'inductive' QSAR descriptors alone, we have been able to achieve 84.28% accuracy for correct separation of compounds with- and without anticancer activity. For the complete set of 30 inductive QSAR descriptors, ANN based method reveals a superior model (accuracy = 84.28%, Q(pred) = 74.28%, sensitivity = 0.9285, specificity = 0.7857, Matthews correlation coefficient (MCC) = 0.6998). The method was trained and tested on a non redundant data set of 380 drugs (122 anticancer and 258 non-anticancer). The elaborated QSAR model based on the Artificial Neural Networks approach has been extensively validated and has confidently assigned anticancer character to a number of trial anticancer drugs from the literature.

Entities: Chemical Disease Gene Species

Keywords: anticancer drugs; artificial neural network; inductive QSAR descriptors; non-anticancer drugs

Year: 2008 PMID： 18841240 PMCID： PMC2561164 DOI： 10.6026/97320630002441

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

A number of natural and synthetic products have been found to exhibit anticancer activity against tumor cell lines [1,3]. Eventually, the number of anticancer drugs is increasing exponentially day by day. Hence, discrimination between anticancer and non-anticancer drugs is a major challenge in current cancer research. The worldwide pharmaceutical industry is investing in technologies for high-throughput screening (HTS) of such compounds. Therefore, development of in silico techniques for anticancer drug screening is the demand of today's anticancer drug discovery. The use of computational tools for discrimination of anticancer drugs from lead molecules prior to their chemical synthesis will accelerate the drug discovery processes in the pharmaceutical industry [3]. Early-phase virtual screening and compound library design often employs filtering routines, which are based on binary classifiers and are meant to eliminate potentially unwanted molecules from a compound library [4,5]. Currently two classifier systems are most often used in these applications: PLS-based classifiers [6,7] and various types of artificial neural networks [8,9]. Quantitative structure activity relationship (QSAR) science uses a broad range of atomic and molecular properties ranging from merely empirical to the ‘ab initio’ computed. The most commonly used QSAR based methods can include up to thousands of descriptors readily computable for extensive molecular datasets. Such varieties of available descriptors in combination with numerous powerful statistical and machine learning techniques such as Artificial Neural Networks (ANN) allow distinguishing biologically active from non-active substances [10,11]. Currently various sets of molecular descriptors are available [12] and thus for application to anticancer drug/non-drug classification of compounds, the molecules can be typically represented by n-dimensional vectors [10,11]. In the current work, we focused on the ‘inductive’ QSAR descriptors [13] for anticancer/non-anticancer drug classification. These include various local parameters calculated for certain kinds of bound atoms (for instance; for most positively/negatively charges etc), groups of atoms (for substituent with the largest/smallest inductive or steric effect within a molecule) or computed for the entire molecule. All these descriptors (except the total formal charge) depend on the actual spatial structure of molecules. These inductive descriptors found broad application for quantification of antibacterial activity of synthetic cationic polypeptides [13]. The demand for computational screening methodology is clear in all areas of human therapeutics. However, the field of anti-cancer drugs has a particular need for computational solutions enabling rapid identification of novel therapeutic leads. QSAR approaches for classification of anticancer compounds against non-anticancer agents represents an important and valuable task for the modern QSAR research. The main objective of this study was to develop a scheme for encoding relevant information from molecular structure into a format which is suitable for use in ANN and to develop a QSAR model of the binary classification of anticancer/non-anticancer drugs with predictive capabilities, which so far has been unattainable.

Methodology

Dataset

To investigate the possibility of using the inductive QSAR descriptors for creation of an effective model of discrimination between anticancer/non-anticancer drugs, we have considered a dataset of 380 structurally heterogeneous compounds including 122 non-redundant anticancer [14,15] and 258 non-redundant non-anticancer drugs. All the 122 anticancer drugs were taken from the NCI anti-cancer agent mechanism database [16] and have been proved to have well known mechanism of action (Table 1 under supplementary material) whereas; all the 258 non-redundant non-anticancer drugs were taken from DrugBank [17].

Descriptors calculation and selection

A set of 50 inductive descriptors have been calculated initially for all the 380 drugs. During calculation the hydrogen atoms were suppressed and only the heavy atoms have been taken into account. The inductive QSAR descriptors were calculated from values of atomic electro-negativities and radii by using the custom SVL-scripts downloaded from the SVL exchanger [18] and implemented within the MOE package (Chemical Computing Group Inc 2005). To avoid cross correlation among the independent variables, we have computed pair-wise correlation among all the 50 QSAR parameters and removed those inductive descriptors which formed any linear dependence with R ≥ 0.9. As a result of this procedure, only 30 inductive QSAR descriptors have been selected (Table 2 see supplementary material). The normalized values (in the scale of 0-1) of these 30 parameters have been used to generate QSAR models.

Composition of the training and testing sets

For effective training of the network (primarily to avoid over-fitting), we have used the training sets of 342 compounds (including 100 anticancer drugs) randomly derived out of the 380 molecules. Such random sampling has been performed 20 times and 20 independent QSAR models have been created. In each training run the remaining 10 percent of the compounds were used as the testing set in order to evaluate the average predictive ability of the method. The given performance measures have been averaged over five QSAR models.

ANN model for classification of anticancer/non-anticancer drugs

In order to relate the inductive descriptors to anticancer activity of the studied molecules we have employed the standard back-propagation ANN using Stuttgart Neural Network Simulator package [19]. The ANN used in this study consists of 30 input nodes, depicting 30 inductive QSAR descriptors and 1 output node. The number of nodes in the hidden layer varied from 2 to 40 in order to find the optimal network that allows most accurate separation of anticancer/non-anticancer drugs in the training sets. During the learning phase, a value of 1 was assigned for the anticancer drugs and 0 to the others. For each configuration of the ANN (with 2, 4, 6, 8, 10, 12, 14, 20 and 40 hidden nodes respectively) 20 independent training runs were performed to evaluate the average predictive power of the network. The corresponding counts of the true positive, true negative, false positive and false negative predictions have been estimated using 0.4 and 0.6 cut-off values for non-anticancer and anticancer respectively. Thus, an anticancer drug from the testing set has been considered classified correctly by the ANN only when its output value ranged from 0.6 to 1.0. Similarly, for each non-anticancer drug of the testing set, the correct classification has been obtained if the ANN output lay between 0 and 0.4. Thus, all network output values ranging from 0.4 to 0.6 have been ultimately considered as incorrect predictions (rather than undetermined or non-defined).

Performance measures

The prediction results from neural network model were evaluated using the following statistical measures like accuracy, Matthews correlation coefficient (MCC), sensitivity (Qsens), specificity (Qspec), probability of correct prediction (Qpred) by using the equations given under supplementary material.

Results and discussion

The accuracy of distinguishing of anticancer compounds by the artificial neural networks built upon the ‘inductive’ descriptors clearly demonstrates the adequacy and good predictive power of the developed QSAR model. There is strong evidence that the introduced inductive descriptors do adequately reflect the structural properties of chemicals, which are relevant to their anticancer activity. This observation is not surprising, considering the inductive QSAR descriptors calculated should cover a very broad range of proprieties of bound atoms and molecules related to their size, polarizability, electro-negativity, compactness, mutual inductive, steric influence and distribution of electronic density, etc. The average value for both the classes were separated to quite an extent on the graph and the selected 30 inductive descriptors should allow building of an effective QSAR model for binary classification. Considering the most important implication of the “anticancer-likeness” model is its potential use for identification of novel anticancer drug candidates from electronic databases, we have calculated the parameters of the positive predictive values (PPV) for the networks while varying the number of hidden nodes. Taking into account the PPV values for the networks with the varying number of the hidden nodes along with the corresponding values of sensitivity, specificity and general accuracy, we have selected neural network with six hidden nodes as the most efficient among the studied ANNs (Table 2 in supplementary material). The ANN with 30 input, 6 hidden and 1 output nodes has allowed the recognition of 84% of anticancer and 84% of non-anticancer compounds on average. The output from this 30-6-1 network has also demonstrated very good separation on positive (anticancer) and negative (non-anticancer) predictions, which revealed a superior model (accuracy = 84.28%, Qpred = 74.28%, sensitivity = 0.9285, specificity = 0.7857, Matthews correlation coefficient (MCC) = 0.6998) (Table 2 in supplementary material). The vast majority of the predictions for the testing sets consisting of ⅓ of anticancer and ⅔ of non-anticancer compounds, has been contained within 0.0 - 0.4 for non-anticancer and 0.6 - 1.0 for anticancer drugs which also illustrates that 0.4 and 0.6 cut-off values provide very adequate separation of two bioactive classes (Table 3 and 4 (see supplementary material) feature the output values from the 30-6-1 ANN for the training and testing sets respectively). Presumably, accuracy of the approach operating by the inductive descriptors can be improved even further by expanding the QSAR descriptors or by applying more powerful classification technique such as Support Vector Machine. Use of merely statistical techniques in conjunction with the inductive QSAR descriptors would also be beneficial, as they allow interpreting individual descriptor contributions into molecular “anticancer-likeness”. Nonetheless, despite certain drawbacks, it is obvious that the developed ANN-based QSAR model operating by the inductive descriptors has demonstrated very high accuracy and can be used for mining electronic collections of chemical structures for novel anticancer candidates.

An application of the model

The developed QSAR model of distinguishing anticancer drugs was validated further based on the anticancer compounds published in the journal ‘Nature Review Drug Discovery’, July 2004, spplement HOT DRUGS 2004; and ‘Current Pharmaceutical Design’, 2000. The “experimental” anticancer drugs cited by the Nature Review includes Gefitinib (an inhibitor of Tyrosine Kinase) and Abarelix (inhibit production of androgens involved in prostrate cancer). The drugs Etoposide and Teniposide and their involvement in cancer treatments are published in Current Pharmaceutical Design [20]. The corresponding structural formulas and their prediction results as anticancer drugs were presented in Table 6 under supplementary material. The predicted output of all the 12 drugs was above 0.60, the threshold value for predicting as anticancer drugs by the model. These results demonstrate that the ANN-based binary classifier of anticancer/non-anticancer drugs is adequate and can be considered an effective tool for ‘in silico’ anticancer drugs screening. The results also demonstrate that the inductive parameters readily accessible from atomic electronegativities, covalent radii and interatomic distances can produce a variety of useful QSAR descriptors to be used in ‘in silico’ chemical research.

Conclusion

The results of the present work demonstrate that a variety of atomic, substituent and molecular properties which can be computed within the framework of inductive and steric effects, inductive electro-negativity and molecular capacitance represent a powerful arsenal of 3D QSAR descriptors for modern ‘in silico’ drug research. Using only 30 inductive descriptors with no additional independent parameters, we have achieved 84.28% accuracy for distinguishing compounds with and without anticancer activity. The selected set of inductive descriptors possesses a number of important merits. They are 3D and stereo-sensitive which can be easily computed from fundamental properties of bound atoms and molecules and possess much defined physical meaning. This ANN-based model for anticancer drug prediction can be used as a powerful QSAR tool for filtering out lead molecules to discover novel anticancer drugs.

16 in total

Review 1. Optimization of chemical libraries by neural networks.

Authors: J Sadowski
Journal: Curr Opin Chem Biol Date: 2000-06 Impact factor: 8.822

Review 2. Neural networks are useful tools for drug design.

Authors: G Schneider
Journal: Neural Netw Date: 2000-01

3. Soft quaternary anticholinergics: comprehensive quantitative structure-activity relationship (QSAR) with a linearized biexponential (LinBiExp) model.

Authors: Peter Buchwald; Nicholas Bodor
Journal: J Med Chem Date: 2006-02-09 Impact factor: 7.446

Distinguishing compounds with anticancer activity by ANN using inductive QSAR descriptors.

Background

Methodology

Dataset

Descriptors calculation and selection

Composition of the training and testing sets

ANN model for classification of anticancer/non-anticancer drugs

Performance measures

Results and discussion

An application of the model

Conclusion

Review 1. Optimization of chemical libraries by neural networks.

Review 2. Neural networks are useful tools for drug design.

3. Soft quaternary anticholinergics: comprehensive quantitative structure-activity relationship (QSAR) with a linearized biexponential (LinBiExp) model.

Review 4. Chemometrics, why, what and where to next?

5. Can we learn to distinguish between "drug-like" and "nondrug-like" molecules?

Review 6. Multivariate calibration: applications to pharmaceutical analysis.

Review 7. Antitumor properties of podophyllotoxin and related compounds.

8. Use of the Kohonen self-organizing map to study the mechanisms of action of chemotherapeutic agents.

9. Neural computing in cancer drug development: predicting mechanism of action.

10. DrugBank: a comprehensive resource for in silico drug discovery and exploration.