| Literature DB >> 28690382 |
Deepak R Bharti1, Andrew M Lynn1.
Abstract
Malaria is a predominant infectious disease, with a global footprint, but especially severe in developing countries in the African subcontinent. In recent years, drug-resistant malaria has become an alarming factor, and hence the requirement of new and improved drugs is more crucial than ever before. One of the promising locations for antimalarial drug target is the apicoplast, as this organelle does not occur in humans. The apicoplast is associated with many unique and essential pathways in many Apicomplexan pathogens, including Plasmodium. The use of machine learning methods is now commonly available through open source programs. In the present work, we describe a standard protocol to develop molecular descriptor based predictive models (QSAR models), which can be further utilized for the screening of large chemical libraries. This protocol is used to build models using training data sourced from apicoplast specific bioassays. Multiple model building methods are used including Generalized Linear Models (GLM), Random Forest (RF), C5.0 implementation of a decision tree, Support Vector Machines (SVM), K-Nearest Neighbour and Naive Bayes. Methods to evaluate the accuracy of the model building method are included in the protocol. For the given dataset, the C5.0, SVM and RF perform better than other methods, with comparable accuracy over the test data.Entities:
Keywords: Malaria; R statistical package; apicoplast; predictive model building
Year: 2017 PMID: 28690382 PMCID: PMC5498782 DOI: 10.6026/97320630013154
Source DB: PubMed Journal: Bioinformation ISSN: 0973-2063
Antimalarial drugs with targets
| Pathway /Process | Targets | Drugs | Source(s) |
| Replication | GyrA, GyrB | Fluoroquinolone, Ciprofloxacin, Clindamycin, Doxycycline, Novobiocin, Coumermycin,chloroquine | [ |
| Transcription | RpoB, RpoC1, RpoC2 | Rifampin, Thiostrepton, Doxycycline, Tetracycline, Clindamycin | [ |
| Translation | Pf1F-1, 23s rRNA, GTPase, Aminoacyl tRNA - synthetase,PTC | Macrolides, Thiostrepton, Chloramphenicol, Lincosamides, Micrococcin, Mupirocin,Indolmycin | [ |
| Fatty acid biosynthesis | FASII, FabH, FabI, β-ketoacyl-ACP sythetase I and II | Thiolactomycin, Cerulenin, Triclosan | [ |
| Isoprenoid synthesis | DOXP reductoisomerase | Fosmidomycin | [ |
| Heme Synthesis | Dehydratases | Herbicides | [ |
Figure 1Workflow adopted for the current study. The initial dataset is in SDF format. Descriptors are calculated, and preprocessing-I is applied regardless of data and the applied Machine Learning (ML) method. The preprocessed data was subjected to Recursive Feature Elimination (RFE) based feature selection method to obtain the best feature subset for model building. The input data is prepared according to the selected feature set, and preprocessing-II was applied which solely depends on best practices suggested by caret package for the underlying ML method. The model building step includes hyper parameter optimisation, cross-validation and best model selection steps. The output is a model file which can be further used for prediction of unlabelled compound libraries. The preprocessing and model building step has been carried out by using R and the caret package.
Figure 2The ROC plots for different classifiers with AUC values. The higher AUC values indicates better prediction power of concerned machine learning method.
Model performance on previously unseen data. The bioassays under study were first cross-checked for common compounds used for model building. Only previously unseen compounds are used for model performance.
| AID-488745 | AID-488752 | AID-504848 | ||||
| Predicted | Predicted | Predicted | Predicted | Predicted | Predicted | |
| Active | Inactive | Active | Inactive | Active | Inactive | |
| GLM | 114/154 | 613/800 | 106/134 | 684/883 | 547/966 | 188/223 |
| RF | 129/154 | 621/800 | 118/134 | 684/883 | 564/966 | 187/223 |
| C5.0 | 126/154 | 608/800 | 117/134 | 669/883 | 599/966 | 182/223 |
| KNN | 95/154 | 412/800 | 82/134 | 448/883 | 534/966 | 149/223 |
| SVM | 139/154 | 516/800 | 123/134 | 558/883 | 593/966 | 185/223 |
Performance of various models on train and test data sets (boot632 re-sampling, 10-fold cross validation repeated 10 times. Values are up to 2 significant points.)
| Method | ROC | Accuracy | Sensitivity | Specificity | Precision | F1-score | MCC | Kappa | ||||
| Train | Test | Train | Test | Train | Test | Train | Test | Test | Test | Test | Test | |
| GLM | 0.82 | 0.82 | 0.75 | 0.75 | 0.74 | 0.74 | 0.76 | 0.76 | 0.76 | 0.75 | 0.5 | 0.5 |
| RF | 0.92 | 0.88 | 0.87 | 0.8 | 0.86 | 0.79 | 0.88 | 0.82 | 0.82 | 0.8 | 0.61 | 0.61 |
| C5.0 | 0.92 | 0.88 | 0.87 | 0.8 | 0.86 | 0.78 | 0.88 | 0.83 | 0.82 | 0.8 | 0.61 | 0.61 |
| SVM | 0.9 | 0.88 | 0.83 | 0.81 | 0.82 | 0.8 | 0.84 | 0.82 | 0.82 | 0.81 | 0.63 | 0.63 |
| KNN | 0.86 | 0.85 | 0.79 | 0.78 | 0.79 | 0.77 | 0.79 | 0.78 | 0.78 | 0.77 | 0.55 | 0.55 |