Literature DB >> 29329334

Ensemble learning method for the prediction of new bioactive molecules.

Lateefat Temitope Afolabi1, Faisal Saeed2,3, Haslinda Hashim3,4, Olutomilayo Olayemi Petinrin3.   

Abstract

Pharmacologically active molecules can provide remedies for a range of different illnesses and infections. Therefore, the search for such bioactive molecules has been an enduring mission. As such, there is a need to employ a more suitable, reliable, and robust classification method for enhancing the prediction of the existence of new bioactive molecules. In this paper, we adopt a recently developed combination of different boosting methods (Adaboost) for the prediction of new bioactive molecules. We conducted the research experiments utilizing the widely used MDL Drug Data Report (MDDR) database. The proposed boosting method generated better results than other machine learning methods. This finding suggests that the method is suitable for inclusion among the in silico tools for use in cheminformatics, computational chemistry and molecular biology.

Entities:  

Mesh:

Year:  2018        PMID: 29329334      PMCID: PMC5766097          DOI: 10.1371/journal.pone.0189538

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Background

Virtual screening, which has its roots in cheminformatics, computational chemistry and structural biology [1], is the computation of the similarity between the target (reference structure) and each molecule in a database [2]. It is an established method for the discovery of new biologically active molecules [3]. It is a process whereby, through molecular modeling, each chemical agent in a database is docked into the binding region of each macro molecule target [4]. Docking is the process whereby the best fit for each agent in the binding region of the macromolecular target is calculated [4]. Schneider and Bohm [5] provided a survey of fast automated docking methods, and a detailed study on the calculation of an optimal box size for molecular docking against predicted binding pockets was carried out by Feinstein and Brylinski [6]. Wang et al. [7] extensively reviewed grapheme-based glucose sensors spanning from the period of 2008 to 2015. Huang et al. [8] worked on Drosophila, where Piwi-piRNA was the guiding epigenetic mechanism to target sites. Their work provided insight into the process involved in the recruitment of epigenetic factors to their target sites. Meanwhile, Marinov et al. [9] investigated the work of Huang et al. and discovered that their genome-wide result was not supported by their dataset. The work of Lin et al. [10] confirmed Marinov et al. who stated that the genomic site was not discovered and reaffirmed that the genome RNA polymerase II distribution is influenced by Piwi. Watanabe and Lin reviewed piRNA with respect to some biological processes, and their detailed work can be found in [11]. The science of processing bioactive molecules in important fields, such as lead discovery and compound optimization, has evolved in recent years [12]. The literature has extensively discussed different virtual screening techniques [13-16] and activity prediction approaches [17]. For example, Burden and Winkler [18] introduced the Quantitative Structure-Activity Relationship (QSAR) method as a solution to large datasets and then proposed back propagation (BP) after comparing this method with Multiple Linear Regression (MLR), Principal Component Regression (PCR) and Partial Least Squares (PLS) methods. They applied QSAR to massive data sets derived from combinatorial chemistry and High Throughput Screening (HTS). QSAR involves the prediction of the biological activity of a compound from a vectoral representation of molecular structure [19]. QSAR has been successfully utilized with regards to many drugs and agro-chemical design problems. In Burden and Winkler’s study [18], more information concerning the challenges of QSAR was outlined, and Rogers and Hopfinger [20] solved the problem of building QSAR and Quantity Structure-Property Relationship (QSPR) models using Genetic Function Approximation (GFA). In their work, they disclosed that the secret of the GFA lies in the creation and use of multiple models, rather than the utilization of a single method. Additionally, the unclear QSAR between plant-derived flavones and their inhibiting effects on aurora B kinase (aurB) was established [21]. In the relevant literature, several similarity search methods have been proposed [22]. Sheridan and Kearsley [22] justified the need for many chemical similarity search methods in the early discovery of leads in a drug discovery project. Detailed reviews of chemical similarity searching and virtual screening can be found in Shneider and Bohm [5] and Willett, Barnard and Downs [23]. In this modern era of computational technological advancement, the adoption of machine learning algorithms for the prediction of molecules has been explored. Willet et al. [24] applied the Binary Kernel Discrimination (BKD) approach for the determination of ion channel activity. BKD was introduced and compared with merged similarity search by Harper [25]. Liu et al. [26] developed a model based on the Support Vector Machine, which can be used to automatically produce predictors. This model has a four-in-one function of extracting features, selecting parameters, training models, and cross-validation. This model improves the prediction rate. A recent survey on the success (to date) and possible opportunities with regards to ligand-based virtual screening in machine learning was performed by Lavecchia [27]. The successes include the development of a large-scale machine learning data protocol, in the work of George et al. [28]; machine learning algorithms in multidimensional analysis of classification performance of compounds, Kurczab and Bojarski [29]; the Naive Bayesian classifier, Kurczab, Smusz and Bojarski [15], Bender et al. [30], and Glick et al. [31]; the Bayesian belief network, Abdo et al. [17], Nidhi et al. [32], and Xia et al. [33]; Support vector machines, Bruce et al. [19] and Buchwald, Ritter and Kramer [34]; Binary kernel discrimination, Willett et al. [24] and Reynolds and Sternberg [25]; the C5 (decision tree), Cao et al. [35]; and Investigational Novel Drug Discovery by Example (INDDEXTM), Reynolds and Sterberg [16]. Krasowski and Ekins [36] addressed the challenges faced in correctly detecting and identifying a molecule intake into a class. They utilized cheminformatics to determine the cross reactivity of designer drugs to their available immunoassay (procedure for detecting or measuring specific proteins or other substances through their properties as antigens) [36]. Stumpfe and Bajorath’s study [37] focuses on the practical applications, calculation, and appropriate domain of ligand-based virtual screening. Sherhod et al. [38] generated structural fragmented descriptors by applying a contrast pattern tree mining algorithm. The pattern forms hierarchical clusters of compounds that represent different classes of chemicals. This method was able to identify common toxic features and their classes. Takigawa and Mamitsuka [39] further elaborated on this idea and the procedures for mining frequent sub-graphs for compounds with molecular graphs and chemical compounds. Smusz et al. [40] adapted virtual screening for their work on the discovery of two structurally new 5-HT6R ligands, and Métivier et al. [41] worked on the discovery of structural alerts. In recent research, clustering algorithms have also been used in cheminformatics to discover drugs. A detailed study [42] compares popular clustering techniques, namely, k-means, bisecting k-means and ward clustering. The applications of clustering include QSAR analysis, High Throughput Screening (HTS), and Absorption, Distribution, Metabolism, Elimination and Toxicity (ADMET) prediction [42]. Meanwhile, Pires et al. [43] proposed a novel technique, called pkCSM, to develop predictive models for toxicity properties and small-molecule pharmacokinetics using graph-based signatures [43]. Ensembles have proven to be suitable in improving the performance of a prediction model since they utilize the ability of more than one classifier. They have been used to identify DNA-binding proteins [44] and Piwi-Interacting RNAs [45]. The purpose of our research is to enhance the prediction of bioactive molecules using the boosting algorithm ensemble AdaboostM1 in conjunction with Bagging, Jrip, PART, Random Forest, REPTree and J48 as nominal classifiers. We also compared the performances of the boosting algorithm with a support vector machine classifier called LibSVM (LSVM) [17, 46].

Materials and methods

Data sets

Bioactive molecules from both natural products and synthetic compounds are precious sources that provide us with the necessary tools to create new drugs to cure diseases [17]. Molecular fingerprints are representations of chemical structures initially designed to support chemical database substructure searching. Subsequently, their use had been for analysis tasks, such as similarity searching, clustering, and classification. extended connectivity fingerprints (ECFPs) is a recently developed fingerprint methodology specifically designed to identify molecular features significant to molecular activity [47]. Three datasets from ECFP_4 standard molecular descriptors, which were used in previous studies, were used for this study. These datasets were retrieved from the MDDR database. The datasets consist of 8294, 5083, and 8568 instances for DS1, DS2, and DS3, respectively, as shown in Tables 1–3. The quality of prediction was based on these datasets and the validation of the classification of molecules was based on the structure-activity relationship.
Table 1

Activity class for dataset DS1.

Activity IndexActivity ClassActivity MoleculesPairwise Similarity (Mean)
31420Renin inhibitors11300.573
71523HIV protease inhibitors7500.446
37110Thrombin inhibitors8030.419
31432Angiotensin II AT1 antagonists9430.403
42731Substance P antagonists12460.339
062335HT3 antagonists7520.351
062455HT reuptake inhibitors3590.345
07701D2 antagonists3950.345
062355HT1A agonists8270.343
78374Protein kinase C inhibitors4530.323
78331Cyclooxygenase inhibitors6360.268
Table 3

Activity class for dataset DS3.

Activity IndexActivity ClassActivity MoleculesPairwise Similarity (Mean)
09249Muscarinic (M1) agonists9000.257
12455NMDA receptor antagonists14000.311
12464Nitric oxide synthase inhibitors5050.237
31281Dopamine β-hydroxylase inhibitors1060.324
43210Aldose reductase inhibitors9570.37
71522Reverse transcriptase inhibitors7000.311
75721Aromatase inhibitors6360.318
78331Cyclooxygenase inhibitors6360.382
78348Phospholipase A2 inhibitors6170.291
78351Lipoxygenase inhibitors21110.365
The three datasets were pre-processed on the work bench via the following filters: unsupervised, attributes, and Numeric to Nominal. DS 1 contains eleven normal activity classes, DS2 contains ten homogenous (average) activity classes, and DS 2 contains ten heterogeneous activity classes. Tables 1–3 show activity index, activity class, active molecules and pairwise similarity (mean). The active molecules are the number of molecules or peptides belonging to the class and the diversity of classes. The diversity of the class is computed as the mean pairwise Tanimoto similarity score calculated across all pairs of molecules/peptides in the class using ECFP_4.

Ensemble learning technique

The employment of AdaboostM1 has been discussed in the literature, see for instance [48, 1]. It is a boosting machine learning algorithm [49] that works with another classifier (called the nominal classifier). It works successfully when the nominal classifier in question (also referred to as weak learner) can achieve at least 50% accuracy on its own [49]. AdaboostM1 is an ensemble learning technique and the most well-known of the boosting family of algorithms. The algorithm sequentially trains models, with a new model trained at each round. At the end of each round, misclassified examples are identified and their emphasis is increased in a new training set, which is then fed into the next round and processed to train a new model [50]. The Waikato Environment for Knowledge Analysis (WEKA) software, which is cross-platform software with various machine learning algorithms written in Java, was used to carry out the study. AdaboostM1 is shown in Algorithm 1 (below). Algorithm 1: AdaboostM1 Input Sequence of m examples < (x1,y),…,(x,y) > with labels y ∈ Y = {1,…,k} weak learning algorithm weakLearn integer T specifying number of iterations Initialize for all i. Do for t = 1, 2,…xo, T 1.  Call weakLearn, providing it with the distribution D. 2.  Get back a hypothesis h: X → Y. 3.  Calculate the error of h: . If , then set T = t– 1 and abort loop. 4.  Set βt = ∈ t/ (1−∈ t). 5.  Update distribution where Zt is a normalisation constant (chosen so that Dt+1 will be a distribution). Output The final hypothesis:

Experimental design

The need to have a known drug that is classifiable to a specific biological molecular structure is a central part of computational chemistry [51]. In this experiment, we used the extended-connectivity fingerprints (ECFP4) developed by SciTegic [32]. The ECFP4 of MDDR (MDL Drug Data Report) [52] implementation in the test cases is used in this study. Discovering the optimal parameters for a classifier was a time-consuming task. WEKA-Workbench offers the possibility of automatically finding the best possible setup for the LSVM classifier. The values of 1.0, 0.1, and 0.001 were given to the Cost, Gamma and Epsilon parameters, respectively, while the default values available in WEKA-Workbench were used for the other parameters. In this study, six AdaBoost ensemble classifiers were applied, including AdaBoostM1+Bagging (Ada_Bag), AdaBoostM1+Jrip (Ada_Jrip), AdaBoostM1+J48 (Ada_J48), AdaBoostM1+PART (Ada_PART), AdaBoostM1+RandomForest (Ada_RF), and AdaBoostM1+REPTree (Ada_RT). Subsequently, a ten-fold cross-validation was carried out, and the results were evaluated using sensitivity, specificity, and area under the curve (AUC) measurements. All experiments were conducted using a personal computer with an Intel® Core ™ i7-4790 CPU 3.60 GHz processor, with 16 GB RAM, and a 64-bit operating system. There are some required settings in the configuration of WEKA to increase the heap size of the memory in the “RunWeka.ini” file under the parameter named “maxheap” with the value of “4096M”. This action supports the processing of the large amount of MDDR datasets being used (the original value was “1024M”). To validate the performance of each classifier, we used the confusion matrix of the classification results as a measure to compute all the evaluation parameters. The percentage of correctly classified instances from the 10-fold cross validation was used as the measure for the model. In cross validation, the parameter value of 10 was used as the default value. This result suggests that the data set is divided into 10 folds; one fold was used for testing, and the rest were used for training. This process was repeated 10 times so that all folds were used as a test fold once. The error rate is calculated by computing the average of the 10-fold errors. The area under the receiver operating characteristic curve (AUC), specificity, sensitivity and accuracy were used as the machine learning evaluation methods. These methods are widely used as quality criteria to quantify performance. They are defined as follow: Where TP = True Positive, FN = False Negative, TN = True Negative, and FP = False Positive.

Results and discussion

Tables 4–6 display the sensitivity measures (the true positive rates). A number of the AdaBoost ensemble classifiers exhibited the best performance and outperformed the existing best classifier in the discovery of novel drugs where 2 (Ada_Bag and Ada_RF) out of 6 AdaBoost classifiers (Table 4 –DS 1) outperformed the existing best classifier (LSVM).
Table 4

Sensitivity measure for the prediction of new bioactive molecules with DS1 (normal sataset).

Classof DS1Activity IndexLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
1314200.9780.9830.9790.9800.9780.9850.977
2715230.9330.9530.9530.9450.9410.9510.953
3371100.9800.9810.9780.9780.9800.9760.971
4314320.9900.9950.9900.9860.9920.9960.989
5427310.9860.9800.9700.9700.9710.9900.968
662330.9730.9790.9640.9610.9510.9830.969
762450.9050.9160.8720.8550.8610.9050.886
877010.8510.8730.8300.8230.8100.8430.813
962350.9410.9490.9350.9060.9000.9530.933
10783740.9450.9430.9600.9320.9430.9510.916
11783310.9700.9730.9730.9470.9510.9800.961
Mean0.9500.9570.9460.9350.9340.9560.940
Table 6

Sensitivity measure for the prediction of new bioactive molecules with DS3 (heterogeneous).

Classof DS3ActivityIndexLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
1092490.9800.9720.9790.9700.9680.9820.974
2124550.9550.9660.9420.9420.9460.9660.949
3124640.9090.8990.9110.9070.9110.9090.893
4312810.9720.9530.9340.8680.8870.9150.896
5432100.9500.9560.9340.9470.9430.9560.937
6715220.9140.9190.9160.9130.8970.9090.880
7757210.9800.9760.9610.9450.9510.9700.956
8783310.8380.8570.7960.8080.8320.8410.838
9783480.8980.9120.8780.9010.8900.8910.867
10783510.9430.9620.9580.9420.9450.9710.949
Mean0.9340.9370.9210.9140.9170.9310.914
Table 5 (with DS2) shows that 3 (Ada_Bag, Ada_J48, and Ada_RT) out of 6 AdaBoost classifiers surpassed the LSVM classifier. However, Table 6 (with DS3) illustrates that only 1 (Ada_Bag) out of 6 AdaBoost classifiers surpassed the LSVM classifier.
Table 5

Sensitivity measure for the prediction of new bioactive molecules with DS2 (homogeneous).

Class of DS2Activity IndexLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
1077070.9660.9610.9660.9560.9560.9710.966
2077080.9680.9680.9620.9740.9490.9870.949
3314200.9950.9930.9950.9920.9960.9960.996
4427100.9820.9730.9730.9820.9730.9640.991
5641000.9720.9780.9810.9770.9750.9820.977
6642000.7340.7720.7150.8100.7590.7220.759
7642200.9970.9960.9930.9950.9950.9950.996
8643000.9680.9520.9520.9760.9680.9680.976
9650000.9970.9970.9970.9950.9950.9950.997
10757550.9960.9960.9930.9960.9960.9960.993
Mean0.9580.9590.9530.9650.9560.9580.960
Tables 7–9 show the specificity measures (the true negative rates), which also demonstrate that a number of AdaBoost classifiers offered the best performance and surpassed the existing best classifier in the discovery of novel drugs, where 2 (Ada_Bag and Ada_RF) out of 6 AdaBoost classifiers (Table 7 –DS1) outperformed the existing best classifier (LSVM).
Table 7

Specificity measure for the prediction of new bioactive molecules with DS1 (normal dataset).

Classof DS1Activity IndexLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
1314200.9950.9970.9970.9960.9780.9960.997
2715230.9970.9970.9970.9970.9410.9980.996
3371100.9980.9980.9980.9960.9800.9980.997
4314320.9990.9990.9990.9980.9920.9980.999
5427310.9950.9960.9940.9920.9710.9950.993
662330.9980.9970.9970.9940.9510.9970.994
762450.9960.9970.9970.9960.8610.9970.995
877010.9940.9960.9930.9930.8100.9970.995
962350.9910.9930.9880.9900.9000.9910.990
10783740.9980.9980.9980.9970.9430.9990.997
11783310.9970.9960.9970.9940.9510.9970.995
Mean0.9960.9970.9960.9950.9340.9970.995
Table 9

Specificity measure for the prediction of new bioactive molecules with DS3 (heterogeneous).

Classof DS3ActivityIndexLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
1092490.9970.9960.9960.9950.9960.9960.994
2124550.9910.9890.9920.9870.9880.9890.985
3124640.9960.9980.9960.9950.9940.9990.996
4312810.9991.0000.9990.9990.9991.0001.000
5432100.9950.9970.9960.9950.9940.9960.994
6715220.9930.9970.9980.9940.9940.9990.995
7757210.9970.9980.9970.9960.9960.9980.997
8783310.9900.9930.9910.9890.9890.9960.989
9783480.9920.9950.9950.9940.9920.9960.993
10783510.9760.9740.9560.9710.9740.9650.971
Mean0.9930.9940.9920.9920.9920.9930.991
Moreover, Table 8 (with DS2) illustrates that 5 (Ada_Bag, Ada_J48, Ada_PART, Ada_RF and Ada_RT) out of 6 AdaBoost classifiers outperformed the LSVM classifier. Table 9 (with DS3) illustrates that only 1 (Ada_Bag) out of 6 AdaBoost classifiers surpassed the LSVM classifier in these specificity measures.
Table 8

Specificity measure for the prediction of new bioactive molecules with DS2 (homogeneous).

Classof DS2ActivityIndexLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
1077070.9990.9990.9990.9990.9990.9990.999
2077080.9990.9980.9990.9990.9980.9980.998
3314200.9970.9980.9980.9970.9970.9980.998
4427101.0000.9990.9990.9990.9990.9990.999
5641000.9890.9900.9870.9920.9900.9890.990
6642000.9930.9950.9950.9950.9940.9960.995
7642200.9980.9990.9990.9980.9990.9980.999
8643000.9991.0000.9991.0001.0000.9991.000
9650001.0000.9990.9990.9990.9991.0001.000
10757551.0001.0001.0000.9991.0001.0001.000
Mean0.9970.9980.9970.9980.9980.9980.998
Tables 10–12 display the AUC measures, which also shows that a number of the AdaBoost classifiers offered the best performance and surpassed the existing best classifier in the discovery of novel drugs, where 2 (Ada_Bag and Ada_RF) out of 6 AdaBoost classifiers (Table 10 –DS1) outperformed the existing best classifier (LSVM).
Table 10

AUC measure for the prediction of new bioactive molecules with DS1 (normal dataset).

Classof DS1Activity IndexLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
1314200.9870.9900.9880.9880.9870.9910.987
2715230.9650.9750.9750.9710.9700.9750.975
3371100.9890.9900.9880.9870.9880.9870.984
4314320.9950.9970.9950.9920.9950.9970.994
5427310.9910.9880.9820.9810.9810.9930.981
662330.9860.9880.9810.9780.9730.9900.982
762450.9510.9570.9350.9260.9290.9510.941
877010.9230.9350.9120.9080.9020.9200.904
962350.9660.9710.9620.9480.9450.9720.962
10783740.9720.9710.9790.9650.9710.9750.957
11783310.9840.9850.9850.9710.9730.9890.978
Mean0.9730.9770.9710.9650.9650.9760.967
Table 12

AUC measure for the prediction of new bioactive molecules with DS3 (heterogeneous).

Classof DS3ActivityIndexLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
1092490.9890.9840.9880.9830.9820.9890.984
2124550.9730.9780.9670.9650.9670.9780.967
3124640.9530.9490.9540.9510.9530.9540.945
4312810.9860.9770.9670.9340.9430.9580.948
5432100.9730.9770.9650.9710.9690.9760.966
6715220.9540.9580.9570.9540.9460.9540.938
7757210.9890.9870.9790.9710.9740.9840.977
8783310.9140.9250.8940.8990.9110.9190.914
9783480.9450.9540.9370.9480.9410.9440.930
10783510.9600.9680.9570.9570.9600.9680.960
Mean0.9630.9650.9560.9530.9540.9620.953
Furthermore, Table 11 (with DS2) illustrates that 4 (Ada_Bag, Ada_J48, Ada_RF and Ada_RT) out of 6 AdaBoost classifiers outperformed the LSVM classifier. Table 12 (with DS3) illustrates that there was 1 (Ada_Bag) out of 6 AdaBoost classifiers that surpassed the LSVM classifier for AUC measurements.
Table 11

AUC measure for the prediction of new bioactive molecules with DS2 (homogeneous).

Classof DS2ActivityIndexLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
1077070.9830.9800.9830.9780.9780.9850.983
2077080.9840.9830.9810.9870.9740.9930.974
3314200.9960.9960.9970.9950.9970.9970.997
4427100.9910.9860.9860.9910.9860.9820.995
5641000.9810.9840.9840.9850.9830.9860.984
6642000.8640.8840.8550.9030.8770.8590.877
7642200.9980.9980.9960.9970.9970.9970.998
8643000.9840.9760.9760.9880.9840.9840.988
9650000.9990.9980.9980.9970.9970.9980.999
10757550.9980.9980.9970.9980.9980.9980.997
Mean0.9770.9780.9750.9820.9770.9780.979
From the results illustrated in Tables 4–12, for all three measures (sensitivity, specificity and AUC), it can be seen that in most cases the AdaBoost ensemble classifiers provided better outcomes when compared with LSVM; these ensemble methods built a sequence of base models where each model was constructed based on the performance of the previous model on the training set. In other words, by suitably combining the results of a set of base classifiers, the performance obtained was better than that of any base classifier. This study used a cut-off value of 0.05 for the significance level (p-value). The p-value was considered significant and capable of providing an overall ranking if p<0.05 and the critical value for chi-square χ2 at p = 0.05 for 6 degrees of freedom was 12.59. The degrees of freedom are equal to the total number of algorithms minus 1. In this study, there were 7 algorithms applied (LSVM + six AdaBoost ensemble classifiers), leading to 6 degrees of freedom. The results of Kendall’s W tests are presented in Tables 13–15 (below).
Table 13

Rankings of existing best performing classifier (LSVM) and AdaBoost ensemble classifiers, based on Kendall’s W test results using the MDDR dataset by sensitivity measure.

DatasetsWχ 2pRanks
DS10.50633.3870.000TechniqueLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
Mean Ranks4.455.914.182.362.685.862.55
DS20.0865.1760.521TechniqueLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
Mean Ranks4.44.13.14.13.254.44.65
DS30.39723.8270.001TechniqueLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
Mean Ranks5.105.703.702.552.855.352.75
Table 15

Rankings of existing best performing classifier (LSVM) and AdaBoost ensemble classifiers, based on Kendall’s W test results using the MDDR dataset by AUC measure.

DatasetsWχ 2pRanks
DS10.60039.5730.000TechniqueLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
Mean Ranks4.505.914.412.182.276.002.73
DS20.1227.2930.295TechniqueLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
Mean Ranks4.353.902.954.303.154.554.80
DS30.48629.1330.000TechniqueLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
Mean Ranks5.255.853.502.552.755.502.60
The analysis in Table 13 shows that Kendall’s coefficients (for DS1 and DS3 using the sensitivity measure) were significant (p<0.05, χ 2>12.59) and that the performance of Ada_Bag significantly outperformed all of the other methods. The overall rankings for DS1 were Ada_Bag>Ada_RF> LSVM >Ada_Jrip and Ada_PART>Ada_RT> Ada_J48. For DS3, they were Ada_Bag>Ada_RF> LSVM >Ada_Jrip>Ada_PART>Ada_RT> Ada_J48. Table 14 illustrates that Kendall’s coefficients (also for DS1 and DS3 using the specificity measure) were significant (p <0.05, χ 2> 12.59) and that the performance of Ada_Bag in DS1 and Ada_RF in DS3 significantly outperformed all of the other methods. The overall rankings for DS1 were Ada_Bag>Ada_RF> LSVM >Ada_Jrip>Ada_RT>Ada_PART> Ada_J48. For DS3 the rankings were Ada_RF>Ada_Bag>Ada_Jrip> LSVM >Ada_RT> Ada_J48 >Ada_PART.
Table 14

Rankings of existing best performing classifier (LSVM) and AdaBoost ensemble classifiers, based on Kendall’s W test results using the MDDR dataset by specificity measure.

DatasetsWχ2pRanks
DS10.41327.2870.000TechniqueLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
Mean Ranks4.645.454.452.272.735.363.09
DS20.0432.5620.862TechniqueLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
Mean Ranks3.704.303.903.803.703.954.65
DS30.43225.8950.000TechniqueLSVMAda_BagAda_JripAda_J48Ada_PARTAda_RFAda_RT
Mean Ranks4.055.654.502.552.555.703.00
Table 15 illustrates that Kendall’s coefficients (also for DS1 and DS3 using the AUC measure) were significant (p <0.05, χ 2> 12.59) and that the performance of Ada_RF and Ada_Bag considerably surpassed all of the other methods. The overall rankings for DS1 were Ada_RF>Ada_Bag> LSVM >Ada_Jrip>Ada_RT>Ada_PART> Ada_J48. For DS3 they were Ada_Bag>Ada_RF> LSVM >Ada_Jrip>Ada_PART>Ada_RT> Ada_J48. In contrast, it can be seen in Tables 13–15 that the results for DS2 using all measures (sensitivity, specificity and AUC) were not significant (p > 0.05, χ 2 < 12.59) because the performance of all classifiers in DS2, even though good, were very similar to each other. As such, the differences were not significant. Fig 1 (below) illustrates that the highest accuracy was obtained by Ada_PART 96.72% in DS1, Ada_J48 with 98.11% in DS2, and Ada_Bag with 94.54% in DS3. Thus, from the results in Fig 1, we can also conclude that AdaBoost classifiers were able to handle all the datasets.
Fig 1

Accuracy rates for the prediction of new bioactive molecules with MDDR (DS1, DS2 and DS3).

Most importantly, the results for DS3 (Fig 1) show that using Ada_Bag as the AdaBoost classifier improved the effectiveness of the prediction of new bioactive molecules in highly diverse data when compared to using the existing best classification method (LSVM). The results of DS3 show an accuracy of 94.54% compared to 93.73% for LSVM. In comparison, our proposed methods outperform the method adopted by Liu et al. [44], of which the Liu et al. 2016 method supersedes four other works, as illustrated in their report.

Conclusions

In this paper, we have presented various machine learning and ensemble methods that were applied to three MDDR benchmark datasets. The results of the experiments illustrate that the incorporation of the boosting algorithm (AdaboostM1), in conjunction with Bagging (Ada_Bag) and Random Forest (Ada_RF) as the nominal classifiers into the in silico discovery of drugs, provides a significant improvement with regard to highly diverse datasets. In future research, other ensemble methods will be examined to see if they improve the effectiveness of the prediction of new bioactive molecules.
Table 2

Activity class for dataset DS2.

Activity IndexActivity ClassActivity MoleculesPairwise Similarity (Mean)
07707Adenosine (A1) agonists2070.424
07708Adenosine (A2) agonists1560.484
31420Renin inhibitors11300.584
42710Monocyclic β-lactams1110.596
64100Cephalosporins13010.512
64200Carbacephems1580.503
64220Carbapenems10510.414
64300Penicillin1260.444
65000Antibiotic, macrolide3880.673
75755Vitamin D analogous4550.569
  35 in total

Review 1.  Similarity-based virtual screening using 2D fingerprints.

Authors:  Peter Willett
Journal:  Drug Discov Today       Date:  2006-10-20       Impact factor: 7.851

2.  Prediction of biological targets for compounds using multiple-category Bayesian models trained on chemogenomics databases.

Authors:  Meir Glick; John W Davies; Jeremy L Jenkins
Journal:  J Chem Inf Model       Date:  2006 May-Jun       Impact factor: 4.956

3.  Discovering structural alerts for mutagenicity using stable emerging molecular patterns.

Authors:  Jean-Philippe Métivier; Alban Lepailleur; Aleksey Buzmakov; Guillaume Poezevara; Bruno Crémilleux; Sergei O Kuznetsov; Jérémie Le Goff; Amedeo Napoli; Ronan Bureau; Bertrand Cuissart
Journal:  J Chem Inf Model       Date:  2015-05-07       Impact factor: 4.956

4.  Plant-derived flavones as inhibitors of aurora B kinase and their quantitative structure-activity relationships.

Authors:  Yearam Jung; Soon Young Shin; Yeonjoong Yong; Hyeryoung Jung; Seunghyun Ahn; Young Han Lee; Yoongho Lim
Journal:  Chem Biol Drug Des       Date:  2014-10-28       Impact factor: 2.817

5.  Virtual screening of bioassay data.

Authors:  Amanda C Schierz
Journal:  J Cheminform       Date:  2009-12-22       Impact factor: 5.514

6.  Predicting a small molecule-kinase interaction map: A machine learning approach.

Authors:  Fabian Buchwald; Lothar Richter; Stefan Kramer
Journal:  J Cheminform       Date:  2011-06-27       Impact factor: 5.514

7.  Calculating an optimal box size for ligand docking and virtual screening against experimental and predicted binding pockets.

Authors:  Wei P Feinstein; Michal Brylinski
Journal:  J Cheminform       Date:  2015-05-15       Impact factor: 5.514

8.  Using cheminformatics to predict cross reactivity of "designer drugs" to their currently available immunoassays.

Authors:  Matthew D Krasowski; Sean Ekins
Journal:  J Cheminform       Date:  2014-05-10       Impact factor: 5.514

9.  A document classifier for medicinal chemistry publications trained on the ChEMBL corpus.

Authors:  George Papadatos; Gerard Jp van Westen; Samuel Croset; Rita Santos; Simone Trubian; John P Overington
Journal:  J Cheminform       Date:  2014-08-12       Impact factor: 5.514

10.  The influence of negative training set size on machine learning-based virtual screening.

Authors:  Rafał Kurczab; Sabina Smusz; Andrzej J Bojarski
Journal:  J Cheminform       Date:  2014-06-11       Impact factor: 5.514

View more
  5 in total

1.  Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique.

Authors:  Xiaoying Wang; Bin Yu; Anjun Ma; Cheng Chen; Bingqiang Liu; Qin Ma
Journal:  Bioinformatics       Date:  2019-07-15       Impact factor: 6.937

2.  Machine Learning Platform to Discover Novel Growth Inhibitors of Neisseria gonorrhoeae.

Authors:  Janaina Cruz Pereira; Samer S Daher; Kimberley M Zorn; Matthew Sherwood; Riccardo Russo; Alexander L Perryman; Xin Wang; Madeleine J Freundlich; Sean Ekins; Joel S Freundlich
Journal:  Pharm Res       Date:  2020-07-13       Impact factor: 4.200

Review 3.  The Arsenal of Bioactive Molecules in the Skin Secretion of Urodele Amphibians.

Authors:  Ana L A N Barros; Abdelaaty Hamed; Mariela Marani; Daniel C Moreira; Peter Eaton; Alexandra Plácido; Massuo J Kato; José Roberto S A Leite
Journal:  Front Pharmacol       Date:  2022-01-14       Impact factor: 5.810

4.  Identifying Pupylation Proteins and Sites by Incorporating Multiple Methods.

Authors:  Wang-Ren Qiu; Meng-Yue Guan; Qian-Kun Wang; Li-Liang Lou; Xuan Xiao
Journal:  Front Endocrinol (Lausanne)       Date:  2022-04-26       Impact factor: 6.055

5.  Comprehensive ensemble in QSAR prediction for drug discovery.

Authors:  Sunyoung Kwon; Ho Bae; Jeonghee Jo; Sungroh Yoon
Journal:  BMC Bioinformatics       Date:  2019-10-26       Impact factor: 3.169

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.