Literature DB >> 18541060

Asymmetric bagging and feature selection for activities prediction of drug molecules.

Guo-Zheng Li¹, Hao-Hua Meng, Wen-Cong Lu, Jack Y Yang, Mary Qu Yang.

Abstract

BACKGROUND: Activities of drug molecules can be predicted by QSAR (quantitative structure activity relationship) models, which overcomes the disadvantages of high cost and long cycle by employing the traditional experimental method. With the fact that the number of drug molecules with positive activity is rather fewer than that of negatives, it is important to predict molecular activities considering such an unbalanced situation.
RESULTS: Here, asymmetric bagging and feature selection are introduced into the problem and asymmetric bagging of support vector machines (asBagging) is proposed on predicting drug activities to treat the unbalanced problem. At the same time, the features extracted from the structures of drug molecules affect prediction accuracy of QSAR models. Therefore, a novel algorithm named PRIFEAB is proposed, which applies an embedded feature selection method to remove redundant and irrelevant features for asBagging. Numerical experimental results on a data set of molecular activities show that asBagging improve the AUC and sensitivity values of molecular activities and PRIFEAB with feature selection further helps to improve the prediction ability.
CONCLUSION: Asymmetric bagging can help to improve prediction accuracy of activities of drug molecules, which can be furthermore improved by performing feature selection to select relevant features from the drug molecules data sets.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：

Year: 2008 PMID： 18541060 PMCID： PMC2423448 DOI： 10.1186/1471-2105-9-S6-S7

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Modeling of quantitative structure activity relationship (QSAR) of drug molecules will help to predict the molecular activities, which reduce the cost of traditional experiments, simultaneously improve the efficiency of drug molecular design [1]. Molecular activity is determined by its structure, so structure parameters are extracted by different methods to build QSAR models. Many machine learning methods have been used to the modeling of QSAR problems, like multiple linear regression, k-nearest neighbor [2], partial least squares [3], Kriging [4], artificial neural networks [5] and support vector machines (SVM), of which SVM is a state-of-arts method and achieved satisfactory results in the previous studies [6-8]. Nowadays, ensemble learning is becoming a hot topic in the machine learning and bioinformatics communities [9], which has been widely used to improve the generalization performance of single learning machines. For ensemble learning, a good ensemble is one whose individuals are both accurate and make their errors on different parts of the input space [9]. The most popular methods for ensembles creation are Bagging and Boosting [10-12]. The effectiveness of such methods comes primarily from the diversity caused by re-sampling the training set. Agrafiotis et al. [13] compared bagging with other single learning machines on handling QSAR problems and found that bagging is not always the best one. Signal was proposed in [14], it created an ensemble of meaningful descriptors chosen from a much larger property space which showed better performance than other methods. Random forest was also used in QSAR problems [15]. Dutta et al. used [16] different learning machines to make an ensemble to build QSAR models, and feature selection is used to produce different subsets for different learning machines. Although the above learning methods obtained satisfactory results, but most of the previous works ignored a critical problem in the modeling of QSAR that the number of positive examples often greatly fewer than that of negatives. To handle this problem, Hou et al. [17] discussed this problem and assigned different costs for two different classes of SVM and improved the prediction results. Here combing ensemble methods, we propose to use asymmetric bagging of SVM to address the unbalanced problem. Asymmetric bagging of SVM has been used to improve relevance feedback in image retrieval [18]. Instead of re-sampling from the whole data set, asymmetric bagging keeps the positive examples fixed and re-samples only from the negatives to make the data subset of individuals unbalanced. Furthermore, we employ AUC (area under ROC curves) [19] as the measure of predictive results, because only the measure of prediction accuracy of correction can not show the overall performance. We will analysis the experimental results in terms of AUC and other several popular measures like sensitivity and specificity as well as correction. Furthermore, In QSAR problems, many parameters are extracted from the molecular structures as features, but some features are redundant and even irrelevant, these features will hurt the generalization performance of learning machines [20]. For feature selection, different methods can be categorized into the filter model, the wrapper model and the embedded model [20-22], where the filter model is independent of the learning machine and both the embedded model and the wrapper model are depending on the learning machine, but the embedded model has lower computation complexity than the wrapper model has. Different methods have been applied to QSAR problems [23-25], and shown that proper feature selection of molecular descriptor will help improve the prediction accuracy. In order to improve the accuracy of asymmetric bagging, we will use the feature selection methods to improve the accuracy of individuals, this is motivated by the work of Li and Liu's work [26], where they found embedded feature selection is effective to improve accuracy of bagging of SVM and proposed an algorithm PRIFEB, which improved generalization performance of ordinary bagging. Here we propose to combine PRIFEB with asymmetric bagging and develop a novel algorithm named PRIFEAB to solve the prediction problem of unbalanced QSAR.

Results and discussion

In order to demonstrate the effect of unbalanced learning methods, we have performed the following series experiments by using support vector machine (SVM) as base classifiers. 1. SVM is a baseline method, which uses a 2-norm soft margin version of SVM. 2. unSVM assigns different C for different classes. The parameter of balanced_bridge is set as the value of the ratio of the number of positive examples to that of negatives which is 0.0188 in this paper. 3. Bagging a commonly used ensemble method, which uses SVM as base learners. The number of individuals is 55. 4. unBagging is also a commonly used bagging method, which uses unSVM as base learners. There are also 55 individuals. 5. asBagging is asymmetric bagging which uses SVM as base learners. 6. PPIFEAB is a bagging method, which employs feature section for asBagging to remove irrelevant and redundant features.

Prediction performance

Experiments are performed to investigate if asymmetric bagging and feature selection help to improve performance of bagging. Support vector machines with C = 100, σ = 0:1 are used as individual classifiers, and the number of individuals is 55 for all bagging methods. For unSVM, balanced_bridge is used to denote the ratio of C+ to C-, which is 0.0188. For ordinary bagging, each individual has one tenth of the training data set, while for asBagging, the size of individual data subset is twice of the positive sample in the whole data set. The 3-fold cross validation scheme is used to validate the results, experiments on each algorithm are repeated 10 times. We test the learning methods on individual molecular descriptors, and there are BCUT, Constitutional, Prop and Topological descriptors, which are represented by BCUT, CONST, PROP and TOPO respectively. The average BACC values are shown in Figure 1, from which, we can obviously find that:

Figure 1

Performance of different learning algorithms. Both graphs show BACC scores. Top: Results grouped by descriptors. Bottom: Results grouped by different learning algorithm.

Performance of different learning algorithms. Both graphs show BACC scores. Top: Results grouped by descriptors. Bottom: Results grouped by different learning algorithm. (1) unSVM does improve performance of SVM. (2) Bagging does not reach our expectation, it does not improve performance of SVM, so does unBagging, which has the similar results of Bagging. (3) asBagging greatly improves performance of SVM, and PRIFEAB slightly improve results of asBagging. Tables 1, 2, 3, 4, 5, 6, 7 list the results of different measures i.e. AUC, BACC, sensitivity, specificity, PPV, NPV, correction by using the above SVM and bagging methods. We also list the ratio values of the number of features used in PRIFEAB to the total number in Table 8. From tables 1, 2, 3, 4, 5, 6, 7, 8, we can see that:

Table 1

Statistics values of AUC (%).

Descriptor	SVM	unSVM	Bagging	unBagging	asBagging	PRIFEAB
BCUT	59.4(1.3)	61.2(1.3)	55.0(1.0)	55.2(1.0)	75.3(0.8)	75.8(0.6)
CONST	50.8(0.8)	59.3(1.9)	50.3(1.1)	50.4(1.1)	75.0(0.3)	75.3(0.5)
PROP	62.3(1.4)	63.0(1.2)	55.4(1.5)	55.5(1.3)	78.0(0.9)	78.3(0.9)
TOPO	57.7(1.0)	50.8(0.8)	54.0(1.1)	54.1(2.0)	73.4(0.5)	73.6(0.7)
Average	57.6(1.1)	58.6(1.3)	53.7(1.2)	53.8(1.4)	75.4(0.6)	75.8(0.7)