| Literature DB >> 27127534 |
Meng-yu Wang1, Peng Li2, Pei-li Qiao1.
Abstract
Using the theory of machine learning to assist the virtual screening (VS) has been an effective plan. However, the quality of the training set may reduce because of mixing with the wrong docking poses and it will affect the screening efficiencies. To solve this problem, we present a method using the ensemble learning to improve the support vector machine to process the generated protein-ligand interaction fingerprint (IFP). By combining multiple classifiers, ensemble learning is able to avoid the limitations of the single classifier's performance and obtain better generalization. According to the research of virtual screening experiment with SRC and Cathepsin K as the target, the results show that the ensemble learning method can effectively reduce the error because the sample quality is not high and improve the effect of the whole virtual screening process.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27127534 PMCID: PMC4834164 DOI: 10.1155/2016/4809831
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Figure 1Virtual screening process.
Figure 2The calculation process of Pharm-IF [14].
SVM parameter setting.
| Parameter name | Parameter values |
|---|---|
| SVM type | C-SVM |
| Class number | 2 |
| Kernel function | RBF |
| The degree in kernel function | 3 |
|
| 0.001 |
| Coast factor | 5 |
| Cache size | 500 MB |
| Tolerance in the termination criteria | 0.001 |
| The weight value of penalty factor for all kinds of samples | 1 |
| Cross validation | 5 |
Adaboost-SVM parameter setting.
| Parameter name | Parameter values |
|---|---|
| Ensemble learning type | Adaboost |
| Class number | 2 |
| The basic classifier type | C-SVM |
| Number of classifiers per layer | 100 |
| The max false alarm rate | 0.5 |
| The min hit rate | 0.9 |
| Number of iterations | 5 |
| Weight trim rate | 0.9 |
| Cache size | 500 MB |
Experimental data structure.
| Date set | Positive sample | Negative sample | Total |
|---|---|---|---|
| Training set | 100 | 2000 | 2100 |
| Test set | 100 | 2000 | 10500 |
Figure 3Two ROC curves X and Y.
Random Forest parameter setting.
| Parameter name | Parameter values |
|---|---|
| Tree number | 1000 |
| Node size | 5 |
| The number of different descriptors tried at each split | 50 |
Figure 4Comparative experiment of SRC.
Figure 5Comparative experiment of Cathepsin K.
Experimental comparison of SVM, Adaboost-SVM, and Random Forest.
| Algorithm | Target protein | 10% EF | AUC |
|---|---|---|---|
| SVM | SRC | 4.7 | 0.734 |
| Cathepsin K | 3.9 | 0.683 | |
|
| |||
| Adaboost-SVM | SRC | 5.5 | 0.821 |
| Cathepsin K | 4.8 | 0.802 | |
|
| |||
| Random Forest | SRC | 5.3 | 0.805 |
| Cathepsin K | 4.5 | 0.783 | |