| Literature DB >> 22026913 |
Ozgur Demir-Kavuk1, Mayumi Kamada, Tatsuya Akutsu, Ernst-Walter Knapp.
Abstract
BACKGROUND: Machine learning methods are nowadays used for many biological prediction problems involving drugs, ligands or polypeptide segments of a protein. In order to build a prediction model a so called training data set of molecules with measured target properties is needed. For many such problems the size of the training data set is limited as measurements have to be performed in a wet lab. Furthermore, the considered problems are often complex, such that it is not clear which molecular descriptors (features) may be suitable to establish a strong correlation with the target property. In many applications all available descriptors are used. This can lead to difficult machine learning problems, when thousands of descriptors are considered and only few (e.g. below hundred) molecules are available for training.Entities:
Mesh:
Year: 2011 PMID: 22026913 PMCID: PMC3224215 DOI: 10.1186/1471-2105-12-412
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Prediction results
| rank | task I | task II | taskIII | |
|---|---|---|---|---|
| first | 0.677 | 0.735 | 0.237 | -2.578 (0.593) |
| second | 0.627 | 0.612 | 0.201 | -2.560 (0.565) |
| third | 0.615 | 0.455 | 0.154 | -2.561 (0.472) |
| λ 1 | 0.05 | 0.05 | 0.08 | 0.1 |
| predict | 0.667 | 0.642 | 0.205 | -2.573 (0.548) |
| featuresb | 50 | 43 | 56 | 41 |
| λ2 | 0.1 | 0.01 | 0.3 | 0.2 |
| predict | 0.691 | 0.668 | 0.131 | -2.574 (0.586) |
a Numbers in brackets are Spearman Rank Correlation Coefficients (SRCC) [29].
b number of features after L1 regularization.
Prediction results of q2 values, eq. (5), for all four CoEPrA regression tasks using a two-step optimization procedure. First three lines display the results of the three best predictions for the different CEoPrA tasks. Stage 1: only L1 regularization is used. All features are removed, where the corresponding parameters have absolute values smaller than 10-8 after optimization, Stage 2: only L2 regularization is applied for all features remaining after stage 1. The regularization parameters λ1 and λ2 have been determined using 5 times a 10-fold cross validation procedure.
Figure 1Correlation diagrams for the four CoEPrA regression tasks. Blue crosses: recall performance on the training set. Red dots: prediction performance on the test set.
Figure 2Overview of selected features. Feature selection with L1 regularization with increasing λ1 values (from top to bottom) for CoEPrA regression task I. Left vertical axis: λ1 Right vertical axis: number of remaining features after a cycle of L1 regularization with the corresponding λ1 value. Horizontal axis displays the feature index. The initial number of features is 6219. The colors, varying from black to yellow, exhibit how often specific features reappear in the selection procedure after they disappeared in the selection round before. For instance yellow marks features that reappeared seven times for the eight different λ1 values considered.
Figure 3Prediction performance during feature selection. Prediction performance measured as q2 values, eq. (5), on test set plotted versus the number of used features.
Figure 4Percentage of selected features with increasing . Vertical axis: percentage of selected features. Horizontal axis: selected λ1 values on a logarithmic scale.
Overview of used data sets.
| CoEPrA task | ||||
|---|---|---|---|---|
| 1 | 89 | 88 | 9 | 6219 |
| 2 | 76 | 76 | 8 | 5528 |
| 3 | 133 | 133 | 9 | 6219 |
| 4 | 133 | 47 | 9 | 6219 |
a Number of ligands in training set. b Number of ligands in prediction set c Lengths of oligo-peptides for the four regression tasks.d total number of considered features.