| Literature DB >> 20799726 |
Robert Lowe1, Robert C Glen, John B O Mitchell.
Abstract
Phospholipidosis is an adverse effect caused by numerous cationic amphiphilic drugs and can affect many cell types. It is characterized by the excess accumulation of phospholipids and is most reliably identified by electron microscopy of cells revealing the presence of lamellar inclusion bodies. The development of phospholipidosis can cause a delay in the drug development process, and the importance of computational approaches to the problem has been well documented. Previous work on predictive methods for phospholipidosis showed that state of the art machine learning methods produced the best results. Here we extend this work by looking at a larger data set mined from the literature. We find that circular fingerprints lead to better models than either E-Dragon descriptors or a combination of the two. We also observe very similar performance in general between Random Forest and Support Vector Machine models.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20799726 PMCID: PMC2949053 DOI: 10.1021/mp100103e
Source DB: PubMed Journal: Mol Pharm ISSN: 1543-8384 Impact factor: 4.939
Average MCC Value for the Internal Validation and Test Set for Each of 10 Different Definitions of a 10-fold Cross Validation for the E-Dragon Descriptorsa
| RF ( | SVM (γ = 0.0003, | |||||||
|---|---|---|---|---|---|---|---|---|
| fold definitions | int vals | σV | test | σT | int vals | σV | test | σT |
| 1 | 0.524 | 0.162 | 0.144 | 0.517 | 0.208 | 0.500 | 0.228 | |
| 2 | 0.538 | 0.143 | 0.211 | 0.451 | 0.274 | 0.481 | 0.191 | |
| 3 | 0.558 | 0.179 | 0.176 | 0.570 | 0.212 | 0.462 | 0.182 | |
| 4 | 0.538 | 0.111 | 0.133 | 0.434 | 0.230 | 0.474 | 0.207 | |
| 5 | 0.519 | 0.218 | 0.194 | 0.439 | 0.258 | 0.476 | 0.245 | |
| 6 | 0.543 | 0.259 | 0.257 | 0.434 | 0.188 | 0.409 | 0.206 | |
| 7 | 0.457 | 0.161 | 0.188 | 0.521 | 0.132 | 0.519 | 0.198 | |
| 8 | 0.497 | 0.147 | 0.229 | 0.534 | 0.130 | 0.464 | 0.238 | |
| 9 | 0.452 | 0.182 | 0.485 | 0.225 | 0.543 | 0.111 | 0.293 | |
| 10 | 0.567 | 0.155 | 0.176 | 0.554 | 0.179 | 0.517 | 0.196 | |
| av | 0.519 | 0.172 | 0.532 | 0.193 | 0.500 | 0.192 | 0.485 | 0.218 |
The average reported at the bottom is the averaged value of the columns. σV represents the standard deviation across the 10 different folds for the internal validation. Similarly, σT represents the standard deviation for the test set. Highlighted in bold is the highest MCC on the test fold for each fold definition.
Average MCC Value for the Internal Validation and Test Set for Each of 10 Different Definitions of a 10-fold Cross Validation for the Combination of Descriptorsa
| RF ( | SVM (γ = 0.019, | |||||||
|---|---|---|---|---|---|---|---|---|
| fold definitions | int vals | σV | test | σT | int vals | σV | test | σT |
| 1 | 0.502 | 0.171 | 0.134 | 0.553 | 0.223 | 0.403 | 0.215 | |
| 2 | 0.535 | 0.227 | 0.506 | 0.233 | 0.527 | 0.151 | 0.506 | 0.220 |
| 3 | 0.564 | 0.218 | 0.165 | 0.584 | 0.229 | 0.516 | 0.140 | |
| 4 | 0.512 | 0.070 | 0.117 | 0.581 | 0.220 | 0.564 | 0.217 | |
| 5 | 0.449 | 0.200 | 0.140 | 0.429 | 0.260 | 0.478 | 0.221 | |
| 6 | 0.511 | 0.169 | 0.212 | 0.529 | 0.242 | 0.481 | 0.240 | |
| 7 | 0.511 | 0.224 | 0.545 | 0.158 | 0.509 | 0.235 | 0.214 | |
| 8 | 0.476 | 0.221 | 0.147 | 0.522 | 0.239 | 0.430 | 0.162 | |
| 9 | 0.542 | 0.226 | 0.517 | 0.272 | 0.579 | 0.177 | 0.269 | |
| 10 | 0.531 | 0.096 | 0.199 | 0.546 | 0.173 | 0.540 | 0.190 | |
| av | 0.513 | 0.182 | 0.539 | 0.178 | 0.536 | 0.215 | 0.505 | 0.209 |
The average reported at the bottom is the averaged value of the columns. σV represents the standard deviation across the 10 different folds for the internal validation. Similarly, σT represents the standard deviation of the test set. Highlighted in bold is the highest MCC on the test fold for each fold definition.
Average MCC Value for the Internal Validation and Test Set for Each of 10 Different Definitions of a 10-fold Cross Validation for the CFP Descriptorsa
| RF ( | SVM (γ = 0.0110, | |||||||
|---|---|---|---|---|---|---|---|---|
| fold definitions | int vals | σV | test | σT | int vals | σV | test | σT |
| 1 | 0.650 | 0.191 | 0.639 | 0.207 | 0.706 | 0.182 | 0.212 | |
| 2 | 0.638 | 0.259 | 0.227 | 0.679 | 0.188 | 0.629 | 0.198 | |
| 3 | 0.634 | 0.223 | 0.619 | 0.238 | 0.648 | 0.188 | 0.205 | |
| 4 | 0.591 | 0.167 | 0.639 | 0.171 | 0.600 | 0.142 | 0.180 | |
| 5 | 0.512 | 0.396 | 0.536 | 0.361 | 0.680 | 0.080 | 0.152 | |
| 6 | 0.650 | 0.264 | 0.626 | 0.243 | 0.668 | 0.223 | 0.182 | |
| 7 | 0.696 | 0.169 | 0.607 | 0.191 | 0.672 | 0.179 | 0.168 | |
| 8 | 0.624 | 0.188 | 0.610 | 0.190 | 0.663 | 0.182 | 0.206 | |
| 9 | 0.643 | 0.222 | 0.611 | 0.204 | 0.696 | 0.296 | 0.234 | |
| 10 | 0.675 | 0.161 | 0.653 | 0.180 | 0.708 | 0.146 | 0.653 | 0.202 |
| av | 0.631 | 0.224 | 0.619 | 0.221 | 0.672 | 0.181 | 0.650 | 0.194 |
The average reported at the bottom is the averaged value of the columns. σV represents the standard deviation across the 10 different folds for the internal validation. Similarly, σT represents the standard deviation for the test set. Highlighted in bold is the highest MCC on the test fold for each fold definition.
Figure 1(a) SVM (γ = 0.0110, C = 0.841) results for y-scrambling. The MCC is plotted against the accuracy, ACC (the fraction of correct predictions). Blue stars show the repeated runs of different scrambled data. Red stars show our best model run on this split of the unscrambled data which is repeated 10 times. For SVM this produces the same confusion matrix for all runs as expected. (b) Random Forest (mtry = 4) results for y-scrambling. Here Random Forest produces 3 distinct confusion matrices for the 10 runs on the unscrambled data.
Average Rank of the Features across All Runs for the Combined Data Seta
| feature | average rank | |
|---|---|---|
| 1 | 0-Cac;1-C3;1-O.co2;1-O.co2 | 9.86 |
| 2 | 0-N3;1-C3;1-C3;1-C3 | 18.71 |
| 3 | 0-C2;1-C2 | 21.98 |
| 4 | 0-Car;1-Car;1-Car;1-Nar | 23.09 |
| 5 | 0-C3;1-C3 | 23.58 |
| 6 | Mor23u | 28.51 |
| 7 | 0-C3;1-C2;1-C3;1-N3 | 29.35 |
| 8 | 0-O.co2;1-Cac | 29.60 |
| 9 | Hypertens-50 | 29.70 |
| 10 | 0-C2;1-Nam;1-Nam;1-O2 | 30.12 |
The rank is determined from the feature selection performed using SVMAttributeEval on the training folds. Here the top 10 highest ranked features are shown. All features apart from 6 and 9 are represented in circular fingerprint notation. Mor23u relates to 3D-MoRSE—signal 23/unweighted and Hypertens-50 relates to Ghose−Viswanadhan−Wendoloski antihypertensive-like index at 50%. Both are descriptors calculated by E-Dragon.