| Literature DB >> 31007759 |
Robert Ancuceanu1, Mihaela Dinu1, Iana Neaga2, Fekete Gyula Laszlo3, Daniel Boda4.
Abstract
SK-MEL-5 is a human melanoma cell line that has been used in various studies to explore new therapies against melanoma in different in vitro experiments. Based on this study we report on the development of quantitative structure-activity relationship (QSAR) models able to predict the cytotoxic effect of diverse chemical compounds on this cancer cell line. The dataset of cytotoxic and inactive compounds were downloaded from the PubChem database. It contains the data for all chemical compounds for which cytotoxicity results expressed by GI50 was recorded. In total 13 blocks of molecular descriptors were computed and used, after appropriate pre-processing in building QSAR models with four machine learning classifiers: Random forest (RF), gradient boosting, support vector machine and random k-nearest neighbors. Among the 186 models reported none had a positive predictive value (PPV) higher than 0.90 in both nested cross-validation and on an external dataset testing, but 7 models had a PPV higher than 0.85 in both evaluations, all seven using the RFs algorithm as a classifier, and topological descriptors, information indices, 2D-autocorrelation descriptors, P-VSA-like descriptors, and edge-adjacency descriptors as sets of features used for classification. The y-scrambling test was associated with considerably worse performance (confirming the non-random character of the models) and the applicability domain was assessed through three different methods.Entities:
Keywords: QSAR; SK-MEL-5; gradient boosting; k-nearest neighbors; melanoma; random forests; support vector machines
Year: 2019 PMID: 31007759 PMCID: PMC6466999 DOI: 10.3892/ol.2019.10068
Source DB: PubMed Journal: Oncol Lett ISSN: 1792-1074 Impact factor: 2.967
Figure 1.Heat map depicting the chemical diversity of the substances used in our study, based on the Gower distance. The left column shows their activity (active or inactive), whereas in the heat map proper darker regions correspond to higher dissimilarity and whiter to lower dissimilarity. The density plot shows the distribution of the (scaled) Gower distances (dissimilarity).
Figure 2.Distribution of the two data sets (learning, n=316 and external, n=106) in bi-dimensional chemical space (molecular weight and atomic LogP). The triangles correspond to the training data set, whereas the circles to the test.
Performance of selected classification models with PPV higher than 75% for the 10-fold nested cross-validation.
| Models | Specificity | Sensitivity | PPV | Balanced accuracy | MMCE |
|---|---|---|---|---|---|
| Topological descriptors-RF ( | 0.9374 | 0.3583 | 0.8424 | 0.6479 | 0.3022 |
| Topological descriptors-RF ( | 0.9298 | 0.3628 | 0.7964 | 0.6463 | 0.3105 |
| Topological descriptors-RF ( | 0.9148 | 0.5752 | 0.8749 | 0.745 | 0.2548 |
| Topological descriptors-RF ( | 0.8946 | 0.499 | 0.8158 | 0.6968 | 0.3086 |
| Walk and path-RF ( | 0.9465 | 0.285 | 0.7587 | 0.6158 | 0.3231 |
| Information indices-RF ( | 0.9486 | 0.3434 | 0.8368 | 0.646 | 0.3003 |
| Information indices-RF ( | 0.9685 | 0.3448 | 0.8848 | 0.6566 | 0.2878 |
| Information indices-RF ( | 0.9022 | 0.634 | 0.8715 | 0.7681 | 0.2319 |
| Information indices-RF ( | 0.9023 | 0.5438 | 0.851 | 0.723 | 0.2776 |
| Information indices-BST ( | 0.78 | 0.7536 | 0.7803 | 0.7668 | 0.2344 |
| 2D-autocorrelation-RF ( | 0.927 | 0.3414 | 0.776 | 0.6342 | 0.3063 |
| 2D-autocorrelation-RF ( | 0.9687 | 0.3005 | 0.8707 | 0.6346 | 0.3063 |
| 2D-autocorrelation-RF ( | 0.9453 | 0.611 | 0.9201 | 0.7782 | 0.2289 |
| 2D-autocorrelation-RF ( | 0.9174 | 0.4858 | 0.8583 | 0.7016 | 0.2993 |
| Burden eigenvalues-RF ( | 0.941 | 0.3373 | 0.7943 | 0.6391 | 0.3063 |
| Burden eigenvalues-RF ( | 0.8803 | 0.6373 | 0.8417 | 0.7588 | 0.2427 |
| Burden eigenvalues-RF ( | 0.8445 | 0.6265 | 0.8057 | 0.7355 | 0.2641 |
| P-VSA-like-RF ( | 0.9327 | 0.3528 | 0.7825 | 0.6428 | 0.3058 |
| P-VSA-like-RF ( | 0.9332 | 0.3716 | 0.7996 | 0.6524 | 0.2967 |
| P-VSA-like-RF ( | 0.9149 | 0.6159 | 0.8891 | 0.7654 | 0.2369 |
| P-VSA-like-RF ( | 0.8919 | 0.5541 | 0.8273 | 0.723 | 0.283 |
| Eta indices-RF ( | 0.9384 | 0.3807 | 0.8394 | 0.6596 | 0.2872 |
| Edge adjacency-RF ( | 0.9412 | 0.3453 | 0.8242 | 0.6432 | 0.307 |
| Edge adjacency-RF ( | 0.9301 | 0.3652 | 0.8006 | 0.6477 | 0.3038 |
| Edge adjacency-RF ( | 0.9031 | 0.6477 | 0.8635 | 0.7754 | 0.2239 |
| Edge adjacency-SVM ( | 0.7663 | 0.7113 | 0.7519 | 0.7388 | 0.2696 |
| Global-BST ( | 0.793 | 0.8137 | 0.7899 | 0.8034 | 0.1994 |
| Global-BST ( | 0.7974 | 0.7957 | 0.7927 | 0.7966 | 0.202 |
RF, random forest classifier; BST, gradient boosting classifier; SVM, support vector machines; PPV, positive predictive value. Numbers in brackets indicate the subset of features selected by the different feature selection algorithms (1-random forest importance and information gain; 2-symmetrical uncertainty); over denotes the training set balanced through oversampling; smote denotes the training set balanced through the smote technique (synthetic minority oversampling technique). The first term in the name of each model indicates the block of descriptors used for its building.
Performance of selected classification models with PPV higher than 75% on the independent data set.
| Models | Specificity | Sensitivity | PPV | Balanced accuracy | MMCE |
|---|---|---|---|---|---|
| Topological descriptors-RF ( | 0.9194 | 0.5 | 0.8148 | 0.7097 | 0.2547 |
| Topological descriptors-RF ( | 0.9194 | 0.5227 | 0.8214 | 0.721 | 0.2453 |
| Topological descriptors-RF ( | 0.9355 | 0.5682 | 0.8621 | 0.7518 | 0.217 |
| Topological descriptors-RF ( | 0.9516 | 0.5909 | 0.8966 | 0.7713 | 0.1981 |
| Walk and path-RF ( | 0.9516 | 0.2727 | 0.8 | 0.6122 | 0.3302 |
| Information indices-RF ( | 1 | 0.5 | 1 | 0.75 | 0.2075 |
| Information indices-RF ( | 0.9839 | 0.5227 | 0.9583 | 0.7533 | 0.2076 |
| Information indices-RF ( | 1 | 0.5227 | 1 | 0.7614 | 0.1981 |
| Information indices-RF ( | 1 | 0.5682 | 1 | 0.7841 | 0.1792 |
| Information indices-BST ( | 0.9355 | 0.75 | 0.8919 | 0.8427 | 0.1415 |
| 2D-autocorrelation-RF ( | 0.9355 | 0.3864 | 0.8095 | 0.6609 | 0.2924 |
| 2D-autocorrelation-RF ( | 0.9677 | 0.4091 | 0.9 | 0.6884 | 0.2642 |
| 2D-autocorrelation-RF ( | 0.9032 | 0.5 | 0.7857 | 0.7016 | 0.2642 |
| 2D-autocorrelation-RF ( | 0.9194 | 0.4773 | 0.8077 | 0.6983 | 0.2642 |
| Burden eigenvalues-RF ( | 0.9516 | 0.4773 | 0.875 | 0.7144 | 0.2453 |
| Burden eigenvalues-RF ( | 0.9516 | 0.5909 | 0.8966 | 0.7713 | 0.1981 |
| Burden eigenvalues-RF ( | 0.9355 | 0.5682 | 0.8621 | 0.7518 | 0.217 |
| P-VSA-like-RF ( | 0.9783 | 0.6562 | 0.9545 | 0.8173 | 0.1538 |
| P-VSA-like-RF ( | 0.9783 | 0.6875 | 0.9565 | 0.8329 | 0.141 |
| P-VSA-like-RF ( | 0.9783 | 0.7812 | 0.9615 | 0.8798 | 0.1026 |
| P-VSA-like-RF ( | 0.9783 | 0.9062 | 0.9667 | 0.9423 | 0.0513 |
| Eta indices-RF ( | 0.9032 | 0.4318 | 0.76 | 0.6675 | 0.2924 |
| Edge adjacency-RF ( | 0.9839 | 0.4545 | 0.9524 | 0.7192 | 0.2358 |
| Edge adjacency-RF ( | 0.9839 | 0.3864 | 0.9444 | 0.6851 | 0.2642 |
| Edge adjacency-RF ( | 0.9516 | 0.4545 | 0.8696 | 0.7031 | 0.2547 |
| Edge adjacency-SVM ( | 0.9023 | 0.6364 | 0.8235 | 0.7698 | 0.2076 |
| Global-BST ( | 0.8871 | 0.9318 | 0.8542 | 0.9095 | 0.0943 |
| Global-BST ( | 0.9032 | 0.9318 | 0.8723 | 0.9175 | 0.0849 |
RF, random forest classifier; BST, gradient boosting classifier; SVM, support vector machines; PPV, positive predictive value. Numbers in brackets indicate the subset of features selected by the different feature selection algorithms (1-random forest importance and information gain; 2-symmetrical uncertainty); over, denotes the training set balanced through oversampling; smote, denotes the training set balanced through the smote technique (synthetic minority oversampling technique). The first term in the name of each model indicates the block of descriptors used for its building.