| Literature DB >> 26300983 |
María Jimena Martínez1, Ignacio Ponzoni1, Mónica F Díaz2, Gustavo E Vazquez3, Axel J Soto4.
Abstract
BACKGROUND: The design of QSAR/QSPR models is a challenging problem, where the selection of the most relevant descriptors constitutes a key step of the process. Several feature selection methods that address this step are concentrated on statistical associations among descriptors and target properties, whereas the chemical knowledge is left out of the analysis. For this reason, the interpretability and generality of the QSAR/QSPR models obtained by these feature selection methods are drastically affected. Therefore, an approach for integrating domain expert's knowledge in the selection process is needed for increase the confidence in the final set of descriptors.Entities:
Keywords: Cheminformatics; Feature selection; QSAR; Visual analytics
Year: 2015 PMID: 26300983 PMCID: PMC4540751 DOI: 10.1186/s13321-015-0092-4
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1a Primary (left) and secondary (right) undirected graphs (Gp and Gs, respectively), which focus on pairwise associations between descriptors. b Bipartite graph that represents the molecular descriptors grouped in each model. c Scatterplot and histograms for showing the relationship between descriptors and the target property.
Candidate models obtained for log Pliver using the dataset reported by [11]
| Model | Predictive accuracy | Subset cardinality | #Frequent descriptors | #Descriptors shared with other model |
|---|---|---|---|---|
| M1 (ALOGP, Mor29u, AMW, Se, Pol) | R2 = 0.81 | 5 | 2 | 2 |
| M2 (ALOGP, SP15, RDF015v, RDF020e, H6v) | R2 = 0.76 | 5 | 1 | 1 |
| M3 (ALOGP, Mor29u, X4Av, ESpm15,Mor31e, Ui) | R2 = 0.79 | 6 | 3 | 3 |
| M4 (ALOGP, Mor29u, X4Av, DP06, QZZv, Mor02v, F01[C–C]) | R2 = 0.79 | 7 | 3 | 3 |
The second column shows the predictive accuracy of the “best” model after applying 4-fold cross validation on three different methods (linear regression, decision trees, and neural networks). In this case, the best predictive accuracy for the four models was obtained by using a decision tree (M5P). The parameter setup and predictive accuracy for all methods are available in the Additional file 1: Table S1.
Fig. 2a Relationships among models and descriptors. Frequent descriptors correspond to nodes that are filled with a darker gray color. b Visualization when hovering over M3. c Visualization when hovering over M4.
Fig. 3Links with the four levels of mutual information (columns) between the descriptors of M1, M3 and M4 (rows). This filtering can be obtained by following these steps: (1º) select each model by clicking on the corresponding node on the bipartite graph; (2º) having selected the entropy-based mode, move the edge threshold on the Gp to the right until it stops; (3º) filter edges by double-clicking over the color range above the graph.
Fig. 4Medium and high degree of co-occurrence between descriptors of M1 and M3. This filtering can be obtained by selecting each model by clicking on the corresponding node on the bipartite graph, and then modifying the edge threshold on Gs.
Prediction accuracy and cardinality for the best ten models obtained by Soto’s method [5]
| Model | Predictive accuracy | Cardinality |
|---|---|---|
| M1 (Mn/MW, Sp, RHyDp, ETA_EtaP_F_L) | R2 = 0.26 | 4 |
| M2 (Mn/MW, MDEO-11, D/Dr09, SMTIV) | R2 = 0.32 | 4 |
| M3 (Mn/MW, nHBint4, nHBint10, ETA_dEpsilon_B) | R2 = 0.56 | 4 |
| M4 (Mn/MW, nsCH3, nF6Ring, ALOGP2, RDCHI) | R2 = 0.41 | 5 |
| M5 (Mn/MW, nROH, n6Ring, nHCsatu, ALOGP2) | R2 = 0.68 | 5 |
| M6 (Mn/MW,nP, minHBa, T(O..P), ETA_Epsilon_3) | R2 = 0.25 | 5 |
| M7 (Mn/MW, ETA_dEpsilon_B, C-005, SHaaCH, nHBint9,nCt) | R2 = 0.31 | 6 |
| M8 (Mn/MW, ndssC, minHBint9, MSD, C-004, Mw/Mn (PDI), crosshead speed(CHS)) | R2 = 0.39 | 7 |
| M9 (Mn/MW, Pol, Wap, maxHAvin, nHAvin, MWC04) | R2 = 0.15 | 6 |
| M10 (Mn/MW,maxHBint6, ETA_dEpsilon_A, TIC2, ndO, nHdCH2) | R2 = 0.48 | 6 |
The second column shows the predictive accuracy of the “best” model after applying 4-fold cross validation on three different methods (linear regression, decision trees, and neural networks). The parameter setup and predictive accuracy for all methods is available in the Additional file 1: Table S2.
Fig. 5a Membership relation between models and descriptors. Frequent descriptors correspond to nodes that are filled in shades of gray. b View after hovering over the most frequent descriptor.
Fig. 6Mutual information among the three most frequent descriptors: Mn/MW shows greater independence compared to the other two descriptors, but the opposite occurs between ETA_dEpsilon_B and ALOGP2. This filtering can be obtained by following these steps: (1º) having selected the entropy-based mode, check the option “Deselect All” on the list of descriptors; (2º) select from the list these three descriptors.
Fig. 7Co-occurrence degree among frequent descriptors. This filtering can be obtained using the same steps described in Fig. 6 and then modifying the edge threshold.
Fig. 8Mutual information (high, medium and low) between descriptors of Models 3, 4, 5 and 7. This filtering can be obtained by following these steps: (1º) select each model by clicking on the corresponding node on the bipartite graph; (2º) having selected the entropy-based mode, move the edge threshold on the Gp to the right until it stops.
Fig. 9Scatterplots of descriptor values vs. elongation at break for M5. This can be obtained by clicking on the corresponding Gs node. Descriptors Mn/MW, ALOGP2, n6Ring, nROH and nHCsatu are plotted in each panel, respectively (a–e).
Predictive accuracy of the models M3, (M3 + CHS) and (M3 + CHS − nHBint4)
| Model | Predictive accuracy |
|---|---|
| M3 (Mn/MW, nHBint4, nHBint10, ETA_dEpsilon_B) | R2 = 0.56 |
| M3 + CHS (Mn/MW, nHBint4, nHBint10, ETA_dEpsilon_B,CHS) | R2 = 0.62 |
| M3 + CHS − nHBint4 (Mn/MW, nHBint10, ETA_dEpsilon_B,CHS) | R2 = 0.69 |
The second column shows the predictive accuracy of the “best” model after applying 4-fold cross validation on three different methods (linear regression, decision trees, and neural networks). In this case, the best predictive accuracy for the three models was obtained by using a decision tree (M5P) and evaluating using 4-fold cross validation. The parameter setup and predictive accuracy for all methods is available in the Additional file 1: Table S3.
Fig. 10Scatterplots of descriptor values vs. elongation at break for M3 + CHS. Descriptors nHBint4, nHBint10, crosshead speed (CHS), ETA_dEpsilon_B and Mn/MW are plotted in each panel, respectively (a–e).