| Literature DB >> 24391854 |
Matthias Dehmer1, Frank Emmert-Streib2, Shailesh Tripathi1.
Abstract
Molecular descriptors have been explored extensively. From these studies, it is known that a large number of descriptors are strongly correlated and capture similar characteristics of molecules. In this paper, we evaluate 919 Dragon-descriptors of 6 different categories by means of clustering. Also, we analyze these different categories of descriptors also find a subset of descriptors which are least correlated among each other and, hence, characterize molecular graphs distinctively.Entities:
Mesh:
Year: 2013 PMID: 24391854 PMCID: PMC3877108 DOI: 10.1371/journal.pone.0083956
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
A contingency table which defines the overlap between two cluster solutions, and .
| U |
|
| . | . | . |
| Sums |
|
|
|
| . | . | . |
|
|
|
|
|
| . | . | . |
|
|
| . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . |
|
|
|
| . | . | . |
|
|
| Sums |
|
| . | . | . |
|
|
Figure 1Hierarchical clustering using the average algorithm, (left), (middle), (right).
The total number of descriptors equals 919. They belong to 6 different categories which are as follows: connectivity indices (24), edge adjacency indices (301), topological indices (57), walk path counts (28), information indices (40) and 2D Matrix-based (469).
Figure 2The normalized mutual information, , between reference clusters, , and the number of clusters, , obtained by hierarchical clustering for three data-sets (left), (right) and (bottom). for each has been generated by sampling the data sets , where (data set ).
The total number of descriptors equals 919. They belong to 6 different categories which are as follows: connectivity indices (24), edge adjacency indices (301), topological indices (57), walk path counts (28), information indices (40) and 2D Matrix-based (469).
Figure 3Consensus indices using the adjusted rand index for estimating the number of clusters in the data.
These plots have been generated by sampling the data sets , where for the three data sets, (left), (right), (bottom). The dotted red line shows the optimal number of clusters.
The optimal number of clusters for the three data-sets obtained by using consensus indices (CI).
| Data-set | CI | # of clusters ( | # Descriptors in each cluster |
|
| 0.942 | 5 |
|
|
| 0.9878 | 16 |
|
|
| 1.00 | 7 |
|
The optimal numbers of clusters (for three data-sets) for a clustering solution is represented by the set , where is the optimal number of clusters in the data.
The descriptors in predicted clusters (rows) overlapping with different categories of descriptors.
|
| ||||||
| Number of cluster | connectivity indices | edge adjacency indices | topological indices | walk path counts | information indices | 2D Matrix-based |
| 1 | 24 | 261 | 56 | 28 | 25 | 469 |
| 2 | 0 | 22 | 0 | 0 | 0 | 0 |
| 3 | 0 | 18 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 1 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 | 15 | 0 |
|
| ||||||
| 1 | 17 | 214 | 51 | 22 | 34 | 426 |
| 2 | 4 | 21 | 3 | 2 | 0 | 2 |
| 3 | 3 | 6 | 1 | 2 | 0 | 0 |
| 4 | 0 | 26 | 0 | 0 | 0 | 0 |
| 5 | 0 | 2 | 0 | 0 | 0 | 0 |
| 6 | 0 | 10 | 0 | 0 | 0 | 0 |
| 7 | 0 | 9 | 0 | 0 | 0 | 0 |
| 8 | 0 | 6 | 0 | 0 | 0 | 0 |
| 9 | 0 | 6 | 0 | 0 | 0 | 0 |
| 10 | 0 | 1 | 0 | 0 | 0 | 0 |
| 11 | 0 | 0 | 1 | 0 | 0 | 0 |
| 12 | 0 | 0 | 1 | 0 | 0 | 0 |
| 13 | 0 | 0 | 0 | 2 | 0 | 0 |
| 14 | 0 | 0 | 0 | 0 | 6 | 0 |
| 15 | 0 | 0 | 0 | 0 | 0 | 24 |
| 16 | 0 | 0 | 0 | 0 | 0 | 17 |
|
| ||||||
| 1 | 24 | 287 | 56 | 28 | 14 | 425 |
| 2 | 0 | 2 | 1 | 0 | 0 | 0 |
| 3 | 0 | 12 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 26 | 0 |
| 5 | 0 | 0 | 0 | 0 | 0 | 27 |
| 6 | 0 | 0 | 0 | 0 | 0 | 14 |
| 7 | 0 | 0 | 0 | 0 | 0 | 3 |
Given the subset ; then, the remaining descriptors have at least one pair for which the summary statistic is greater than with descriptors.
| Data-set | Names of the descriptors |
|
| SM3_L, H_Dt, AVS_B.v., SM02_EA.dm., Eig11_AEA.bo., SpMAD_AEA.ed., CIC2, Eig13_AEA.bo., AVS_B.s., SM06_AEA.dm., Eig14_AEA.dm., MAXDP, J_Dz.v., BIC4, SpDiam_AEA.dm., SpMAD_X, PJI2, SpPosA_B.m., IDDE |
|
| SM2_B.s., PW4, Chi1_EA.ri., SM02_EA.dm., VE1_A, IC2, CENT, SM13_AEA.bo., Eig03_EA.bo., SM03_AEA.dm., VE3_Dz.p., piPC05, Eig04_AEA.bo., SpDiam_AEA.dm., piPC06, Eig02_AEA.dm., IVDE, MAXDP, PJI2, Eig05_AEA.dm., Chi0_EA.dm., Eig07_AEA.ed. |
|
| QW_L, TIE, VE3_B.i., BIC1, VE3_Dz.i., Eig10_AEA.dm., SpPosLog_B.m., SM03_AEA.dm., Eig11_AEA.ri., SM04_AEA.dm., CSI, VE1_Dt, Eig08_EA.ed., SpMaxA_AEA.bo., Yindex, Ram, IVDE, Chi1_EA.dm |
Figure 4Levelplot of the correlation between the subset for the three data sets, (left), (right), (bottom).
The number of descriptors of which belong to six different categories by using three data sets.
| Descriptor category |
|
|
|
| Connectivity indices | 0 | 0 | 0 |
| Edge adjacency indices | 7 | 11 | 7 |
| Topological indices | 2 | 4 | 3 |
| Walk path counts | 0 | 2 | 0 |
| Information indices | 3 | 2 | 3 |
| 2D Matrix-based | 7 | 3 | 5 |
The overlap between and the predicted clusters (rows).
|
| |
| Number of cluster | Descriptors of |
| 1 | SpMAD_AEA.ed., SpDiam_AEA.dm., Eig13_AEA.bo., Eig14_AEA.dm., MAXDP, IDDE, SM3_L, SpMAD_X, H_Dt, J_Dz.v., SpPosA_B.m., AVS_B.v., AVS_B.s. |
| 2 | SM02_EA.dm. |
| 3 | SM06_AEA.dm., Eig11_AEA.bo. |
| 4 | PJI2 |
| 5 | CIC2, BIC4 |
|
| |
| 1 | SpDiam_AEA.dm., Eig03_EA.bo., Eig07_AEA.ed., Eig02_AEA.dm., PW4, IC2, SM2_B.s. |
| 2 | Chi1_EA.ri. |
| 3 | CENT, piPC05 |
| 4 | SM02_EA.dm. |
| 5 | Chi0_EA.dm. |
| 6 | Eig04_AEA.bo. |
| 7 | SM13_AEA.bo. |
| 8 | SM03_AEA.dm. |
| 9 | _ |
| 10 | Eig05_AEA.dm. |
| 11 | PJI2 |
| 12 | MAXDP |
| 13 | piPC06 |
| 14 | IVDE |
| 15 | VE1_A |
| 16 | VE3_Dz.p. |
|
| |
| 1 | Eig08_EA.ed., Eig10_AEA.dm., Eig11_AEA.ri., CSI, TIE, Yindex, QW_L, SpMaxA_AEA.bo., IVDE, SpPosLog_B.m. |
| 2 | Chi1_EA.dm., Ram |
| 3 | SM03_AEA.dm., SM04_AEA.dm. |
| 4 | BIC1 |
| 5 | VE3_B.i. |
| 6 | VE3_Dz.i. |
| 7 | VE1_Dt |