| Literature DB >> 26052348 |
Dávid Bajusz1, Anita Rácz2, Károly Héberger3.
Abstract
BACKGROUND: Cheminformaticians are equipped with a very rich toolbox when carrying out molecular similarity calculations. A large number of molecular representations exist, and there are several methods (similarity and distance metrics) to quantify the similarity of molecular representations. In this work, eight well-known similarity/distance metrics are compared on a large dataset of molecular fingerprints with sum of ranking differences (SRD) and ANOVA analysis. The effects of molecular size, selection methods and data pretreatment methods on the outcome of the comparison are also assessed.Entities:
Keywords: Analysis of variance; Data fusion; Distance metrics; Fingerprint; Ranking; Similarity; Sum of ranking differences
Year: 2015 PMID: 26052348 PMCID: PMC4456712 DOI: 10.1186/s13321-015-0069-3
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Size classes of molecules and their definitions
|
|
|
|
|
|---|---|---|---|
| Fragment | MW ≤ 250 | 166.458 | [ |
| log | |||
| rotB ≤ 5 | |||
| Leadlike | 250 ≤ MW ≤ 350 | 1.234.403 | [ |
| log | |||
| rotB ≤ 7 | |||
| Druglike | 150 ≤ MW ≤ 500 | 3.745.649 | [ |
| log | |||
| rotB ≤ 7 | |||
| PSA < 150 | |||
| HBD ≤ 5 | |||
| HBA ≤ 10 |
Formulas for the various similarity and distance metrics
|
|
|
|
|---|---|---|
| Manhattan distance |
|
|
| Euclidean distance |
|
|
| Cosine coefficient |
|
|
| Dice coefficient |
|
|
| Tanimoto coefficient |
|
|
| Soergel distanceb |
|
|
| Substructure similarity | See Ref [ | |
| Superstructure similarity | See Ref [ | |
aS denotes similarities, while D denotes distances (according to the more commonly used formula for the given metric). Note that distances and similarities can be converted to one another using Equation . x means the j-th feature of molecule A. a is the number of on bits in molecule A, b is number of on bits in molecule B, while c is the number of bits that are on in both molecules.
bThe Soergel distance is the complement of the Tanimoto coefficient.
Figure 1Scheme of the procedure to calculate sum of ranking differences. The input matrix contains similarity measures (n = 8) in the columns and molecules (m = 99) in the rows. A reference column (golden standard, here: average of the eight similarity measures) is added in the data fusion step (red). Then, all columns are doubled (green) and the molecules in each column are ranked by increasing magnitude (columns r1, r2, … rn). The differences (yellow columns) are calculated for each similarity measure and each molecule (i.e. each cell) between its rank (r11, r12 to rnm) and the rank assigned by the known reference method (rR = q1, q2, … qm). In the last step, the absolute values of the differences are summed up for each measure to give the final SRD values, which are to be compared. The smaller SRD means proximity to the reference, the smaller the better.
Distribution of SRD runs in terms of molecule size and selection method
|
|
|
|
|
|---|---|---|---|
| 0-124 | Fragment | Random | 125 |
| 125-249 | Diverse | 125 | |
| 250-374 | Leadlike | Random | 125 |
| 375-499 | Diverse | 125 | |
| 500-624 | Druglike | Random | 125 |
| 625-749 | Diverse | 125 | |
| 750-874 | All | Random | 125 |
| 875-999 | Diverse | 125 |
Figure 2Scheme of the data generation. The SRD procedure was repeated 1000 times to eliminate the effect of random choices. Sum of ranking differences was calculated for 1000 data sets and gathered in an output file. The final output file contains a table with all of the SRD values for each similarity measure (n) on every dataset (m).
Figure 3Visualization of SRD ranking and grouping. Average was used as reference. Scaled SRD values (between 0 and 100) are plotted on the x axis and left y axis. The right y axis shows the relative frequencies for the fitted Gauss curve on random numbers (black) (XX1 = 5% error limit, med = median, XX19 = 95% limit). If an SRD value (similarity metric) overlaps with the Gaussian curve, it is not distinguishable from random ranking.
Figure 4Box and whisker plot of the SRD values for eight similarity (and distance) metrics (with range scaling as data pretreatment method) in the SRDall dataset. The uncertainties (distribution) of SRD values reveal equivalent similarity metrics (e.g. Eucl and Manh). The high SRD values of the Euclidean, Manhattan and Substructure similarities indicate that their ranking behavior is significantly different from the average of the eight metrics (consensus), while Cosine, Dice, Soergel and Tanimoto similarities better represent the ranking based on the averages. The coefficient is 1 for non-outlier range. 1.5 coefficients is the limit for the outliers and over 1.5 coefficients the point is detected as an extreme value.
Figure 5An illustrative example of two-way ANOVA (sigma restricted parametrization). A general, but not exclusive trend is to observe higher SRD values for the ranking of diversity picked molecules, which implies that the consensus of the discussed similarity metrics gets weaker as we investigate more diverse compound sets. Influential factors are shown using weighted means. The line plots are shifted on the categorical x axis horizontally for clarity. The vertical bars denote 0.95 confidence intervals.
Figure 6Effect of data pretreatment for the three-way ANOVA (sigma restricted parameterization). The changes of SRD values can be seen in different combinations of the factors. The data scaling methods are on the x axis and the selection method was: (A) random draw; (B) diversity picking. With random draw, Substructure similarities produce significantly higher SRD values for the ranking of fragment-like compounds than for bigger molecules. Meanwhile, with diversity picked molecules, Euclidean (and also Manhattan) similarities exhibit a trend to produce higher SRD values (i.e. deviate more from the consensus) as the size of the molecules increases. Weighted means were used for the creation of the plot. The vertical bars denote 0.95 confidence intervals. (Manhattan and Soergel similarities were omitted for clarity).
Figure 7Comparison of diverse and random picking (three-way ANOVA with sigma restricted parameterization) in the case of fragment-like molecules. The SRD values in the case of standardization are quite different compared to the others. (This effect seems to be less pronounced for intentionally diverse molecules). Weighted means were used for the creation of the plot. The vertical bars denote 0.95 confidence intervals. (Manhattan and Soergel coefficients were omitted for clarity).