| Literature DB >> 20179075 |
Yiqun Cao1, Tao Jiang, Thomas Girke.
Abstract
MOTIVATION: Similarity searching and clustering of chemical compounds by structural similarities are important computational approaches for identifying drug-like small molecules. Most algorithms available for these tasks are limited by their speed and scalability, and cannot handle today's large compound databases with several million entries.Entities:
Mesh:
Year: 2010 PMID: 20179075 PMCID: PMC2844998 DOI: 10.1093/bioinformatics/btq067
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Performance tests of EI-Search
| Dataset | NCI | PubChem Subset | PubChem Compound | |
|---|---|---|---|---|
| Descriptor type | Atom pair | Atom pair | Atom pair | Fingerprint |
| Average search time (s) | ||||
| Sequential search | 0.800 | 11.570 | 93.121 | 19.658 |
| EI-Search | 0.067 | 0.170 | 0.427 | 0.499 |
| Recall of EI-Search (%) | ||||
| Mean | 99.95 | 99.60 | 97.38 | 96.32 |
| SD | 0.44 | 1.82 | 5.61 | 11.54 |
Search times and recall rates are listed for searching three large compound sets with EI-Search and the sequential search methods. The same descriptor type was used for each comparison pair. The experiments were performed on the same hardware using the same embedding and relaxation parameters (R = 300, D = 120 and γ = 30). The LSH parameters were supplied by lshkit.
Performance tests for EI-Clustering
| Dataset | NCI | PubChem Subset | PubChem Compound | |
|---|---|---|---|---|
| Similarity measure | Atom pair | Atom pair | Atom pair | Fingerprint |
| Total clustering time (h) | ||||
| Jarvis–Patrick | 72.9 | 7355.6 | N/A | N/A |
| EI-Clustering | 3.5 | 92.2 | 1517.2 | 2869.71 |
| Jaccard coefficient | 0.9913 | 0.9887 | N/A | N/A |
The table compares the time and accuracy performance of EI-Clustering with Jarvis–Patrick clustering when using exhaustive search methods for generating the required nearest neighbor information. The compute time is given in hours of total CPU time. The agreement among the clustering results is given in the last row in form of Jaccard partition coefficients. Clustering of the PubChem Compound dataset was not possible with the exhaustive search methods due to their insufficient performance on this large dataset.