| Literature DB >> 19055766 |
Noel M O'Boyle1, Geoffrey R Hutchison.
Abstract
BACKGROUND: Open Source cheminformatics toolkits such as OpenBabel, the CDK and the RDKit share the same core functionality but support different sets of file formats and forcefields, and calculate different fingerprints and descriptors. Despite their complementary features, using these toolkits in the same program is difficult as they are implemented in different languages (C++ versus Java), have different underlying chemical models and have different application programming interfaces (APIs).Entities:
Year: 2008 PMID: 19055766 PMCID: PMC2646723 DOI: 10.1186/1752-153X-2-24
Source DB: PubMed Journal: Chem Cent J ISSN: 1752-153X Impact factor: 4.215
Some features of toolkits which are not shared by all three toolkits.
| A large number of descriptors (some overlap with RDKit) |
| Pharmacophore searching (like RDKit*) |
| Calculation of maximum common substructure |
| 2D structure layout (like RDKit) and depiction |
| MACCS keys (also RDKit) and E-State fingerprints |
| Integration with the R statistical programming environment |
| Support for mass-spectrometry analysis (representations for cleavage reactions, structure generation from formulae) |
| Fragmentation schemes (ring fragments, Murcko) |
| 3D structure generation using a template and heuristics (like OpenBabel) |
| 3D similarity using ultrafast shape descriptors |
| Gasteiger π charge calculation |
| Not just focused on cheminformatics |
| Supports a very large number of chemical file formats including quantum mechanics file formats, molecular mechanics trajectories, 2D sketchers |
| 3D structure generation using a template method (like CDK) |
| Included in all major Linux distributions |
| Bindings available from several scripting languages apart from Python, as well as the Java and .NET platforms |
| Conformation generation and searching |
| InChI (also CDK) and InChIKey generation |
| Support for crystallographic space groups |
| Several forcefield implementations: UFF (also RDKit), MMFF94, MMFF94s, Ghemical |
| Ability to add custom data types to atoms, bonds, residues, molecules |
| A large number of descriptors (some overlap with CDK) |
| Fragmentation using RECAP rules |
| 2D coordinate generation (like CDK) and depiction |
| 3D coordinate generation using geometry embedding |
| Calculation of Cahn-Ingold-Prelog stereochemistry codes (R/S) |
| Pharmacophore searching (like CDK) |
| Calculation of shape similarity (based on volume overlap) |
| Chemical reaction handling and transforms |
| Atom pairs and topological torsions fingerprints |
| Feature maps and feature-map vectors |
| Machine-learning algorithms |
* Where the term "like" is used, it indicates that the implementation details differ.
An overview of the Cinfony API.
| Molecule | Wraps a molecule instance of the underlying toolkit and provides access to methods that act on molecules |
| Atom | Wraps an atom instance of the underlying toolkit |
| MoleculeData | Provides dictionary-like access to the information contained in the tag fields in SDF and MOL2 files |
| Outputfile | Handles multimolecule output file formats |
| Smarts | Wraps the SMARTS functionality of the toolkit in an analogous way to the Python 're' module for regular expression matching |
| Fingerprint | Simplifies Tanimoto calculation of binary fingerprints |
| readfile | Return an iterator over Molecules in a file |
| readstring | Return a Molecule |
| descs | A list of descriptor IDs |
| forcefields | A list of forcefield IDs |
| fps | A list of fingerprint IDs |
| informatsaa | A list of input format IDs |
| outformats | A list of output format IDs |
Figure 1Relationship of Cinfony modules to Open Source toolkits. Python modules are accessible from CPython (green), Jython (pale blue), or both (striped green and pale blue). Java libraries are indicated by dark blue, while C++ libraries are yellow.
Performance of Cinfony modules compared to a native Java or C++ implementation.
| Time (s) | Normalised | Time (s) | Normalised | |
| Native Java | 21.2 | 1.00 | 36.8 | 1.00 |
| 23.1 | 1.09 | 41.6 | 1.13 | |
| 33.0 | 1.57 | 69.5 | 1.89 | |
| Native C++ | 31.9 | 1.00 | 43.0 | 1.00 |
| 34.1 | 1.07 | 45.1 | 1.05 | |
| 38.0 | 1.19 | 49.6 | 1.15 | |
| Native C++ | 99.7 | 1.00 | 100.7 | 1.00 |
| 99.9 | 1.00 | 101.0 | 1.00 | |
The times reported are wallclock times from the best of three runs on a dual-core Intel Pentium 4 3.2 GHz machine with 1GB RAM.
Figure 2Comparison of depictions of PubChem CID7250053 using different toolkits. The depiction using the development version of RDKit showed incorrect stereochemistry for the isopropyl substituent of the thiazole ring.