| Literature DB >> 29086040 |
Egon L Willighagen1, John W Mayfield2, Jonathan Alvarsson3, Arvid Berg3, Lars Carlsson4, Nina Jeliazkova5, Stefan Kuhn6, Tomáš Pluskal7, Miquel Rojas-Chertó8, Ola Spjuth3, Gilleain Torrance9, Chris T Evelo10, Rajarshi Guha11, Christoph Steinbeck12.
Abstract
BACKGROUND: The Chemistry Development Kit (CDK) is a widely used open source cheminformatics toolkit, providing data structures to represent chemical concepts along with methods to manipulate such structures and perform computations on them. The library implements a wide variety of cheminformatics algorithms ranging from chemical structure canonicalization to molecular descriptor calculations and pharmacophore perception. It is used in drug discovery, metabolomics, and toxicology. Over the last 10 years, the code base has grown significantly, however, resulting in many complex interdependencies among components and poor performance of many algorithms.Entities:
Keywords: Bioinformatics; Cheminformatics; Depiction; Java; Metabolomics
Year: 2017 PMID: 29086040 PMCID: PMC5461230 DOI: 10.1186/s13321-017-0220-4
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Atom type information specified for a -hybridized carbon
Fig. 2Relative storage of stereochemistry, the type and focus of stereochemistry are fixed for a given stereocenter description but the carriers and configuration are relative. The multiple rows for each stereochemistry type are different internal representation that would be considered equivalent. In the tetrahedral types, hydrogens may be suppressed in a molecular graph so the focus is reused in the carriers list as a placeholder
Fig. 3The raw input files of CHEMBL23970 and CHEMBL444314 are displayed (ChEMBL 21). Without perceiving the stereochemistry indicated by Haworth projection in CHEMBL23970, the database entries are incorrectly considered distinct. Down stream aggregation databases mirror this separation (PubChem CID 5280, CID 65119)
Fig. 4Integrated example showing the rendering and SMILES parsing functionality. Example from U.S. Patent US 2014 231770 A1 para 287
Fig. 5The improved structure diagram generation has improved code to solve overlap. The original SDG code used general heuristics (left) and the OverlapResolver would fine tune the layout to ensure atoms would not be placed at the same location (middle). The new SDG algorithm is able to make more rigorous changes, making the final output must more pleasing (right)
Fig. 6Structure diagram generation for structures with double bond and tetrahedral stereochemistry
Evaluation of molecular formula generators
| Input | Mass tolerance (±Da) | # of generated formulas | Runtime (s) | ||||
|---|---|---|---|---|---|---|---|
| HR2 | PFG | CDK | HR2 | PFG | CDK | ||
| 10,000 small masses | 0.001 | 616,846 | 616,846 | 616,843 | 669 | 168 | 41 |
| 10,000 small masses | 0.01 | 6,163,303 | 6,163,302 | 6,163,326 | 689 | 501 | 212 |
| 20 large masses | 0.001 | 4,912,939 | 4,912,939 | 4,912,904 | 26,370 | 1292 | 177 |
| 20 large masses | 0.01 | 49,128,811 | 49,128,810 | 49,128,815 | 26,587 | 3406 | 1580 |
The resulting formula counts and runtimes of the HR2, PFG, and CDK chemical formula generators on two different inputs with two different mass tolerance settings. For the set of small masses, 10,000 mass values in the range of 0–500 Da were randomly selected from the Global Natural Products Social Molecular Networking database [64]. For the set of large masses, 20 mass values in the range of 1500–3500 Da were randomly selected from the same database. Formulas were generated using chemical elements C, H, N, O, P, S without bounds (the allowed atom count was set to 0–10,000 for each element). All heuristic filtering rules were disabled for the purpose of the evaluation. The slight differences in the number of generated formulas were caused by different isotope masses embedded in each software and/or by rounding errors during calculation. The runtimes are average values from three independent runs performed on three different 16-core Intel Xeon 2.9 GHz CPU workstations equipped with 189 GB RAM, running Ubuntu Linux version 12.04.5 LTS and OpenJDK Java runtime version 1.7.0_101
Summary of systematic benchmark comparing v1.4.19 to v2.0
| Benchmark | Data set | CDK v1.4.19 | CDK v2.0 | Improvement | |||||
|---|---|---|---|---|---|---|---|---|---|
| Skip | Time | Per min | Skip | Time | Per min | ||||
| countheavy | ChEBI 149 | smi | 2112 | 22.51s | 108.2K | 9 | 0.85s | 2.9M | 26.48 |
| sdf | 0 | 7.21s | 355.4K | 25 | 3s | 854.1K | 2.4 | ||
| ChEMBL 22.1 | smi | 0 | 8m39.3s | 193.9K | 9 | 10.74s | 9.4M | 48.35 | |
| sdf | 0 | 3m17.29s | 510.4K | 0 | 53.27s | 1.9M | 3.7 | ||
| rings | ChEBI 149 | smi | 2112 | 22.91s | 106.3K | 9 | 1.06s | 2.3M | 21.61 |
| sdf | 0 | 8.71s | 294.2K | 25 | 3.11s | 823.9K | 2.8 | ||
| ChEMBL 22.1 | smi | 0 | 8m45.78s | 191.5K | 9 | 17.09s | 5.9M | 30.77 | |
| sdf | 0 | 4m12.01s | 399.6K | 0 | 1m6.54s | 1.5M | 3.79 | ||
| rings | ChEBI 149 | smi | 2112 | 27.4s | 88.9K | 9 | 1.43s | 1.7M | 19.16 |
| sdf | 0 | 11.84s | 216.4K | 25 | 3.78s | 677.8K | 3.13 | ||
| ChEMBL 22.1 | smi | 0 | 12m4.62s | 139K | 9 | 27.16s | 3.7M | 26.68 | |
| sdf | 0 | 7m9.58s | 234.4K | 0 | 1m8.17s | 1.5M | 6.3 | ||
| rings | ChEBI 149 | smi | 2126 | 45.28s | 53.8K | 26 | 1.26s | 1.9M | 35.94 |
| sdf | 16 | 36.56s | 70.1K | 40 | 3.51s | 730K | 10.42 | ||
| ChEMBL 22.1 | smi | 88 | 12m40.2s | 132.5K | 9 | 24.97s | 4M | 30.44 | |
| sdf | 90 | 8m5.64s | 207.4K | 0 | 1m5.68s | 1.5M | 7.39 | ||
| cansmi | ChEBI 149 | smi | 2112 | 36.58s | 66.6K | 9 | 1.91s | 1.3M | 19.15 |
| sdf | 35 | 21.15s | 121.1K | 26 | 4.37s | 586.3K | 4.84 | ||
| ChEMBL 22.1 | smi | 14 | 14m33.86s | 115.2K | 9 | 40.84s | 2.5M | 21.4 | |
| sdf | 0 | 8m59.82s | 186.6K | 0 | 1m29.33s | 1.1M | 6.04 | ||
| convert | ChEBI 149 | smi | 2112 | 35.63s | 68.4K | 16 | 1.47s | 1.7M | 24.24 |
| sdf | 35 | 20.91s | 122.5K | 25 | 4.55s | 563.1K | 4.6 | ||
| ChEMBL 22.1 | smi | 14 | 14m26.02s | 116.3K | 37 | 26.2s | 3.8M | 33.05 | |
| sdf | 0 | 8m59.38s | 186.7K | 1 | 1m12.49s | 1.4M | 7.44 | ||
| convert | ChEBI 149 | smi | 2112 | 32.42s | 75.1K | 9 | 10.39s | 234.4K | 3.12 |
| sdf | 13 | 17s | 150.7K | 25 | 13.96s | 183.5K | 1.22 | ||
| ChEMBL 22.1 | smi | 0 | 14m25.82s | 116.3K | 9 | 5m26.29s | 308.6K | 2.65 | |
| sdf | 1 | 8m51.33s | 189.5K | 0 | 6m34.5s | 255.3K | 1.35 | ||
| convert | ChEBI 149 | smi | 2112 | 24m28.02s | 1.7K | 9 | 35.86s | 67.9K | 40.94 |
| sdf | 13 | 35m12.03s | 1.2K | 25 | 42.43s | 60.4K | 49.78 | ||
| ChEMBL 22.1 | smi | 0 | 3h27m7s | 8.1K | 9 | 17m44.64s | 94.6K | 11.67 | |
| sdf | 1 | 5h58m30s | 4.7K | 0 | 19m42.77s | 85.1K | 18.19 | ||
| fpgen | ChEBI 149 | smi | 2112 | 1m38s | 24.9K | 9 | 10.28s | 236.9K | 9.53 |
| sdf | 0 | 2m11.03s | 19.6K | 25 | 13.03s | 196.6K | 10.06 | ||
| ChEMBL 22.1 | smi | 0 | 42m56.15s | 39.1K | 9 | 6m34.67s | 255.2K | 6.53 | |
| sdf | 0 | 47m5.58s | 35.6K | 0 | 7m52.32s | 213.2K | 5.98 | ||
| fpgen | ChEBI 149 | smi | 2150 | 1h37m35s | 416 | 9 | 19.51s | 124.8K | 300.1 |
| sdf | 48 | 1h44m17s | 409 | 25 | 21.25s | 120.6K | 294.45 | ||
| ChEMBL 22.1 | smi | 214 | 20h24m57s | 1.4K | 9 | 13m31.21s | 124.1K | 90.6 | |
| sdf | 225 | 24h41m46s | 1.1K | 0 | 13m26.41s | 124.9K | 110.25 | ||
| fpgen | ChEBI 149 | smi | 0 | – | 9 | 4.37s | 557.4K | 0 | |
| sdf | 0 | – | 25 | 6.81s | 376.2K | 0 | |||
| ChEMBL 22.1 | smi | 0 | – | 9 | 2m43.45s | 616.1K | 0 | ||
| sdf | 0 | – | 0 | 3m42.01s | 453.6K | 0 | |||
The total elapsed real time was measured with the unix time utility. The throughput is reported in molecules per minute (K = thousand, M = million) as a relatable metric. This throughput was calculated by taking the total elapsed time and dividing it by the number of molecule in the dataset (42704 for ChEBI 149, and 1678393 for ChEMBL 22.1). The ChEBI SMILES input contains 2107 blank (but valid) inputs, this accounts for the majority skipped in v1.4.19. The throughput calculation was adjust to account for this
CTfile format support
| Format | V2000 | V3000 |
|---|---|---|
| MOLfile | Read and write | Read and write |
| RXNfile | Read and write | Read |
| SDfile MOLfile | Read and write | Read |
| RGfile | Read and write | |
| RDfile |
Fig. 8Examples of Sgroups now captured by the CDK and encoded in molfiles and CXSMILES. a Ethyl esterification fully expanded reaction. b Using Sgroup abbreviations allows display short cuts and more compact depiction. c An example of a structure repeat unit in DNA 5′-phosphate (CHEBI:4294)
The molecular fingerprints in CDK
| Bit version | Count version | CDK version | Default Size | |
|---|---|---|---|---|
| CircularFingerprinter [ |
|
| v2.0 | 1024/ |
| EStateFingerprinter [ |
| v1.2.0 | 79 | |
| ExtendedFingerprinter |
| v1.0 |
| |
| Fingerprinter |
| v1.0 |
| |
| GraphOnlyFingerprinter |
| v1.0 |
| |
| HybridizationFingerprinter |
| v1.4.0 |
| |
| KlekotaRothFingerprinter [ |
| v1.4.6 | 4860 | |
| LingoFingerprinter [ |
| v2.0 | NA | |
| MACCSFingerprinter |
| v1.2.0 | 166 | |
| PubchemFingerprinter [ |
| v1.4.0 | 881 | |
| ShortestPathFingerprinter |
| v2.0 | 1024 | |
| SignatureFingerprinter [ |
|
| v2.0 |
|
| SubstructureFingerprinter |
| v1.0 | 307 |
Listed are the currently available molecular fingerprint in CDK with information about whether they come as a bit and/or count version, what CDK version they were introduced in, their default size, and relevant references, where applicable
* For the CircularFingerprinter the bit version is folded to 1024 whereas the count version is unfolded
The LingoFingerprinter does not have a default size
A selection of key CDK modules with major changes
| Module | Description | Major changes | Dependencies |
|---|---|---|---|
| interfaces | Interfaces for the data models | Vecmath 1.5.2 | |
| core | Core functionality | Google Guava 17.0 | |
| standard | Common functionality | ||
| render | Graphical rendering | Redesigned to make it more modular and support Multiple widget toolkits, like AWT and SWT | |
| isomorphism | Isomorphism and substructure searching | ||
| atomtype | Various non-core atom type schemes | Unified approach where atom typing is separated from other algorithms | |
| ioformats | Definitions of (chemical) input/output formats | ||
| io | Readers and writers for input/output formats | The molfile reader has been rewritten and supports atom types defined in the specification | XPP3 1.1.4c |
| iordf | Stores data models as in the Resource Description Framework serialization formats | New | Jena 2.7.4 |
| inchi | IUPAC International Chemical Identifier support | JNI-InChI 0.8 [ | |
| libiocml | Writer for the Chemical Markup Language format | XOM 1.2.5, CMLXOM 3.1 [ | |
| sdg | Structure diagram generation. | Much improved overlap resolution | |
| smiles | Reading and writing in the SMILES format | SMILES support performance and coverage is greatly improved | Beam 0.9.1 [ |
| smarts | Substructure searching with the SMARTS format | Beam 0.9.1 [ | |
| hash | Molecular hash codes [ | ||
| formula | Chemical formula support | New | |
| fingerprint | Calculate fingerprints | Many new fingerprint types (see text) | Apache Commons Math 3.1.1 |
| qsar and qsarmolecular | Molecular descriptors | XOM 1.2.5, JAMA 1.0.3 [ | |
| signatures | Calculation of molecular and atomic signatures | Signatures 1.1 |
An overview of a selection of often used CDK modules with description, dependencies on third-party libraries, and the major changes since version 1.2. Dependencies between modules are depicted in Fig. 7
Fig. 7Dependencies between CDK modules. Visualization of the dependencies between CDK modules. For example, the cdk-core depends on the cdk-interfaces module. A few higher level modules have been left out: cdk-builder3dtools, cdk-legacy, and cdk-depict
Summary of systematic benchmark comparing v1.4.19 to v2.0 without read times
| Benchmark | Data set | CDK v1.4.19 | CDK v2.0 | Improvement | |||||
|---|---|---|---|---|---|---|---|---|---|
| Skip | Time | Per Min | Skip | Time | Per min | ||||
| countheavy | ChEBI 149 | smi | 0 | 0s | – | 0 | 0s | – | |
| sdf | 0 | 0s | – | 0 | 0s | – | |||
| ChEMBL 22.1 | smi | 0 | 0s | – | 0 | 0s | – | ||
| sdf | 0 | 0s | – | 0 | 0s | – | |||
| rings | ChEBI 149 | smi | 0 | 0.4s | 6.1M | 0 | 0.21s | 11.6M | 1.9 |
| sdf | 0 | 1.5s | 1.7M | 0 | 0.11s | 23.3M | 13.6 | ||
| ChEMBL 22.1 | smi | 0 | 6.48s | 15.5M | 0 | 6.35s | 15.9M | 1 | |
| sdf | 0 | 54.72s | 1.8M | 0 | 13.27s | 7.6M | 4.1 | ||
| rings | ChEBI 149 | smi | 0 | 4.89s | 498.1K | 0 | 0.58s | 4.2M | 8.4 |
| sdf | 0 | 4.63s | 553.4K | 0 | 0.78s | 3.3M | 5.9 | ||
| ChEMBL 22.1 | smi | 0 | 3m25.32s | 490.5K | 0 | 16.42s | 6.1M | 12.5 | |
| sdf | 0 | 3m52.29s | 433.5K | 0 | 14.9s | 6.8M | 15.6 | ||
| rings | ChEBI 149 | smi | 14 | 22.77s | 107K | 17 | 0.41s | 5.9M | 55.5 |
| sdf | 16 | 29.35s | 87.3K | 15 | 0.51s | 5M | 57.5 | ||
| ChEMBL 22.1 | smi | 88 | 4m0.9s | 418K | 0 | 14.23s | 7.1M | 16.9 | |
| sdf | 90 | 4m48.35s | 349.2K | 0 | 12.41s | 8.1M | 23.2 | ||
| cansmi | ChEBI 149 | smi | 0 | 14.07s | 173.1K | 0 | 1.06s | 2.3M | 13.3 |
| sdf | 35 | 13.94s | 183.8K | 1 | 1.37s | 1.9M | 10.2 | ||
| ChEMBL 22.1 | smi | 14 | 5m54.56s | 284K | 0 | 30.1s | 3.3M | 11.8 | |
| sdf | 0 | 5m42.53s | 294K | 0 | 36.06s | 2.8M | 9.5 | ||
| convert | ChEBI 149 | smi | 0 | 13.12s | 185.7K | 7 | 0.62s | 3.9M | 21.2 |
| sdf | 35 | 13.7s | 187K | 0 | 1.55s | 1.7M | 8.8 | ||
| ChEMBL 22.1 | smi | 14 | 5m46.72s | 290.4K | 28 | 15.46s | 6.5M | 22.4 | |
| sdf | 0 | 5m42.09s | 294.4K | 1 | 19.22s | 5.2M | 17.8 | ||
| convert | ChEBI 149 | smi | 0 | 9.91s | 245.8K | 0 | 9.54s | 255.3K | 1 |
| sdf | 13 | 9.79s | 261.7K | 0 | 10.96s | 233.8K | 0.9 | ||
| ChEMBL 22.1 | smi | 0 | 5m46.52s | 290.6K | 0 | 5m15.55s | 319.1K | 1.1 | |
| sdf | 1 | 5m34.04s | 301.5K | 0 | 5m41.23s | 295.1K | 1 | ||
| convert | ChEBI 149 | smi | 0 | 24m5.51s | 1.7K | 0 | 35.01s | 69.6K | 41.3 |
| sdf | 13 | 35m4.82s | 1.2K | 0 | 39.43s | 65K | 53.4 | ||
| ChEMBL 22.1 | smi | 0 | 3h18m28s | 8.5K | 0 | 17m33.9s | 95.6K | 11.3 | |
| sdf | 1 | 5h55m13s | 4.7K | 0 | 18m49.5s | 89.2K | 18.9 | ||
| fpgen | ChEBI 149 | smi | 0 | 1m15.49s | 32.3K | 0 | 9.43s | 258.3K | 8 |
| sdf | 0 | 2m3.82s | 20.7K | 0 | 10.03s | 255.5K | 12.3 | ||
| ChEMBL 22.1 | smi | 0 | 34m16.85s | 49K | 0 | 6m23.93s | 262.3K | 5.4 | |
| sdf | 0 | 43m48.29s | 38.3K | 0 | 6m59.05s | 240.3K | 6.3 | ||
| fpgen | ChEBI 149 | smi | 38 | 1h37m12s | 418 | 0 | 18.66s | 130.5K | 312.6 |
| sdf | 48 | 1h44m10s | 410 | 0 | 18.25s | 140.4K | 342.5 | ||
| ChEMBL 22.1 | smi | 214 | 20h16m18s | 1.4K | 0 | 13m20.47s | 125.8K | 91.2 | |
| sdf | 225 | 24h38m29s | 1.1K | 0 | 12m33.14s | 133.7K | 117.8 | ||
| fpgen | ChEBI 149 | smi | 0 | – | 0 | 3.52s | 692K | ||
| sdf | 0 | – | 0 | 3.81s | 672.5K | ||||
| ChEMBL 22.1 | smi | 0 | – | 0 | 2m32.71s | 659.4K | |||
| sdf | 0 | – | 0 | 2m48.74s | 596.8K | ||||
The number of records skipped and time to run the countheavy benchmark (Table 5) has been subtracted. The remaining results provides a relative comparison without accounting for the overhead of reading the input