| Literature DB >> 29411163 |
Hirotomo Moriwaki1, Yu-Shi Tian2, Norihito Kawashita3, Tatsuya Takagi2.
Abstract
Molecular descriptors are widely employed to present molecular characteristics in cheminformatics. Various molecular-descriptor-calculation software programs have been developed. However, users of those programs must contend with several issues, including software bugs, insufficient update frequencies, and software licensing constraints. To address these issues, we propose Mordred, a developed descriptor-calculation software application that can calculate more than 1800 two- and three-dimensional descriptors. It is freely available via GitHub. Mordred can be easily installed and used in the command line interface, as a web application, or as a high-flexibility Python package on all major platforms (Windows, Linux, and macOS). Performance benchmark results show that Mordred is at least twice as fast as the well-known PaDEL-Descriptor and it can calculate descriptors for large molecules, which cannot be accomplished by other software. Owing to its good performance, convenience, number of descriptors, and a lax licensing constraint, Mordred is a promising choice of molecular descriptor calculation software that can be utilized for cheminformatics studies, such as those on quantitative structure-property relationships.Entities:
Keywords: Calculation software; Cheminformatics; Molecular descriptor; Python; QSPR
Year: 2018 PMID: 29411163 PMCID: PMC5801138 DOI: 10.1186/s13321-018-0258-y
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Comparison features of major descriptor calculation software
| Mordred | PaDEL-Descriptor | BlueDesc | ChemoPy | PyDPI | Rcpi | Cinfony | Dragon | |
|---|---|---|---|---|---|---|---|---|
| Number of descriptors | 1825 | 1875 | 174 | 1135 | 615 | 307 | –a | 5270 |
| Citation countb | – | 598 | – | 48 | 17 | 21 | 38 | 148 |
| Library | Python2/3 | – | – | Python2 | Python2 | R | Python2/3 | – |
| Parallel computation | ✓ | ✓ | – | – | – | – | – | – |
| GUI | – | ✓ | – | – | – | – | – | ✓ |
| CLI | ✓ | ✓ | ✓ | – | – | – | – | ✓ |
| KNIME | – | ✓ | – | – | – | – | – | ✓ |
| RapidMiner | – | ✓ | – | – | – | – | – | – |
| Web Interface | ✓ | – | – | – | – | – | – | ✓c |
| Last release | 2018/1/20 | 2014/7/21 | 2008/10/3 | 2013/2/1 | 2015/11/10 | 2017/11/18 | 2015/8/1 | ?d |
| License | BSD-3-Clause | e | GPL | GPL | GPL | Artistic license | BSD-2-clause/GPLv2/GPLv3 | Proprietary |
| Source code distribution | Github | Official site | Official site | Google code | pypi | github | github | – |
| Other advantages | Easy to use with libSVM | Can also calculate protein descriptor | Can also calculate protein descriptor | Include analysis tool | ||||
| Other disadvantages | Some bugs are founded | No configurable options | Require many manually installed dependencies | Payware |
aDepends on backends
bCitation counts on Google Scholar (accessed on 2018/01/16)
cProvided by e-Dragon, which uses the old version of Dragon
dUnknown; however, Dragon is being actively developed
e“This software is free for all (e.g. personal, academic, non-profit, non-commercial, government, commercial, etc.) to use.” (http://yapcwsoft.com/dd/padeldescriptor, accessed on 2018/01/16)
Defects identified in the descriptor calculation software
| Software | Details |
|---|---|
| CDK | Theoretically, the roto-translation of a molecule should not change any molecular property. However, TPSA and LengthOverBreadth descriptors resulted in different values of molecules before and after roto-translation transformation |
| The value of ChiPathCluster is invalid because the patterns are not adequate (fixed in the latest version of CDK) | |
| PaDEL | Several molecules (e.g., Cyanidin) resulted in invalid values in many descriptors (e.g., nH (hydrogen count) returned 12) when using the default configuration owing to a bug in the aromaticity detecting procedure and/or 3D conformer generator of PaDEL-Descriptor. This caused breakage of an aromatic ring and attachment to an invalid hydrogen |
| Some descriptors use the log sum exponential (LSE) function ( | |
| In the constitutional descriptor, discrepancies might be induced in the algorithm implementation owing to incorrect code reuse | |
| ChemoPy | Cannot calculate exact values of modified Zagreb index 2 |
List of Mordred descriptors
| Descriptor name | Number of descriptors (preset) |
|---|---|
|
| |
| ABCIndex | 2 |
| AcidBase | 2 |
| AdjacencyMatrix | 13 |
| Aromatic | 2 |
| AtomCount | 16 |
| Autocorrelation | 606 |
| BCUTa | 24 |
| BalabanJa | 1 |
| BaryszMatrixa | 104 |
| BertzCT | 1 |
| BondCount | 9 |
| CarbonTypes | 10 |
| Chi | 56 |
| Constitutional | 16 |
| DetourMatrix | 14 |
| DistanceMatrix | 13 |
| EState | 316 |
| EccentricConnectivityIndex | 1 |
| ExtendedTopochemicalAtom | 45 |
| FragmentComplexity | 1 |
| Framework | 1 |
| HydrogenBonda | 2 |
| InformationContent | 42 |
| KappaShapeIndex | 3 |
| Lipinski | 2 |
| McGowanVolume | 1 |
| MoeTypea | 53 |
| MolecularDistanceEdge | 19 |
| MolecularId | 12 |
| PathCount | 21 |
| Polarizability | 2 |
| RingCount | 138 |
| RotatableBonda | 2 |
| SLogPa | 2 |
| TopoPSAa | 2 |
| TopologicalCharge | 21 |
| TopologicalIndex | 4 |
| VdwVolumeABC | 1 |
| VertexAdjacencyInformation | 1 |
| WalkCount | 21 |
| Weight | 2 |
| WienerIndex | 2 |
| ZagrebIndex | 4 |
|
| |
| CPSA | 43 |
| GeometricalIndex | 4 |
| GravitationalIndex | 4 |
| MoRSE | 160 |
| MomentOfInertia | 3 |
aRDKit wrapper
Fig. 1Overview of Mordred library. Mordred consists of two main classes: descriptor and calculator. Users can register descriptors on a Calculator instance. A Calculator instance can calculate descriptors in parallel
Summary of descriptor improvement
| Descriptor | Summary |
|---|---|
| Chi | Depth-first search is used instead of SMARTS pattern matching |
| DetourMatrix | Molecular graph is divided into a small graph by articulation points |
| Framework | Specialize to Framework descriptor |
| MolecularId | Cache parts of the calculation to avoid its redundancy |
Fig. 2DetourMatrix algorithm. The chemical structures are split into subgraphs by all articulation points. Then, the detour matrix of each subgraph is calculated. Finally, the detour matrices of subgraphs are merged and other elements are filled
Fig. 3Web interface. a Top page of the web interface to upload a structure file, b preview page to check the conformation of compounds, c result page to confirm the descriptive statistics value of calculated descriptors and download the results
Number of atoms in the benchmark dataset
| Number of atoms | Compounds | Cumulative percentage |
|---|---|---|
| (0, 25] | 917 | 12.74 |
| (25, 50] | 3412 | 60.15 |
| (50, 75] | 2180 | 90.44 |
| (75, 100] | 366 | 95.53 |
| (100, 125] | 147 | 97.57 |
| (125, 150] | 79 | 98.67 |
| (150, 175] | 37 | 99.18 |
| (175, 200] | 33 | 99.64 |
| (200, 225] | 12 | 99.81 |
| (225, 250] | 7 | 99.90 |
| (250, 275] | 2 | 99.93 |
| (275, 300] | 3 | 99.97 |
| (300, 325] | 1 | 99.99 |
| (325, 350] | 0 | 99.99 |
| (350, 375] | 1 | 100.00 |
(·,·] denotes a left-open and right-closed interval
Fig. 4Calculation time for all descriptors. Comparison of Mordred and PaDEL-Descriptor in terms of the mean descriptor calculation time per molecule arranged by its number of atoms. The vertical axis shows the mean time of calculating all descriptors of single molecule. The horizontal axis shows class interval of number of atoms
Fig. 5Calculation time for each descriptor. Comparison of Mordred and PaDEL-Descriptor in terms of the mean descriptor calculation time of each kind of descriptor over 0.1 s in Mordred and/or PaDEL-Descriptor. The vertical axis shows the mean time of calculating the descriptor of single molecule. The horizontal axis shows the class interval of the number of atoms
Fig. 6Throughput of CLI. A comparison of Mordred and PaDEL-Descriptor on the throughput of the CLI using one to six threads