| Literature DB >> 31737032 |
Roman Martin1,2, Dominik Heider1.
Abstract
In recent years, machine learning techniques have been widely used in biomedical research to predict unseen data based on models trained on experimentally derived data. In the current study, we used machine learning algorithms to emulate computationally complex predictions in a reverse engineering-like manner and developed ContraDRG, a software that can be used to predict partial charges for small molecules based on PRODRG and Automated Topology Builder (ATB) predictions. Both tools generate molecular topology files, including the partial atomic charge, by using different procedures. We show that ContraDRG can accurately predict partial charges in a fraction of the time, because it exploits existing complex models with intensive calculations by using machine learning techniques and thus can also be applied for screening projects with large amounts of molecules. We provide ContraDRG as a web server, which can be used to automatically assign partial charges to incoming user-specified molecules by using our machine learning models. In this study, we compared ContraDRG with PRODRG and ATB in regard of predictivity by statistical methods. ContraDRG allows predicting ATB-derived partial charges with an R2 value up to 0.980 and for PRODRG up to 1.00. While ATB requires hours or days for the quantum mechanical accurate calculation and refinements, ContraDRG does its approximation within seconds.Entities:
Keywords: ATB; PRODRG; machine learning; molecular dynamics simulations; partial charge prediction
Year: 2019 PMID: 31737032 PMCID: PMC6831742 DOI: 10.3389/fgene.2019.00990
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Schematic overview of the feature encoding. (A) Each atom will be selected (red dot), and encodings will be generated (B–D). (B) Overall circular structures (green line) and nested (colored areas) are detected by a depth-first search. (C) Distance searches with three different radii are applied. (D) Second-level neighbors path tracing is implemented (orange arrows, first level; green arrows, second level). Chemical structures were drawn with MolView (https://molview.org).
Figure 2(A) The violin plots show the Tanimoto coefficient for both datasets. The plot width correlates with the relative frequencies of the coefficient. The white dot represents the median, while the black box represents the interquartile range, and the black lines, the 95% confidence intervals. One-sample t tests for both sets of Tanimoto coefficients show a p < 0.001 for a mean below 0.15. (B) The distribution of atom types for each dataset is represented by relative bar plots.
Figure 3Smoothed kernel density estimates represent the distribution of partial charges (units of e) for each molecule in the datasets. Distribution from PRODRGs dataset reveals more clustered peaks (green) than from ATB (red).
Performance comparison for partial charge prediction (units of e) by random forest and support vector machines with linear kernel of the PRODRG and ATB dataset.
| PRODRG | ATB | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Random forest | SVM linear | Random forest | SVM linear | |||||||||
| RMSE | NRMSE | RMSE | NRMSE | RMSE | NRMSE | RMSE | NRMSE | |||||
| C | 0.011 | 1.443 | 0.989 | 0.054 | 7.073 | 0.738 | 0.069 | 2.398 | 0.961 | 0.152 | 5.268 | 0.810 |
| H | 0.005 | 2.878 | 0.955 | 0.026 | 13.924 | 0.010 | 0.018 | 2.313 | 0.980 | 0.046 | 5.794 | 0.879 |
| N | 0.048 | 1.986 | 0.990 | 0.249 | 10.374 | 0.730 | 0.113 | 5.391 | 0.919 | 0.163 | 7.772 | 0.834 |
| O | 0.051 | 3.184 | 0.971 | 0.153 | 9.494 | 0.739 | 0.047 | 4.200 | 0.887 | 0.071 | 6.302 | 0.746 |
| P | 0.002 | 0.152 | 1.000 | 0.073 | 7.157 | 0.965 | 0.075 | 3.712 | 0.892 | 0.097 | 4.803 | 0.823 |
| S | 0.015 | 0.678 | 1.000 | 0.120 | 5.454 | 0.985 | 0.068 | 3.095 | 0.982 | 0.087 | 3.962 | 0.971 |
| F | 0.003 | 2.436 | 0.993 | 0.007 | 5.184 | 0.968 | 0.017 | 4.179 | 0.897 | 0.037 | 9.205 | 0.520 |
| Cl | 0.004 | 2.724 | 0.980 | 0.020 | 15.293 | 0.415 | 0.030 | 5.490 | 0.895 | 0.054 | 9.796 | 0.705 |
| Br | 0.011 | 8.625 | 0.791 | 0.016 | 12.222 | 0.589 | 0.033 | 8.796 | 0.778 | 0.049 | 13.033 | 0.531 |
| I | 0.004 | 2.575 | 0.955 | 0.010 | 6.592 | 0.706 | 0.036 | 12.840 | 0.888 | 0.062 | 22.082 | 0.624 |
| 0.015 | 2.668 | 0.962 | 0.073 | 9.277 | 0.685 | 0.051 | 5.241 | 0.908 | 0.082 | 8.802 | 0.744 | |
The root median square error (RMSE) represents the quality of errors while NRMSE shows a normalized RMSE.