| Literature DB >> 31737032 |
Roman Martin1,2, Dominik Heider1.
Abstract
In recent years, machine learning techniques have been widely used in biomedical research to predict unseen data based on models trained on experimentally derived data. In the current study, we used machine learning algorithms to emulate computationally complex predictions in a reverse engineering-like manner and developed ContraDRG, a soEntities:
Keywords: ATB; PRODRG; machine learning; molecular dynamics simulations; partial charge prediction
Year: 2019 PMID: 31737032 PMCID: PMC6831742 DOI: 10.3389/fgene.2019.00990
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Schematic overview of the feature encoding. (A) Each atom will be selected (red dot), and encodings will be generated (B–D). (B) Overall circular structures (green line) and nested (colored areas) are detected by a depth-first search. (C) Distance searches with three different radii are applied. (D) Second-level neighbors path tracing is implemented (orange arrows, first level; green arrows, second level). Chemical structures were drawn with MolView (https://molview.org).
Figure 2(A) The violin plots show the Tanimoto coefficient for both datasets. The plot width correlates with the relative frequencies of the coefficient. The white dot represents the median, while the black box represents the interquartile range, and the black lines, the 95% confidence intervals. One-sample t tests for both sets of Tanimoto coefficients show a p < 0.001 for a mean below 0.15. (B) The distribution of atom types for each dataset is represented by relative bar plots.
Figure 3Smoothed kernel density estimates represent the distribution of partial charges (units of e) for each molecule in the datasets. Distribution from PRODRGs dataset reveals more clustered peaks (green) than from ATB (red).
Performance comparison for partial charge prediction (units of e) by random forest and support vector machines with linear kernel of the PRODRG and ATB dataset.
| PRODRG | ATB | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Random forest | SVM linear | Random forest | SVM linear | |||||||||
| RMSE | NRMSE | RMSE | NRMSE | RMSE | NRMSE | RMSE | NRMSE | |||||
| C | 0.011 | 1.443 | 0.989 | 0.054 | 7.073 | 0.738 | 0.069 | 2.398 | 0.961 | 0.152 | 5.268 | 0.810 |
| H | 0.005 | 2.878 | 0.955 | 0.026 | 13.924 | 0.010 | 0.018 | 2.313 | 0.980 | 0.046 | 5.794 | 0.879 |
| N | 0.048 | 1.986 | 0.990 | 0.249 | 10.374 | 0.730 | 0.113 | 5.391 | 0.919 | 0.163 | 7.772 | 0.834 |
| O | 0.051 | 3.184 | 0.971 | 0.153 | 9.494 | 0.739 | 0.047 | 4.200 | 0.887 | 0.071 | 6.302 | 0.746 |
| P | 0.002 | 0.152 | 1.000 | 0.073 | 7.157 | 0.965 | 0.075 | 3.712 | 0.892 | 0.097 | 4.803 | 0.823 |
| S | 0.015 | 0.678 | 1.000 | 0.120 | 5.454 | 0.985 | 0.068 | 3.095 | 0.982 | 0.087 | 3.962 | 0.971 |
| F | 0.003 | 2.436 | 0.993 | 0.007 | 5.184 | 0.968 | 0.017 | 4.179 | 0.897 | 0.037 | 9.205 | 0.520 |
| Cl | 0.004 | 2.724 | 0.980 | 0.020 | 15.293 | 0.415 | 0.030 | 5.490 | 0.895 | 0.054 | 9.796 | 0.705 |
| Br | 0.011 | 8.625 | 0.791 | 0.016 | 12.222 | 0.589 | 0.033 | 8.796 | 0.778 | 0.049 | 13.033 | 0.531 |
| I | 0.004 | 2.575 | 0.955 | 0.010 | 6.592 | 0.706 | 0.036 | 12.840 | 0.888 | 0.062 | 22.082 | 0.624 |
| 0.015 | 2.668 | 0.962 | 0.073 | 9.277 | 0.685 | 0.051 | 5.241 | 0.908 | 0.082 | 8.802 | 0.744 | |
The root median square error (RMSE) represents the quality of errors while NRMSE shows a normalized RMSE.