| Literature DB >> 35721000 |
Fritz Mayr1, Marcus Wieder1, Oliver Wieder1, Thierry Langer1.
Abstract
Enumerating protonation states and calculating microstate pK a values of small molecules is an important yet challenging task for lead optimization and molecular modeling. Commercial and non-commercial solutions have notable limitations such as restrictive and expensive licenses, high CPU/GPU hour requirements, or the need for expert knowledge to set up and use. We present a graph neural network model that is trained on 714,906 calculated microstate pK a predictions from molecules obtained from the ChEMBL database. The model is fine-tuned on a set of 5,994 experimental pK a values significantly improving its performance on two challenging test sets. Combining the graph neural network model with Dimorphite-DL, an open-source program for enumerating ionization states, we have developed the open-source Python package pkasolver, which is able to generate and enumerate protonation states and calculate pK a values with high accuracy.Entities:
Keywords: Graph Neural Network (GNN); PKA; physical properties; protonation states; transfer learning
Year: 2022 PMID: 35721000 PMCID: PMC9204323 DOI: 10.3389/fchem.2022.866585
Source DB: PubMed Journal: Front Chem ISSN: 2296-2646 Impact factor: 5.545
Performance of state-of-the-art knowledge-based approaches and commercial software solutions to predict pK values on the Novartis and Literature test sets are shown. For each data set, the mean absolute error (MAE) and root mean squared error (RMSE) is calculated. For MolGpKa, Epik, pkasolver-epic, and pkasolver-light the median value and the 90% confidence interval are reported.
| Model | Novartis data set | Literature data set | ||
|---|---|---|---|---|
| MAE | RMSE | MAE | RMSE | |
| Random Forest | 1.15 | 1.51 | 0.53 | 0.76 |
| ChemAxon Marvin (V20.1.0) | 0.86 | 1.17 | 0.57 | 0.87 |
| MolGpKa | 0.87 [0.77;0.97] | 1.27 [1.08;1.45] | 0.49 [0.40;0.65] | 1.00 [0.56;1.53] |
| Epik | 0.83 [0.75;0.91] | 1.16 [1.06;1.26] | 0.58 [0.48;0.67] | 0.92 [0.74;1.12] |
| pkasolver-epic | 0.71 [0.64;0.74] | 0.93 [0.85;0.97] | 0.52 [0.49;0.56] | 0.82 [0.76;0.86] |
| pkasolver-light | 0.86 [0.81;0.94] | 1.13 [1.04;1.20] | 0.56 [0.51;0.64] | 0.82 [0.71;0.93] |
Used a random forest implementation with 1,000 estimators and the FCFP6 fingerprint. Values for the best performing random forest implementation are shown.
Epik identified different protonation centers than were reported in the data sets for the Novartis data set for 26 out of 280 molecules. These molecules were excluded from the MAE and RMSE calculation for Epik.
values were obtained from Baltruschat and Czodrowski (Baltruschat and Czodrowski, 2020a).
the reason for the large confidence interval is the incorrect prediction for a single molecule (Isomeric Smiles: CCNC) by MolGpKa with an error of 8.86 pK units
FIGURE 1The fine-tuned GNN model is able to predict the pK values of the Novartis and Literature test set with high accuracy. Panel (A) shows the performance of the fine-tuned model (initially trained with the ChEMBL data set and subsequently fine-tuned on the experimental data set) on the Literature test set. Panel (B) shows the performance of the same model on the Novartis test set. The solid red line in the scatter plot indicates the ideal behavior of the reference and predicted pK values, the dashed lines mark the ±1 pk unit interval. Mean absolute error (MAE) and root mean squared error (RMSE) is shown, the values in bracket indicate the 90% confidence interval calculated from 50 repetitions with random training/validation splits. N indicates the number of investigated samples.
FIGURE 2Panel (A) shows the general workflow used to train the GNN on pK values for a single molecule. During the training and testing phase, each molecule was provided in the structure dominant at pH 7.4 with atom indices indicating the protonation sites and corresponding pK values connecting them. In the Enumeration of protonation states phase we generate the protonation state for each pK value. The molecular species for each of the protonation states are then translated in their graph representation using nodes for atoms and edges for bonds, with node feature vectors encoding atom properties in the Graph representation phase. In the pK prediction phase graphs of two neighboring protonation states are combined and used as input for the GNN model to predict the pK value for the acid-base pair [using the Brønsted–Lowry acid/base definition (McNaught and Wilkinson, 2014)]. The architecture of the GNN model is shown in detail in panel (B). For a pair of neighboring protonation states two independent GIN (graph isomorphism network) convolution layers and ReLU activation functions are used for the protonated and the deprotonated molecular graph to pass information of neighboring atoms and achieve the embedding of the chemical environment of each atom (Xu et al., 2019). The output of the convolutional layer is summarized using a global average pooling layer, generating the condensed input for the multilayer perceptron (MLP). To add regularization and to prevent co-adaptation of neurons a dropout layer was added.