Literature DB >> 35721000

Improving Small Molecule pK _a Prediction Using Transfer Learning With Graph Neural Networks.

Fritz Mayr¹, Marcus Wieder¹, Oliver Wieder¹, Thierry Langer¹.

Abstract

Enumerating protonation states and calculating microstate pK a values of small molecules is an important yet challenging task for lead optimization and molecular modeling. Commercial and non-commercial solutions have notable limitations such as restrictive and expensive licenses, high CPU/GPU hour requirements, or the need for expert knowledge to set up and use. We present a graph neural network model that is trained on 714,906 calculated microstate pK a predictions from molecules obtained from the ChEMBL database. The model is fine-tuned on a set of 5,994 experimental pK a values significantly improving its performance on two challenging test sets. Combining the graph neural network model with Dimorphite-DL, an open-source program for enumerating ionization states, we have developed the open-source Python package pkasolver, which is able to generate and enumerate protonation states and calculate pK a values with high accuracy.

Entities: Chemical

Keywords: Graph Neural Network (GNN); PKA; physical properties; protonation states; transfer learning

Year: 2022 PMID： 35721000 PMCID： PMC9204323 DOI： 10.3389/fchem.2022.866585

Source DB: PubMed Journal: Front Chem ISSN： 2296-2646 Impact factor: 5.545

1 Introduction

The acid dissociation constant (K ), most often written as its negative logarithm (pK ), plays a significant role in molecular modeling, as it influences the charge, tautomer configuration, and overall 3D structure of molecules with accessible protonation states in the physiological pH range. All these factors further shape the mobility, permeability, stability, and mode of action of substances in the body (Manallack et al., 2013). In case of insufficient or missing empirical data, the correct determination of pK values is thus essential to correctly predict the aforementioned molecular properties. Authors and studies disagree on the exact percentage of drugs with ionizable groups, but a conservative estimate suggests that at least two-thirds of all drugs contain one or more ionization groups (in a pH range of 2–12) (Manallack, 2007). The importance of pK predictions for drug discovery has been widely recognized and has been the topic of multiple blind predictive challenges—most notable the Statistical Assessment of Modeling of Proteins and Ligands (SAMPL) series SAMPL6 (Işık et al., 2021), SAMPL7, (Bergazin et al., 2021) and ongoing SAMPL8 challenge. Multiple methods have been developed to estimate pK values of small molecules, ranging from physical models based on quantum chemistry calculations (Selwa et al., 2018; Tielker et al., 2018) and/or free energy calculations (Prasad et al., 2018; Zeng et al., 2018) to empirical models based on linear free energy relationships using the Hammet-Taft equation or more data driven methods using quantitative structure-property relationship (QSPR) and machine learning (ML) approaches like deep neural network or random forest models (Liao and Nicklaus, 2009a; Rupp et al., 2011; Mansouri et al., 2019; Baltruschat and Czodrowski, 2020a; Bergazin et al., 2021). In general empirical methods require significantly less computational effort than their physics-based counterparts once they are parameterized but require a relatively large number of high-quality data points as training set (Bergazin et al., 2021). In recent years, machine learning methods have been widely applied to predict different molecular properties including pK predictions. Many of these approaches learn pK values on fingerprint representations of molecules (Baltruschat and Czodrowski, 2020a; Yang et al., 2020). The pK value of an acid and conjugate base pair is determined by the molecular structure and the molecular effects on the reaction center exerted by its neighborhood, including mesomeric, inductive, steric, and entropic effects (Perrin et al., 1981). Ideally, these effects should be included and encoded in a suitable fingerprint or set of descriptors. For many applications, extended-connectivity fingerprints (ECFPs) in combination with molecular features have proven to be a suitable and powerful tool to learn structure-property relationships (Rogers and Hahn, 2010; Jiang et al., 2021). The emergence of graph neural networks (GNNs) has shifted some focus from descriptors and fingerprints designed by domain experts to these emerging deep learning methods. GNNs are a class of deep learning methods designed to perform inference on data described by graphs and provide straightforward ways to perform node-level, edge-level, and graph-level prediction tasks (Wu et al., 2019; Wieder et al., 2020; Zhou et al., 2020). GNNs are capable of learning representations and features for a specific task in an automated way eliminating the need for excessive feature engineering ((Gilmer et al., 2017)). Another aspect of their attractiveness for molecular property prediction is the ease with which a molecule can be described as an undirected graph, transforming atoms to nodes and bonds to edges encoded both atom and bond properties. GNNs have proven to be useful and powerful tools in the machine learning molecular modeling toolbox (Gilmer et al., 2017; Deng et al., 2021). Pan et al. (Pan et al., 2021) have shown that GNNs can be successfully applied to pK predictions of chemical groups of a molecule, outperforming more traditional machine learning models relying on human-engineered descriptors and fingerprints, developing MolGpka, a web server for predicting pK values. MolGpka was trained on molecules extracted from the ChEMBL database (Gaulton et al., 2012) containing predicted pK values (predicted with ACD/Labs Physchem software ). Only the most acidic and most basic pK values were considered for the training of the GNN models. The goal of this work was to extend the scope of predicting pK values for independently ionizable atoms (realized in MolGpka) and develop a workflow that is able to enumerate protonation states and predict the corresponding pK values connecting them (sometimes referred to as “sequential pK prediction”). To achieve this we implemented and trained a GNN model that is able to predict values for both acidic and basic groups by considering the protonated and deprotonated species involved in the corresponding acid-base reaction. We trained the model in two stages. First, we started by pre-training the model on calculated microstate pK values for a large set of molecules obtained from the ChEMBL database (Gaulton et al., 2012). The pre-trained model already performs well on the two independent test sets used to measure the performance of the trained models. To improve its performance we fine-tuned the model on a small training set of molecules for which experimental pK values were available. The fine-tuned model shows excellent and improved performance on the two test sets. We have implemented the training routine and prediction pipeline in an open-source Python package named pkasolver, which is freely available and can be obtained as described in the Code and data availability section. Due to the terms of its licence agreement we are unable to distribute models trained using results generated with Epik. Users with an Epik licence can follow the instructions outlined in the data repository to obtain the fine-tuned models. For users without such a licence we provide models trained without Epik. We also provide a ready-to-use Google Colab Jupyter notebook which includes trained models and can be used to predict pk values for molecules without locally installing the package (for further information see the Code and data availability section) (Bisong, 2019).

2 Results and Discussion

We will start by discussing the performance of the model on the validation set of the ChEMBL data set (which contains pK values calculated with Epik on a subset of the ChEMBL database) and the two independent test sets: the Novartis test set (280 molecules) and the Literature test set (123 molecules). This will be followed by a discussion of the fine-tuned model on its validation set (experimental data set), on both test sets, and on the ChEMBL data set. Subsequently, we will discuss the performance of the models trained only on the monoprotic experimental data set (without transfer learning). Finally, we will discuss the developed pkasolver package, its use cases, and limitations. Performance of the different predictive models is subsequently reported using the mean absolute error (MAE) and root mean squared error (RMSE). For each metric (MAE and RMSE) the median value from 50 repetitions with different training/validation set splits is reported and the 90% confidence interval is shown. To visualize training results a single training run (out of the 50) was randomly selected and the results on the validation set plotted. In the following sections we will use the term pkasolver to describe the sequential pK prediction pipeline using trained GNN models. To distinguish between the transfer learning approach (models trained both on the and experimental data set) and the models trained only on the experimental data set we will indicate the former with pkasolver-epic and the latter with the keyword pkasolver-light.

2.1 Pre-Training Model Performance

The initial training of the GNN model was performed using the ChEMBL data set (microstate pK values calculated with Epik). Supplementary Figure S3A shows the results of the best performing model on the hold-out validation set. The MAE and RMSE are 0.29 [90% CI: 0.28; 0.31] and 0.45 [90% CI:0.44;0.49] pK units shows a good fit across the reference pK values. The kernel density estimates (KDE) of the distribution of the reference and predicted pK values shown in Supplementary Figure S3A highlights the ability of the GNN to correctly learn to predict pK values throughout the investigated pH range. The performance of the trained GNN model was assessed on two independent test sets: the Novartis and the Literature test set (both test sets are described in detail in the Methods section) (Baltruschat and Czodrowski, 2020a). The trained model performs well on both test sets with a MAE of 0.62 [90% CI:0.57;0.67] and a RMSE of 0.97 [90% CI:0.89;1.10] pk units on the Literature test set and a MAE of 0.82 [90% CI:0.77;0.85] and a RMSE of 1.13 [90% CI:1.05;1.21] pk units on the Novartis test set (shown in Supplementary Figure S2). The performance is comparable to the performance of Epik and Marvin on both test sets (shown in Table 1).

TABLE 1

Model	Novartis data set		Literature data set
Model	MAE	RMSE	MAE	RMSE
Random Forest ¹ ^, ³	1.15	1.51	0.53	0.76
ChemAxon Marvin (V20.1.0) ³	0.86	1.17	0.57	0.87
MolGpKa Pan et al. (2021)	0.87 [0.77;0.97]	1.27 [1.08;1.45]	0.49 [0.40;0.65]	1.00 [0.56;1.53] ⁴
Epik ² Pan et al. (2021)	0.83 [0.75;0.91]	1.16 [1.06;1.26]	0.58 [0.48;0.67]	0.92 [0.74;1.12]
pkasolver-epic	0.71 [0.64;0.74]	0.93 [0.85;0.97]	0.52 [0.49;0.56]	0.82 [0.76;0.86]
pkasolver-light	0.86 [0.81;0.94]	1.13 [1.04;1.20]	0.56 [0.51;0.64]	0.82 [0.71;0.93]

Used a random forest implementation with 1,000 estimators and the FCFP6 fingerprint. Values for the best performing random forest implementation are shown.

Epik identified different protonation centers than were reported in the data sets for the Novartis data set for 26 out of 280 molecules. These molecules were excluded from the MAE and RMSE calculation for Epik.

values were obtained from Baltruschat and Czodrowski (Baltruschat and Czodrowski, 2020a).

the reason for the large confidence interval is the incorrect prediction for a single molecule (Isomeric Smiles: CCNC) by MolGpKa with an error of 8.86 pK units

Performance of state-of-the-art knowledge-based approaches and commercial software solutions to predict pK values on the Novartis and Literature test sets are shown. For each data set, the mean absolute error (MAE) and root mean squared error (RMSE) is calculated. For MolGpKa, Epik, pkasolver-epic, and pkasolver-light the median value and the 90% confidence interval are reported. Used a random forest implementation with 1,000 estimators and the FCFP6 fingerprint. Values for the best performing random forest implementation are shown. Epik identified different protonation centers than were reported in the data sets for the Novartis data set for 26 out of 280 molecules. These molecules were excluded from the MAE and RMSE calculation for Epik. values were obtained from Baltruschat and Czodrowski (Baltruschat and Czodrowski, 2020a). the reason for the large confidence interval is the incorrect prediction for a single molecule (Isomeric Smiles: CCNC) by MolGpKa with an error of 8.86 pK units

2.2 Fine-Tuned Model Performance

While the performance on the test sets of the pre-trained model was already acceptable we were able to further increase model accuracy by fine-tuning the pre-trained model using a data set of experimentally measured pK values. The performance of the fine-tuned model on the validation set of the experimental data set is shown in Supplementary Figure S3B. The median performance of the fine-tuned model was improved from a RMSE of 0.97 [90% CI:0.89;1.10] to 0.82 [90% CI:0.76;0.88] pK units on the Literature test set and from a RMSE of 1.13 [90% CI:1.05;1.21] to 0.93 [90% CI:0.85;0.97] pK units on the Novartis test set (shown in Figure 1).

FIGURE 1

The fine-tuned GNN model is able to predict the pK values of the Novartis and Literature test set with high accuracy. Panel (A) shows the performance of the fine-tuned model (initially trained with the ChEMBL data set and subsequently fine-tuned on the experimental data set) on the Literature test set. Panel (B) shows the performance of the same model on the Novartis test set. The solid red line in the scatter plot indicates the ideal behavior of the reference and predicted pK values, the dashed lines mark the ±1 pk unit interval. Mean absolute error (MAE) and root mean squared error (RMSE) is shown, the values in bracket indicate the 90% confidence interval calculated from 50 repetitions with random training/validation splits. N indicates the number of investigated samples. In order to avoid model performance degradation on the ChEMBL data set we randomly added molecules from the ChEMBL data set during the fine-tuning workflow. Adding molecules from the ChEMBL data set to restrict model parameters and avoid overfitting decreased the performance of the fine-tuned model on the ChEMBL data set only slightly (shown in Supplementary Figure S4). This was necessary since previous attempts without regularization showed decreased accuracy of the fine-tuned model in regions outside the limited pH range of the experimental data set while improving the performance on the test sets (details to the pH range of both the ChEMBL and experimental data set are shown in Supplementary Figure S6). An example of the performance of the fine-tuned model on the ChEMBL data set without regularization is shown in Supplementary Figure S7. To set the performance of the fine-tuned model in context we compare its performance with two recent publications investigating pK predictions using machine learning. In Table 1 the results are summarized for the methods presented in both Baltruschat and Czodrowski (Baltruschat and Czodrowski, 2020a) and Pan et al. (Pan et al., 2021). We extracted data from these publications where appropriate and recalculated values if needed. Pan et al. (Pan et al., 2021) split the reported results into basic and acidic groups making it necessary to recalculate the values reported there for MolGpKa and Epik, the values for Marvin were taken directly from reference (Baltruschat and Czodrowski, 2020a) (reported values were calculated without confidence interval). The fine-tuned GNN model (shown as pkasolver-epic in Table 1) performs on a par with the best performing methods reported there. It is difficult to rationalize MAE/RMSE differences between different methods/models shown in Table 1) since training sets and methods are different. The small difference in performance between pkasolver-epic and MolGpka could be attributed to the transfer learning routine which added experimentally measured pK values. The random forest model was trained on significantly less data (only on the 5,994 pk values present in the experimental data set) than either pkasolver or MolGpka yet performs comparably to both on the Literature data set while significantly worse on the Novartis data set. This might highlight the complexity of the Novartis data set, an observation previously made and investigated in Pan et al. (Pan et al., 2021). Both Epik and Marvin perform well on both test data sets. It is surprising that pkasolver-epic can slightly outperform Epik, even though its initial training was based on data calculated by Epik. We think this emphasizes the potential of transfer learning as used in this work and data-driven deep learning in general.

2.3 Training on the Experimental Data Set Without Transfer Learning

To provide a ready-to-use pK prediction pipeline for which we can distribute the trained models under the MIT licence we trained models exclusively on the experimental data set. The performance on the Novartis and Literature data set of these models is shown in Supplementary Figure S5 and summarized in Table 1 (shown as pkasolver-light). While the results are comparable to Epik and MolGpKa on the test sets it is important to stress that both test sets contain only monoprotoic molecules (Baltruschat and Czodrowski, 2020a).

2.4 Sequential pK Predictions With Pkasolver

Combining the trained GNN models with Dimorphite-DL, a tool that identifies potential protonation sites and enumerates protonation states, enabled us to perform sequential pK predictions. A detailed description of this approach is given in the Detailed methods section. We investigated multiple mono- and polyprotic molecules for qualitative and quantitative agreement between prediction and experimental data. The results for the investigate systems were of excellent consistency using pkasolver-epic and of reasonable accuracy using pkasolver-light. The list of molecules that we tested is included in the pkasolver repository and a subset of molecules of general interest for drug discovery are discussed in detail in the Supplementary Materials section.

2.5 Limiations of Pkasolver

The sequential pK prediction of pkasolver generates microstates and the calculated pK values are microstate pK values. One limitation of pkasolver is that only a single microstate per macrostate is generated. Tautomeric and mesomeric states are never changed during the sequential de-/protonation (i.e., double bond positions are fixed). For each protonation state the bond patter of the molecule that was proposed by Dimorphite-DL at pH 7.4 is used. This shortcoming has several consequences. First, it leads to unusual protonation states. One example that has been observed throughout the sequential pK prediction tests with pkasolver-epic are amide groups with a negative charge on the nitrogen atom. The more likely position of the charge is the more electronegative oxygen atom. This has little practical consequence since this pattern was also present in the pK prediction training set generated with Epik (the mesomeric state was fixed in training too). A far more severe limitation is the fact that it is not possible to model microstates within a singe macrostate, since tautomers can not be changed (Gunner et al., 2020). To overcome this limitation it is necessary to enumerate tautomers for each protonation state and estimate their relative population. Solving this particular problem will be part of future work.

2.5.1 Limitations of Pkasolver-Light

The training set of pkasolver-light contains only monoprotic pK data with the majority of pK values between 4 and 10 (as shown in Supplementary Figure S6) (Baltruschat and Czodrowski, 2020a). The trained models are not necessarily suitable for polyprotic molecules. This limitation becomes apparent in the in depth discussion of some mono- and polyprotic molecules discussed in the Supplementary Materials section. For polyprotic molecules it is highly recommended to use pkasolver-epic instead of the pkasolver-light.

2.5.2 Limitations of Pkasolver-Epic

The pre-training data set imposes limitations on the applicability domain of the pK predictions with pkasolver-epic. The selection criteria of the pre-training data set are described in the Methods section. In Supplementary Figure S8 the distribution of several molecular properties (molecular weight, number of heteroatoms, number of hydrogen bond acceptor/donor, frequency of elements) are shown. The transferability of the trained models for molecules outside these distributions has not been tested and the usage of pkasolver-epic for such molecules is not recommended.

3 Detailed Methods

3.1 Data Set Generation and Pre-processing

Four different data sets were used in this work: the ChEMBL data set, the experimental data set, the Novartis data set and the Literature data set. The ChEMBL data set used for pre-training was obtained from the ChEMBL database using the number of Rule-of-Five violations (set to a maximum of one violation) as filter criteria (Gaulton et al., 2012; Davies et al., 2015). For each of the molecules, a pK scan for the pH range between zero and 14 was performed using the Schrodinger tool Epik (Shelley et al., 2007; Greenwood et al., 2010) (Version-2121-1). The sequential pK scan indicated for 320,800 molecules one or multiple protonation state/s, resulting in a total of 729,375 pK values. For each pK value, Epik further indicated the protonation center using the atom index of the heavy atom at which either a hydrogen is attached or removed. To perform transfer learning we obtained a second data set with experimental pK values. This data set (subsequently called ‘experimental data set‘) was developed by Baltruschat and Czodrowski (Baltruschat and Czodrowski, 2020b) and can be acquired from their GitHub repository . For a detailed description of the curating steps taken to generate this data set, we point the reader to the Methods section of (Baltruschat and Czodrowski, 2020b). The experimental data set consists of 5,994 unique molecules, each with a single pK value and an atom index indicating the reaction center. Some of the molecules had to be corrected to obtain their protonation state at pH 7.4 (examples shown in ??). To test the performance of the models, two independent data sets were used, which were provided and curated by Baltruschat and Czodrowski (Baltruschat and Czodrowski, 2020b). The Literature data set contains 123 compounds collected by manual curating the literature. The Novartis data set contains 280 molecules provided by Novartis (Liao and Nicklaus, 2009b). For each molecule, a pK value and atom index indicating the reaction center was provided. To avoid training the model on molecules present in the Literature or Novartis data set we filtered the ChEMBL data set using the InChIKey and canonical SMILES strings of the neutralized molecules as matching criteria. 50 molecules were identified and removed from the ChEMBL data set. All checks were performed using RDKit (RDKit and Open-Source Chemiformatics, 2022).

3.2 Enumerate Protonation States During Training/Testing

The goal of calculating microstate pK values is to find the pH value at which the concentration of two molecular species is equal. To do this efficiently, we provide as input the protonated and deprotonated molecular species of the acid-base pair for which we want to calculate the pK value (the Brønsted acid/base definitions are used here and subsequently (McNaught and Wilkinson, 2014)). This approach enables a consistent treatment of acids and bases with a single data structure (the acid-base pair). This workflow made it necessary that we generate the molecular species at each protonation state starting from the molecule at pH 7.4 by removing or adding hydrogen to the reaction center (which was calculated by Marvin for the experimental, Novartis, and Literature data set and Epik for the ChEMBL data set). We do this by sequentially adding hydrogen atoms from highest to lowest pK for acids (i.e., at pH = 0 all possible protonation sites are protonated) and removing hydrogen atoms from lowest to highest pK value for bases on the structure present at pH 7.4 (at pH = 14 all possible protonation sites are deprotonated). This approach presented challenges for the ChEMBL data set for which sequential pK values and reaction centers were calculated with Epik. Epik calculates the microstate pK value on the most probable tautomeric/mesomeric structure. This leads to potential protonation states that require changes in the double bond pattern and redistribution of hydrogen. Since we do not consider tautomeric changes to the molecular structure in the present implementation, such tautomeric changes can introduce invalid molecules in either the sequential removal or addition of hydrogen atoms. Whenever such molecular structures were encountered we removed these protonation states from further consideration. Additionally, we used RDKit’s sanitize function to identify cases for which protonation state changes introduce invalid atom valences. In other cases in which the protonation state change on a mesomeric structure introduces valid yet improbable molecular structures (e.g. protonating the oxygen in an amide instead of the nitrogen) we keep these structures. This reduced the number of molecules and protonation states in the ChEMBL data set to 286,816 molecules and 714,906 protonation states. The distribution of pK values for the ChEMBL and experimental data set is shown in Supplementary Figure S6.

3.3 Training and Testing With PyTorch Geometric

We use PyTorch and PyTorch geometric (subsequently abbreviated as PyG) for model training, testing, and prediction of pK values on the graph data structures (Fey and Lenssen, 2019; Paszke et al., 2019).

3.3.1 Graph Data Structure

A graph G is defined as a set of no the nodes V and edges E connecting the nodes. Each node v ∈ V has a feature vector x , which encodes atom properties like element, charge, number of hydrogen, as well as the presence of particular SMARTS patterns as a one-hot-encoding bit vector (all atom properties are shown in Supplementary Table S1). The adjacency matrix A defines the connectivity of the graph. A is defined as a quadratic matrix with A = 1 if there is an edge between node u and v and A = 0 if there is no edge between node u and v. We used RDKit to generate a graph representation of the molecule with atoms represented as nodes and bonds as edges (in coordinate list format to efficiently represent the spares matrix).

3.3.2 Graph Neural Network Architecture

To predict a single pk value the graph neural network (GNN) architecture takes as input two graphs representing the conjugated acid-base pair as shown in Figure 2. Figure 2B shows the high-level architecture of the used GNN.

FIGURE 2

Panel (A) shows the general workflow used to train the GNN on pK values for a single molecule. During the training and testing phase, each molecule was provided in the structure dominant at pH 7.4 with atom indices indicating the protonation sites and corresponding pK values connecting them. In the Enumeration of protonation states phase we generate the protonation state for each pK value. The molecular species for each of the protonation states are then translated in their graph representation using nodes for atoms and edges for bonds, with node feature vectors encoding atom properties in the Graph representation phase. In the pK prediction phase graphs of two neighboring protonation states are combined and used as input for the GNN model to predict the pK value for the acid-base pair [using the Brønsted–Lowry acid/base definition (McNaught and Wilkinson, 2014)]. The architecture of the GNN model is shown in detail in panel (B). For a pair of neighboring protonation states two independent GIN (graph isomorphism network) convolution layers and ReLU activation functions are used for the protonated and the deprotonated molecular graph to pass information of neighboring atoms and achieve the embedding of the chemical environment of each atom (Xu et al., 2019). The output of the convolutional layer is summarized using a global average pooling layer, generating the condensed input for the multilayer perceptron (MLP). To add regularization and to prevent co-adaptation of neurons a dropout layer was added. There are three phases to predict a pK value from a pair of molecular graphs. The first stage involves recurrently updating the node states using GIN (graph isomorphism network) convolution layers and ReLU activation functions (Xu et al., 2019). We used 3 GIN layers with an embedding size of 64 bits each to propagate information throughout the graph and update each node with information about the extended environment. In the second stage, a global average pooling is performed to produce the embedding of the protonated and deprotonated graph, resulting in two 32 bit vectors. Concatenating the two 32 bit vectors produces the input for the third stage, the multilayer perceptron (MLP) with 3 fully connected layers (each with an embedding size of 64). To add regularization and to prevent co-adaptation of neurons a dropout layer randomly zeros out elements of the pooling output vector with p = 0.5 during training. Additionally, batch normalization is applied as described in (Ioffe and Szegedy, 2015).

3.3.3 GNN Model Training

Before each training run the ChEMBL and experimental data set were shuffled and randomly split in training (90% of the data) and validation set (10% of the data). To ensure that we can reproduce these splits the seed for each split was recorded. The mean squared error (MSE) of predicted and reference pK values on the training data set was calculated and parameter optimization was performed using the Adam optimizer with decoupled weight decay regularization (Loshchilov and Hutter, 2019) as implemented in PyTorch. Model performance was evaluated on the validation set and the model with the best performance was selected either for fine-tuning or further evaluation on the test data sets. The performance on the evaluation data set was calculated after every fifth epoch and the corresponding weights were saved. The learning rate for all training runs was dynamically reduced by a factor of 0.5 if the validation set performance did not change within 150 epochs (validation set performance threshold was set to 0.1). Pre-training of the GNN was performed on the ChEMBL data set with a learning rate of 1x10−3 and a batch size of 512 molecules for 1,000 epochs. Fine-tuning was performed using the experimental data set with a learning rate of 1x10−3 and a batch size of 64 molecules for 1,000 epochs. All parameters of the GNN models were optimized during fine-tuning. To avoid overfitting to the experimental data set we added to each batch of the fine-tuning data set a randomly selected batch (1,024 molecules) of the pre-training data set. To calculate the confidence intervals of the model performance, pre-training and fine-tuning were repeated 50 times, each with a random training-validation set split resulting in 50 independently fine-tuned models.

3.4 Sequential pK Value Prediction With Pkasolver

We use Dimorphite-DL to identify the proposed structure at pH 7.4 and all de-/protonation sites for a given molecule (Ropp et al., 2019). We iteratively protonate each of the proposed de-/protonation sites generating a molecular pair consisting of the protonated and deprotonated molecular species (in the first iteration the deprotonated molecule is the molecule at pH 7.4). For each of the protonate/deprotonated pairs a pK value is calculated. The protonated structure with the highest pK value (but below pH 7.4) is kept and the protonation site is removed from the list of possible protonation sites. This is repeated until either (1) all protonation sites are protonated, (2) no more valid molecules can be generated, or (3) the calculated pK values are outside the allowed pK range. To enumerate all deprotonated structures we start again with the structure at pH 7.4 and start to iteratively deprotonate each of the proposed de-/protonation sites. Here, we always keep the deprotonated structure with the lowest pK value that is above 7.4. pK values are calculated using 25 of the 50 fine-tuned GNN models. For each protonation state, the average pK value is calculated and the standard deviation is shown to enable the user to identify molecules or protonation states for which the GNN model estimates are uncertain. We provide a ready to use implementation of pkasolver to predict sequential pK values in our GitHub repository (for further information see the Code and data availability section).

4 Conclusion

We have shown that GNNs can be used to predict mono- and polyprotic pK values and achieve excellent performance on two external test sets. Training the GNN model in two stages with a pre-training phase using a large set of molecules with calculated pK values and a fine-tuning phase on a small set of molecules with experimentally measured pK values improves the performance of the GNN model significantly. This performance boost is especially noteworthy on the challenging Novartis test set (the RMSE was decreased from 1.18 [1.05;1.27] to 0.93 [0.85;0.97] pk units). A direct comparison with other software solutions and machine learning models on the two test sets shows that the fine-tuned GNN model performs consistently on a par with the best results of other commercial and non-commercial tools. We have implemented pkasolver as an open-source and free-to-use Python package under a permissive licence (MIT licence). We provide two versions of the package: pkasolver-epic and pkasolver-light. The former performs best on both test sets and is suitable for sequential pK prediction on poloyprotic molecules. It was pretraind on a subset of the ChEMBL data set for which pK values were predicted using Epik and fine-tuned on experimental monoprotic pK values. Due to the terms of the licence agreement of Epik we are unable to supply the trained models but provide the training pipeline to reproduce the models (which requires an active Epik license). pkasolver-light performs well on both test sets but its application domain is limited to monoprotic molecules. These are the trained models distributed with the pkasolver package.

27 in total

1. Towards the comprehensive, rapid, and accurate prediction of the favorable tautomeric states of drug-like molecules in aqueous solution.

Authors: Jeremy R Greenwood; David Calkins; Arron P Sullivan; John C Shelley
Journal: J Comput Aided Mol Des Date: 2010-03-31 Impact factor: 3.686

2. The SAMPL6 challenge on predicting aqueous pK_a values from EC-RISM theory.

Authors: Nicolas Tielker; Lukas Eberlein; Stefan Güssregen; Stefan M Kast
Journal: J Comput Aided Mol Des Date: 2018-08-02 Impact factor: 3.686

3. MolGpka: A Web Server for Small Molecule pK_a Prediction Using a Graph-Convolutional Neural Network.

Authors: Xiaolin Pan; Hao Wang; Cuiyu Li; John Z H Zhang; Changge Ji
Journal: J Chem Inf Model Date: 2021-07-12 Impact factor: 4.956

Review 4. A compact review of molecular property prediction with graph neural networks.

Authors: Oliver Wieder; Stefan Kohlbacher; Mélaine Kuenemann; Arthur Garon; Pierre Ducrot; Thomas Seidel; Thierry Langer
Journal: Drug Discov Today Technol Date: 2020-12-17

5. An explicit-solvent hybrid QM and MM approach for predicting pKa of small molecules in SAMPL6 challenge.

Authors: Samarjeet Prasad; Jing Huang; Qiao Zeng; Bernard R Brooks
Journal: J Comput Aided Mol Des Date: 2018-10-01 Impact factor: 3.686

6. SAMPL6: calculation of macroscopic pK_a values from ab initio quantum mechanical free energies.

Authors: Edithe Selwa; Ian M Kenney; Oliver Beckstein; Bogdan I Iorga
Journal: J Comput Aided Mol Des Date: 2018-08-06 Impact factor: 3.686

7. ChEMBL: a large-scale bioactivity database for drug discovery.

Authors: Anna Gaulton; Louisa J Bellis; A Patricia Bento; Jon Chambers; Mark Davies; Anne Hersey; Yvonne Light; Shaun McGlinchey; David Michalovich; Bissan Al-Lazikani; John P Overington
Journal: Nucleic Acids Res Date: 2011-09-23 Impact factor: 16.971