Literature DB >> 34643666

Multi-instance learning of graph neural networks for aqueous pKa prediction.

Jiacheng Xiong1,2, Zhaojun Li3, Guangchao Wang4, Zunyun Fu1, Feisheng Zhong1,2, Tingyang Xu5, Xiaomeng Liu1,2, Ziming Huang1,2, Xiaohong Liu1,3,6, Kaixian Chen1,2, Hualiang Jiang1,2,6, Mingyue Zheng1,2.   

Abstract

MOTIVATION: The acid dissociation constant (pKa) is a critical parameter to reflect the ionization ability of chemical compounds and is widely applied in a variety of industries. However, the experimental determination of pKa is intricate and time-consuming, especially for the exact determination of micro pKa information at the atomic level. Hence, a fast and accurate prediction of pKa values of chemical compounds is of broad interest.
RESULTS: Here, we compiled a large scale pKa dataset containing 16595 compounds with 17489 pKa values. Based on this dataset, a novel pK a prediction model, named Graph-pKa, was established using graph neural networks. Graph-pKa performed well on the prediction of macro pK a values, with a mean absolute error around 0.55 and a coefficient of determination around 0.92 on the test dataset. Furthermore, combining multi-instance learning, Graph-pKa was also able to automatically deconvolute the predicted macro pKa into discrete micro pK a values. AVAILABILITY: The Graph-pK a model is now freely accessible via a web-based interface (https://pka.simm.ac.cn/). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) (2021). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

Entities:  

Year:  2021        PMID: 34643666      PMCID: PMC8756178          DOI: 10.1093/bioinformatics/btab714

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

The acid dissociation constant pKa, an equilibrium constant defined as the negative logarithm of the ratio of the protonated and deprotonated form of a compound, is a key parameter to describe the ionization ability of substances. It has been reported that about two-thirds of marketed drugs are ionizable in the aqueous solution (Manallack, 2007). Hence, in the design of new drugs, pKa is a crucial physical property to be considered, which has profound effects on biological activities, ADMET (absorption, distribution, metabolism, excretion and toxicity) properties and other properties of drugs (Charifson and Walters, 2014; Manallack ). Apart from the pharmaceutical industry, the pKa is also related to environmental ecotoxicology, agriculture and chemical industries. Hence, the fast and accurate prediction of pKa values of chemical compounds from their structures is of great interest. Graph neural networks (GNN) are a type of neural network to process graph structure data (Defferrard ; Niepert ). Since first introduced into the prediction of molecular properties several years ago (Duvenaud ), reports of different GNN architectures and their successful applications have been rapidly accumulating in this field (Sun ; Zhang ). However, so far, graph neural networks have rarely been applied in the prediction of pKa, presumably because the pKa values are not only molecular-level ‘global’ properties but also atomic-level ‘local’ properties (Fig. 1). The molecular-level ‘global’ properties refer to the macro-pKa, the acid dissociation constant related to the observable loss or gain of a proton from a molecule regardless of specific ionization site. The ‘local’ properties refer to micro-pKa, the acid dissociation constant related to the loss or gain of a proton from a single titratable site (Işık ). Apart from the macro-pKa, a powerful pKa prediction model should also be capable of providing micro-pKa information at the atomic level. Such information can not only enhance our confidence in the predicted results but also provide useful reference information for the structural modification of compounds, chemical reaction prediction and other related studies. However, for a molecule with multiple ionization sites, usually, we can only measure one or a few macro-pKa values experimentally, but not the micro-pKa values of all individual sites. Thus, it is intricate to predict micro-pKa values, posing a significant challenge to the overall prediction of pKa.
Fig. 1.

The relationship between macro- and micro-pKa of basic compounds. pKa(macro) refers to the macro-pKa; pKa1, pKa2 and pKa3 refer to the micro-pKa

The relationship between macro- and micro-pKa of basic compounds. pKa(macro) refers to the macro-pKa; pKa1, pKa2 and pKa3 refer to the micro-pKa In 2019, Roszak et al. built a graph convolution model for the prediction of the pKa value of the C-H bond in organic solvents and applied this model to predict the products of hydrogen abstraction reaction (Roszak ). To the best of our knowledge, this study was the only attempt to predict compound pKa with graph neural networks. However, the training of their model relied on the pKa dataset containing atomic level labels, which were mainly obtained from quantum chemical calculation or molecules with a single ionizable site. Hence, this method is difficult to extend to the pKa prediction of heterogeneous chemical classes with multiple ionizable sites. Another alternative strategy to obtain micro-pKa data is to assign the macro-pKa value of a molecule to its major responsible ionization site and take it as an approximation of the micro-pKa value. Recently, some pKa prediction models have used this strategy (Hunt ; Yang ), but there are also two significant problems. As illustrated in Figure 1, (i) for molecules such as propane-1,2,3-triamine, there are multiple sites having similar ionization capacity, this approximate treatment may bring large errors; (ii) the selection of major responsible ionization site is a non-trivial process and requires substantial chemical domain knowledge, and in many cases, a macro-pKa value could not be unambiguously assigned to one major ionizable group. Multi-instance learning (MIL) is a kind of weakly supervised learning algorithm for data with only coarse-grained labels (Zhou, 2018). In classic MIL, the training set is composed of many ‘bags’, each of which contains a series of ‘instances’. A bag is labeled as positive if containing at least one positive instance; otherwise, it is labeled as negative. The goal of MIL is to train a classifier that can correctly label unseen bags. Due to the ability to provide instance-level interpretation, MIL has attracted extensive attention in many classification tasks such as medical image analysis, text classification and video annotation (Carbonneau ; Wang ; Zhou ). However, so far, MIL has rarely been used in regression tasks. This is because a necessary prerequisite for obtaining instance labels through MIL is that there should be a clear mathematical relationship between instance labels and bag labels. This relationship is common in classification tasks (such as ‘or’ relationship) but rare in regression tasks. For pKa, there is a relatively clear relationship between macro-pKa and micro-pKa. For example, Figure 1 shows the formula between macro-pKa and micro-pKa of basic compounds. Here, combining multi-instance learning and graph neural networks, we designed a novel pKa prediction model named Graph-pKa. In Graph-pKa, a molecule is regarded as a ‘bag’, and those ionizable atoms in this molecule are regarded as ‘instances’. It means that the macro-pKa value of a molecule is designated as the label of a bag, which is available in the training set, and the unavailable information regarding to the micro-pKa values of ionizable sites are considered as the labels of instances. Under this scheme, Graph-pKa can follow the MIL framework to learn the labels of instances through training against the labels of bags (Fig. 2). Furthermore, it should be noted that those molecules containing multiple ionization sites may have multiple macro-pKa values. In this work, we only consider the most acidic and basic pKa values, which are key parameters that can unambiguously and concisely describe the ionization capabilities of compounds. Some chemical information websites, including ChEMBL (Gaulton ) and DrugBank (Wishart ) also describe the prediction for the pKa of compounds in terms of the most acidic and basic pKa values.
Fig. 2.

The schematic representation of the proposed Graph-pKa model

The schematic representation of the proposed Graph-pKa model

2 Materials and methods

2.1 S-pka dataset

A large pKa dataset named S-pKa was compiled, mainly from three main sources: (i) datasets used in several previous studies on pKa, (ii) a free software named QSAR Toolbox, (iii) manual extraction from various literature. Those chemical structures from different sources were standardized and then merged. The structure standardization procedure includes removing all salts from molecules, neutralizing charged molecules, and standardizing SMILES strings. In addition, considering that the accuracy of publicly available experimentally determined pKa values was often dubious (Rupp ), each data would undergo manual inspection to ensure that it belongs to the most acidic or basic pKa value of its corresponding molecule before adding to the S-pKa dataset. The detailed processes of data collection and cleaning is given in Supplementary Material and Supplementary Figure S1. The S-pKa dataset can be separated into an acidic subset and a basic subset, containing the most acidic pKa values of 9043 chemical structures and the most basic pKa values of 8436 chemical structures, respectively (Supplementary Fig. S2a). The distribution of pKa values in the acidic and basic subset is shown in Figure 3a. The most acidic pKa values varied from -3.3 to 40, while the most basic pKa values varied from -10.1 to 14. Since to learn micro-pKa via MIL is a critical concept utilized in the establishment of the Graph-pKa model, the acidic or basic ionizable sites of compounds in the S-pKa dataset are all enumerated and displayed (Fig. 3b). In this study, the acidic ionizable sites are defined as non-carbon atoms connected with at least one hydrogen atom, and the basic ionizable sites are defined as nitrogen atoms with no positive formal charge. The distribution of the molecular weight of compounds across the S-pKa dataset is also shown in Supplementary Figure S2b.
Fig. 3.

The distributions of simple compound properties in the S-pKa dataset. (a) The experimental pKa values. (b) The number of ionizable sites

The distributions of simple compound properties in the S-pKa dataset. (a) The experimental pKa values. (b) The number of ionizable sites

2.2 Graph-pKa model

The architecture of Graph-pKa is shown in Figure 2. It begins by describing each molecule as an undirected graph where nodes and edges correspond to atoms and chemical bonds, respectively. The molecular graph is then input into the graph neural layers where atoms receive the message of other atoms in the molecule and use the aggregated messages to update their own features. The graph neural layer in Graph-pKa is the same as our previously developed Attentive FP (Xiong ), a molecular representation learning scheme that uses a graph attention mechanism. Here, six graph neural layers are stacked in Graph-pKa for the extraction of atom features. The major difference between Graph-pKa and other graph neural networks lies in the approach to deal with the features of nodes extracted by graph neural network layers. In molecular graph neural networks such as GCN (Duvenaud ), MPNN (Gilmer ) and Attentive FP (Xiong ), those node features are aggregated with various pooling operations such as average pooling and Set2Set to generate the features of the whole molecule, which are next used to fit and predict the molecular properties. However, in Graph-pKa those learned node features are directly fed into a fully connected (FC) layer to predict the pKa values of atoms. Since some atoms in molecules are not ionizable, their predicted pKa values will be masked. In the acidic and basic pKa prediction model, the mask values are respectively positive infinity and negative infinity. Finally, the macro-pKa values of molecules are calculated according to the approximate mathematical relationships between them and the predicted pKa values of ionizable atoms. More specifically, given an atom A with features X, the above process can be formulated as follows: In acidic pKa prediction model: In basic pKa prediction model:    where is referred to a fully connected neural network layer, P is the acidic ionizable sites, Q is the basic ionizable sites, is the number of heavy atoms in a molecule, is the positive infinity, and are the most acidic/basic pKa values of a molecule. Obviously, formula 3 and 6 are the key formulas for MIL. Yang et al. also had used formula 6 to calculate the macro-pKa values in their study (Yang ). Here, we provided the derivation of formula 3 and 6 in Supplementary Material and Supplementary Figure S3.

2.3 Implement of Graph-pKa and other benchmark methods

In Graph-pKa, the conversion from a SMILES string to an undirected graph and initialization for it was implemented with the DGL-LifeSci package. The representations of the graph were initialized with eight kinds of atom features and four kinds of bond features (Supplementary Table S1). The Graph-pKa model was implemented using the PyTorch and DGL. The loss function used to train Graph-pKa was MSELoss. Attentive FP and four machine learning models, including SVM, RF, XGBoost and ANN were implemented as baseline models. XGBoost was implemented with the XGBoost package, SVM, RF and ANN were implemented with the Scikit-learn package. Attentive FP is a graph neural network with the same GNN layers as Graph-pKa but without MIL, which was also implemented as a control for model performance evaluation. For baseline models except Attentive FP, the molecular fingerprints used to encode the molecular structures were a kind of combined molecular fingerprint that integrated eight types of common molecular fingerprints including CDK, Estate, CDK graph only, MACCS, PubChem, Substructure, Klekota-Roth and 2D atom pairs. Those molecular fingerprints had 9121 bits in total and were calculated using PaDEL(Yap, 2011).

2.4 Model training and evaluation

In the experiment of predicting macro-pKa, the S-pKa dataset was randomly split into training/validation/test set in a 70:15:15 ratio. Graph-pKa and other models were trained on the same training set. The best set of hyperparameters for each model were determined based on the result on the validation set. The search ranges and optimal values of these hyperparameters are provided in Supplementary Table S2. The final model performance was assessed on the test set and two external tests set through three independent runs. The metrics for evaluating model performance were mean absolute error (MAE), root mean squared error (RMSE) and coefficient of determination (R2). The structural similarity between the two molecules was calculated using the 1024-bit Morgan2 fingerprints and the Tanimoto coefficient. In the experiment of predicting micro-pKa, about 500 molecules that possessed multiple different acidic/basic ionization sites and whose dominant ionization sites had been uniquely assigned by Hunt were extracted as test data. Those molecules were then removed from the S-pKa Dataset. Graph-pKa was retrained on the remaining dataset with the same set of hyperparameters previously used. The metrics for evaluating the model are consistency rate and difference values. Consistency rate is the probability that the dominant ionization sites of molecules selected by Graph-pKa are the same as that of human experts. Different value is used to quantify the degree of divergence between Graph-pKa and human experts. They are calculated as follows:    where is the most acidic/basic atom of molecule selected by human experts, is the most acidic/basic atoms of molecule predicted by Graph-pKa, is an arbitrary element in , the reason why is a collection is that some molecules have multiple dominant ionization sites with the same ionization ability, is referred to a function of Graph-pKa for atomic pKa prediction.

3 Results and discussion

3.1 Comparison with benchmark methods

In order to evaluate the performance of Graph-pKa, four conventional machine learning models were implemented and taken as benchmark methods. A kind of combined molecular fingerprints was used as the representation of molecules and the input of these machine learning models, due to its good performance on a previous study for pKa prediction (Mansouri ). The comparison between Graph-pKa and other models was carried out on the S-pKa dataset that was randomly divided into training, validation and test set. The performances of those models on the test set are shown in Figure 4. Among the four machine learning models, ANN and XGBoost performed comparatively well, which was consistent with some previous studies (Mansouri ; Yang ). However, the performances of these two models still obviously fell behind Graph-pKa, which achieved a MAE around 0.55 and a R2 around 0.92 on the test sets (Fig. 4a and b). As known, the performance of QSAR models is closely related to the similarity between predicted molecules and the molecules of the training set. To evaluate the generalization capability of different models, we also calculated the pairwise similarity of test set molecules to the training set molecules, and split the test set molecules into five individual subsets according to their maximum similarity to training set molecules (Supplementary Table S3). Then, the MAE of those models on each subset was compared. As shown in Figure 4c and d, the Graph-pKa outperformed other machine learning models on nearly all similarity subsets, which demonstrated it possesses high robustness and generalization ability. For the molecules with max similarity higher than 0.5 to the training set, the MAE of the model was lower than 0.65. If using it as the threshold for acceptable errors, 81.1% of test molecules were within the applicability domain of the models. Furthermore, the performance of Attentive FP on macro-pKa prediction was not better than that of Graph-pKa, meaning that MIL could endow Graph-pKa with the prediction ability of micro-pKa without significant trade-off on its prediction ability of macro-pKa.
Fig. 4.

The performance of the various model on macro-pKa prediction on the S-pKa dataset. (a,b) The MAE and R2 of those models on the test dataset. (c,d) The MAE of those models for acidic (c) and basic (d) pKa prediction on a series of similarity subsets. Error bars represent standard deviations

The performance of the various model on macro-pKa prediction on the S-pKa dataset. (a,b) The MAE and R2 of those models on the test dataset. (c,d) The MAE of those models for acidic (c) and basic (d) pKa prediction on a series of similarity subsets. Error bars represent standard deviations

3.2 Evaluation on external datasets

The performance of Graph-pKa was further validated by testing against two external datasets that were obtained from two blind pKa prediction challenges named SAMPL6 and SAMPL7. These two challenges were launched by the Drug Design Data Resource Community in 2018 and 2020, respectively. The SAMPL6 dataset comprises 24 kinase inhibitor-like molecules with 31 experimental pKa values, and the SAMPL7 dataset comprises 22 molecules (most are sulfonamides) with 20 experimental pKa values. There are two pKa values not belonging to the most acidic or basic pKa values in the SAMPL6 dataset and two molecules without corresponding experiment pKa values in the SAMPL7 dataset, they were excluded from this testing. The performances of Graph-pKa and some commonly used software and models on these two external datasets are shown in Table 1 and Supplementary Table S4. Graph-pKa achieved a low MAE of 0.594 and 0.758 as well as a high R2 of 0.918 and 0.839 on SAMPL6 and SAMPL7 datasets, respectively, comparable to the performance of those commercial software established based on large collections of proprietary data.
Table 1.

Performance of Graph-pKa and other models on the SAMPL6 and SAMPL7 external test sets

DatasetModel nameModel classMAERMSE R 2

SAMPL6
Epik ScanaCommercial0.7840.9620.857
Epik MicroaCommercial0.7830.9720.854
ACD/pKaaCommercial 0.550 0.7830.905
MoKaaCommercial0.8540.9700.854
ChemAxonaCommercial1.0071.2480.759
Hunt’s modelbAcademic0.6870.8640.885
Yang’s XGBbAcademic0.7671.0110.842
Yang’s NNbAcademic0.8321.1410.799
OPERAdAcademic0.9701.2830.619

Graph-pKa

Academic

0.594

0.726

0.918
SAMPL7Epik ScancCommercial1.1211.6480.508
ChemAxoncCommercial 0.559 0.708 0.909
Yang’s XGBcAcademic1.4761.6220.523
Yang’s NNcAcademic0.9321.1560.758
OPERAdAcademic2.1352.515−3.752
Graph-pKaAcademic0.7580.9340.839

The bold entries in the “MAE”, “RMSE”, and “R2” columns represent the best results in corresponding datasets.

The results are cited from a summary of the SAMPL6 challenge results. (https://github.com/samplchallenges/SAMPL6/blob/master/physical_properties/pKa/analysis/).

The results are cited from articles of Hunt and Yang .

The results of Epik predictions are from Schrödinger Suite 2017; the results of ChemAxon predictions are from ChemAxon Marvin Suite 20.15.0. The results of Yang’s XGB and Yang’s NN are from a webserver (http://pka.luoszgroup.com/prediction).

The results are from OPERA 2.7. Nine pKa values that OPERA2.7 failed to predict were excluded.

Performance of Graph-pKa and other models on the SAMPL6 and SAMPL7 external test sets The bold entries in the “MAE”, “RMSE”, and “R2” columns represent the best results in corresponding datasets. The results are cited from a summary of the SAMPL6 challenge results. (https://github.com/samplchallenges/SAMPL6/blob/master/physical_properties/pKa/analysis/). The results are cited from articles of Hunt and Yang . The results of Epik predictions are from Schrödinger Suite 2017; the results of ChemAxon predictions are from ChemAxon Marvin Suite 20.15.0. The results of Yang’s XGB and Yang’s NN are from a webserver (http://pka.luoszgroup.com/prediction). The results are from OPERA 2.7. Nine pKa values that OPERA2.7 failed to predict were excluded. Although our Graph-pKa model has achieved satisfactory prediction performance, there are potentially two limitations. Frist, Graph-pKa is only trained to predict the most acidic and basic pKa values and its capability to predict other types of pKa values such as the 2nd strongest acidic and basic pKa values has not been fully evaluated. This is mainly because of the difficulty in the collection, cleaning, and labeling of this kind of training data. Second, the tautomerism of molecules has not been taken into account in Graph-pKa, which means that the model will give different prediction results for different tautomers of the same molecule. We leave this issue to follow-up studies, such as averaging the predicted values of different tautomers.

3.3 Performance on micro-pKa prediction

Macro-pKa values can describe the ionization degree of the molecule in the solvent but can't pinpoint the ionization state of each atom in this molecule. To acquire more comprehensive knowledge about the ionization of molecules, the prediction of micro-pKa values is equally important. Thus, the performance of Graph-pKa on predicting micro-pKa was also evaluated here. Unfortunately, the experimental determination of micro-pKa values is highly complicated, and there is currently no available micro-pKa dataset. Given this situation, a Turing-like test was designed to determine if Graph-pKa exhibited the intelligent behavior (i.e. to designate the most acidic/basic atoms in a molecule structure) that was indistinguishable from that of a human expert. The results of expert judgments were obtained from a recent work of Hunt et al. for pKa prediction (Hunt ), where each pKa value in their collected dataset (Hunt’s dataset) and two external test sets (Jensen’s dataset and SAMLP6 dataset) was carefully inspected and assigned to a specific site by human experts. As shown in Figure 5a, the overall consistency rates between the most acidic/basic atoms predicted by Graph-pKa and the most acidic/basic atoms selected by the human experts were over 90%. To further quantify the degree of divergence between Graph-pKa and human experts on those controversial molecules, the difference values of the predicted pKa between the most acidic/basic atoms predicted by Graph-pKa and those selected by the human experts are shown in Figure 5b. It could be observed that the difference values of 80% these controversial molecules were within 1.2 pKa units, which indicated that the divergences between Graph-pKa and human expert mainly derived from those molecules whose several atoms had similar ionization capability.
Fig. 5.

Application of Graph-pKa to predict the dominant ionization sites of molecules. (a) The consistency rates between the prediction of Graph-pKa and the judgment of human experts. (b) The distribution of difference values representing the degree of divergence between Graph-pKa and human experts on controversial molecules. (c,d) Some examples of molecules on which the predictions of Graph-pKa and human experts are consistent (c) and different (d), the arrows and circles denote to the dominant ionization sites selected by Graph-pKa and human experts, respectively, red and blue numbers, respectively, denote to the predicted acidic and basic pKa values of atoms by Graph-pKa. (e) Some molecules and their pKa values for reference

Application of Graph-pKa to predict the dominant ionization sites of molecules. (a) The consistency rates between the prediction of Graph-pKa and the judgment of human experts. (b) The distribution of difference values representing the degree of divergence between Graph-pKa and human experts on controversial molecules. (c,d) Some examples of molecules on which the predictions of Graph-pKa and human experts are consistent (c) and different (d), the arrows and circles denote to the dominant ionization sites selected by Graph-pKa and human experts, respectively, red and blue numbers, respectively, denote to the predicted acidic and basic pKa values of atoms by Graph-pKa. (e) Some molecules and their pKa values for reference Some examples of agreement and disagreement between Graph-pKa and human experts are respectively shown in Figure 5c and d. In the assignment of the most acidic atoms, two controversial molecules of note were A1 and A2, and most of the others were hydroxamic acid derivatives. Hunt et al. attributed the acidities of hydroxamic acid derivatives all to their hydroxyls. In fact, the dissociation ability of hydroxylic hydrogen and amino hydrogen in hydroxamic acids was quite similar (Bartmess, 2010) (also see R1, R2 in Fig. 5e, http://ibond.nankai.edu.cn), and the prediction results of Graph-pKa supported their equivalent protonation potential. In the assignment of the most basic atoms, the two most controversial molecules were B1 and B2. Our prediction for B1 was supported by a record from PubChem that the pKa value of the amine in B1 was 7.75 (https://pubchem.ncbi.nlm.nih.gov/compound/135398737). In addition, the basicity of the 1,3,4-Oxadiazol ring in B2 should be very weak, given that the pKa of 1,3,4-thiadiazole was only -4.9 (R3 in Fig. 5e, https://www.scripps.edu/baran/heterocycles/Essentials1-2009.pdf). According to Graph-pKa prediction, the basicity of B2 was attributed to the pyridine ring, instead of the 1,3,4-Oxadiazol ring. This assignment was further confirmed by quantum chemical calculation. As shown in Supplementary Figure S4, the protonation energies of nitrogen atom in the pyridine ring were -5.25 kcal/mol, significantly lower than that of nitrogen atoms in the 1,3,4-Oxadiazol ring (4.74 and 5.39 kcal/mol). The methods of quantum chemistry calculation are described in Supplementary Material. Besides, two molecules (B3, B4) in SAMPL6 datasets (Işık ), whose dominant ionization sites have been determined by nuclear magnetic resonance, are also shown in Figure 5d. The predicted results of Graph-pKa were consistent with the experimental results. The above results demonstrated that Graph-pKa performed outstandingly in the prediction of micro-pKa. It is impressive that in many cases the capability of Graph-pKa to locate the most acidic/basic sites of molecules is equivalent to or better than that of human experts, while all the chemical insight has been learned without explicit supervision in multi-instance learning. It can be expected that when there are more available training data in the future, the capability will be further improved.

3.4 Visualization of the atomic embeddings

In order to visualize the features of the atoms learned by the Graph-pKa model, the embeddings in the last hidden layer of several types of acidic ionization sites in the training data were extracted and submitted to principal component analysis. As shown in Figure 6, after training, the atomic embeddings from phenol hydroxyl, carboxyl, and sulfonamide groups were respectively gathered together. However, the distributions of atomic embeddings from alcoholic hydroxyl and amide groups were still relatively dispersed. These patterns suggest that alcoholic hydroxyl or amide groups in different chemical environments exhibit relatively larger variances, posing challenges for accurate micro-pKa prediction. We speculated that a possible reason was that, although alcoholic hydroxyl and amide groups widely existed in the training set, they have less contribution to the macro-pKa of the whole molecule due to their weak acidity. Therefore, they had lower weights and were less supervised during model training. Three molecules and their atomic embeddings visually display such a situation. After training, the atomic embeddings from carboxyl groups of the three similar molecules are close, whereas the atomic embeddings from amide groups of the three molecules are dispersed. Apparently, adding more samples whose dominant ionization groups are alcoholic hydroxyl groups or amide groups into training data may alleviate this problem.
Fig. 6.

Visualizing the atomic embeddings in last hidden layer using principal component analysis

Visualizing the atomic embeddings in last hidden layer using principal component analysis

3.5 Web server for the prediction of pKa

For the convenience of the community, a free web server wrapping the Graph-pKa model has been developed (https://pka.simm.ac.cn/). This web server was built using the python language and could be simultaneously accessed by multiple users. The web server can take multiple types of inputs including drawing a molecule from the molecular editor or uploading a txt/mol/sdf file. There are two main functions in this web server: pKa prediction and similarity search (Supplementary Fig. S5). In the pKa prediction module, the most acidic/basic pKa values and their corresponding micro-pKa values of the input molecule are predicted. The Monte Carlo dropout is used to evaluate the uncertainty of the prediction results and calculate the 95% confidence interval of the predicted value (Gal and Ghahramani, 2016). It is noteworthy that, due to our definition of possible ionization sites and the processing of input molecules, the web server does not support the pKa prediction for C-H bonds and ionized molecules. In the similarity search module, the most acidic/basic atoms of the molecules from the S-pKa dataset and the most acidic/basic atoms of the molecule input by the user are first predicted by Graph-pKa. Then, the embeddings of those predicted most acidic/basic atoms in the last hidden layer are extracted. Finally, the Euclidean distances between the atomic embeddings of the input molecule and that of the molecules in the S-pKa dataset are calculated. If the Euclidean distance is close enough (the threshold is set as less than 0.05), molecules are considered to be similar, and for each input molecule, up to four similar molecules and their experimentally determined pKa values will be output for reference.

4 Conclusions

In this work, we have developed a novel in silico pKa prediction model named Graph-pKa. Combining multi-instance learning into graph neural network, Graph-pKa not only outperforms those conventional machine learning models based on molecular fingerprints in predicting macro-pKa, but more significantly, can learn the micro-pKa values of atoms through training against the macro-pKa values of molecules. A Turing-like test demonstrated that it gained chemical insights to locate the most acidic/basic sites of molecules, which compared favorably with that of human experts. Such micro-pKa inference ability greatly enhances the interpretability and practicability of this model. Furthermore, in Graph-pKa, the fitting and prediction of macro-pKa are all dependent on the reasoning of micro-pKa, which can also avoid shortcut learning to some extent (Geirhos ). In the end, a Web application based on Graph-pKa model has been made freely available at https://pka.simm.ac.cn.

Funding

This work was supported by the Project supported by Shanghai Municipal Science and Technology Major Project, National Natural Science Foundation of China [81773634] and Tencent AI Lab Rhino-Bird Focused Research Program [JR202002]. Conflict of Interest: none declared. Click here for additional data file.
  16 in total

1.  PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints.

Authors:  Chun Wei Yap
Journal:  J Comput Chem       Date:  2010-12-17       Impact factor: 3.376

2.  Graph convolutional networks for computational drug development and discovery.

Authors:  Mengying Sun; Sendong Zhao; Coryandar Gilvary; Olivier Elemento; Jiayu Zhou; Fei Wang
Journal:  Brief Bioinform       Date:  2020-05-21       Impact factor: 11.622

Review 3.  Acidic and basic drugs in medicinal chemistry: a perspective.

Authors:  Paul S Charifson; W Patrick Walters
Journal:  J Med Chem       Date:  2014-09-18       Impact factor: 7.446

4.  Rapid and Accurate Prediction of pKa Values of C-H Acids Using Graph Convolutional Neural Networks.

Authors:  Rafał Roszak; Wiktor Beker; Karol Molga; Bartosz A Grzybowski
Journal:  J Am Chem Soc       Date:  2019-10-21       Impact factor: 15.419

5.  Predicting pKa Using a Combination of Semi-Empirical Quantum Mechanics and Radial Basis Function Methods.

Authors:  Peter Hunt; Layla Hosseini-Gerami; Tomas Chrien; Jeffrey Plante; David J Ponting; Matthew Segall
Journal:  J Chem Inf Model       Date:  2020-05-18       Impact factor: 4.956

6.  RMDL: Recalibrated multi-instance deep learning for whole slide gastric image classification.

Authors:  Shujun Wang; Yaxi Zhu; Lequan Yu; Hao Chen; Huangjing Lin; Xiangbo Wan; Xinjuan Fan; Pheng-Ann Heng
Journal:  Med Image Anal       Date:  2019-08-30       Impact factor: 8.545

7.  Holistic Prediction of the pKa in Diverse Solvents Based on a Machine-Learning Approach.

Authors:  Qi Yang; Yao Li; Jin-Dong Yang; Yidi Liu; Long Zhang; Sanzhong Luo; Jin-Pei Cheng
Journal:  Angew Chem Int Ed Engl       Date:  2020-08-25       Impact factor: 15.336

8.  The ChEMBL database in 2017.

Authors:  Anna Gaulton; Anne Hersey; Michał Nowotka; A Patrícia Bento; Jon Chambers; David Mendez; Prudence Mutowo; Francis Atkinson; Louisa J Bellis; Elena Cibrián-Uhalte; Mark Davies; Nathan Dedman; Anneli Karlsson; María Paula Magariños; John P Overington; George Papadatos; Ines Smit; Andrew R Leach
Journal:  Nucleic Acids Res       Date:  2016-11-28       Impact factor: 16.971

9.  FraGAT: a fragment-oriented multi-scale graph attention model for molecular property prediction.

Authors:  Ziqiao Zhang; Jihong Guan; Shuigeng Zhou
Journal:  Bioinformatics       Date:  2021-03-26       Impact factor: 6.937

10.  The pK(a) Distribution of Drugs: Application to Drug Discovery.

Authors:  David T Manallack
Journal:  Perspect Medicin Chem       Date:  2007-09-17
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.