| Literature DB >> 32226607 |
Marcel Baltruschat1, Paul Czodrowski1.
Abstract
We present a small molecule pK a prediction tool entirely written in Python. It predicts the macroscopic pK a value and is trained on a literature compilation of monoprotic compounds. Different machine learning models were tested and random forest performed best given a five-fold cross-validation (mean absolute error=0.682, root mean squared error=1.032, correlation coefficient r 2 =0.82). We test our model on two external validation sets, where our model performs comparable to Marvin and is better than a recently published open source model. Our Python tool and all data is freely available at https://github.com/czodrowskilab/Machine-learning-meets-pKa. Copyright:Entities:
Keywords: dissociation; machine learning; pKa value; protonation
Mesh:
Year: 2020 PMID: 32226607 PMCID: PMC7096188 DOI: 10.12688/f1000research.22090.2
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. Distribution of the individual pK a values.
Figure 2. ( A) Pairwise comparison between the training set and the Novartis test set (Fingerprint: 4096 bit MorganFeatures radius=3). ( B) Pairwise comparison between the training set and the test set compiled by manual curation (Fingerprint: 4096 bit MorganFeatures radius=3).
Figure 3. Correlation of Novartis compounds measured in potentiometric and high-throughput (capillary electrophoresis) set-up.
Figure 4. Intersection between ChEMBL and DataWarrior data sets.
Statistics of the five-fold cross-validation of the machine learning models.
The two best and worst performing models are highlighted in green and red. For those neural networks where the values were specified as "not available" (“#NA”), the weights could not be optimized properly due to the large value range of the RDKit descriptors, so training failed here.
| Cross-Validation | |||||||
|---|---|---|---|---|---|---|---|
| Modell (seed=24) | Train Configuration | MAE
| MAE
| RMSE
| RMSE
| R
2
| R
2
|
|
|
| 0,718 | 0,022 | 1,077 | 0,021 | 0,804 | 0,01 |
|
| 0,708 | 0,021 | 1,094 | 0,029 | 0,797 | 0,008 | |
|
| 0,683 | 0,017 | 1,032 | 0,013 | 0,82 | 0,005 | |
|
| 0,717 | 0,022 | 1,076 | 0,022 | 0,804 | 0,011 | |
|
| 0,708 | 0,021 | 1,094 | 0,029 | 0,797 | 0,008 | |
|
| 0,682 | 0,017 | 1,032 | 0,013 | 0,82 | 0,005 | |
|
|
| 2,1 | 0,037 | 2,436 | 0,035 | -0,004 | 0,004 |
|
| 0,851 | 0,025 | 1,24 | 0,035 | 0,74 | 0,012 | |
|
| 2,1 | 0,037 | 2,436 | 0,035 | -0,004 | 0,004 | |
|
| 0,876 | 0,033 | 1,282 | 0,047 | 0,722 | 0,015 | |
|
| 1,09 | 0,034 | 1,466 | 0,041 | 0,637 | 0,014 | |
|
| 1,02 | 0,037 | 1,4 | 0,047 | 0,668 | 0,016 | |
|
|
| 2,016 | 0,042 | 2,362 | 0,039 | 0,056 | 0,009 |
|
| 1,612 | 0,031 | 1,926 | 0,033 | 0,373 | 0,007 | |
|
| 1,642 | 0,061 | 2,052 | 0,06 | 0,288 | 0,027 | |
|
| 0,882 | 0,035 | 1,288 | 0,048 | 0,719 | 0,016 | |
|
| 1,09 | 0,034 | 1,465 | 0,041 | 0,637 | 0,014 | |
|
| 1,019 | 0,037 | 1,4 | 0,047 | 0,669 | 0,016 | |
|
|
| #NA | #NA | #NA | #NA | #NA | #NA |
|
| 0,866 | 0,025 | 1,27 | 0,047 | 0,727 | 0,019 | |
|
| #NA | #NA | #NA | #NA | #NA | #NA | |
|
| 0,726 | 0,018 | 1,102 | 0,05 | 0,794 | 0,022 | |
|
| 1,037 | 0,045 | 1,457 | 0,057 | 0,64 | 0,024 | |
|
| 0,968 | 0,032 | 1,383 | 0,04 | 0,677 | 0,014 | |
|
|
| #NA | #NA | #NA | #NA | #NA | #NA |
|
| 0,894 | 0,024 | 1,297 | 0,04 | 0,715 | 0,016 | |
|
| #NA | #NA | #NA | #NA | #NA | #NA | |
|
| 0,768 | 0,034 | 1,161 | 0,09 | 0,77 | 0,038 | |
|
| 1,031 | 0,037 | 1,447 | 0,057 | 0,645 | 0,026 | |
|
| 0,984 | 0,029 | 1,404 | 0,035 | 0,666 | 0,017 | |
|
|
| #NA | #NA | #NA | #NA | #NA | #NA |
|
| 0,869 | 0,023 | 1,265 | 0,039 | 0,729 | 0,016 | |
|
| #NA | #NA | #NA | #NA | #NA | #NA | |
|
| 0,775 | 0,008 | 1,158 | 0,033 | 0,773 | 0,013 | |
|
| 1,026 | 0,038 | 1,455 | 0,053 | 0,642 | 0,022 | |
|
| 0,973 | 0,035 | 1,388 | 0,053 | 0,674 | 0,023 | |
|
|
| 1,02 | 0,014 | 1,353 | 0,021 | 0,691 | 0,007 |
|
| 1,094 | 0,027 | 1,423 | 0,036 | 0,657 | 0,011 | |
|
| 1,018 | 0,01 | 1,346 | 0,022 | 0,694 | 0,005 | |
|
| 1,02 | 0,014 | 1,353 | 0,021 | 0,691 | 0,007 | |
|
| 1,094 | 0,027 | 1,423 | 0,036 | 0,657 | 0,011 | |
|
| 1,018 | 0,01 | 1,346 | 0,022 | 0,694 | 0,005 | |
MAE, mean absolute error; RMSE, root mean square error.
Predictive performance of the machine learning models the on the two external test sets.
The two best and worst performing models are highlighted in green and red. For those neural networks where the values were specified as "not available" (“#NA”), the weights could not be optimized properly due to the large value range of the RDKit descriptors, so training failed here.
| Novartis | AvLiLuMoVe | ||||||
|---|---|---|---|---|---|---|---|
| Modell (seed=24) | Train Configuration | MAE | RMSE | R 2 | MAE | RMSE | R 2 |
|
|
| 1,259 | 1,607 | 0,513 | 0,689 | 0,979 | 0,828 |
|
| 1,147 | 1,513 | 0,569 | 0,532 | 0,785 | 0,889 | |
|
| 1,2 | 1,532 | 0,558 | 0,628 | 0,884 | 0,86 | |
|
| 1,259 | 1,607 | 0,513 | 0,688 | 0,979 | 0,828 | |
|
| 1,147 | 1,513 | 0,569 | 0,532 | 0,785 | 0,889 | |
|
| 1,198 | 1,531 | 0,558 | 0,628 | 0,884 | 0,86 | |
|
|
| 2,177 | 2,451 | -0,132 | 2,18 | 2,441 | -0,07 |
|
| 1,423 | 1,732 | 0,435 | 0,688 | 0,981 | 0,827 | |
|
| 2,177 | 2,451 | -0,132 | 2,18 | 2,441 | -0,07 | |
|
| 1,382 | 1,735 | 0,433 | 0,772 | 1,058 | 0,799 | |
|
| 1,771 | 2,035 | 0,219 | 1,115 | 1,422 | 0,637 | |
|
| 1,746 | 2,015 | 0,235 | 1,044 | 1,345 | 0,675 | |
|
|
| 2,162 | 2,428 | -0,111 | 1,921 | 2,242 | 0,097 |
|
| 1,686 | 1,932 | 0,297 | 1,429 | 1,67 | 0,499 | |
|
| 2,161 | 2,442 | -0,124 | 1,611 | 2,004 | 0,279 | |
|
| 1,378 | 1,732 | 0,435 | 0,766 | 1,049 | 0,802 | |
|
| 1,77 | 2,034 | 0,22 | 1,114 | 1,421 | 0,637 | |
|
| 1,744 | 2,013 | 0,236 | 1,043 | 1,343 | 0,676 | |
|
|
| #NV | #NV | #NV | #NV | #NV | #NV |
|
| 1,414 | 1,773 | 0,407 | 0,852 | 1,169 | 0,755 | |
|
| #NV | #NV | #NV | #NV | #NV | #NV | |
|
| 1,318 | 1,634 | 0,497 | 0,688 | 0,942 | 0,841 | |
|
| 1,627 | 2,033 | 0,221 | 1,102 | 1,569 | 0,558 | |
|
| 1,542 | 1,941 | 0,29 | 1,001 | 1,427 | 0,634 | |
|
|
| #NV | #NV | #NV | #NV | #NV | #NV |
|
| 1,404 | 1,772 | 0,408 | 0,846 | 1,154 | 0,761 | |
|
| #NV | #NV | #NV | #NV | #NV | #NV | |
|
| 1,298 | 1,626 | 0,502 | 0,701 | 0,936 | 0,843 | |
|
| 1,611 | 2,028 | 0,225 | 1,141 | 1,575 | 0,554 | |
|
| 1,605 | 1,998 | 0,248 | 0,987 | 1,365 | 0,665 | |
|
|
| #NV | #NV | #NV | #NV | #NV | #NV |
|
| 1,363 | 1,717 | 0,445 | 0,86 | 1,164 | 0,757 | |
|
| #NV | #NV | #NV | #NV | #NV | #NV | |
|
| 1,354 | 1,705 | 0,452 | 0,777 | 1,057 | 0,799 | |
|
| 1,584 | 1,989 | 0,254 | 1,053 | 1,468 | 0,613 | |
|
| 1,581 | 1,963 | 0,274 | 0,953 | 1,352 | 0,672 | |
|
|
| 1,367 | 0,453 | 1,704 | 0,819 | 0,806 | 1,04 |
|
| 1,28 | 0,503 | 1,624 | 0,782 | 0,823 | 0,992 | |
|
| 1,293 | 0,495 | 1,637 | 0,774 | 0,822 | 0,995 | |
|
| 1,367 | 0,453 | 1,704 | 0,819 | 0,806 | 1,04 | |
|
| 1,28 | 0,503 | 1,624 | 0,782 | 0,823 | 0,992 | |
|
| 1,293 | 0,495 | 1,637 | 0,774 | 0,822 | 0,995 | |
|
| 0,856 | 1,166 | 0,744 | 0,566 | 0,865 | 0,866 | |
|
| 2,274 | 3,059 | -0,754 | 1,737 | 2,182 | 0,124 | |
*For OPERA 6 molecules from AvLiLuMoVe and 31 molecules from Novartis were left out because OPERA predicted either two or zero pK a values.
MAE, mean absolute error; RMSE, root mean square error.