| Literature DB >> 33079358 |
Nicolas Tielker1, Lukas Eberlein1, Gerhard Hessler2, K Friedemann Schmidt3, Stefan Güssregen4, Stefan M Kast5.
Abstract
Joint academic-industrial projects supporting drug discovery are frequently pursued to deploy and benchmark cutting-edge methodical developments from academia in a real-world industrial environment at different scales. The dimensionality of tasks ranges from small molecule physicochemical property assessment over protein-ligand interaction up to statistical analyses of biological data. This way, method development and usability both benefit from insights gained at both ends, when predictiveness and readiness of novel approaches are confirmed, but the pharmaceutical drug makers get early access to novel tools for the quality of drug products and benefit of patients. Quantum-mechanical and simulation methods particularly fall into this group of methods, as they require skills and expense in their development but also significant resources in their application, thus are comparatively slowly dripping into the realm of industrial use. Nevertheless, these physics-based methods are becoming more and more useful. Starting with a general overview of these and in particular quantum-mechanical methods for drug discovery we review a decade-long and ongoing collaboration between Sanofi and the Kast group focused on the application of the embedded cluster reference interaction site model (EC-RISM), a solvation model for quantum chemistry, to study small molecule chemistry in the context of joint participation in several SAMPL (Statistical Assessment of Modeling of Proteins and Ligands) blind prediction challenges. Starting with early application to tautomer equilibria in water (SAMPL2) the methodology was further developed to allow for challenge contributions related to predictions of distribution coefficients (SAMPL5) and acidity constants (SAMPL6) over the years. Particular emphasis is put on a frequently overlooked aspect of measuring the quality of models, namely the retrospective analysis of earlier datasets and predictions in light of more recent and advanced developments. We therefore demonstrate the performance of the current methodical state of the art as developed and optimized for the SAMPL6 pKa and octanol-water log P challenges when re-applied to the earlier SAMPL5 cyclohexane-water log D and SAMPL2 tautomer equilibria datasets. Systematic improvement is not consistently found throughout despite the similarity of the problem class, i.e. protonation reactions and phase distribution. Hence, it is possible to learn about hidden bias in model assessment, as results derived from more elaborate methods do not necessarily improve quantitative agreement. This indicates the role of chance or coincidence for model development on the one hand which allows for the identification of systematic error and opportunities toward improvement and reveals possible sources of experimental uncertainty on the other. These insights are particularly useful for further academia-industry collaborations, as both partners are then enabled to optimize both the computational and experimental settings for data generation.Entities:
Keywords: Drug discovery; EC-RISM; Integral equation theory; Quantum chemistry; SAMPL; Solvation model
Mesh:
Substances:
Year: 2020 PMID: 33079358 PMCID: PMC8018924 DOI: 10.1007/s10822-020-00347-5
Source DB: PubMed Journal: J Comput Aided Mol Des ISSN: 0920-654X Impact factor: 3.686
Regression parameters of optimized 3D/EC-RISM/PSE-2-based Gibbs energy of solvation models (c, c/kcal mol−1 Å−3, c/kcal mol−1 e−1, c/kcal mol−1) along with statistical metrics (root-mean-square error RMSE/kcal mol−1, mean absolute error MAE/kcal mol−1, mean signed error MSE/kcal mol−1, slope m′, intercept b′/kcal mol−1, and coefficient of determination R2 from descriptive regression). Water model data correspond to the “MP2/6-311+G(d,p)/φopt” approach in [69]
| Solvent | RMSE | MAE | MSE | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Water | ||||||||||
| All | 2.04 | 1.43 | − 0.26 | 1.00 | − 0.35 | 1.00 | – | – | ||
| Neutrals | 1.56 | 1.13 | − 0.36 | 0.97 | − 0.47 | 0.89 | – | – | – | – |
| Anions | 3.07 | 2.46 | 0.01 | 1.10 | 7.18 | 0.94 | – | – | – | – |
| Cations | 2.98 | 2.10 | 0.02 | 0.96 | − 2.62 | 0.85 | – | – | – | – |
| Cyclohexane | ||||||||||
| Uncorrected | 5.86 | 5.60 | 5.60 | 0.13 | 1.53 | 0.05 | – | – | – | – |
| 1-par | 1.07 | 0.86 | 0.20 | 0.73 | − 1.04 | 0.62 | – | − 0.14923 | – | – |
| 2-par | 0.77 | 0.58 | 0.11 | 0.99 | 0.06 | 0.83 | 2.0184 | − 0.17795 | – | – |
| 2-par-I | 0.90 | 0.73 | 0.00 | 0.57 | − 2.00 | 0.76 | – | − 0.10894 | – | − 1.6593 |
| 2-par-I(5) | 0.88 | 0.70 | 0.00 | 0.59 | − 1.94 | 0.77 | – | − 0.10811 | – | − 1.6566 |
| 3-par | 0.68 | 0.50 | 0.00 | 0.84 | − 0.75 | 0.83 | 1.8516 | − 0.14692 | – | − 1.0842 |
| 3-par(5) | 0.76 | 0.56 | 0.00 | 0.84 | − 0.73 | 0.84 | 1.8444 | − 0.14703 | – | − 1.0479 |
For consistency with the SAMPL6 part II representations c corresponds to PMVs computed via the total correlation function route [84, 85] using an experimental isothermal compressibility of 1.1197 × 10−9 Pa−1 for cyclohexane [86] and the RISM estimate of 0.717062 × 10−9 Pa−1 for water. “(5)” after the solvent model code indicates SAMPL5 models from [60]. Optimized solution and gas phase structures are provided as Online Resource 1; calculated data, also split into separate components, are provided as Online Resource 2
Fig. 1Gibbs energies of solvation in cyclohexane from optimized 3D RISM calculations vs. the experimental results from the MNSOL database. Uncorrected data is shown by red squares in both panels. A 1-par (dark blue), 2-par (green) and B 3-par (green), 2-par-I (dark blue)
Statistical metrics (root-mean-square error RMSE, mean absolute error MAE, mean signed error MSE, and slope m′, intercept b′, and coefficient of determination R2 from descriptive regression) for all compounds from SAMPL6-type models for water [MP2/6-311+G(d,p)/PSE-2] and cyclohexane (PSE-2) and the original SAMPL5 setup
| Setup | Observable | Cyclohexane mod | Batches | RMSE | MSE | MAE | |||
|---|---|---|---|---|---|---|---|---|---|
| SAMPL6 | log | 1-par | 0 + 1 + 2 | 3.40 | 1.25 | 2.59 | 0.52 | 1.76 | 1.60 |
| 2-par | 0 + 1 + 2 | 4.36 | 3.38 | 3.67 | 0.56 | 1.65 | 3.69 | ||
| 2-par-I | 0 + 1 + 2 | 2.33 | − 0.01 | 1.76 | 0.54 | 1.4 | 0.18 | ||
| 3-par | 0 + 1 + 2 | 3.18 | 2.21 | 2.68 | 0.57 | 1.45 | 2.42 | ||
| log | 1-par | 0 + 1 + 2 | 3.23 | 0.59 | 2.50 | 0.63 | 2.02 | 1.07 | |
| 2-par | 0 + 1 + 2 | 3.97 | 2.72 | 3.40 | 0.65 | 1.92 | 3.15 | ||
| 2-par-I | 0 + 1 + 2 | 2.46 | − 0.67 | 1.71 | 0.67 | 1.69 | − 0.35 | ||
| 3-par | 0 + 1 + 2 | 2.88 | 1.55 | 2.44 | 0.66 | 1.72 | 1.89 | ||
| SAMPL5 | log | 2-par-I(5) | 0 + 1 + 2 | 2.33 | 0.55 | 1.79 | 0.55 | 1.39 | 0.73 |
| 3-par(5) | 0 + 1 + 2 | 3.63 | 2.85 | 3.10 | 0.57 | 1.43 | 3.04 | ||
| log | 2-par-I(5) | 0 + 1 + 2 | 2.32 | − 0.37 | 1.76 | 0.68 | 1.69 | − 0.05 | |
| 3-par(5) | 0 + 1 + 2 | 3.11 | 1.92 | 2.74 | 0.66 | 1.73 | 2.27 |
Optimized solution structures are provided as Online Resource 3; calculated data, also split into separate components, as Online Resource 4
Statistical metrics (root-mean-square error RMSE, mean absolute error MAE, mean signed error MSE, and slope m′, intercept b′, and coefficient of determination R2 from descriptive regression) separated by batches using the SAMPL6-type models for water and cyclohexane and the original SAMPL5 setup excluding SAMPL5_083
| Setup | Observable | Cyclohexane mod | Batches | RMSE | MSE | MAE | |||
|---|---|---|---|---|---|---|---|---|---|
| SAMPL6 | log | 1-par | 0 + 1 | 2.29 | 0.13 | 1.77 | 0.63 | 1.56 | 0.43 |
| 2 | 4.74 | 3.18 | 4.01 | 0.54 | 2.04 | 3.56 | |||
| 2-par | 0 + 1 | 3.18 | 2.37 | 2.59 | 0.66 | 1.52 | 2.64 | ||
| 2 | 5.87 | 5.14 | 5.53 | 0.59 | 1.82 | 5.44 | |||
| 2-par-I | 0 + 1 | 1.99 | − 0.65 | 1.57 | 0.62 | 1.31 | − 0.49 | ||
| 2 | 2.83 | 1.10 | 2.09 | 0.53 | 1.58 | 1.31 | |||
| 3-par | 0 + 1 | 2.44 | 1.49 | 1.97 | 0.63 | 1.36 | 1.68 | ||
| 2 | 4.15 | 3.47 | 3.93 | 0.60 | 1.56 | 3.67 | |||
| log | 1-par | 0 + 1 | 2.45 | − 0.59 | 1.88 | 0.77 | 1.89 | − 0.12 | |
| 2 | 4.26 | 2.62 | 3.58 | 0.63 | 2.18 | 3.05 | |||
| 2-par | 0 + 1 | 2.88 | 1.65 | 2.49 | 0.74 | 1.85 | 2.09 | ||
| 2 | 5.36 | 4.59 | 4.98 | 0.66 | 1.95 | 4.94 | |||
| 2-par-I | 0 + 1 | 2.44 | − 1.37 | 1.73 | 0.74 | 1.64 | − 1.04 | ||
| 2 | 2.48 | 0.55 | 1.66 | 0.64 | 1.71 | 0.80 | |||
| 3-par | 0 + 1 | 2.33 | 0.77 | 1.91 | 0.72 | 1.69 | 1.13 | ||
| 2 | 3.65 | 2.92 | 3.38 | 0.68 | 1.69 | 3.17 | |||
| SAMPL5 | log | 2-par-I(5) | 0 + 1 | 1.99 | − 0.09 | 1.48 | 0.61 | 1.35 | 0.09 |
| 2 | 2.83 | 1.67 | 2.32 | 0.52 | 1.39 | 1.81 | |||
| 3-par(5) | 0 + 1 | 2.86 | 2.08 | 2.41 | 0.65 | 1.41 | 2.30 | ||
| 2 | 3.86 | 2.98 | 3.54 | 0.67 | 1.81 | 3.27 | |||
| log | 2-par-I(5) | 0 + 1a | 2.25 | − 0.86 | 1.63 | 0.71 | 1.60 | − 0.54 | |
| 2 | 2.44 | 0.48 | 1.99 | 0.69 | 1.81 | 0.77 | |||
| 3-par(5) | 0 + 1b | 2.59 | 1.31 | 2.29 | 0.70 | 1.66 | 1.67 | ||
| 2 | 4.68 | 4.17 | 4.29 | 0.56 | 1.40 | 4.32 |
a–bCorrected results for SAMPL5 setup, original values [60] for RMSE, MSE, R2, m′, b′:
a2.15, − 0.53, 0.59, 1.36, − 0.34
b2.76, 1.64, 0.59, 1.42, 1.87
Experimental distribution coefficients and calculated partition (log P) and distribution (log D) coefficients for all models of the SAMPL5 challenge, for SAMPL5 [60] and SAMPL6-type setups
| SAMPL5 ID | log | log | log | log | log | log | log | log | log | log | log | log | log |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Batch 0 | |||||||||||||
| 003 | 1.90 | 1.17 | 3.19 | 2.09 | 4.22 | 1.54 | 3.51 | 1.17 | 3.19 | 2.09 | 4.22 | 1.54 | 3.51 |
| 015 | − 2.20 | − 5.28 | − 2.87 | − 4.76 | − 1.92 | − 4.79 | − 2.41 | − 8.08 | − 5.67 | − 7.07 | − 4.23 | − 7.10 | − 4.72 |
| 017 | 2.50 | 3.39 | 6.39 | 3.20 | 6.14 | 1.81 | 4.75 | 3.39 | 6.39 | 3.20 | 6.14 | 1.81 | 4.75 |
| 020 | 1.60 | 1.98 | 3.83 | 3.83 | 5.12 | 2.28 | 3.91 | 1.98 | 3.83 | 3.83 | 5.12 | 2.28 | 3.91 |
| 037 | − 1.50 | − 3.79 | − 2.31 | − 3.91 | − 2.29 | − 4.27 | − 2.79 | − 3.95 | − 2.47 | − 4.92 | − 3.30 | − 5.27 | − 3.80 |
| 045 | − 2.10 | − 2.42 | − 0.64 | − 2.26 | − 0.22 | − 2.43 | − 0.67 | − 2.42 | − 0.64 | − 2.26 | − 0.22 | − 2.43 | − 0.67 |
| 055 | − 1.50 | − 3.13 | − 1.31 | − 3.91 | − 1.53 | − 3.50 | − 1.65 | − 3.13 | − 1.31 | − 3.91 | − 1.53 | − 3.50 | − 1.65 |
| 058 | 0.80 | − 0.83 | 1.16 | 0.47 | 2.64 | 0.03 | 2.00 | − 0.83 | 1.16 | 0.47 | 2.64 | 0.03 | 2.00 |
| 059 | − 1.30 | − 0.25 | 1.32 | − 2.17 | − 0.17 | − 1.96 | − 0.36 | − 0.25 | 1.32 | − 2.17 | − 0.17 | − 1.96 | − 0.36 |
| 061 | − 1.45 | − 1.19 | 0.08 | − 2.76 | − 1.37 | − 3.22 | − 1.89 | − 1.91 | − 0.65 | − 3.39 | − 2.00 | − 3.86 | − 2.53 |
| 068 | 1.40 | 0.95 | 3.33 | 0.91 | 2.99 | − 0.76 | 1.57 | 0.95 | 3.33 | 0.91 | 2.99 | − 0.76 | 1.57 |
| 070 | 1.60 | 7.32 | 8.25 | 8.76 | 8.52 | 5.84 | 6.65 | 3.56 | 4.48 | 6.40 | 6.16 | 3.48 | 4.29 |
| 080 | − 2.20 | − 3.42 | − 0.71 | − 4.69 | − 1.21 | − 4.11 | − 1.40 | − 3.42 | − 0.71 | − 4.69 | − 1.21 | − 4.11 | − 1.40 |
| Batch 1 | |||||||||||||
| 004 | 2.20 | 2.60 | 4.96 | 3.85 | 6.12 | 2.64 | 4.96 | 2.60 | 4.96 | 3.84 | 6.12 | 2.63 | 4.95 |
| 005 | − 0.86 | − 1.44 | 1.68 | − 1.17 | 2.41 | − 1.54 | 1.58 | − 1.44 | 1.68 | − 1.18 | 2.41 | − 1.54 | 1.58 |
| 007 | 1.40 | 2.91 | 4.90 | 3.73 | 5.59 | 2.22 | 4.30 | 2.91 | 4.90 | 3.73 | 5.59 | 2.22 | 4.30 |
| 010a | − 1.70 | − 3.45 | − 1.43 | − 3.60 | − 1.38 | − 4.05 | − 2.03 | − 5.88 | − 3.85 | − 5.77 | − 3.55 | − 6.23 | − 4.21 |
| 011b | − 2.96 | 1.03 | 3.43 | 1.36 | 4.05 | 0.95 | 3.34 | − 1.67 | 0.74 | − 2.48 | 0.21 | − 2.89 | − 0.50 |
| 021 | 1.20 | 1.22 | 3.72 | − 0.28 | 2.65 | − 0.48 | 2.04 | 1.22 | 3.72 | − 0.28 | 2.65 | − 0.48 | 2.04 |
| 026c | − 2.60 | − 2.08 | − 0.82 | − 0.31 | 0.77 | − 1.18 | 0.02 | − 5.02 | − 3.76 | − 2.82 | − 1.74 | − 3.69 | − 2.49 |
| 027 | − 1.87 | − 3.44 | − 1.16 | − 4.29 | − 1.48 | − 4.12 | − 1.83 | − 3.44 | − 1.16 | − 4.34 | − 1.53 | − 4.17 | − 1.88 |
| 042 | − 1.10 | 0.40 | 2.63 | 0.01 | 2.12 | − 1.44 | 0.83 | 0.40 | 2.63 | 0.01 | 2.12 | − 1.44 | 0.83 |
| 044 | 1.00 | − 0.74 | 2.97 | 1.00 | 5.21 | 0.50 | 4.19 | − 0.74 | 2.97 | 1.00 | 5.21 | 0.50 | 4.19 |
| 046 | 0.20 | 0.70 | 3.38 | 1.79 | 4.42 | 0.53 | 3.17 | 0.70 | 3.38 | 1.79 | 4.42 | 0.53 | 3.17 |
| 047 | − 0.40 | − 0.35 | 2.53 | 1.26 | 4.48 | 0.79 | 3.64 | − 0.35 | 2.53 | 1.26 | 4.48 | 0.79 | 3.64 |
| 048 | 0.90 | 1.47 | 5.07 | 2.08 | 5.86 | 1.28 | 4.74 | 1.47 | 5.07 | 2.08 | 5.86 | 1.28 | 4.74 |
| 056 | − 2.50 | − 1.10 | 1.12 | − 3.02 | − 0.63 | − 3.56 | − 1.37 | − 1.10 | 1.12 | − 3.63 | − 1.24 | − 4.17 | − 1.98 |
| 060d | − 3.90 | − 4.19 | − 1.79 | − 4.17 | − 1.21 | − 3.99 | − 1.58 | − 6.86 | − 4.45 | − 6.13 | − 3.17 | − 5.95 | − 3.54 |
| 063 | − 3.00 | − 6.93 | − 5.06 | − 6.88 | − 5.15 | − 7.86 | − 6.08 | − 8.77 | − 6.90 | − 9.41 | − 7.68 | − 10.39 | − 8.61 |
| 071 | − 0.10 | − 0.99 | 1.02 | − 1.03 | 0.61 | − 2.47 | − 0.60 | − 1.02 | 0.99 | − 1.04 | 0.61 | − 2.48 | − 0.60 |
| 072 | 0.60 | 3.49 | 4.30 | 4.53 | 4.55 | 2.27 | 3.09 | − 0.05 | 0.76 | 3.04 | 3.07 | 0.78 | 1.60 |
| 081 | − 2.20 | − 6.02 | − 4.20 | − 4.41 | − 2.96 | − 5.72 | − 4.05 | − 7.69 | − 5.86 | − 6.68 | − 5.23 | − 7.99 | − 6.32 |
| 090 | 0.80 | 2.04 | 4.46 | 1.87 | 3.82 | − 0.08 | 2.23 | 2.04 | 4.46 | 1.87 | 3.82 | − 0.08 | 2.23 |
| Batch 2 | |||||||||||||
| 002 | 1.40 | 2.17 | 4.35 | 3.07 | 5.22 | 2.06 | 4.21 | 2.17 | 4.35 | 3.07 | 5.22 | 2.06 | 4.21 |
| 006 | − 1.02 | 0.20 | 1.41 | − 0.28 | 0.71 | − 1.26 | − 0.09 | 0.20 | 1.41 | − 0.28 | 0.71 | − 1.26 | − 0.09 |
| 013 | − 1.50 | − 2.53 | 1.28 | − 0.44 | 3.64 | − 1.45 | 2.31 | − 2.53 | 1.28 | − 0.44 | 3.64 | − 1.45 | 2.31 |
| 019 | 1.20 | 2.81 | 5.61 | 3.74 | 6.59 | 2.61 | 5.38 | 2.77 | 5.57 | 3.74 | 6.59 | 2.61 | 5.38 |
| 024 | 1.00 | 3.46 | 6.75 | 5.40 | 8.43 | 3.51 | 6.70 | 3.46 | 6.75 | 5.40 | 8.43 | 3.51 | 6.70 |
| 033 | 1.80 | 5.06 | 6.72 | 9.80 | 10.24 | 6.33 | 7.90 | 5.06 | 6.72 | 9.80 | 10.24 | 6.33 | 7.90 |
| 049 | 1.30 | 1.80 | 3.81 | 2.50 | 4.79 | 2.25 | 4.25 | 1.80 | 3.81 | 2.50 | 4.79 | 2.25 | 4.25 |
| 050 | − 3.20 | − 0.11 | 2.49 | − 1.00 | 2.12 | − 0.91 | 1.67 | − 5.58 | − 2.98 | − 4.36 | − 1.24 | − 4.27 | − 1.69 |
| 065 | 0.70 | 1.88 | 7.06 | 6.16 | 9.79 | 0.54 | 5.53 | 1.88 | 7.06 | 6.16 | 9.79 | 0.54 | 5.53 |
| 067 | − 1.30 | 1.40 | 3.15 | 3.23 | 4.54 | 1.59 | 3.26 | 0.17 | 1.94 | 3.23 | 4.54 | 1.59 | 3.26 |
| 069 | − 1.30 | 2.34 | 5.18 | 2.01 | 4.64 | 0.28 | 3.08 | 0.95 | 3.79 | 1.86 | 4.49 | 0.13 | 2.93 |
| 074 | − 1.90 | − 6.61 | − 3.04 | − 9.85 | − 5.62 | − 9.76 | − 6.25 | − 6.61 | − 3.04 | − 9.85 | − 5.62 | − 9.76 | − 6.26 |
| 075 | − 2.80 | 1.35 | 3.07 | 1.22 | 2.46 | − 0.48 | 1.15 | − 0.36 | 1.37 | − 1.05 | 0.18 | − 2.75 | − 1.13 |
| 082 | 2.50 | 8.17 | 9.06 | 12.15 | 10.96 | 7.34 | 8.02 | 4.94 | 5.84 | 9.88 | 8.69 | 5.06 | 5.75 |
| 083e | − 1.90 | – | – | – | – | – | – | – | – | – | – | – | – |
| 084 | 0.00 | 3.79 | 6.52 | 4.66 | 6.42 | 1.77 | 4.25 | 1.25 | 3.97 | 3.90 | 5.67 | 1.02 | 3.50 |
| 085 | − 2.20 | − 2.33 | − 0.57 | − 1.24 | 0.39 | − 2.29 | − 0.56 | − 8.14 | − 6.39 | − 1.24 | 0.39 | − 2.29 | − 0.56 |
| 086 | 0.70 | 4.15 | 6.59 | 7.23 | 7.80 | 3.74 | 5.52 | 2.89 | 5.32 | 5.58 | 6.15 | 2.09 | 3.87 |
| 088 | − 1.90 | − 1.46 | − 0.62 | 2.19 | 2.02 | − 0.41 | 0.35 | − 1.46 | − 0.62 | 2.19 | 2.02 | − 0.41 | 0.35 |
| 092 | − 0.40 | − 0.71 | 3.52 | 2.91 | 5.61 | − 1.51 | 2.33 | − 0.71 | 3.52 | 2.87 | 5.56 | − 1.55 | 2.28 |
a–dCorrected results for SAMPL5 setup, original data [60] for log P(2-par-I(5), 3-par) and log D7.4(2-par-I(5), 3-par):
a − 3.45, − 1.43, − 3.46, − 1.43
b1.03, 3.43, 1.03, 3.43
c − 2.08, − 0.82, − 2.08, − 0.82
d − 4.19, − 1.79, − 4.19, − 1.79
eExcluded as MP2 energies could not be calculated
Statistical metrics (root-mean-square error RMSE, mean absolute error MAE, mean signed error MSE, and slope m′, intercept b′, and coefficient of determination R2 from descriptive regression) for the pKa values predicted using the SAMPL5 and SAMPL6 setups compared with the Chemicalize [87] predictions
| p | RMSE | MSE | MAE | |||
|---|---|---|---|---|---|---|
| SAMPL5 | 2.07 | − 0.57 | 1.54 | 0.72 | 0.88 | 0.21 |
| SAMPL6 | 2.10 | 0.58 | 1.45 | 0.73 | 0.94 | 0.99 |
Fig. 2Partition (dark blue) and distribution coefficients (green) calculated using EC-RISM with the SAMPL6- (A–D 1-par, 2-par, 2-par-I, 3-par) and SAMPL5-type [E and F 2-par(5), 3-par(5)] models for water and cyclohexane compared with the experimental results (excluding SAMPL5_083, for which MP2 energies could not be obtained). Compounds of batches 0, 1, and 2 are shown as triangles, squares, and pentagons, respectively. Dashed lines indicate descriptive linear regression results
Fig. 3Acidity constants calculated with the original SAMPL5 setup compared to the SAMPL6 setup (A), results obtained from SAMPL5 (green) and SAMPL6 setups (dark blue) compared to Chemicalize [87] predictions (B). Dashed lines indicate descriptive regression results. Raw data are provided as Online Resource 5
Fig. 4Distribution coefficients calculated using the best-performing SAMPL6-type EC-RISM model (2-par-I, RMSE 2.49, excluding SAMPL5_083) compared with predictions from the COSMO-RS model applied to the SAMPL5 challenge [89] with an RMSE of 2.11 (before correction for water presence, including special correction for SAMPL5_069). The diagonal line indicates perfect correlation
Statistical metrics (root-mean-square error RMSE/kcal mol−1, mean absolute error MAE/kcal mol−1, mean signed error MSE/kcal mol−1, and slope m', intercept b', and coefficient of determination R2 from descriptive regression) for all SAMPL2 tautomer pairs from SAMPL6-type models for water [MP2/6-311+G(d,p)/PSE-2] and the original SAMPL2 setup (MP2/aug-cc-pVDZ/PSE-3), the latter reported for calculations including minimum rotamer (“min”) free energies only and from partition function (Z) averaging, while SAMPL6-style calculations—also for explicit consideration of thermally corrected gas phase legs [“CCSD(T)”] of the thermodynamic cycle as in [71]—are shown for the partition function approach only
| Model | Group | RMSE | MAE | MSE | |||
|---|---|---|---|---|---|---|---|
| SAMPL6/ | |||||||
| 1–6 | Obscure | 1.59 | 1.26 | 1.26 | 1.21 | 2.19 | 0.95 |
| 10–16 | Explanatory | 3.36 | 3.08 | 2.78 | 0.02 | 2.00 | 0.00 |
| 1–16 | All | 2.69 | 2.24 | 2.04 | 1.00 | 2.04 | 0.79 |
| SAMPL6/CCSD(T) | |||||||
| 1–6 | Obscure | 1.52 | 1.13 | 0.62 | 1.31 | 2.02 | 0.93 |
| 10–16 | Explanatory | 2.62 | 2.39 | 2.39 | 0.82 | 2.24 | 0.46 |
| 1–16 | All | 2.20 | 1.83 | 1.32 | 1.03 | 1.38 | 0.79 |
| SAMPL2/min | |||||||
| 1–6 | Obscure | 2.90 (2.91) | 2.67 | − 2.67 | 1.10 | − 2.20 | 0.89 |
| 10–16 | Explanatory | 0.58 (0.57) | 0.46 | 0.12 | 0.83 (0.89) | − 0.02 (− 0.05) | 0.78 (0.77) |
| 1–16 | All | 1.98 (1.93) | 1.49 | − 1.00 | 1.18 | − 0.64 (− 0.63) | 0.86 |
| SAMPL2/ | |||||||
| 1–6 | Obscure | 2.78 | 2.53 | − 2.53 | 1.10 | − 2.10 | 0.89 |
| 10–16 | Explanatory | 0.66 | 0.52 | 0.21 | 0.84 | 0.09 | 0.74 |
| 1–16 | All | 1.93 | 1.47 | − 0.94 | 1.16 | − 0.63 | 0.86 |
Numbers in parentheses denote original values from the SAMPL2 paper [54] where equilibrium constants have been transformed to reaction Gibbs energies, whereas we here show metrics relative to reference Gibbs energies from the SAMPL2 overview paper [37]. Structures are provided as Online Resource 6; calculated data, also split into separate components, as Online Resource 7
Experimental [37] tautomerization Gibbs energies (kcal mol−1) including estimated errors, calculated values from original SAMPL2 setup for rotamer minima (“min”) and partition functions (“Z”) [54], and from direct (“Z”) and indirect (“CCSD(T)”) [71] approaches using the SAMPL6 setup
| Reaction | Exp | Error | SAMPL2/min | SAMPL2/ | SAMPL6/ | SAMPL6/CCSD(T) |
|---|---|---|---|---|---|---|
| 1A → 1B | − 4.8 | 0.3 | − 7.73 | − 7.57 | − 3.38 | − 4.52 |
| 2A → 2B | − 6.1 | 0.3 | − 9.66 | − 9.29 | − 5.40 | − 6.74 |
| 3A → 3B | − 7.2 | 0.3 | − 11.17 | − 11.12 | − 7.04 | − 8.12 |
| 4A → 4B | − 2.3 | 0.4 | − 4.57 | − 4.43 | 0.96 | − 0.52 |
| 5A → 5B | − 4.8 | 0.5 | − 6.16 | − 5.83 | − 3.28 | − 4.19 |
| 5B → 5C | 0.5 | 0.2 | − 0.51 | − 0.51 | 1.50 | 1.25 |
| 6A → 6B | − 9.2 | 0.4 | − 11.15 | − 11.12 | − 9.05 | − 9.59 |
| 6A → 6Z | − 2.4 | 0.3 | − 6.72 | − 6.69 | − 0.43 | 1.17 |
| 7A → 7B | 7.0 | 1.5 | 5.11 | 4.71 | 6.50 | 3.94 |
| 8A → 8B | − 3.0 | 3.0 | − 1.01 | − 1.38 | 0.38 | − 2.34 |
| 10B → 10C | − 2.9 | 0.4 | − 2.84 | − 2.83 | 0.91 | − 0.20 |
| 10D → 10C | − 1.2 | 0.2 | − 0.55 | − 0.45 | 3.54 | 2.70 |
| 11D → 11C | − 0.5 | 0.2 | − 0.39 | − 0.23 | 3.64 | 2.96 |
| 12D → 12C | − 1.8 | 0.7 | − 0.79 | − 0.60 | 2.73 | 1.57 |
| 13D → 13C | 0.1 | 0.1 | 0.81 | 1.09 | 4.31 | 3.20 |
| 13D → 14C | 0.3 | 0.3 | 0.16 | 0.32 | 1.64 | 0.84 |
| 15A → 15B | 0.9 | 0.3 | 0.02 | 0.01 | − 0.62 | 2.65 |
| 15A → 15C | − 1.2 | 0.3 | − 1.87 | − 1.87 | 0.53 | 1.18 |
| 15B → 15C | − 2.2 | 0.3 | − 1.88 | − 1.88 | 1.15 | − 1.47 |
| 16A → 16C | 0.5 | 0.1 | 0.56 | 0.56 | 1.90 | 2.46 |
Fig. 5Calculated and experimental standard reaction Gibbs energies for the tautomer pairs of the SAMPL2 dataset (A–C) [37, 54] and comparison of explicit thermodynamic cycle data with corresponding explicit COSMO-RS (MP2+vib-CT-BP-TZVP) results [90] (D). Data using the SAMPL6 workflow (MP2/6-311+G(d,p)/φopt/PSE-2) are shown as orange squares (obscure pairs 1-6), green triangles (explanatory pairs 10–16) and green crosses (explanatory pairs 7 and 8). Linear regressions are depicted as dashed lines in corresponding colors, with the total regression over all pairs in light blue (A–C). The data of the original SAMPL2 submission are shown by red squares (1–6), blue triangles (10–16) and blue crosses (7 and 8) with regression lines again in corresponding color and total regression in magenta for the best performing SAMPL2 model (MP2/aug-cc-pVDZ/PSE-3) using only minimum conformations for SAMPL2 setup (A SAMPL2/min and SAMPL6/Z) or the Boltzmann weighted free energies of the conformational ensemble (B SAMPL2/Z and SAMPL6/Z). Results from the explicit thermodynamic cycle combining SAMPL6-style Gibbs free energies of hydration and CCSD(T)/cc-pVTZ gas phase free energies including B3LYP/6-311+G(d,p) thermal corrections are shown by analogously color-coded symbols in (C)