Literature DB >> 34152755

Machine Learning Strategies When Transitioning between Biological Assays.

Staffan Arvidsson McShane¹, Ernst Ahlberg^1,2,3, Tobias Noeske⁴, Ola Spjuth¹.

Abstract

Machine learning is widely used in drug development to predict activity in biological assays based on chemical structure. However, the process of transitioning from one experimental setup to another for the same biological endpoint has not been extensively studied. In a retrospective study, we here explore different modeling strategies of how to combine data from the old and new assays when training conformal prediction models using data from hERG and NaV assays. We suggest to continuously monitor the validity and efficiency of models as more data is accumulated from the new assay and select a modeling strategy based on these metrics. In order to maximize the utility of data from the old assay, we propose a strategy that augments the proper training set of an inductive conformal predictor by adding data from the old assay but only having data from the new assay in the calibration set, which results in valid (well-calibrated) models with improved efficiency compared to other strategies. We study the results for varying sizes of new and old assays, allowing for discussion of different practical scenarios. We also conclude that our proposed assay transition strategy is more beneficial, and the value of data from the new assay is higher, for the harder case of regression compared to classification problems.

Entities: Chemical Disease Gene

Year: 2021 PMID： 34152755 PMCID： PMC8317157 DOI： 10.1021/acs.jcim.1c00293

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Assessing properties of novel compounds using one or several biological and biochemical assays is a common methodology in preclinical drug discovery.[1] Important properties include on- and off-target effects, ADME (Absorption, Distribution, Metabolism, Excretion) and Toxicity, and there exist a large number of assays developed for these and other endpoints.[2] Predicting the result of an assay using in silico methods such as Machine Learning (ML), prior to performing the assay or even before synthesizing the compound,[3−5] has increased in popularity over the years. When the chemical structure is used to represent the compound in such ML modeling, the method is referred to as QSAR or SAR (Quantitative Structure–Activity Relationships)[6] and falls under what is called ligand-based methods. QSAR has been used to model a wide range of endpoints, such as interaction with various targets.[7,8] The database ChEMBL[9] collects a great deal of curated data from different compounds and assays, and it is a common approach to merge data for the same target from different assays into a single data set that is subjected to ML modeling.[10,11] When evaluating the accuracy of a model, the standard protocol is to split the data set in a training set for training the model and a test set to evaluate its accuracy. There are also methods such as cross-validation that can be used to produce a balanced accuracy measure. The accuracy of QSAR models typically depends on the number of compounds/experiments in the training set. In pharmaceutical companies, the data generating process leads to a continuous expansion of assay data and hence over time increases the accuracy of their models. However, sometimes the organization might want to switch to another experimental setup to report on the same endpoint; this might be due to better capture the underlying phenomenon, decrease variance, or to reduce time and cost. Before transitioning to a new assay, it is common to run a set of compounds with both the old and the new assay in order to study the agreement between the measurements, referred to as assay concordance.[12,13] These sets typically contain well characterized tool compounds covering the biological phenomena of interest (e.g., mode of actions) and compounds that are chemically diverse and span a wide potency range. Assay concordance may be determined by a statistical approach that analyzes the mean difference of both assays and produces an agreement interval accounting for 95% of the differences between both assays.[14] However, how to use data from the old assay as efficiently as possible in downstream ML applications leads to several questions: (1) When the organization starts to generate data from the new assay, how should the data from the old assay be used; (2) Should models be trained exclusively on the new assay data, potentially resulting in low accuracy until a sufficient number of experiments have been run; (3) Should data from the old assay be pooled with the new assay, even though there is a known difference between them. A core assumption of all ML methods is that the data used for training the model is i.i.d., i.e., independent and identically distributed. If data from, e.g., an old assay and a new assay stem from different distributions, then a model trained on pooled data from both assays might not be valid and predictions cannot necessarily be trusted. There are methods devised to detect violations against i.i.d., commonly called data set shifts,[15,16] but these are restricted to specific versions of shifts (e.g., covariate shift or concept shift). Several methods have also been proposed to increase the accuracy of the trained model when knowing that a data set shift is present.[17−20] Conformal prediction (CP) is a mathematical framework developed for ML with the objective to produce well-calibrated predictions where the predictions adhere to a user-defined confidence (e.g., requiring 80% confidence results in at least 80% accurate predictions).[21,22] CP assumes exchangeability between all data, which is a similar but a slightly weaker assumption than i.i.d. A benefit of using CP is that the calibration of test data can be inspected, and poor calibration can intrinsically reveal data shifts and improper handling of data. In recent work, we assessed CP for improving the calibration when a data drift has occurred and concluded that updating the calibration set improved calibration across all evaluated data sets.[23] CP has been used extensively in various drug discovery applications.[24−29] In this manuscript, we perform a retrospective analysis of how old data can be used most efficiently when a decision has been made to switch to a new assay system, using data from the hERG and NaV endpoints at AstraZeneca. We refer to this specific problem as Assay Transition, and a distinguishing property of the problem is that it includes a continuous decision making process during the transition from one specific assay to another. We apply conformal prediction which enables us to evaluate the level of calibration and efficiency for different modeling strategies and discuss their implications.

Materials and Methods

Data

In-house bioassay data sets from AstraZeneca were used, containing dose–response data for the two ion channels hERG and NaV.[30,31] These are routinely screened in the early phases of drug discovery as they are tightly linked to cardiovascular risks[32] and are thus among the largest data sets generated from single assays. The raw, unprocessed data sets contained in excess of 152,000 and 16,000 records for hERG and NaV, respectively. Some records included a qualifier (“>” or “<”), indicating that the endpoint value was not determined exactly but that the IC50 value was either larger or smaller than the tested concentrations. From late 2016 until early 2017, the in-house routine voltage-gated ion-channel assays were moved from the existing medium-throughput electrophysiology IonWorks[33] device to the high-throughput SyncroPatch 384 PE platform. This switch facilitated technical improvements (reduced screening turnaround time, increased capacity, reduction of consumables spent) as well as the ability to detect slow onset ion-channel blockers. To assess how well the assays are agreeing, compounds tested in both assays were studied (Figure ). Overall, 78.2% of hERG and 89.3% of NaV compounds retained their categorical label after the transition (assuming a 10 μM threshold, see Data Preparation), see Figure B,E. Considering the measured endpoint values, the measurements for hERG were statistically different (p = 3.06 × 10–12 for IC50 and p = 1.95 × 10–9 for pIC50, using paired t-tests). NaV only contained 17 compounds with exact assay values, too few to give a statistical difference, albeit visually there is a large variance in assay measurements in Figure D.

Figure 1

Analyzing compounds measured in both the old and new assays. The top panels show the overlap in hERG (1,169 compounds), and the bottom panels show NaV (56 compounds). Panels A and D plot new measurements vs old measurements in pIC50, requiring exact assay values without qualifier, cutting down the overlap to 972 and 17 compounds, respectively. Fitting a linear model between old and new endpoint values (equations shown in the panels) indicate a positive correlation between new and old assays but with a slope far from the ideal (slope of 1). For hERG, the measurements were statistically different using a paired t-test (p = 1.95 × 10–9). The NaV overlap, only containing 17 compounds, was too small to prove statistically different. Panels B and E display the change in categorical activity, when applying a 10 μM threshold. Both data sets retain the same activity-class for the majority of the compounds. Finally, panels C and F plot the computed logP (cLogP) versus molecular weight, with no apparent clusters in this “chemical space” and indicate that all compounds originate from druggable chemical space.[34]

Data Preparation

Data was acquired in CSV format including compound ID, signature feature counts[35] using heights 1–3, test date, measured endpoint value, an optional qualifier (“>” or “<”), and some additional descriptors such as molecular weight and computed LogP. The qualifier indicates whether the endpoint value was determined to be that exact value, or if the IC50 value was either larger or smaller than the tested concentrations. A 10 μM threshold was used for categorizing compounds as active (A) or nonactive (N), according to ref (36). We define Anew to represent a data set containing observations from the new assay and Aold to represent the equivalent for observations from the old assay; the actual number of observations depends on the context. The data preparation steps are outlined in Box 1. For the NaV data set, step 2 also included removing all records with an IC50 of 33.3 μM, as it was found to be over-represented in the data set and likely an artifact. The exclusion of compounds recorded at over 100 μM was due to the typical experimental range was only up to 100 μM. Too few active compounds, i.e., with IC50 ≤ 10, were measured in the new NaV assay, so the classification data set was excluded for further analysis. The size of the final data sets used in the analysis is found in Table , and the total number of signature descriptors was in excess of 27,000 (NaV), 80,000 (hERG regression), and 120,000 (hERG classification).

Table 1

Final Data Sets Used in the Experiments, after Filtration Steps and Applying the 10 μM Threshold for Generating Categorical Labelsa

data set	A_new	A_old	% active [new]	% active [old]
hERG classification	4,800	64,000	45%	30%
hERG regression	3,300	36,400
Na_V regression	190	4,900

Note that the final sizes are slightly rounded off for the confidentiality of AstraZeneca.

Note that the final sizes are slightly rounded off for the confidentiality of AstraZeneca. Investigating potential divergences between the assays was also performed using the final data sets, in both descriptor space and the distribution of the measured assay values (Figure A–F). Truncated singular value decomposition (Truncated SVD) was used to compress the descriptor space into two dimensions, the score space was computed using the new assay data matrices (independently for hERG and NaV), and all records were then projected into the resulting score space (Figure A,D). Although only new data was used for computing the score space, there is an overall high similarity between the old and new assays. The old assays (green) are covering a larger area, suggesting that the diversity of compounds was higher for the old assays, or alternatively this could be a remnant of the sheer difference in the number of tested compounds, with the total explained variance of around 13% it is impossible to determine.

Figure 2

Visual analysis of the data sets. Aold is plotted in green, Anew is plotted in orange, the first row (A–C) shows hERG data, the second row (D–F) shows NaV, and the third row (G–I) shows the hERG augmented data set. Panels A and D plot the truncated SVD of the signature descriptors, explaining 13.28% and 13.62% of the total variance in data. Panels B and E plot the computed signature descriptors using t-SNE dimensionality reduction. Panels C and F plot the distribution of measured assay values, where the dashed lines show the mean for each assay. Panels G and H display the augmentation made to the hERG measurements in Aold, expressed in IC50; the dashed line in panel G corresponds to a 1:1 relation between the original and augmented data (i.e., no change). The dashed line in panel H is the mean value of the plotted distribution. Figure B,E shows the result of t-SNE (t-distributed Stochastic Neighbor Embedding) dimensionality reduction computed using both old and new data combined (independently for hERG and NaV) after an initial step of truncated SVD into 50 dimensions. Similarly as for the truncated SVD, the old assay (green) covers a larger area compared to the new assay (orange). Here the records group into separate clusters, characteristic of the t-SNE algorithm, and not necessarily all due to true clusters in data. Nonetheless, panel E shows that many of the clusters are exclusively green (old assay records), and many of the orange triangles (new assay records) cluster together rather than spread uniformly with the green records. These clusters could be due to the typical drug discovery process, where in the lead optimization process many similar compounds are tested.[37] Lastly, the distribution of the measured endpoint values from the assays was plotted in Figure C,F. Interestingly the difference between the old and new assays is the opposite for hERG and NaV, where the new hERG assay had a higher proportion of compounds with a stronger interaction and thus indication of a higher safety risk. For NaV, the change is the opposite, where the new assay has measured compounds with a lower degree of interaction. Possibly this difference could be due to the development process where the hERG assay is performed prior to NaV, and high risk compounds are excluded before further testing is performed. The combination of the results shown in Figure A–F indicates that both hERG and NaV assays are not i.i.d. between the old and new data, both in terms of descriptor space and the measured activity. Simply merging data from the legacy assay and the new assay would violate the requirement of i.i.d., and the resulting models would likely be unreliable.

Augmented Data Set

Apart from the hERG and NaV data sets, corresponding to real life data, an augmented data set was generated for simulating a scenario when transitioning between assays with less agreement. To keep the relevance to biological assays in the simulation, the augmented data set was constructed based on the hERG regression data sets. The descriptors were kept unchanged, while the assay measurements were altered with the goal of increasing the mismatch between the old and new assays. For hERG, the old assay reported compounds on average 2.54 μM (median 0.70 μM) higher than the new assay when analyzing the compounds measured in both assays (see Figure A). To increase the disagreement between the assays, the measurements of the old assay data were altered according to eq , where G5,0.5 denotes Gaussian random noise with μ = 5 and σ = 0.5. The transition of the augmented data set (altered old assay values) to the new hERG assay thus corresponds to a larger mismatch between the assay measurements compared to the original problem, see Figure G,H. The alteration includes a quadratic term in order to increase the mismatch for higher concentrations, as well as a randomized term that both performs a fixed shift and adds noise, resulting in an assay with different characteristics compared with the original.

Conformal Prediction

Conformal prediction (CP)[21,25] is a mathematical framework sitting on top of standard ML algorithms, proven to produce well-calibrated predictions adhering to user-defined confidence levels. To achieve this, the conformal predictor outputs prediction intervals (regression) or prediction sets (classification). A prediction is considered accurate if the true label is located within the interval or is part of the prediction set. Two types of conformal predictors, the transductive conformal predictor (TCP) and the inductive conformal predictor (ICP), are proven to produce well-calibrated predictions where the accuracy of the predictions is equal to or greater than the specified confidence the user asks for, given that data is exchangeable.[21] Standard ML methods already impose the stricter requirement for data being i.i.d., so CP does not introduce any further requirements from what is already present. Another group of inductive conformal predictors, collectively called Aggregated Conformal Predictors (ACPs), is based on training and combining several ICPs and merging their individual predictions into a single, final prediction. These are probably the most practically useful, with an improved informational efficiency (see the definition in the section Efficiency and Validity) compared to a single ICP,[38] but do not retain the guaranteed validity which thus requires more effort to be put in validating the calibration of the resulting models.

Nonconformity

CP operates on the notion of nonconformity, or “strangeness”, of observations. The nonconformity of an object is calculated using a nonconformity function (or measure) which is typically derived from an underlying ML algorithm, and the nonconformity of test objects is what is used for generating the final prediction by a ranking against the nonconformity scores of a calibration set. For inductive conformal predictors, the type of predictors that is used herein, the calibration set is derived by sampling observations without replacement from the full training set. The remaining observations in the training set are called the proper training set, and these are used for training the underlying ML algorithm that is used in the nonconformity function. The nonconformity function is a parameter of the CP algorithm, and choosing a good function is key to optimal predictive performance.[39]

Efficiency and Validity

Conformal predictors are evaluated based on two concepts; validity and efficiency. Validity refers to the calibration of the predictions, verifying that the predictor adheres to the user-provided confidence level, and is typically confirmed with calibration curves where the accuracy is plotted against the desired confidence. Deviations from perfect calibration is, however, possible due to, e.g., test set size or calibration set size, and there is no strict method to definitively decide if a model is valid or not. The efficiency of a predictor quantifies the informativeness of the predictions and can be measured in many different ways,[40] e.g., by the width of the prediction intervals (regression) or by the fraction of prediction sets that include a single label (classification). Compared with traditional model accuracy estimates for ML based on an external test set or cross-validation, where the same estimate is given to all test examples, CP delivers object-specific prediction intervals that depend both on the predicted object and on the user-defined confidence level.

Study Design

The design of the experiments was aimed at benefiting readers in many scenarios of transitioning between assays, hence a large range of different sizes of Aold and Anew was evaluated, trying to simulate a wide range of possible combinations. After a transition to a new assay, the goal will be to predict the measurement that the new assay would generate for new compounds. Thus, testing was exclusively performed on data from the new assay. Each combination of Aold and Anew was evaluated with a 10-fold cross-validation (CV) of all the Anew data, repeated with ten replicates (N = 10). For each fold in the CV, the fixed number of samples was then drawn randomly from the training split of Anew from the CV and from the full Aold data set. Each replicate experiment had a fixed seed used for shuffling, CV splitting, sampling of data, and seeding the modeling algorithm, and the seeds were reused for all combinations of Aold, Anew, and a modeling strategy. Using this experimental setup facilitates the comparison of all results from the figures, as the test sets were identical at all points. All experiments using a specific Aold size (i.e., a single panel of the result plots) will have had access to exactly the same observations for training, and the same applies to a specific Anew size. An artifact that comes with the study design is that the variance between replicates will become smaller as the data sizes increase, simply due to sampling more records out of the full data set will lead to more overlaps among the replicates. E.g., when using all observations in Aold, the only difference between replicate experiments will be the splits of Anew in the CV, the sampling into calibration and proper training set in the CP algorithm, and the seed used for the modeling algorithm. We consider this to still be practically useful, even though statistically the plotted confidence intervals in the results are for the mean result of the population fixed to the complete data sets, rather than the mean for any possible set of compounds.

Evaluated Assay Transitioning Strategies

In this section, we describe the evaluated strategies for transitioning between two assays; a list with definitions can be found in Table . As mentioned previously, the ICP and TCP types of conformal predictors have been mathematically proven to be valid given exchangeable data. However, the TCP version is computationally demanding, as it requires retraining the underlying algorithm for every new prediction and is thus impossible for all but the smallest modeling problems. ICP, on the other hand, trains a single model using the proper training set, and this model is then used for all predictions, still producing valid models; but to some degree, reduced efficiency due to some training examples is set aside in a calibration set. One of the evaluated strategies was an ICP, termed ICPoldnew, where the proper training set was fixed to be all of the Aold data and the calibration set of all of the Anew data. Exchangeability is thus preserved between the testing data and calibration data, and it should thus be guaranteed to be valid and act at least as a reference point when inspecting the calibration curves, albeit presumably with lower efficiency than the other strategies.

Table 2

Modeling Strategies Evaluated in the Studya

strategy	aggregated models	proper training set	calibration set	exchangeable
CCP_new	10	A_new	A_new	×
CCP_old	10	A_old	A_old
CCP_pool	10	(A_old ∪A_new)	(A_old ∪A_new)
ICP_old^new	1	∀A_old	∀A_new	×
CCP_AT	10	A_new ∪ ∀A_old	A_new	×
CCP_AT2	10	A_old	A_old ∪ ∀A_new

The A and A notation should be interpreted as 1 or 9 parts of a 10-fold split of the data set A, where the folds are shifted for each ICP model in a similar fashion as in cross-validation test-train splits. The last column indicates whether the model’s calibration set is exchangeable with the test data, i.e., theoretically guarantees valid models. All remaining strategies were based on Cross-Conformal Predictors (CCPs),[41] a type of ACP where data is randomly split into a calibration set and a proper training set in a folded fashion similar to k-fold cross-validation, consequently training k independent Inductive Conformal Predictors (ICPs), each with one fold for a calibration set and k–1 folds for a proper training set. The conformal p-values (classification) and intervals (regression) from the k predictions were aggregated using the median value, being the preferred method to retain good calibration of the final models.[42] For the classification data set, the calibration was conducted in a Mondrian fashion, where the calculation of p-values is performed independently for each class, which has been shown to work well for imbalanced data sets without requiring under/oversampling, boosting, or similar techniques.[43,44] Mondrian calibration was also performed in the ICPoldnew modeling strategy. The evaluated CCP-based strategies and their rationales were as follows (see also Table for definitions):From these strategies, we expect the CCPnew and CCPAT strategies to produce well-calibrated predictions, as the calibration sets are exchangeable with the test data which is only drawn from Anew. CCPnew which only uses Anew data and thus avoids potential issues of mixing data from different distributions. CCPold which only uses Aold data, maximizing the number of training observations while avoiding mixing of data. CCPpool which pools all Aold and Anew data, producing the largest data set, but potentially violating the i.i.d. assumption. CCPAT which uses Anew data in the k-fold CCP splits but augments the proper training set of all ICPs by adding all Aold data–maximizing the amount of data in the proper training set but exclusively calibrating the predictions using observations from the new assay. CCPAT2 which uses Aold in the k-fold CCP splits and instead augments the calibration set of all ICPs by adding all available Anew data to the calibration set.

Hyperparameters

A linear Support Vector Machine (SVM)[45] was used as an underlying learning algorithm, successfully applied in previous QSAR studies in combination with the signature molecular descriptor,[46,47] while being computationally less demanding compared to other kernel-based SVMs.[47] The SVM cost parameter was set to 0.25, and ϵ (termed p in LIBLINEAR) in ϵ-SVR was set to 0.01. These parameters were found using grid-search of cost and ϵ without applying CP and instead optimizing the RMSE of the trained models. The full hERG regression data set, both exclusively using the new assay data and with a combination of both assays, and NaV, using only the new assay, were evaluated in the grid-search for cost and ϵ. The results in terms of RMSE were stable across the data sets, and the obtained cost value was close to an earlier benchmark study of SVM parameters,[48] so no further tuning was conducted. The LIBLINEAR solver type was set to L2R_L2LOSS_SVR_DUAL for regression and L2R_L2LOSS_SVC for classification. The tolerance of the termination criterion was set to 0.001 following the default in the software that was used. This study was conducted utilizing the implementation of ICP and CCP from the software CPSign version 1.5.0-beta4,[49] with customized sampling strategies to match the sampling strategies of Table . The goal of the study was to compare different strategies of how the available data should be used, and improvements in absolute efficiency were of less interest. To facilitate the extensive number of experimental runs, no effort was put into finding optimal hyperparameters for each experimental setup; instead the cost and ϵ from the non-CP grid-search were used in all setups to give similar advantage/disadvantage to all setups. The number of folds in CCP (k) was set to 10. The nonconformity function for classification was defined as the negative distance to the decision surface of the SVM (termed “NegativeDistanceToHyperplane” in CPSign). For regression, a normalized nonconformity function was used, which normalizes the prediction interval width depending on the predicted accuracy of the scoring model, following the definition in Papadopoulos and Haralambous[50] (termed “LogNormalized” in CPSign). Both the scoring and error model were linear SVMs, with the nonconformity function outlined in eq , where α is the nonconformity score, y is the true label, ŷ is the predicted label from the scoring model, and μ̂ is the predicted error from the error model, all for instance i. The smoothing factor, β, was set to 0.01. Linear interpolation of p-values and prediction intervals were used to accommodate for small calibration sets.[51,52]

Code Availability

All essential code and instructions needed for running the experiments can be found in a GitHub repository at https://github.com/pharmbio/assay-transition-study. Note that the data preparation steps were left out as they were deemed too specific for the particular data sets, and the reader is instead referred to Box 1 and adapting it depending on their particular data.

Results

The results are divided into subsections for each data set, and to simplify the interpretations, the models with insufficient calibration were excluded from the analysis in the paper; but all results can be found in the Supporting Information, alongside calibration plots for all experiments. As pointed out in the Conformal Prediction section, there is no definitive way to decide if a model is valid or not; here, we based our assessment on whether the calibration plot for the largest Aold size put the accuracy clearly below the expected accuracy to judge a modeling strategy to be invalid. The result plots are arranged so that the reader can pick the Aold size that is of interest (i.e., picking one of four panels) and follow the trends of what happens when more data is generated with the new assay (i.e., changes along the x-axis). Efficiency is plotted using the 95% confidence intervals (CI) of the mean result values assuming a t-distribution, computed from the 10 replicates for each experiment combination (Anew size, Aold size, and strategy). As described in the Study Design section, the CI will become smaller for larger Anew and Aold sizes, due to training data having larger overlaps of observations as more data is sampled.

hERG Classification Data Set

hERG classification was the largest data set, and it was evaluated using Observed Fuzziness (OF) as efficiency metric. OF is a confidence-independent metric, which is favorable as the desired (or required) confidence can vary between use cases. OF is calculated using the average sum of the p-values for the nontrue classes; following Vovk et al.,[40] smaller values are preferable. The results are shown in Figure , where the 95% CIs are plotted with colored ribbons. Only strategies CCPnew, CCPAT, and ICPoldnew produced well-calibrated models for all setups, whereas CCPpool produced valid models for the three smallest Aold data sets and were slightly below the desired accuracy when using all Aold data (see Supporting Information Figure S1 for all calibration plots). It was thus included, even though borderline invalid.

Figure 3

hERG classification results for all well-calibrated strategies, plotting the Observed Fuzziness (smaller values are better). The colored areas correspond to the 95% confidence intervals computed from the ten replicate runs. Note that the CCPnew strategy is independent of Aold size and will thus be the same in all four panels. The overall winning strategies are CCPpool and CCPAT, depending on the combination of Anew and Aold size, shifting from CCPpool to CCPAT when the number of compounds in Anew exceeds 10% of that of Aold. Strategies CCPnew and CCPAT were removed when using Anew only contained 50 observations, as the calibration sets became too small (10-fold CCP uses 10% of all data as the calibration set, i.e., only five observations out of two different classes, which was not allowed in the CP implementation that was used). The overall trend is that the efficiency improves as the number of training observations increases, which is expected. The ICPoldnew strategy is the worst in terms of efficiency, consistent with literature and the reason for using ACPs. CCPnew and CCPAT are very similar in terms of OF and have overlapping CIs in many cases, with only a clear separation in favor for CCPAT when including all Aold data. Analyzing the four panels jointly, the overall most efficient strategy is to use the CCPpool when having more than ten times as many compounds in the Aold assay and then start to use CCPAT (the CCPpool, CCPnew, and CCPAT overlap at 500, 2,500, and 4,300 for panels 2–4). If requiring strictly well-calibrated models, the CCPpool has to be excluded in panel 4, making CCPAT the preferred strategy in that case.

hERG Regression Data Set

The hERG regression data set was slightly smaller than the corresponding categorical data set. Efficiency was computed as the median prediction interval width at a fixed confidence of 0.8, expressed in pIC50 (negative log molar concentration). The choice of 80% confidence was based on domain and data set knowledge, as the data was known to have too much variance and noise for realistically expecting informative results at a higher confidence. For this data set and the following regression data sets, only three strategies produced well-calibrated models in all experimental setups: CCPnew, CCPAT, and ICPoldnew. The results are shown in Figure , plotting the 95% CI for the ten replicate runs.

Figure 4

hERG regression results for all valid models. Efficiency is expressed in terms of prediction interval width at a fixed confidence of 0.8; smaller values are preferable. The colored areas correspond to the 95% CI computed from the ten replicate runs. The overall winning strategy is the CCPAT, having overlapping CIs for the smaller Anew sizes but favorable when more data is used from Anew. The results are more prominent in the second row of panels, where more Aold observations are used. Similarly as for the classification data set, the ICPoldnew strategy was overall inferior in terms of efficiency. The only scenario where ICPoldnew is preferable was for the smallest Anew size, explainable by the size of the calibration set where CCPnew and CCPAT only have five observations and the ICPoldnew can use all 50 observations for calibration. The overall best strategy was CCPAT, especially in panels 3–4 with more available data from Aold. Compared to the classification setting, the CCPpool strategy was mostly invalid (see the Supporting Information), having a lower accuracy than the desired confidence. In the calibration plot in Figure S2, it looks like the CCPpool is overconservative for the smallest Aold data set and thus a valid strategy, but when reviewing the calibration for each combination of Aold and Anew size (data not shown), it was evident that valid models were only produced when Anew made up at least 20% of the total training set (i.e., the two largest Anew sizes, where CCPpool was less efficient than both CCPAT and CCPnew). Thus, the CCPpool either produced invalid models or was outperformed by other strategies.

NaV Regression Data Set

The NaV data set was the smallest data set, only containing 190 tested compounds in the new assay, possibly making it the most interesting and potentially beneficial to include additional data in the modeling. Results are shown in Figure , again using median prediction interval width at confidence 0.8 as the efficiency metric, expressed in units of pIC50. Similar to hERG, the three strategies CCPnew, CCPAT, and ICPoldnew were the only ones producing well-calibrated models at all experimental setups. The CCPpool was again mostly invalid, but for panel 1, with the least amount of Aold data, it was producing valid predictions and thus would be the preferred strategy when only having 30 compounds in Anew and then getting surpassed by the CCPnew and CCPAT (Figures S3 and S7).

Figure 5

NaV regression results for all valid models. Efficiency is expressed in terms of prediction interval width at a fixed confidence of 0.8; smaller values are preferable. The colored areas correspond to the 95% CI computed from the ten replicate runs. The ICPoldnew was the only possible strategy to use when only having 30 available compounds in Anew but was surpassed in efficiency in all other experimental setups. CCPAT and CCPnew have very similar efficiencies, and there is no clear winning strategy. The results are similar to those found for the hERG Regression Data Set section but less pronounced. CCPAT and CCPnew are in most cases overlapping in terms of efficiency, with possibly CCPAT improving the efficiency compared to CCPnew with more data in the old assay, but nothing definitive can be said. Interestingly, the ICPoldnew improves in efficiency with the inclusion of more Aold data, more so than for the hERG experiments; extrapolation of the results from all panels indicate that there could be a scenario where this strategy could become better than the other two, if more Aold data were available.

hERG Augmented Data Set

For the augmented data set, simulating a larger divergence between assay measurements by altering the labels of the Aold observations, the evaluation was performed in the same way as for the hERG regression data set experiments, and the results are shown in Figure . Comparing these results to those of the unaltered data (Figure ) makes it possible to assess the usefulness of additional data even when there is less agreement between the assays. Note that strategy CCPnew is identical to those of the unaltered experiments as it uses no augmented data.

Figure 6

hERG augmented data set results for all valid models. Efficiency is expressed in terms of prediction interval width at a fixed confidence of 0.8; smaller values are preferable. The colored areas correspond to the 95% CI computed from the ten replicate runs. The overall results are similar to those in Figure but with CCPAT performing slightly worse here. CCPAT and CCPnew strategies are mostly overlapping but with CCPAT having a small advantage in the last two panels and the two largest sizes of Anew. The results are similar to the unaltered experiments, but the improvement in efficiency of CCPAT over CCPnew seen in the previous experiment is less pronounced and only found when including 2,500 or all of the Anew data. Another interesting difference is that the ICPoldnew had an improved efficiency compared to the unaltered experiments and is arguably the best strategy when only having 50 records in Anew.

Discussion

Over time as new assays are incorporated and old ones are phased out, organizations are faced with challenges on how to maintain predictive models that are valid and as accurate as possible. If the accumulated data from old assays is small in relation to the data generation in the new assay, this might not constitute a big problem. However, if the organization has invested significant efforts in building up a knowledge base for predictive modeling based on one type of assay, it would be highly profitable to maximize the usefulness of this data after transitioning to a new assay–especially if it will take some time for the accumulated data from the new assay to reach levels where models with high accuracy can be trained. It is generally advised to perform experiments to assert assay concordance between the old and new assays before deciding and implementing a change; but once a new assay has been implemented, the problem still remains as to how new machine learning models should be trained on data from both the old and new assays. Herein, we investigated modeling strategies based on conformal prediction in order to produce valid (well-calibrated) models with the highest informational efficiency. The overall trends in the results are consistent across the four data sets, with the exception of the CCPpool modeling strategy which was producing valid models for most runs for the classification experiments (Figure ), whereas for the regression data sets, it was invalid. Out of the remaining strategies, the three modeling strategies CCPnew, CCPAT, and ICPoldnew were found to always be valid, which can be linked back to the discussion in the section Conformal Prediction, as these three strategies are the only ones that theoretically would be valid according to standard conformal prediction proofs (disregarding the potential invalidity of CCP and variances in calibration due to the finite number of test samples). These conflicting results are likely due to classification problems being typically easier to model than regression problems but nevertheless make the CCPpool a strategy necessary to evaluate as it had cases with a clear advantage over the other strategies in terms of efficiency. Between the regression experiments, the overall best strategy was to use ICPoldnew when there is an insufficient amount of data in the new assay (50 examples for hERG and 30 for NaV) to produce efficient models and then start to use either CCPnew or CCPAT. Overall, there are no scenarios where the CCPnew is preferred over the CCPAT, as their CIs either overlap or CCPAT is superior. Comparing the augmented data set experiments with the unaltered experiments demonstrates that there is less advantage of the CCPAT when there is a larger discrepancy between the assays, where CCPAT is only favorable in the scenarios with access to a great deal of data from both the old and new assays. For the classification data set, these three strategies had similar trends as for the regression data sets, but the CCPpool strategy was clearly preferable in scenarios with access to at least a 10-fold excess of Aold observations over that of Anew. Our overall objective was to demonstrate and evaluate several different scenarios of transitioning between assays and how to use data from a legacy assay with conformal prediction. The results show that the best possible usage of old data depends both on the amount of data each assay has and agreement between the assays. Figures –6 for hERG and NaV assays allow for discussing different practical scenarios relating to size of old and new assays and assay concordance. A key finding is that the best possible strategy can change while gathering more data using the new assay, and we therefore propose that the way data from the old assay is used should be evaluated continuously when data from the new assay is produced, instead of performing a single evaluation and then persisting with that strategy indefinitely. Such one-time evaluation could potentially result in a suboptimal strategy in the long term. The results in our study show that a simple strategy such as pooling old and new data might lead to invalid models, even if the assays are deemed to have enough concordance. Further, we show that training models based only on new data will lead to less efficient models until enough data has been produced. Classification is normally a simpler task than regression, and when comparing Figures and 4, we see that the value of old data and our proposed assay transition modeling strategy (CCPAT) is larger in the case of regression compared to classification. Further, comparing the results in Figure with Figure shows that the value of data from the old assay is lower when it is disturbed (hence a lower assay concordance), and the benefit of the CCPAT strategy is also lower. A benefit of using conformal prediction compared to traditional ML is that the calibration of the predictions can be monitored alongside the efficiency metrics, making it possible to discover data drifts or improper handling of data (e.g., by pooling data incorrectly). Although we point out that there is no absolute way to determine the validity of a model in a strict sense, deviation from perfect calibration can occur due to the finite number of test examples (making statistical fluctuations have an impact) and choice of predictor type. One assumption that was made in this study was that the goal is to predict the outcome of the new assay, exclusively evaluating performance on observations from the new assay. We argue that this is preferable as new experiments will only be conducted in the new setup. We also expect a natural drift in the tested compounds (i.e., exploring new regions of chemical space), making compounds tested more recently being more relevant for future projects. The evaluation will thus be more useful compared to performing the evaluation of pooled data from both assays, but we expect the validity of the strategies to be very much affected based on the testing strategy. To facilitate the large number of evaluated scenarios, minimal parameter tuning was performed, which could have some effects on the results as the SVM cost and ϵ parameters were determined using all available data which could be suboptimal for the smaller data set sizes. Another angle of investigation would be to replace the linear SVM with a more complex learning algorithm, such as an RBF kernel SVM, in those cases where it is feasible with respect to run time. This alternative strategy could lower, e.g, the benefit of using CCPAT over CCPnew in scenarios where a great deal of old data is used, making the CCPAT infeasible to train using an RBF kernel, while practically possible when only using new data. This, however, is arguably a more data set dependent analysis and thus not pursued herein but could potentially have an impact in some situations.

Conclusions

We show that it is important to continuously monitor predictive models when transitioning between assays, in order to maximize the usefulness of data from the old assay. We suggest to use conformal prediction in this transitioning process to measure both the level of calibration in addition to efficiency of models, thereby ensuring that invalid models can be identified and dismissed. We also propose a modeling strategy where data from the old assay is used to expand the proper training set of an inductive conformal predictor and where calibration is performed exclusively on data from the new assay, resulting in valid models and the best overall efficiency across all experiments.

30 in total

Review 1. Computational methods to predict drug safety liabilities.

Authors: S K Durham; G M Pearl
Journal: Curr Opin Drug Discov Devel Date: 2001-01

2. Ionworks HT: a new high-throughput electrophysiology measurement platform.

Authors: Kirk Schroeder; Brad Neagle; Derek J Trezise; Jennings Worley
Journal: J Biomol Screen Date: 2003-02

3. Optimisation and validation of a medium-throughput electrophysiology-based hERG assay using IonWorks HT.

Authors: M H Bridgland-Taylor; A C Hargreaves; A Easter; A Orme; D C Henthorn; M Ding; A M Davis; B G Small; C G Heapy; N Abi-Gerges; F Persson; I Jacobson; M Sullivan; N Albertson; T G Hammond; E Sullivan; J-P Valentin; C E Pollard
Journal: J Pharmacol Toxicol Methods Date: 2006-03-06 Impact factor: 1.950

Review 4. High-throughput in vitro profiling assays: lessons learnt from experiences at Novartis.

Authors: Bernard Faller; Jianling Wang; Alfred Zimmerlin; Leslie Bell; Jacques Hamon; Steven Whitebread; Kamal Azzaoui; Dejan Bojanic; Laszlo Urban
Journal: Expert Opin Drug Metab Toxicol Date: 2006-12 Impact factor: 4.481

5. Conformal prediction of HDAC inhibitors.

Authors: U Norinder; J J Naveja; E López-López; D Mucs; J L Medina-Franco
Journal: SAR QSAR Environ Res Date: 2019-04 Impact factor: 3.000

6. Ligand-based target prediction with signature fingerprints.

Authors: Jonathan Alvarsson; Martin Eklund; Ola Engkvist; Ola Spjuth; Lars Carlsson; Jarl E S Wikberg; Tobias Noeske
Journal: J Chem Inf Model Date: 2014-10-03 Impact factor: 4.956

Review 7. Principles of early drug discovery.

Authors: J P Hughes; S Rees; S B Kalindjian; K L Philpott
Journal: Br J Pharmacol Date: 2011-03 Impact factor: 8.739