Literature DB >> 26750305

Open Source Bayesian Models. 3. Composite Models for Prediction of Binned Responses.

Alex M Clark¹, Krishna Dole², Sean Ekins^2,3.

Abstract

Bayesian models constructed from structure-derived fingerprints have been a popular and useful method for drug discovery research when applied to bioactivity measurements that can be effectively classified as active or inactive. The results can be used to rank candidate structures according to their probability of activity, and this ranking benefits from the high degree of interpretability when structure-based fingerprints are used, making the results chemically intuitive. Besides selecting an activity threshold, building a Bayesian model is fast and requires few or no parameters or user intervention. The method also does not suffer from such acute overtraining problems as quantitative structure-activity relationships or quantitative structure-property relationships (QSAR/QSPR). This makes it an approach highly suitable for automated workflows that are independent of user expertise or prior knowledge of the training data. We now describe a new method for creating a composite group of Bayesian models to extend the method to work with multiple states, rather than just binary. Incoming activities are divided into bins, each covering a mutually exclusive range of activities. For each of these bins, a Bayesian model is created to model whether or not the compound belongs in the bin. Analyzing putative molecules using the composite model involves making a prediction for each bin and examining the relative likelihood for each assignment, for example, highest value wins. The method has been evaluated on a collection of hundreds of data sets extracted from ChEMBL v20 and validated data sets for ADME/Tox and bioactivity.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2016 PMID： 26750305 PMCID： PMC4764945 DOI： 10.1021/acs.jcim.5b00555

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Bayesian inference is a category of machine learning that has been greatly beneficial to computer-aided drug design.[1−16] One category in particular, the Laplacian-modified naïve Bayesian variant using extended connectivity fingerprints or molecular function class fingerprints of maximum diameter 6 (ECFP6 or FCFP6) derived from chemical structures has established itself as a powerful workhorse tool since it was originally popularized in Pipeline Pilot.[17−19] This Bayesian method has some key advantages over other types of model building techniques applied to 2D structures (e.g., quantitative structure/property activity relationships (QSAR/QSPR)); it is very fast,[20,21] requires little expertise, is relatively robust with regard to overtraining, and can be interpreted intuitively since model characteristics are directly related to structural features, which medicinal chemists are very well attuned to. The Laplacian modification to the conventional naïve Bayes formula solves the problem of numerical precision and skewing caused by use of thousands of priors (structure-derived fingerprints) at the cost of returning an unscaled result, which, unlike for the conventional formula, is not a probability.[18] We have previously described a simple calibration method that allows results to be interpreted as a probability-like value.[22] The most obvious drawback to using a Bayesian model rather than quantitative structure–activity relationships or quantitative structure–property relationships (QSAR/QSPR)[23−26] is that the training data inputs must be classified as one of two states (e.g., active vs inactive), and predictions return an indication of the likelihood that the molecule represented by the chemical structure is one rather than the other.[17−19] Because the kinds of biological measurements that are being used for these models generally originate as continuous data (e.g., an IC50, MIC, Ki, EC50, solubility, a value of clearance or metabolic stability, etc.), it is necessary to precede the model building step by the user selecting a threshold to partition the collection into two states. The choice of threshold can depend on a variety of circumstantial or historical factors. For instance, there may be hard scientific reasons; for an underdeveloped target where few strong inhibitors are known, a low threshold is likely to furnish a model with better predictivity, whereas if there are already a number of strong inhibitors in the training set, a high threshold may be more useful if the objective is to double-down on the structural features required to achieve very high potency. We have recently developed a new method for fully automated creation of thousands of Bayesian models using publicly available data, and for this purpose, we needed to design an algorithm for automatically detecting a suitable threshold for splitting the data set.[22] One of the first lessons we learned is that the choice of threshold has a profound effect on the quality of the model, which is consistent with intuition; if a group of structurally similar compounds have similar activities, then drawing a line through the middle will result in a model with very limited ability to resolve the two categories, whereas drawing the line so that any such clusters of related structures are on the same side of the threshold results in highly predictive models. As we experimented with ways to score proposed thresholds, we found that one of the most effective ways to ascertain the suitability of a threshold was to actually build a Bayesian model (using a diverse subset of the data to ensure scalability) and use the computed receiver operator characteristic (ROC) integral as part of the score for evaluating the suitability of the threshold. Using this approach, we were able to propose thresholds that led to very effective automated model building on a large scale.[22] Besides the need to find a threshold before building a Bayesian model, the other obvious drawback is that the result of a model prediction is a probabilistic indicator, which is in contrast to traditional QSAR/QSPR methods, for which the result is a continuous value with the same units as the training data, with an estimated error. Attempts to map the outcome of a Bayesian model (which is a floating point number) to a continuous property value generally gives poor results, which is intuitive and should be expected. The inputs are partitioned based on above/below a certain threshold, and so there is no particular reason why a model should be able to distinguish between multiple states (e.g., high vs medium vs low). We have a keen interest in expanding the scope of Bayesian models for prediction of biological properties because we have found the method to be highly valuable from a pragmatic point of view.[22,27] In particular, we have the need to provide computer-aided drug design software for use by scientists who are not computational experts or even necessarily have any insight into the nature of the data they are modeling, but are nonetheless in a position to benefit from machine learning technologies.[27] For many scenarios, structure–activity data can be effectively treated as if it were binary, such as high throughput screening results, which are classified as either hits or misses. However, this classification is often less reasonable for more thoroughly determined dose–response assays, which often require more resolution than active/inactive, although at times it is still useful, as shown by our recent success using dose response data for Mycobacterium tuberculosis Bayesian model building.[3,4,7,28−31] One alternative approach is to divide the training set not into two categories, but rather three or more, i.e., each measurement is assigned to a “bin”, which represents a range of activities. For example, a data set might be divided into four bins, defined as [<5, 5..6, 6..7, > 7]. For each of these bins, a Bayesian model can be created, for which its own partitioning is defined as does/does not belong in the bin. Evaluating a predicted test molecule would involve submitting it to each of the four Bayesian models, each of which competes for ownership; the predicted bin membership is based on assignment to the model with the highest prediction. This idea of competitive Bayesian models is not particularly novel or difficult to implement, but during its development, we found that the effectiveness of the method was profoundly affected by the choice of segmentation boundaries used to mark where one bin ends and the next one begins. As with the development of an algorithm for selecting a single threshold for a two-state Bayesian model building exercise, the selection of number of bins, population size, and boundary thresholds should be done by trying to avoid splitting up clusters of compounds that have similar structures and similar activities. Because the objective of this work has always been to provide inexperienced users with an essentially turn-key user experience that does not require preexisting knowledge of the data, we have put considerable effort into designing an algorithm that can propose a suitable binning scheme without the need for unexperienced users to engage in trial-and-error attempts to coax a good multistate model out of their data. The benefits of being able to provide a predicted bioactivity measurement as a range of values, rather than greater/less than a particular threshold value, are self-evident for many data sets. As with our previous recent work on Bayesian models,[22,27] we have made the code available to the community under the terms of the GNU Lesser General Public License, and data sets used as validation materials can be freely obtained from Github (http://github.com/cdd/bayseg) or CDD Public.[32]

Methods

Laplacian-Modified Naïve Bayesian Composite Models

The basic idea involves dividing up the ranges of activity measurements into bins, each of which is defined by a scope of possible values. For example, a training set where molecules are represented by pIC50 or pKi measurements might be divided up into four bins, as shown in Figure . Once the number of bins and the boundaries between them have been decided on, the next step is to create a Bayesian model for each bin, using the Laplacian-modified naïve Bayesian method described previously.[27] Rather than using a cutoff threshold for activity, the classification is defined as active if the molecule is in the current bin and inactive if it is not.

Figure 1

Visual example of the binning system: Four groups of molecular structures are divided according to an activity boundary. For each bin, a conventional two-state Laplacian-corrected naïve Bayesian model is constructed using in bin versus not in bin as the classifier. By making use of the calibration method that we described previously for converting the Laplacian-modified predictions into a probability-like range (i.e., most in-domain predictions are in the range 0..1),[22,27] it is meaningful to compare the prediction values for each of the bins. For a proposed molecular structure, the Bayesian model for each bin is applied. The most rudimentary interpretation is that the bin whose model provides the highest calibrated prediction value is the most probable range. During our initial experimentation with this approach, we found that the success rate for recalling the correct bin was extremely sensitive to the choice of boundary positions separating the bins. Using a convenient scheme (e.g., whole numbers), or selecting bins based on evenly sized proportions, produced results that were very inconsistent, i.e., minor changes in selection of boundaries had a major impact on the rate at which the models were able to predict the correct bin. This is an intuitively justifiable observation since the fundamental prerequisite of a database of structures and activity measurements is that certain structural fragments are correlated with positive or negative trends in activity. If a cluster of compounds with similar activity values and a similar ensemble of significant structural fragments is arbitrarily segregated by a boundary partition drawn through the cluster, one would expect poor results. Consider the schematic shown in Figure ; the graph shows a histogram of frequency plotted against activity, for which the compounds are divided into three groups. Representative structures from each of these groups are shown, and each can be clearly seen to have their own distinctive structural characteristics. In this contrived example, the solution is obvious; the data set should be partitioned into three bins, and the boundaries denoted by the heavy green lines show the optimum partitioning. If one were to choose a partitioning scheme that balances the bin sizes evenly, creating three equally proportioned bins would cause the partitions to be drawn through clusters of distinct structure–activity groups. Proceeding with this suboptimal partitioning would be expected to lead to generally poor results because the ability of any model to distinguish between two structures that are very similar but have been classified differently due to miniscule differences in activity will be poor. Similarly unsophisticated approaches, such as setting boundaries by using round numbers (e.g., [4, 8] or [3, 6, 9]), leads to inconsistent results for the same reason.

Figure 2

Demonstration of separation of structure activity groups by population integral (red line) or by boundaries between different categories of structure–activity relationships (green line).

Demonstration of separation of structure activity groups by population integral (red line) or by boundaries between different categories of structure–activity relationships (green line). To solve this problem it is necessary to compose a method for identifying suitable boundary points that distinguish between clusters of structure–activity relationships. In general, this is not straightforward since model building exercises can be expected to encounter widely diverging scenarios. In some cases, clusters of structures with similar composition and similar activity are encountered, i.e., as shown in Figure , but many data sets have much more amorphous relationships between structural fragments and activity. There are of course many real world experimental data sets that have essentially no structure–activity relationship, meaning that a treatment with structure-derived fingerprints like ECFP6 is unlikely to elucidate any correlation better than noise. As mentioned in the introduction, a further constraint of the requirements for the method is that it has to function as a “black box” in that the user provides a collection of structures and activities and is not required to provide any further parameters or even have any prior knowledge of the content or whether it is even suitable for modeling. The method we have developed involves proposing a series of candidate cut points that might be used to divide bins, and each of them is scored by criteria such as the relative sizes of the new bins that it would create and how well a test-model would be able to predict which side of the threshold the newly separated molecules reside. The scores are made up of three components: (1) ROC integral from the trial Bayesian model, (2) second derivative of the activity population, and (3) ratio of actives vs inactives. Each of these terms are used in a way that encourages splitting the data set at points that distinctively segregate boundaries between structure–activity groups. The method is iterative and greedy, starting by dividing the data into two bins, and continuing on to subdivide bins until no more favorable opportunities exist. Detection of the partitioning boundaries is performed using the following steps: In this way, each of the putative boundaries (cut points) that has a reasonable balance of entries above and below is recorded and assigned an initial desirability score. The algorithms referred to in bold are summarized only briefly and can be examined in more detail in the source code (http://github.com/cdd/bayseg): greedy_linear_clustering: If the list of compounds exceeds a maximum count, the ECFP6 fingerprints are used to reduce the size to the given count in a way that maximizes diversity of structural features and diversity of activity values, while also running in O(N) time for large data sets. all_interstitial_midpoints: The list of activity values is sorted, and for each adjacent pair of nonequal values, the midpoint is retained. sample_bayesian_ROC: A Bayesian model is created from the list of compounds where activity is based on being above/below the putative threshold, and the receiver-operator-characteristic (ROC) integral is returned; this is fast because fingerprints are already calculated and the collection is guaranteed to be small. plot_gaussian_intensity: An array of values is defined, which represents an evenly sampled distribution from lowest to highest activity, plus an edge buffer; for each value, a Gaussian distribution is added to the entire array, resulting in a smooth/smeared out frequency histogram. numerical_2nd_derivative: The intensity array is differentiated twice numerically by setting each value to the difference between its two neighbors. After these scores are calculated, the cut point that is most desirable (lowest score) is recorded and used as the first partition. The remaining cut points are sorted and enumerated, and this list is used in the following section, which iteratively adds additional partition boundaries: The iterative addition of new partition boundaries stops when the process runs out of candidates or the maximum limit is encountered. The iterative addition is greedy; at each step, it attempts to add a new cut point in the midst of an existing partition in a way that best separates the partition, from a model building point of view. Each iteration involves going through all of the candidates and all of the bins, which is in itself O(N2), but since the number of bins is small and capped and the list of candidates is initially capped and consists of only cut points that were considered desirable for applying to the data overall, the total number of evaluations is never more than hundreds, depending on the parameters and the properties of the incoming data set. For large data sets, the process of creating a Bayesian model for each iteration is kept brief by using the greedy linear clustering method, which is fast and O(N). As mentioned in the Introduction, the process of selecting good partition boundaries is by far the most difficult part of the method since the remaining steps can be composed from existing methods for working with Bayesian models. The objective is to create a model for each bin such that for any proposed compound the model that gives the highest prediction is that which corresponds to the range of activities that includes the actual activity for the compound. In our previous article,[22] we described a method for post-calibrating the Laplacian-corrected naïve Bayesian models that are effective for structure-derived fingerprint-based models, so that the results of each prediction are probability-like. This calibration is done by analyzing the ROC curve and is used to transform the numeric results so that most predictions within the domain of the model fall between the range 0..1, which means that they can be independently interpreted. It also means that these calibrated results can be compared between models, and therefore, it is reasonable to expect that for two mutually exclusive models the one that produces the higher value can be considered the leading contender. The extent to which this hypothesis holds true is described in the Results section. For conventional binary Bayesian models, the method of choice for evaluating structure–activity relationships is the venerable receiver-operator-characteristic (ROC) curve, which can be neatly summarized by its integral. This does not apply to the composite models, and instead, a different set of visual and quantitative metrics is needed. The primary objective is for all models to predict the correct bin by ensuring that its corresponding model produces the highest value, and so the fraction of the time for which compounds in a training set accomplish this is relevant, relative to the chance of guessing randomly. For example, a composite model with five bins might presume that a random guess would be correct 20% of the time, and hence, if the model predicts the correct bin 60% of the time, this represents a 3× improvement. For simplicity purposes, it is assumed that no prior knowledge of the bin distribution of the training set is available, and hence, each bin is equally likely to be guessed correctly. It is also assumed that the bin sizes are approximately the same in terms of membership frequency, which is reasonable since the detection method ensures that the training set is partitioned so that no one bin represents a disproportionate fraction of the training set. For increasingly large bin counts, the consolation prize for predicting the adjacent bin (off-by-1) becomes more useful, and so it is relevant to provide more information than just the “direct hits”. Figure shows some of the results obtained by evaluating the test set for a composite model of aminopeptidase-N inhibitor candidates (extracted from ChEMBL v20[22,33,34]). The response matrix for the test set is shown in (a); as can be seen quickly with the chosen visual style, the majority of predicted responses are concentrated along the diagonal, which corresponds to correct prediction of bins. The off-diagonal cells correspond to predictions that were incorrect. The table in (b) provides summary details; the first row indicates that 55.4% of the test set entries were predicted to be in the correct bin. Since there are seven bins, a random guess would have a 14.3% chance of being correct if an even distribution is assumed, and so the composite model provides an enrichment factor of 3.88. The next row in the table expands the definition of success to include results that were predicted to be in the adjacent bin, for which 73.8% of results qualify, relative to a random chance of 38.8%. Subsequent rows in the table converge to 100% as the bar is lowered.

Figure 3

Example of a data set containing activities against aminopeptidase N-protein with 651 measurements, of which 65 were reserved for use in the training set. The correctness matrix is shown in (a), while (b) shows enrichment statistics, and (c–i) show examples of molecular predictions from the test set. On the right-hand side of Figure are a number of visual representations of individual predictions for selected molecules from the test set. In each case a series of seven bars are shown, and these are assigned height and color-coding in proportion to the calibrated prediction of the underlying model. The highest (and most green-tinted) bar is taken to be the predicted bin of choice for bulk evaluation purposes, but for evaluation of a small number of prospective molecules, it is useful to provide all of this information in a visually accessible way, so that the user can make a more informed judgment about the value of the model prediction. The most definitive example is shown in Figure c, for which a single bin corresponds to a very high prediction, and the other six are essentially zero. As it happens, the molecular structure contains a hydroxyamide functional group, which is unique within the training and test sets and represents an overwhelmingly dominant structure–activity trend. The example in Figure d is quite definitive, with one prediction—the correct one—standing out, but with nonzero predictions for several other bins. The cases shown for Figure e–g all predict the correct bin, but there are patterns of competing structure–activity relationship that make other options quite plausible and hence might be thought of skeptically. For case Figure h, the correct result is marked by a red dot above the bar, which does not correspond to the highest prediction; however, it can be seen that the model is working quite well and hit the adjacent bin by a moderate margin. Case Figure i, on the other hand, shows a poor result, where the actual activity is in the lowest category, but the intermediate level models showed quite high predictions. Figure shows three response matrices for selected data sets (Caspase3, Death kinase, and dihydrofolate reductase), also extracted from ChEMBL v20. In Figure a, the Caspase3 distribution was resolved into only three distinct bins. While the models performed well and the correct prediction rate for the test set was 75.5%, the enrichment rate is only 2.27 since the chance of randomly guessing the correct bin is 1 in 3. The other two examples, Figure b and c, were divided into six bins, and both indicate a respectable success rate. However, the rate of correct prediction for each of the bins is quite variable. While there is significant opportunity for random error when using a small test set, an abnormally low success rate for a particular bin tends to be indicative of the inability to extract a meaningful structure–activity relationship for the compounds within the bin using Bayesian methods and ECFP6 fingerprints.

Figure 4

Three examples of recall rates: (a) Caspase (980 rows), (b) Death kinase (926 rows), and (c) dihydrofolate reductase (1056 rows). In each case, 10% of the structures were retained for use as the testing set.

Results

In order to determine the boundaries of performance with composite models using real data, we applied the method to a selection of 1843 data sets, extracted from ChEMBL v 20,[33,34] using a method similar to that which we described previously.[22] Each of the models contained at least 100 structure–activity data points assigned to the same target and consisted of either Ki or IC50 type measurements but not both. The validation data sets were not filtered in any way by the presence or absence of an actual structure–activity relationship, and so it can be expected that some proportion of the data sets are simply not suited to modeling. We consider it reasonable to operate under the assumption that the data sets extracted from ChEMBL are representative of the kinds of real world drug discovery scenarios for which this method will be used. The collection of structures and activities that were used for this validation exercise can be downloaded from http://www.collaborativedrug.com/composite-bayes.[32] For each data set, 10% of the entries were set aside to use as the test set. These entries were selected using the greedy clustering algorithm described earlier, which means that in general the choice of test set is nondiabolical and falls within the same domain as the training set (although it should be noted that when we repeated the experiment with a random selection of testing sets there was no bias in favor of preclustering). Data sets for which the partition detection method was not able to detect at least three bins were left out of the results. Figure shows the distribution of results, where the percentage of successful prediction of compounds within the testing set is plotted against the size of the data set. The first row shows the enrichment rate of correct detection for each of the bin sizes from 3 through 8. In each case, a horizontal purple line shows an enrichment rate of 1, which corresponds to no better than random. The average and standard deviation is indicated to the left of the vertical axis. As shown, the large majority of data sets demonstrate substantial predictive power relative to random guessing. As the size of the data set grows, the enrichment rate converges toward the overall average, which indicates that larger data sets are less prone to random fluctuations.

Figure 5

Unsupervised model building based on 1843 data sets extracted from ChEMBL v20. Results are divided into bin sizes (columns). Each point corresponds to the ratio of correctly predicted bins versus chance of random guessing (enrichment), with a purple line indicating the null hypothesis. The average and standard deviation are marked on the Y-axis. Training set size is shown on the X-axis. The testing sets were made up of 10% of each total data set. The second row of graphs in Figure shows the enrichment rate for predictions that indicated either the correct bin or a bin adjacent to the correct bin as the highest performer. While the chance of meeting this criterion by random guessing is increased, especially for smaller bins, the same overall trend is observed. Three data sets were selected for further study: solubility (log S),[35] mouse epoxide hydroxylase (mouse, pIC50),[32,36−39] and Chagas disease (pIC50).[40] In order to establish that the composite models are adding predictive value relative to the much simpler method of correlating the raw prediction values from a single Bayesian model, each of these data sets was partitioned into equally sized training and testing sets (using the greedy linear clustering method, as described above) with the threshold set to the median activity. Figure shows each of the predictions plotted against the actual value, and in each case, the correlation is very poor. This is to be expected since the two-state Bayesian model is quite simply not provided with enough information to distinguish between intermediate predictions.

Figure 6

Three examples of correlating raw Laplacian-modified naïve Bayesian predictions with activity: (a) solubility (test set = 649, R2 = 0.513), (b) mouse epoxide hydroxylase (test set = 328, R2 = 0.582), and (c) activity against Chagas disease (test set = 371, R2 = 0.393). The results for running the composite model method on three selected data sets are shown in Figure . The axis positions represent actual vs predicted bins, whereby each occupancy along the diagonal represents an instance where the correct bin had the highest prediction. The grid sizes are scaled so that the area of each cell on the diagonal is proportional to the actual population of the bin. Off-diagonal occupancies show how far and how often the strongest prediction strayed from the correct value. Each cell is color-coded using a shade that is scaled according to the average actual population of the corresponding bins, i.e., diagonal cells are darkest if there are no incorrect predictions, and a pair of off-diagonal cells (i, j) would be darkest if all predictions for either i or j were incorrectly transposed.

Figure 7

Analysis for three data sets: aqueous solubility, mouse hydrolase epoxide, and Chagas disease. The correctness matrix and enrichment statistics results for the training sets (a, c, e) are shown on the top; test sets are shown below (b, d, f). The analyses for using the training set data are shown in Figure a, c, and e, while the test set predictions are shown in Figure b, d, and f. As would be expected, the results for the training set have a much higher recall rate. Nonetheless, the testing set predictions demonstrate substantial enrichment relative to random, and in each case, the correct result ± off by 1 portion is greater than 50%. The effectiveness of the method is largely a function of the extent to which the data set is organized into islands of structure–activity: clusters of molecules with similar structure characteristics and activity values within a distinctive range. The presence of molecules with similar structural features and a wide range of activities spanning multiple groups degrades the ability of the method to prioritize the correct bin as the primary choice, but as shown in Figure , the probabilistic results can at least favor a subset of all the available options. The underlying fingerprints (ECFP6) are designed to statistically express explicit structural features, which has the advantage of being highly interpretable by chemists examining the structures, but it should be noted that this is not necessarily the most effective way to capture more abstract features like shape, size, polarity, etc. The ability to resolve effects like bioisosterism is only possible if the applicable patterns are present in the training set. It is important to think of the composite Bayesian model approach as a solution to a particular use case, rather than an alternative to methods such as QSAR/QSPR[41−44] modeling of continuous properties, which it is not intended to displace. As mentioned in the Introduction, the requirements for the method are that the user possess no expertise in modeling, has no preexisting understanding of the structure–activity trends, and is not required to perform any analysis or refinement of the model. The method is required to operate with the user simply uploading a collection of structures and activities and from this provide a way to evaluate proposed structures and gain insight that is qualitatively useful.

Discussion

This study extends the preceding work describing the development of open source Bayesian models.[22,27] In addition, it complements earlier efforts to partition training or test sets.[22] For example, our recent work on microsomal stability in mouse demonstrated improved binary Bayesian models by “pruning” out the moderately unstable/moderately stable compounds from the training set.[45] Earlier work had used a support vector machine to perform novelty detection and margin detection to remove uncertain predictions from models using Kernel–PLS.[46] These represent attempts to refine the way we approach using data for test or training sets with Bayesian and other machine learning models. We have now extended the use case scenario for the popular Laplacian-modified naïve Bayesian method based on ECFP6 fingerprints to include predictions for data sets that are more appropriately divided into multiple categories rather than just two states. In doing so, we have preserved some of the most desirable properties of this Bayesian method, such as requiring little or no expertise on behalf of the user, being quite robust to the effects of over/under-training, producing intuitive results that are closely related to the underlying structure activity relationship, rapid performance on modest computing hardware, and being constructed entirely of open source and easily portable algorithms. We have also made a point of evaluating the method on a huge collection of curated data sets (extracted from ChEMBL v20[33,34,47]), which is intended to simulate the diversity of data encountered in drug discovery research. Whereby, some collections have a strong and easily identifiable correlation between structural features and activity, while other collections have very little correlation, for a variety of possible reasons (e.g., multiple binding modes, multiple targets, complex biology, low data quality, noise, frequent hitters,[48−50] aggregators,[51−55] PAINS,[56] etc.). These new models can be downloaded at http://www.collaborativedrug.com/composite-bayes. The large majority of data sets demonstrate substantial predictive power relative to random. As the size of the data set grows, the enrichment rate converges toward the overall average (as is demonstrated in Figure ), which suggests that larger data sets are less prone to random fluctuations. Three additional data sets were studied with the composite Bayesian model approach: solubility,[35] epoxide hydrolase,[32,36−39] and Chagas disease.[40] While conventional two-state Bayesian modeling can be used to some effect on these data sets (as we have previously described[22,27,40]), the classification of all molecules into two categories is a blunt instrument and is generally ineffective at predicting intermediate responses. The process of bin assignment followed by composite model generation provides more degrees of freedom for structure features to be associated with finer grained activity levels. Whether this is the appropriate method for the data set, i.e., multiple groups of structure–activity correlations actually exist, can be evaluated by the quality of the metrics returned by the method. The enrichment value for correctly predicted bins from within the training set can also be used as a top level indication of model quality, much like ROC values are used for conventional Bayesian models. While they do not tell the whole story, they do convey a significant amount of information about the likely effectiveness of the model building exercise, and low values definitively indicate poor results from this modeling technique. Machine learning approaches have found wide applications in numerous areas such as genetics and genomics[57] (e.g., predicting multiple cancer classes[58]). The Bayesian classifier approach has also been used widely in cheminformatics for target prediction where there are multiple classes/targets.[59−64] The composite Bayesian model approach that we have described should be considered as an extension of the two-state Bayesian method rather than a replacement for prediction of continuous properties by conventional QSAR/QSPR methods, as the outcome is probabilistic in nature rather than an attempt to simulate an experimental measurement, complete with error bars. The composite Bayesian method described has been made available as open source (http://github.com/cdd/bayseg) along with the corresponding validation materials. We are currently creating a user interface that fits within the CDD Models(27) extension to the CDD Vault(65,66) service, and we intend to bring this additional form of structure–activity modeling to nonexpert users in this commercial software, much as we did when the binary Bayesian modeling was implemented in the same software. Because the composite Bayesian method is designed for unsupervised use, we will be able to design the functionality in such a way that the user is taken directly to the results of the model building rather than having to setup various model parameters and iteratively examine their effectiveness. We will also apply variations of the methods described in this article for making an automated determination of whether a data set is more appropriate for binary Bayesian modeling or for the multistate composite Bayesian method. The composite model technique that we have described is comprised of two distinct steps: selecting the partitions and modeling them competitively. While we have applied the same set of Bayesian/ECFP6 technologies to solve both of these problems, it may well be productive to explore other techniques for scoring molecules for membership within these bins (such as other machine learning methods). The use of sampled Bayesian models to detect boundary thresholds and separate groups of structure–activity relationships is effective and has desirable properties (e.g., performance, unit agnostic), but methods such as QSAR/QSPR may turn out to be more effective for scoring the resulting bins. In future work, we intend to investigate such methods, as it may well be possible to deliver improved predictions while still adhering to the same use-case constraints, namely, zero user input. In conclusion, while much has been published on the use of Bayesian models in cheminformatics,[67] we have now developed an approach (which to our knowledge has not been addressed before) that may extend them further, making them potentially more useful, when further granularity is required. The advantages of this approach are that it is fast, designed for use without operator intervention, easily implemented, and open source.

62 in total

1. A common mechanism underlying promiscuous inhibitors from virtual and high-throughput screening.

Authors: Susan L McGovern; Emilia Caselli; Nikolaus Grigorieff; Brian K Shoichet
Journal: J Med Chem Date: 2002-04-11 Impact factor: 7.446

2. Global mapping of pharmacological space.

Authors: Gaia V Paolini; Richard H B Shapland; Willem P van Hoorn; Jonathan S Mason; Andrew L Hopkins
Journal: Nat Biotechnol Date: 2006-07 Impact factor: 54.908

3. Analysis of pharmacology data and the prediction of adverse drug reactions and off-target effects from chemical structure.

Authors: Andreas Bender; Josef Scheiber; Meir Glick; John W Davies; Kamal Azzaoui; Jacques Hamon; Laszlo Urban; Steven Whitebread; Jeremy L Jenkins
Journal: ChemMedChem Date: 2007-06 Impact factor: 3.466

4. Use of ligand based models for protein domains to predict novel molecular targets and applications to triage affinity chromatography data.

Authors: Andreas Bender; Dmitri Mikhailov; Meir Glick; Josef Scheiber; John W Davies; Stephen Cleaver; Stephen Marshall; John A Tallarico; Edmund Harrington; Ivan Cornella-Taracido; Jeremy L Jenkins
Journal: J Proteome Res Date: 2009-05 Impact factor: 4.466

5. New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays.

Authors: Jonathan B Baell; Georgina A Holloway
Journal: J Med Chem Date: 2010-04-08 Impact factor: 7.446

6. Open Source Bayesian Models. 2. Mining a "Big Dataset" To Create and Validate Models with ChEMBL.

Authors: Alex M Clark; Sean Ekins
Journal: J Chem Inf Model Date: 2015-06-03 Impact factor: 4.956

7. A collaborative database and computational models for tuberculosis drug discovery.

Authors: Sean Ekins; Justin Bradford; Krishna Dole; Anna Spektor; Kellan Gregory; David Blondeau; Moses Hohman; Barry A Bunin
Journal: Mol Biosyst Date: 2010-02-09

8. A predictive ligand-based Bayesian model for human drug-induced liver injury.

Authors: Sean Ekins; Antony J Williams; Jinghai J Xu
Journal: Drug Metab Dispos Date: 2010-09-15 Impact factor: 3.922

9. Computational models for drug inhibition of the human apical sodium-dependent bile acid transporter.

Authors: Xiaowan Zheng; Sean Ekins; Jean-Pierre Raufman; James E Polli
Journal: Mol Pharm Date: 2009 Sep-Oct Impact factor: 4.939

10. Development and use of hydrophobic surface area (HSA) descriptors for computer-assisted quantitative structure-activity and structure-property relationship studies.

Authors: David T Stanton; Brian E Mattioni; James J Knittel; Peter C Jurs
Journal: J Chem Inf Comput Sci Date: 2004 May-Jun

7 in total

7. Machine Learning Model Analysis and Data Visualization with Small Molecules Tested in a Mouse Model of Mycobacterium tuberculosis Infection (2014-2015).

Authors: Sean Ekins; Alexander L Perryman; Alex M Clark; Robert C Reynolds; Joel S Freundlich
Journal: J Chem Inf Model Date: 2016-07-01 Impact factor: 4.956