Literature DB >> 30718737

OptiPharm: An evolutionary algorithm to compare shape similarity.

S Puertas-Martín1,2, J L Redondo3, P M Ortigosa3, H Pérez-Sánchez4.   

Abstract

Virtual Screening (VS) methods can drastically accelerate global drug discovery processes. Among the most widely used VS approaches, Shape Similarity Methods compare in detail the global shape of a query molecule against a large database of potential drug compounds. Even so, the databases are so enormously large that, in order to save time, the current VS methods are not exhaustive, but they are mainly local optimizers that can easily be entrapped in local optima. It means that they discard promising compounds or yield erroneous signals. In this work, we propose the use of efficient global optimization techniques, as a way to increase the quality of the provided solutions. In particular, we introduce OptiPharm, which is a parameterizable metaheuristic that improves prediction accuracy and offers greater computational performance than WEGA, a Gaussian-based shape similarity method. OptiPharm includes mechanisms to balance between exploration and exploitation to quickly identify regions in the search space with high-quality solutions and avoid wasting time in non-promising areas. OptiPharm is available upon request via email.

Entities:  

Mesh:

Substances:

Year:  2019        PMID: 30718737      PMCID: PMC6361934          DOI: 10.1038/s41598-018-37908-6

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.379


Introduction

The discovery of new drugs is a very expensive process, frequently taking around 15 years with success rates that are usually very low[1,2]. Many experimental approaches have been used for discovering new compounds with the desired pharmacological properties, ranging from traditional medicine[3,4] to High Throughput Screening (HTS) infrastructures[5,6]. The latter is mostly used by the Pharma Industry, but little by academic research groups; in other words, its application is not widespread outside the industrial domain. In order to avoid these limitations, new techniques based on principles of Physics and Chemistry were developed about three or four decades ago for the computer simulation (mainly using high-performance computing architectures) of systems of biological relevance[7,8]. Computational chemistry was later applied for processing large compound databases, and also for predicting their bioactivity or other relevant pharmacologic properties. Using this approach, it was shown that it was possible to use such computational methodology to pre-filter compound databases into much smaller subsets of compounds that could be characterized experimentally. This idea was named Virtual Screening (VS), and it reduces the time needed and expenses involved when working on drug discovery campaigns[9,10]. Nonetheless, the accuracy of the predictions made with VS methods still needs to be improved to avoid discarding promising compounds or providing erroneous signals and the time needed for their calculations still needs to be reduced. The inaccuracies in the predictions of VS methods are mostly due to the simplifications used in their scoring functions[11]. VS methods can be divided into Structure-Based (SBVS) and Ligand-Based (LBVS) methods. When the structure of the protein target is known, SBVS can be applied, and methods such as molecular docking[12] and Molecular Dynamics[13] are employed. But the number of already resolved crystallographic structures is still insufficient[14], so SBVS methods cannot always be applied. Another option is to use LBVS methods, where only data about known compounds with desired properties are used to derive new improved ones. In practice, whether SBVS or LBVS methods should be used, or even both at the same time, will depend on the specific drug discovery project. This study focuses on LBVS methods, which can be divided into several categories[15] such as pharmacophore methods[16,17], shape similarity methods (SSM)[18], QSAR[19], Machine Learning[20], atom-based clique-matching such as SQ/SQW[21] and Lisica[22], property-based (USR[23]) or atom distribution triplet based (Phase-Shape[24]). In SSM, a large database of compounds is processed against a molecular query, to provide information concerning which of the molecules from the database is geometrically similar, in terms of global molecular shape, to the input molecule used. Indeed, different strategies exist for shape calculation. One of the most widely used is the Gaussian[25] model. Tools such as ROCS[26], WEGA[27], SHAFTS[28] and Shape-IT[29] use it. The main differences between SSM reside in the accuracy of the predictions. It has been demonstrated that, depending on the compound dataset, some methods perform better than others[30], but there is currently no one-size-fits-all approach that can be considered first choice for any molecular dataset. Besides, the computational time needed for the calculations is also of the utmost importance. Among the previously commented SSM methods, we consider WEGA to be the state of the art in terms of accuracy of the predictions, while ROCS is considered to be the state of the art in terms of computational speed. For achieving such performance, ROCS introduced a number of drastic short-cuts for efficiency for computing overlap volumes between molecules[31]. For instance, all hydrogen atoms are ignored as they make very small contribution for the overall molecular shape, and all heavy atoms are set with equal radii. Besides, the most critical simplification in ROCS is that the shape density function of each molecule contains only the first-order terms, and all higher order terms in the original Gaussian approach[25] are omitted. This significantly simplified ROCS computations but also received criticism for the inaccuracy of this approximation[23]; mainly that the molecular volumes are significantly overestimated. And since the Gaussian shape algorithms are widely used in various VS methods, it is important to avoid errors introduced to the shape similarity calculation due to this overestimation of the volumes. WEGA was the first method that partially solved some of these ROCS issues by avoiding the use of only first-order terms and incorporating more terms, at the expenses of increasing computation costs, but increasing accuracy of the calculations, which is desirable in the drug discovery context. In this work, we introduce a novel SSM method named OptiPharm, which introduces a new optimization scheme that can be adapted through extensive parameterization to relevant features of molecular datasets, such as average size, shape, etc. In other words, OptiPharm is an evolutionary method for global optimization, which can be parametrized to different aims. SSM methods with extensive parameterization at the search level have not been practically explored in the VS context. The most of techniques are local optimizers which do not sufficiently explore the search space. As the results we later show, making an effort to deeply explore the whole search space can be of a great interest to increase hit rates in drug discovery projects.

Method

This section describes the main idea behind shape similarity calculations and its application in the drug discovery process using the new OptiPharm method. Next, the optimization algorithm used in similarity screening calculations is presented and, finally, the benchmarks used in this study are explained in detail.

Shape Similarity

The similarity score between molecules A and B is computed as the overlapping volume of their atoms. In particular, to compare the results obtained by OptiPharm with those achieved by WEGA, the similarity function is implemented as in WEGA[27]. For the sake of completeness, this function is written in the following form:where w and w are weights associated with the atoms i and j, respectively. This weight is calculated by solving the following mathematical expression:where k is a universal constant, which is set to 0.8665, v is the volume of atom i, whose value is computed as , similarly to how it was done in the original work of WEGA[27]. Finally, the overlapping is represented as a product of Gaussian representations:where p is a parameter that controls the softness of the Gaussian spheres, i.e., the height of the original Gauss function, and σ is the radius of the atom. More precisely, the radius represents the well-known van der Waals radius. The values associated to those two parameters are obtained by empirical knowledge. For the problem under consideration, the same figures proposed in WEGA[27] are considered. Notice that the score obtained from Equation 1 depends on the number of atoms of the two compared molecules, i.e., the higher this number, the longer the value of . In reality, it lies in the interval [0, +inf). To be able to measure the grade of similarity between compounds, independently of the number of atoms that compose them, the Tanimoto Similarity[32] value is computed:where V and V is the self-overlap volume of molecules A and B, respectively. It has a value in the range , where 0 means there is no overlapping, and 1 means the shape densities of both molecules are the same.

Previous approaches

WEGA is a local optimizer conceived to maximize the overlapping between two molecules A and B, given as input parameters. To direct the search, it computes the derivate of the objective function Tc, which specifically considers Equation 1. It means that WEGA can be only applied when the similarity of two molecules is measured by means of such an equation. WEGA starts the search with an initial solution and moves it from neighbor to neighbor as long as possible while increasing the objective function value. The main advantage of WEGA is its ability to find a solution in a sufficiently short period of time. On the contrary, its main drawback is its difficulty to escape from local optima where the search cannot find any further neighbor solution that improves the objective function value, i.e., the quality of the final solution closely depends on the considered starting ligand pose, obtained from the conformation of the molecular query. To deal with this drawback and to increase its probability of success, WEGA considers more than a single starting point. More precisely, it applies the local optimizer from four different poses. The first one is obtained by aligning and centering the two input molecules at the origin of the coordinates. The remaining ones are obtained by rotating the first one 180 grades at each axis[27]. The interested reader can revise literature[33-35] for the research progress of WEGA algorithm and some of its applications. In this work, we consider that it is possible to find a better trade-off between quality of the solution and computing time.

Optimization algorithm

OptiPharm is an evolutionary global optimizer, available upon request. It can be considered a general-purpose algorithm, in the sense that it can be used to solve any optimization problem that involves the computation of the similarity of two compounds given as input parameters. In other words, it is independent of the objective function used to measure the similarity between two given molecules. Nevertheless, in this work, its performance is illustrated by solving a maximization problem which consists on finding the solution which maximizes the Tc function previously defined. OptiPharm is a global optimization method in the sense that it makes an effort to analyze the whole search space looking for promising areas where the local and global optima can be. In other words, instead of focusing on a set of pre-specified starting points, as WEGA does, it applies procedures to find promising subareas of the search space, which will be deeper analyzed during the optimization procedure. OptiPharm applies procedures based on species evolution to gradually adjust one of the molecules (the query) to the other one (the target), which remain fixed during the optimization procedure. A solution represents the rotation and translation to be accomplished by the query. More precisely, is a quaternion of the form  = (θ, , , Δ), where θ is the rotation angle to be carried out over a rotation edge defined by the points  = (x1, y1, z1) and  = (x2, y2, z2), and Δ = (Δx, Δy, Δz) represents a displacement vector. It should be borne in mind throughout that a quaternion indicates the rotation and the translation applied to the variable molecule from its initial state. The parameters associated to a quaternion are bounded. Since each pair of input compounds can have different sizes, the corresponding limits are dynamically computed by OptiPharm, for each particular instance. To do so, the 3D boxes containing the input compounds are calculated. Then, the bound values for both and are set to the borders of the box containing the variable molecule. Notice that the same axis can be given by an infinite number of two coordinates. In this way, redundancy is prevented, which is very important from an optimization point of view, since exploring the same solutions several times makes the algorithm inefficient. The interval of Δ is set to [−, ], being the maximum difference between the boxes. This avoids the evaluation of situations where no overlapping exists between molecules, and the similarity between them is clearly zero (see Fig. 1). Finally, the angle θ is always set in the interval [0, 2π], independently of the compounds considered as input parameters.
Figure 1

The correct bounding of the parameter Δ prevents the evaluation of poor quality solutions, such as that considered in this figure, where no overlapping exists and hence the shape similarity of both molecules is equal to zero.

The correct bounding of the parameter Δ prevents the evaluation of poor quality solutions, such as that considered in this figure, where no overlapping exists and hence the shape similarity of both molecules is equal to zero. The bound values of the quaternion components define a multidimensional search space (or feasible region) with multiple local and global optima. OptiPharm is a new metaheuristic for global optimization. OptiPharm includes mechanisms to detect promising subareas of the search space and to discard those in which no global optima are expected to be found. In other words, instead of focusing on some fixed starting solutions, OptiPharm attempts to detect new ones which have the potential to become local or global optima. To do so, OptiPharm initially works on a set of M solutions (quaternions), called population. The quaternions can be considered as independent starting points on which OptiPharm applies reproduction procedures based on natural evolution. The term independent signifies that a point has the ability to discover new promising poses (in this work we use the concept of pose as rigid body rotations and translations obtained from starting conformation of query compound) without the participation of the rest of the population. As a consequence, offsprings of new promising solutions can appear. Then, from among all the existing poses, the best M solutions will be promoted to the next stage, where they are improved by means of a local optimizer. This reproduction-replacement-improvement sequence is repeated until a number of iterations t is achieved (see Fig. 2).
Figure 2

OptiPharm algorithm structure.

OptiPharm algorithm structure. But the real strength of OptiPharm lies on the concept of radius: each solution in the population has an associated radius value, which determines a multidimensional subarea of the search space. It can be understood as a window, where the reproduction and improvement methods are applied. The radius associated to a pose depends on the iteration i where it has been discovered. More precisely, the radius R of a new point, found during the reproduction procedure at iteration i, comes from an exponential function that decreases as the index level (cycles or generations) increases, and which depends on the initial domain landscape (the radius at the first level, R1) and the radius of the smallest candidate solution , which is given as input parameter. This radius mechanism, designed as a balance between exploration and exploitation, is inherited from UEGO, a general optimization method widely used in the literature with promising results[36]. During the execution of OptiPharm, several candidate solutions with different radii can coexist simultaneously which means that the method is able to analyze both big and small subregions at the same stage of the optimization procedure as it looks for valuable new solutions (see Fig. 3).
Figure 3

Several solutions with different radii can coexist simultaneously. Therefore, at the same stage of the optimization procedure, new promising regions are systematic analyzed, while others are examined thoroughly. This figure illustrates an example for a 2-dimensional case.

Several solutions with different radii can coexist simultaneously. Therefore, at the same stage of the optimization procedure, new promising regions are systematic analyzed, while others are examined thoroughly. This figure illustrates an example for a 2-dimensional case. Apart from the maximum number of starting solutions M, the number of iterations t and the smallest radius value , OptiPharm has another input given parameter: the maximum number of function evaluations for the whole optimization procedure, N. These function evaluations are distributed among the candidate solutions at each iteration, in such a way that each one has a budget to generate new solutions and to improve them. These budgets are mathematically computed by means of equations that depend on the previously mentioned input parameters. Again, this idea has been borrowed from UEGO[36]. In a previous work[37] the effects of the different parameters of UEGO and, hence, of OptiPharm were analyzed. Moreover, some guidelines to fine-tune the parameters depending on the problem to be solved were also proposed. Finally, it should be noted that, unlike most heuristics in the literature, the termination criteria of OptiPharm is not based on the number of function evaluations N, but on the number of iterations t. This point is important since the number of function evaluations consumed by OptiPharm depends on the particular case being solved. In other words, OptiPharm adapts itself to the complexity of the problem considered. In the following subsubsections, the key stages of OptiPharm are explained.

Initialization method

In the initialization phase, the two input molecules are aligned and centered at the origin of the coordinates (see Fig. 4). Then, from this initial situation, a population of M poses is composed. The first pose represents this initial stage, i.e. the former candidate solution will be equal to  = (θ, , , Δ) = (0, (0, 0, 0), (0, 0, 0), (0, 0, 0)), indicating than the molecule to be optimized is not moved with respect to the target, which remains fixed. Three more initial poses are obtained by rotating the variable molecule π radians at each axis (always from the initial state), resulting in the following candidate solutions  = (π, (1, 0, 0), (0, 0, 0), (0, 0, 0)),  = (π, (0, 1, 0), (0, 0, 0), (0, 0, 0)) and  = (π, (0, 0, 1), (0, 0, 0), (0, 0, 0)). Finally, in order to introduce some randomness and prevent a possible drift to local optima, M − 4 molecular poses, with all their randomly obtained parameters, are also included.
Figure 4

Initially both molecules are aligned and centered at the origin of the coordinates (see figure above). The variable molecule is depicted in green, while the target is represented in red. Then, OptiPharm applies procedures based on species evolution to gradually adjust the variable molecule to the target. The two figures below show intermediate solutions obtained by OptiPharm when, from the initial state (top), a rotation is carried out (left) and a consecutive translation is accomplished (right).

Initially both molecules are aligned and centered at the origin of the coordinates (see figure above). The variable molecule is depicted in green, while the target is represented in red. Then, OptiPharm applies procedures based on species evolution to gradually adjust the variable molecule to the target. The two figures below show intermediate solutions obtained by OptiPharm when, from the initial state (top), a rotation is carried out (left) and a consecutive translation is accomplished (right). Figure 5 shows the five initial solutions achieved for a particular instance with M = 5. As can be seen, there is always some overlap between both molecules. Consequently, the objective function is always greater than zero, while the radius value associated to all the initial poses is equal to R1. Notice that such a value is equal to the diameter of the search space.
Figure 5

Initial solutions for a case with M = 5: (a) , initial situation; (b) , obtained when rotating π rad at x-axis; (c) , obtained when rotating π rad at y-axis; (d) , obtained when rotating π rad at z-axis; (e) , all the parameter (θ, , , Δ) are randomly computed in the limits dynamically calculated by OptiPharm, for this particular instance.

Initial solutions for a case with M = 5: (a) , initial situation; (b) , obtained when rotating π rad at x-axis; (c) , obtained when rotating π rad at y-axis; (d) , obtained when rotating π rad at z-axis; (e) , all the parameter (θ, , , Δ) are randomly computed in the limits dynamically calculated by OptiPharm, for this particular instance.

Reproduction method

The reproduction method is in charge of exploring the different subareas defined by the radius of each pose in the population (see Fig. 3). The idea is to find new promising solutions which can evolve toward local or global optima at later phases of the algorithm. Each subarea is analyzed independently of the remaining ones. The process is as follows: From each pose in the population, new candidate solutions are randomly computed in the area defined by its radius (see Fig. 6(a)). Additionally, for each pair of trial solutions ( and ), the middle point ((, )) of the segment connecting the pair is computed (see Fig. 6(b)). Then, the objective function value of the extreme points (f() and f()), as well as the middle point (f((, ))), is computed. If any objective function value of these new generated points is better of the original solution , it will be updated, i.e., the centre of that subarea will be the one with the best objective function value. Additionally, if the objective function value in the middle solution is better than that of the extreme points, it may mean that it is in a hill (see Fig. 6(b)), so that it is considered a candidate to be included in the population list. On the contrary, the endpoints will be inserted as new poses. The radius of the new pose in the population will be that one associated with the current iteration. Figure 6(c) shows a summary of the whole process by keeping the references to the names in Fig. 6(a,b).
Figure 6

Reproduction method.

Reproduction method.

Replacement method

After the reproduction method has been applied, it is highly probable that the size of the population will be greater than the population size given by the input parameter M. Therefore, a mechanism for selecting the surviving solutions must be applied. Different types of replacements exist but, in this work, a deterministic and highly elitist one has been implemented: the original population and their corresponding offspring are grouped in an intermediate population, and then the M best solutions, i.e., the best poses, are selected as members of the population. The remaining ones are eliminated. The implementation of this direct replacement involves the use of a sorting procedure whereby the poses are sorted according to their shape similarity value.

Improvement method

In order to introduce some noise into the search process, and hence avoid the convergence to local optima, a mutation operator is usually applied to the new offspring. Then, in most evolutionary or genetic algorithms, mutation mechanisms are included in the optimization procedure, which runs small random changes to the new individuals. However, for the present problem, the use of improvement methods has better shown to better approximate the poses towards the optima. The improvement method implemented in OptiPharm is the local search method SASS, initially proposed by Solis and Wets[38]. It has been chosen mainly because it is a derivative-free optimization algorithm that can be applied to maximize any arbitrary function over a bounded subset of . Several modifications have been included to adapt SASS to the problem at hand. In the following they are briefly described. Algorithm SASS internally assumes that the range in which each variable is allowed to vary is the interval . Since this is not our case, when necessary we use a function to rescale (normalize) the variable values to the interval , and the function denorm to invert this process. In SASS, the new points are generated using a Gaussian perturbation over the search point (x,α) and a normalized bias term to direct the search. The standard deviation σ specifies the size of the sphere that most likely contains the perturbation vector. In this work, its upper bound σ should have the same value as the normalized radius of the caller solution. Then, the parameter σ is also considered an argument of SASS. Hence, any single step taken by the optimizer is no longer than the radius of the calling candidate solution. Finally, the stopping rules are determined by a maximum number of function evaluations (fe) and by the maximum number of consecutive failures (Maxfcnt). OptiPharm applies SASS to every pose in the population. See Fig. 7 for an illustrative example of its performance.
Figure 7

Example. The local optimizer SASS has been used as Improvement method. This figure shows the performance of SASS for a 2D case. SASS is a derivative-free optimization algorithm that can be applied to maximize an arbitrary function over a bounded subset of . It looks for an improving direction and moves the starting point along it by making changes of different sizes (if the number of consecutive successes is larger than a pre-specified value, then the advance along the suggested searching direction will be longer; otherwise, the size of the step will be reduced. The area of action of the optimizer is limited by the corresponding radius. In OptiPharm, the stopping rule of SASS is determined by a maximum number of function evaluations and by the maximum number of consecutive failures.

Example. The local optimizer SASS has been used as Improvement method. This figure shows the performance of SASS for a 2D case. SASS is a derivative-free optimization algorithm that can be applied to maximize an arbitrary function over a bounded subset of . It looks for an improving direction and moves the starting point along it by making changes of different sizes (if the number of consecutive successes is larger than a pre-specified value, then the advance along the suggested searching direction will be longer; otherwise, the size of the step will be reduced. The area of action of the optimizer is limited by the corresponding radius. In OptiPharm, the stopping rule of SASS is determined by a maximum number of function evaluations and by the maximum number of consecutive failures.

Computational Experiments Framework

Hardware setup

All the experiments carried out in this work have been executed in a Bullx R424-E3, which consists of 2 Intel Xeon E5 2650v2 (16 cores), 128 GB of RAM memory and 1 TB HDD.

Methodology to test the performance of the algorithms

OptiPharm is a computer program which implements an evolutionary optimization algorithm which includes randomness in the search procedure. Then, in order to test its performance, we run each particular instance several times and we provide some statistical metrics, as usually is done when testing any heuristic algorithm in works in literature[39-43]. From a statistic point of view, a minimum number of 30 samples need to be considered for this[44]. Nevertheless, in this work, each particular instance has been run 100 times to increase confidence in the results. Then, figures as the average value and the standard deviation are computed to analyze its effectiveness and efficiency. It is important to highlight that executing several times a particular instance is only a methodology to analyze the robustness of the algorithm, but in the real world scenario, OptiPharm only needs a single run to provide reliable results. Regarding WEGA, it is only run once for each particular instance, since it is deterministic (it uses a descent gradient method) and different executions always produce the same result.

Benchmarks

Unlike OptiPharm, WEGA does not consider the hydrogen atoms in the shape similarity calculations. To be able to compare the results provided by both algorithms, OptiPharm has been configured to omit the hydrogens when computing the shape similarity score. Additionally, as WEGA does[27], all the heavy atom radii have been set to 1.7 Å. Furthermore, all compound pairs are centred and aligned in the same way. Consequently, the molecule centroids have been located at the coordinates centre of the search space. Finally, each molecule has been aligned in such a way that its longest axis has been oriented at X-axis and the shortest along the Z-axis. The underlying OptiPharm algorithm is parameterizable, which means that it can be fine-tuned depending on the user’s preferences. So users may prefer to obtain high-quality solutions at the expense of slightly increasing the computational effort, while others may want an acceptable solution with reasonable computing time. In this work, the parameters that control OptiPharm were tuned by trying several combinations of parameter values with a reduced set of problems, and following the guidelines described in a previous work[37]. As a consequence, two different sets of input parameters are proposed, given rise to two versions of OptiPharm with different aims: OptiPharm Robust (OpR). In this case, the set of input parameters is chosen to make OptiPharm reliable and robust; in other words, to allow OptiPharm to deeply explore and exploit the search space in the search for the best possible pose. In particular, the following values were considered: N = 200000 function evaluations, M = 5 starting poses, t = 5 iterations and as the smallest possible radius. OptiPharm Fast (OpF). On this occasion, the parameters are tuned so that the running times are lower or similar to those of WEGA, enabling a fair comparison between both algorithms. The following values were considered: N = 1000 function evaluations, M = 5 starting poses, t = 5 iterations and a minimum radius of . From the previous paragraphs, one could infer that the number of starting poses, M = 5, and the number of iterations, t = 5, can be fixed independently of the goal pursued, while the smallest radius , and most importantly, the number of function evaluations N have a bigger influence in both the effectiveness and the efficiency of the algorithm. Four computational studies were designed by considering the well-known Food and Drug Administration (FDA)[45], Directory of Useful Decoys (DUD)[46], Directory of Useful Decoys - Enhanced (DUD-E)[47] and Maybridge datasets. In the following sections, they are briefly described.

FDA

The FDA, a federal agency of the United States Department of Health and Human Services, is responsible for protecting and promoting public health by controlling, among other things, prescription and over-the-counter pharmaceutical drugs (medications). This agency provides a data set containing 1751 molecules, which represents approved medicines that can be used with safety in humans in the USA. It is a common practice[48], in the current scenario, to identify which compound pairs in the FDA database share a high degree of shape similarity. To compare the performance of both OptiPharm and WEGA, a set of 40 query compounds were randomly selected from this database. In order to obtain a representative set of samples, the FDA dataset was initially sorted according to the number of atoms of the compounds, and divided into 24 intervals (see Fig. 8). Then, a subset of compounds was randomly chosen for each interval. The number of selected samples in each interval was proportional to the number of compounds it included.
Figure 8

Number of compounds included on the FDA database, according to their number of atoms.

Number of compounds included on the FDA database, according to their number of atoms.

DUD

Tests were also carried out applying shape similarity calculations and using different sets of molecules that are known to be active or inactive, and standard VS benchmark tests, such as the DUD[46], whereby VS methods check how efficient they are at differentiating ligands that are known to bind to a given protein target, from non-binders or decoys. Input data for each molecule of each set contain its molecular structure and information about whether it is active or not. Information about active molecules for each protein of the DUD set was taken from experimental data. Decoys were prepared in order to resemble active ligands physically, but at the same time, to be chemically different from active molecules, making it very unlikely that they would act as binders. On average, for each ligand it is possible to find 36 decoy molecules that are very similar in physical terms, but with a very different topology. Details about how decoys were prepared (selected from already existing molecules in the ZINC database) can be found in the literature[46], so that we shall only mention here the principal details required to understand the present study. The initial database was built using 3.5 million Lipinski-compliant molecules from the ZINC database of commercially available compounds (version 6, December 2005). Feature key fingerprints were calculated using the default type 2 substructure keys of CACTVS[49] and the fingerprint-based similarity analysis was performed with the program SUBSET. Compounds with Tc values lower than 0.9 to any annotated ligand (named as actives) were selected. This reduced the number of ZINC compounds to 1.5 million molecules topologically dissimilar to the ligands. The program QikProp (Schrodinger, LLC, New York, NY) was used to calculate 32 physical properties of all the annotated ligands and selected ZINC compounds from the previous step, and QikSim (Schrodinger, LLC, New York, NY) was applied to prioritize ZINC compounds possessing similar physical properties to any of the ligands. A weight of 4 was used to emphasize the druglike descriptors (molecular weight, number of hydrogen bond acceptors, number of hydrogen bond donors, number of rotatable bonds, and log P), while the rest of the descriptors were ignored (weight 0) during the similarity analysis procedure. Finally, thirty-six decoy compounds were selected for each ligand, leading to a total of 95316 decoys that were similar in terms of physical properties but topologically dissimilar to the 2950 annotated ligands. The total number of decoys is less than 36 times the number of annotated ligands because some ligands had the same decoys. The original DUD database downloaded from http://zinc.docking.org has been used.

DUD-E

The DUD-E[47] is a well-known benchmark for structure-based virtual screening methods from the Shoichet Lab at UCSF[47]. The methodology of the DUDE benchmark is fully described in its original work[47]. Briefly, the benchmark is constructed by first gathering diverse sets of active molecules for a set of target proteins. Analogue bias is mitigated by removing similar actives; similar actives are eliminated by first clustering the actives based on scaffold similarity, then selecting exemplar actives from each cluster. Then, each active molecule is paired with a set of property-matched decoys (PMD)[50]. PMD are selected to be similar to each other and to known actives with respect to some 1-dimensional physicochemical descriptors (e.g., molecular weight) while being topologically dissimilar based on some 2D fingerprints (e.g., ECFP[51]). The enforcement of the topological dissimilarity supports the assumption that the decoys are likely to be inactive because they are chemically different from any know active. The benchmark consists of 102 targets, 22,886 actives (an average of 224 actives per target) and 50 PMD per active[52]. The original DUD-E database downloaded from http://dude.docking.org/ has been used in this work.

Maybridge

Maybridge[53] Screening Hit Discovery collection (over 53,000 compounds) is a commercial library of small hit-like and lead-like organic compounds of high diversity (Tanimoto Clustering at 0.9)[54], that covers ca. 87% of the 400,000 theoretical drug pharmacophores with general compliance with the Lipinsky rule of five and of good ADMET properties. The HitCreatorTM Collection (selection of 14,400 of Maybridge screening compounds) aims to represent the diversity of the main collection covering the drug-like chemical space. Maybridge also offers a fragment library (30,000 fragments), a hit-to-lead building block collection, and a Ro3 2500 diversity fragment library (2500 fragments) with a Tanimoto similarity index of 0.66 (based on standard Daylight fingerprinting), assured solubility, optimized for SPR and Ro3 compliant. It provides special collections of Fluoro[55], Fluoro and Bromo-fragment libraries[56]. The original Maybridge database downloaded from https://www.maybridge.com has been used in this study.

The AUC metric

In this work, to measure the goodness of the algorithms when distinguishing between ligands and decoys, the Area Under a ROC Curve (AUC) was computed, as previously done in other related papers[27]. See[57] for an in-depth description of calculation. Broadly speaking, the AUC of a set of elements is computed by considering a descriptor value that is associated to each element. For the problem at hand, such a descriptor is given by Equation 4, which measures the shape similarity between two molecules, A and B. However, before computing the AUC, given a query molecule and a set of molecules the similarity to which is to be computed, a optimization problem must be solved to obtain the shape similarity scores for each molecule in the set. Then, the list is sorted in descending order according to the shape similarity values. Without going into detail, an AUC value equal to 1 means that such a particular algorithm has been able to differentiate perfectly between two datasets - in our case, between ligands and decoys. In other words, it is possible to determine a cut-off point (a real value) which divides the list into two intervals that contain all the decoys and ligands, respectively. When it is not possible to determine only two intervals, more cut-off points should be considered in an incremental way. Of course, the larger the number of intervals, the smaller the AUC value. However, AUC values smaller than or equal to 0.5 mean the algorithm has poor effectiveness, i.e., a random method would have achieved a similar classification.

Results and Discussion

Results obtained for FDA database

It is important to mention that for all the algorithms and all the instances, a score equal to 1 is obtained when a molecule is compared to itself. Thus, from here on, when we mention “the molecule with the highest shape similarity to a query compound”, and noted by BestComp, we exclude the case where target molecule and query are equal. Table 1 shows, for each query compound, its number of atoms (nA), the other compound from the FDA database with the highest shape similarity (BestComp) and the associated function score (Tc), according to OpR, OpF and WEGA. As can be seen, the OpR algorithm provides the highest shape similarity values Tc, although it is also the most time-consuming method according to Table 2. This means that better predictions can be accomplished by using OpR when there are no time constraints. However, if lower execution times are required, algorithms such as OpF or WEGA should be considered.
Table 1

Results obtained for 40 query compounds from the FDA database.

queryOpROpFWEGA
namenABestCompTcBestCompTc(i, Tc(OpR))BestCompTc(i, Tc(OpR))
DB005297DB008280.921DB008280.920DB008280.921
DB003319DB011890.940DB011890.936DB011890.940
DB0136512DB001910.944DB001910.943DB001910.944
DB0135215DB003060.891DB003060.884DB002370.872(2, 0.872)
DB0038019DB008160.842DB008160.822DB008160.842
DB0621620DB003700.905DB003700.902DB093040.856(2, 0.869)
DB0067421DB016190.865DB016190.855DB003700.850(2, 0.850)
DB0063223DB004640.724DB004640.719DB004640.717
DB0761524DB012500.799DB012500.797DB012500.799
DB0069325DB016190.841DB016190.793DB010680.825(2, 0.825)
DB0088725DB066140.745DB066140.732DB049380.733(2, 0.730)
DB0921925DB004340.819DB007920.805(3, 0.812)DB007920.812(3, 0.812)
DB0035127DB048390.941DB048390.936DB006030.902(2, 0.902)
DB0038128DB010230.819DB010230.732DB067120.707(5, 0.706)
DB0923728DB010540.717DB010540.648DB011150.686(4, 0.685)
DB0119829DB004020.933DB004020.929DB004020.933
DB0087630DB090390.664DB052390.651(3, 0.653)DB052390.653(3, 0.653)
DB0162132DB011480.694DB011480.693DB011480.694
DB0923633DB002700.672DB011150.615(2, 0.669)DB014330.662(3, 0.662)
DB0890337DB003330.653DB003330.610DB067030.630(4, 0.630)
DB0072838DB013390.820DB013390.816DB013390.820
DB0141942DB066050.630DB066050.626DB066050.630
DB0032043DB014130.629DB014130.618DB014130.629
DB0123249DB010820.549DB010820.535DB010820.549
DB0024650DB012610.761DB012610.738DB012610.761
DB0050350DB008450.499DB013190.461(4, 0.496)DB013190.498(4, 0.496)
DB0911450DB089930.476DB048940.411(6, 0.416)DB089930.477
DB0025455DB005950.877DB005950.874DB005950.877
DB0030955DB005410.634DB005410.618DB005410.634
DB0643957DB002070.515DB002070.494DB002120.513(2, 0.513)
DB0119660DB002860.784DB002860.779DB002860.784
DB0107866DB005110.502DB005110.479DB005110.503
DB0159068DB008770.469DB003850.459(2, 0.464)DB008770.469
DB0489480DB003640.482DB003640.468DB008640.453(3, 0.453)
DB0478686DB010780.387DB091580.306(3, 0.369)DB010780.387
DB0073287DB010450.434DB010450.417DB010450.434
DB0040394DB000350.394DB064020.355(4, 0.376)DB088740.386(2, 0.386)
DB00050102DB005690.396DB005690.391DB005690.396
DB06699117DB000910.454DB005120.409(2, 0.414)DB090990.412(3, 0.411)
DB06219128DB005120.422DB003640.354(2, 0.409)DB003640.410(2, 0.409)

For each query, its nA and the BestComp with the highest Tc is shown, according to OpR, OpF and WEGA. Note that the score Tc is equal to 1 when the query compound is compared with itself for all the instances and algorithms, so that BestComp really represents the second most similar molecule to the query.

Table 2

Performance results obtained by the different similarity methods.

querynAOpROpFWEGAspeedup
AvSDAvSDT
DB00529761.20.5604.80.00816.43.4
DB00331977.40.7525.80.04117.53.0
DB013651296.70.7147.30.00416.92.3
DB0135215116.50.8239.10.03719.52.1
DB0038019165.11.42511.00.02820.41.9
DB0621620169.21.20311.80.03025.32.1
DB0067421169.91.12312.30.01120.61.7
DB0063223130.41.56411.30.00522.32.0
DB0761524205.41.38513.40.01022.41.7
DB0069325215.22.15814.50.01724.21.7
DB0088725213.51.54714.20.00121.61.5
DB0921925223.11.70914.30.01022.61.6
DB0035127220.71.98015.30.01723.41.5
DB0038128227.51.49915.50.01332.12.1
DB0923728227.41.22215.80.00122.81.4
DB0119829223.91.35414.60.00023.11.6
DB0087630262.01.87817.10.00223.71.4
DB0162132267.11.47517.20.01724.71.4
DB0923633280.72.23018.10.05927.01.5
DB0890337289.32.18820.00.04525.51.3
DB0072838284.61.78720.30.03225.81.3
DB0141942359.22.37121.70.03128.51.3
DB0032043355.62.37422.80.01625.61.1
DB0123249395.72.89625.70.03629.21.1
DB0024650250.71.71915.50.07322.61.5
DB0050350416.62.74326.30.00831.01.2
DB0911450388.32.78223.90.00531.61.3
DB0025455263.11.91917.40.00325.31.5
DB0030955377.32.62628.50.02230.81.1
DB0643957434.82.93729.30.04832.91.1
DB0119660244.71.59915.90.05427.11.7
DB0107866485.93.53828.90.05036.01.2
DB0159068495.13.29731.90.01039.41.2
DB0489480550.43.87637.80.00640.71.1
DB0478686598.74.72832.30.00245.41.4
DB0073287628.54.14740.10.01544.21.1
DB0040394609.55.07239.60.04149.51.3
DB00050102664.14.83445.50.05051.31.1
DB06699117725.65.25750.80.00555.01.1
DB06219128828.47.03052.00.09063.41.2
mean46330.02.40821.70.02429.71.6

Columns represent: DrugBank code for each molecule, its corresponding nA, average running time (in seconds) and standard deviation obtained by OpR and OpF (see columns 3–6), execution time spent by WEGA (see column 7), and speedup of OpF against WEGA.

Results obtained for 40 query compounds from the FDA database. For each query, its nA and the BestComp with the highest Tc is shown, according to OpR, OpF and WEGA. Note that the score Tc is equal to 1 when the query compound is compared with itself for all the instances and algorithms, so that BestComp really represents the second most similar molecule to the query. Performance results obtained by the different similarity methods. Columns represent: DrugBank code for each molecule, its corresponding nA, average running time (in seconds) and standard deviation obtained by OpR and OpF (see columns 3–6), execution time spent by WEGA (see column 7), and speedup of OpF against WEGA. To the best of our knowledge, no algorithm, method or program exists that is able to provide with certainty the most similar molecule to a given query compound. Until this work, WEGA was the algorithm providing the most optimal shape similarity values[27,34]. Now, as can be seen in Table 1, OpR improves on WEGA in terms of the ability to find higher values of shape similarity when processing a query compound against a ligand database. Therefore, to analyze the effectiveness of OpF and WEGA in term of their predictions, the solutions provided by OpR will be considered the optimal ones. As can be seen in Table 1, the predictions of WEGA coincide with those of OpR in 22 out of 40 cases, while OpF does it in 30 out of 40 occasions. This represents a small advantage to OpF against WEGA in terms of success in the predictions. Additionally, from Table 2, which shows the computing times, one can appreciate that OpF is quicker than WEGA. Furthermore, it is important to study the instances where the predictions of OpF and WEGA do not coincide with those achieved by OpR. This occurs in 18 out of 40 cases for WEGA, and 10 times for OpF. Then, for each particular query, the 1751 compounds are sorted in descending order according to the shape similarity value obtained by OpR. Next, it is computed the position i in the list where the BestComp achieved by OpF (resp. WEGA) is, and which one shape similarity value, Tc(OpR). This information is shown in Table 1, columns 6 and 9 for OpF and WEGA, respectively. Broadly speaking, in most of cases the predictions carried out by OpF are located in a better position in the OpR list than the predictions proposed by WEGA. It is important to mention that, in general, OptiPharm is designed to maintain population diversity and to investigate many promising poses in parallel, avoiding the genetic drift towards a single (local or global) optimal pose. However, depending on the selected set of parameters, the accuracy when approximating to the optima may be higher or lower. For this reason, OpF has been fine-tuned to explore the search space looking for the most promising poses, but without wasting time by “polishing” them. In optimization terms, the input parameters are selected to determine the highest peaks in the search space, but not to actually reach the top of the highest peak. Even when OpF proposes as BestComp the same compound as OpR (or even WEGA), its shape similarity value may be smaller. If the algorithm is allowed to run longer, as with OpR, the identified poses can be polished, increasing the score value. In this case we prioritize the computational effort. Figure 9 depicts a graphical example of this fact, specifically the query DB09236 from the FDA database, whose result can be seen in Table 1. Considering this query, OpR reveals that DB00270 is the compound which maximizes the shape similarity function, with a score value equal to Tc = 0.672 (see Fig. 9(a)). OpF reveals that molecule DB01115 maximises the shape similarity function with a score value equal to Tc = 0.615. Finally, WEGA reveals that the molecule DB01433 maximizes the shape similarity function with Tc = 0.662. Apparently, WEGA achieves a more similar compound than OpF, since it provides as solution a compound with a higher score than the one proposed by OpF. However, when OpR optimizes the query with the molecule DB01115 proposed by OpF, it provides a score value of 0.669 (see Fig. 9(b)). By contrast, OpR gives a value of 0.662 when it optimizes the query with the compound DB01433 given by WEGA, (see Fig. 9(c)). This means that the solution provided by OpF is more similar in terms of shape than that of WEGA.
Figure 9

Depiction of shape similarity between the query DB09236 and (a) the molecule DB00270, (b) the compound DB01115, and (c) the molecule DB01433, when they are optimized by OpR.

Depiction of shape similarity between the query DB09236 and (a) the molecule DB00270, (b) the compound DB01115, and (c) the molecule DB01433, when they are optimized by OpR. Table 2 shows performance values among the different methods. Clearly, the slowest algorithm is OpR, since it has been fine-tuned to be robust and accurate. Even so, the time values are not extremely high when compared against the other two methods. In fact, taking into account the possibility of using high-performance computing to accelerate it (please, see Future Work Section), it would be perfectly justifiable to use Robust mode to increase the percentage success in the predictions. For its part, OpF is the fastest algorithm, reducing on average the computational effort of WEGA almost 3.5 times. Besides, as can be appreciated in the Speedup column, the lower the number of atoms, the greater the increase in speed obtained by OpF. Additionally, it is important to mention that OpF is able to adapt itself to the complexity of the problem to solve. Finally, it is interesting to remark that, in spite of the randomness included at some stages of the OptiPharm algorithm, its variability is almost negligible, as can be appreciated from the standard deviation values provided in Table 2.

Results obtained for DUD and DUD-E databases

Tables 3 and 4 show the results of testing the shape-based VS performance of both OptiPharm (in its two versions) and WEGA against the DUD and DUD-E databases, respectively. Metrics of AUC values and execution time have been computed. As previously was mentioned, to test the OptiPharm reliability, each particular instance has been run 100 times and average values have been computed. Furthermore, the corresponding SD has also been provided. Regarding WEGA, since it is deterministic, only one single execution has been carried out for each particular instance and the corresponding values have been shown.
Table 3

DUD database.

nameAUCTime
OpROpFWEGAOpROpFWEGA
AvSDAvSDAUCAvSDAvSDTime
ace0.390.0130.440.0210.33278.70.04615.20.00031.0
ache0.710.0040.710.0080.72645.50.05935.50.00367.0
ada0.670.0030.710.0110.6667.80.0114.90.00012.5
alr20.240.0030.280.0120.2287.30.0156.80.00013.9
ampc0.700.0050.750.0200.7168.60.0135.00.00010.9
ar0.730.0030.730.0050.72209.20.02018.10.00141.2
cdk20.600.0100.580.0100.59184.30.02612.40.00028.7
comt0.430.0170.450.0160.3745.60.0073.30.00010.0
cox10.490.0030.510.0090.4857.20.0094.70.00012.6
cox20.950.0020.930.0040.951738.50.112109.60.0061038.6
dhfr0.650.0030.610.0070.651392.80.08183.60.006742.6
egfr0.590.0030.540.0060.572128.50.100137.30.008962.1
er_agonist0.790.0030.800.0070.79228.40.02617.60.001120.7
er_antagonist0.730.0080.730.0150.72262.40.02915.20.00070.0
fgfr10.410.0010.450.0030.40668.70.04739.40.003387.6
fxa0.600.0070.600.0100.681161.20.07365.50.005244.6
gart0.310.0070.410.0120.27197.00.02411.60.00049.6
gpb0.850.0040.820.0080.84128.90.01610.50.00035.5
gr0.620.0050.660.0080.62365.20.03427.40.00253.5
hivpr0.780.0110.710.0110.76622.70.06336.00.00151.1
hivrt0.750.0110.750.0100.75143.80.0199.80.00034.0
hmga0.750.0120.750.0150.77240.70.02714.90.00078.5
hsp900.680.0090.770.0160.66128.70.0198.20.00030.5
inha0.610.0070.530.0090.60479.70.04532.40.00284.4
mr0.840.0040.840.0070.8466.60.0115.60.00010.7
na0.860.0080.830.0080.85165.40.01712.40.00031.6
p380.500.0030.450.0120.471997.20.125112.70.006371.6
parp0.500.0030.460.0080.4996.30.0167.80.00033.8
pde50.750.0080.740.0090.75420.60.03823.50.001124.9
pdgfrb0.450.0030.470.0060.46964.00.05854.30.005145.7
pnp0.610.0080.610.0200.6371.40.0115.60.00017.2
ppar_gamma0.680.0140.720.0110.701055.60.08650.20.003134.2
pr0.620.0180.650.0290.61151.70.02410.90.00044.5
rxr_alpha0.900.0230.910.0130.91122.00.0167.30.00013.8
sahh0.890.0060.870.0070.8987.90.0126.70.00019.5
src0.320.0030.380.0080.301388.00.07274.30.006272.7
thrombin0.500.0090.570.0130.55510.20.04528.60.001145.4
tk0.560.0180.560.0170.5847.80.0084.20.00020.6
trypsin0.280.0060.330.0090.26255.40.02412.60.00041.0
vegfr20.610.0060.600.0080.61323.50.02721.20.00149.5
mean0.620.0070.630.0110.61481.40.03829.10.002142.2

For each query compound, the average AUC value and the mean running time (in seconds) over 100 independent executions were computed with both OpR and OpF. For the sake of completeness, the SD is also provided for both OpR and OpF versions. WEGA is a deterministic algorithm, so it was only executed once and its computed AUC value and the execution time are included. The last row of the table shows average values for the query molecules.

Table 4

DUD-E database.

nameAUCTime
OpROpFWEGAOpROpFWEGA
AvSDAvSDAUCAvSDAvSDTime
aa2ar0.570.0000.560.0110.572656.824.74845.61.1251363.1
abl10.520.0010.530.0030.52905.65.75635.81.197430.4
ace0.510.0000.500.0150.531392.54.36927.80.827710.6
aces0.240.0000.270.0060.241733.63.07226.90.901978.7
ada0.630.0000.580.0410.71245.61.5705.30.193232.0
ada170.480.0000.470.0020.481894.63.29747.61.7961011.2
adrb10.360.0020.330.0030.36966.51.98732.61.002634.5
adrb20.370.0010.370.0030.381155.910.91219.80.736555.6
akt10.260.0010.260.0080.261062.51.99240.40.977675.6
akt20.410.0010.340.0040.39504.15.27117.00.563210.9
aldr0.540.0010.500.0070.54565.12.70123.00.789310.2
ampc0.630.0000.520.0070.64118.91.7362.50.096101.8
andr0.630.0000.600.0020.63595.817.95326.10.774398.3
aofb0.440.0000.450.0020.44135.20.4113.10.090198.3
bace10.530.0000.460.0180.541284.84.04137.81.197758.4
braf0.560.0020.480.0050.55766.79.02514.80.424352.3
cah20.450.0030.440.0020.44765.513.34237.61.2971173.8
casp30.410.0000.440.0010.39561.40.76313.80.384436.6
cdk20.660.0010.640.0040.663007.871.46570.71.907837.4
comt0.600.0020.560.0050.62122.91.5915.30.146111.7
cp2c90.430.0000.430.0050.44408.32.05511.10.326280.0
cp3a40.530.0010.530.0070.531370.16.52632.50.904430.5
csf1r0.550.0000.580.0060.601085.737.66727.90.806397.4
cxcr40.710.0020.650.0030.73231.41.8514.00.117112.3
def0.690.0000.550.0080.69324.215.6766.50.188191.6
dhi10.640.0000.670.0020.641200.77.46726.80.689703.7
dpp40.570.0000.550.0020.573618.46.68462.81.8661402.7
drd30.300.0000.290.0020.292085.86.17156.31.8841174.4
dyr0.400.0010.380.0040.40976.626.65263.61.888624.9
egfr0.520.0010.450.0050.543601.299.47854.81.7801460.8
esr10.640.0000.640.0030.631994.76.16843.21.385749.2
esr20.690.0000.650.0030.681300.22.19625.80.919693.1
fa70.660.0010.600.0030.522691.78.408111.91.731184.1
fa100.510.0020.530.0080.67761.62.98214.00.388589.3
fabp40.690.0090.620.0100.67285.83.17611.10.425119.0
fak10.690.0020.670.0020.67648.86.13537.60.863160.3
fgfr10.470.0020.470.0040.4650.50.7881.70.05250.2
fkb1a0.680.0010.670.0080.72458.116.79210.40.308253.5
fnta0.550.0000.470.0040.556081.64.037186.25.6342102.5
fpps0.860.0010.810.0020.88250.71.8188.70.267221.3
gcr0.520.0000.480.0020.501046.01.88329.30.970624.2
glcm0.360.0020.300.0010.35132.66.1533.40.104132.2
gria20.590.0010.560.0040.60740.58.28822.90.865418.7
grik10.620.0010.670.0040.61262.64.9798.30.273253.6
hdac20.340.0000.310.0020.35521.02.44010.60.354400.5
hdac80.420.0000.400.0040.43528.35.14411.80.372353.7
hivint0.410.0010.350.0040.41384.11.26012.30.389221.2
hivpr0.700.0010.690.0090.714748.85.144133.43.8081354.1
hivrt0.520.0000.490.0010.521107.64.25042.31.469573.8
hmdh0.750.0000.710.0030.74976.87.56519.70.549399.8
hs90a0.630.0010.600.0080.64390.38.9469.70.189183.7
hxk40.640.0010.490.0020.62358.312.64711.50.382188.1
igf1r0.480.0020.460.0040.501048.32.24431.81.059401.7
inha0.390.0020.340.0050.43130.60.4072.30.07579.6
ital0.390.0020.440.0070.381157.52.99630.10.827459.8
jak20.680.0000.640.0040.68412.24.7348.10.277283.3
kif110.830.0000.580.0060.83606.86.0538.20.272318.1
kit0.430.0000.410.0030.44678.50.30715.90.559325.9
kith0.690.0030.650.0020.70153.31.2273.70.145104.5
kpcb0.580.0000.520.0040.59622.66.03913.60.471310.2
lck0.460.0010.430.0020.442110.14.28140.61.2371121.1
lkha40.520.0000.520.0030.58599.40.8669.70.298365.4
mapk20.650.0000.610.0030.65376.41.19010.40.316210.0
mcr0.640.0000.590.0020.63292.43.0966.10.200175.4
met0.680.0020.730.0070.722182.015.72446.81.571564.2
mk010.390.0010.380.0020.40256.90.7396.10.209154.7
mk100.450.0000.490.0050.44559.12.01711.50.362258.8
mk140.540.0010.520.0030.546277.225.295139.14.4521404.4
mmp130.560.0000.550.0020.603671.87.85063.42.0801525.6
mp2k10.420.0000.530.0030.45722.917.18211.70.383339.2
nos10.350.0010.330.0030.35366.68.3226.30.197267.2
nram0.850.0000.790.0020.85357.08.8986.20.163200.7
pa2ga0.600.0000.620.0050.60416.63.4338.60.324218.3
parp10.640.0000.630.0010.641481.62.18143.21.314981.3
pde5a0.590.0000.560.0020.562777.056.57437.41.2221243.3
pgh10.700.0000.720.0040.71620.41.82615.30.412425.2
pgh20.790.0000.740.0010.791130.52.54435.51.161791.9
plk10.530.0000.470.0060.54797.56.74411.30.346267.3
pnph0.740.0000.700.0010.74264.41.5215.90.185212.8
ppara0.760.0000.750.0030.772109.327.19939.31.456870.2
ppard0.470.0010.340.0020.441557.12.22331.70.856503.6
pparg0.450.0010.430.0020.452867.434.03469.31.9421122.2
prgr0.720.0010.690.0020.711148.712.21933.81.017469.9
ptn10.310.0010.290.0050.30348.11.4258.90.264290.6
pur20.370.0000.260.0090.33242.81.8444.90.153146.9
pygm0.580.0000.620.0050.57241.41.8125.90.162173.2
pyrd0.840.0000.800.0010.85343.00.9708.20.237233.1
reni0.590.0020.560.0030.58970.521.35739.61.241292.4
rock10.550.0000.520.0020.54216.73.3384.30.167207.4
rxra0.610.0000.490.0030.60410.010.7928.50.312258.6
sahh0.870.0000.600.0030.86123.90.3942.10.105131.8
src0.550.0020.530.0020.604995.26.656271.27.7811318.6
tgfr10.600.0010.490.0030.59514.39.72310.50.373350.7
thb0.790.0000.750.0010.81651.81.96312.80.428321.2
thrb0.450.0000.430.0030.452427.1114.33971.52.4441205.4
try10.570.0000.560.0010.572483.450.89360.82.1511123.2
tryb10.380.0000.360.0030.39555.02.4718.30.275277.8
tysy0.650.0020.610.0070.66705.40.34716.40.525266.7
urok0.400.0000.400.0020.41511.11.83413.80.466342.5
vgfr20.570.0000.600.0030.591816.616.64942.61.154902.3
wee10.650.0010.470.0180.62695.55.63812.10.377204.3
xiap0.790.0040.760.0100.78530.06.23316.10.448187.4
mean0.560.0010.530.0050.561152.910.25630.10.912516.6

For each query compound, the average AUC value and the mean running time (in seconds) over 100 independent executions were computed with both OpR and OpF. For the sake of completeness, the standard deviation SD is also provided for both OpR and OpF versions. WEGA is a deterministic algorithm, so it was only executed once and its computed AUC value and the execution time are included. The last row of the table shows average values for the query molecules.

DUD database. For each query compound, the average AUC value and the mean running time (in seconds) over 100 independent executions were computed with both OpR and OpF. For the sake of completeness, the SD is also provided for both OpR and OpF versions. WEGA is a deterministic algorithm, so it was only executed once and its computed AUC value and the execution time are included. The last row of the table shows average values for the query molecules. DUD-E database. For each query compound, the average AUC value and the mean running time (in seconds) over 100 independent executions were computed with both OpR and OpF. For the sake of completeness, the standard deviation SD is also provided for both OpR and OpF versions. WEGA is a deterministic algorithm, so it was only executed once and its computed AUC value and the execution time are included. The last row of the table shows average values for the query molecules. In general terms, the SD values obtained for OpR and OpF are quite small, which indicates that their variability is small, and that (i) they converge toward the same optima in spite of the included randomness and (ii) the computing time is practically the same when different executions of the same instance are carried out. Focusing now on Table 3, it is possible to infer that the three algorithms are equivalent in terms of accuracy of the predictions, i.e. they obtain about the same AUC values regardless of the considered instance. In fact, the average of the AUC values is practically equal, as can be seen in the last row of the table. Nevertheless, OpF is almost 5 times faster than WEGA and more than 16 times quicker than OpR. Finally, similar conclusions than previously can be obtained for the DUD-E database (see Table 4). In terms of effectiveness, OpR and WEGA are comparable, since they obtain practically the same mean AUC value. On the contrary, OpF obtains an average AUC value slightly smaller. Nevertheless, OpF is more than 17 times faster than WEGA and more than 38 times quicker than OpR.

Results obtained when hydrogen atoms are considered

By default, WEGA does not consider hydrogen atoms during optimization, which is a common practice for most tools in the current scenario, since evaluation without hydrogens is less time-consuming. However, this simplification may have serious consequences in a VS process. In this work, the effect of excluding the hydrogens of the molecules when optimizing is analyzed. Table 5 shows number of atoms for the 40 query molecules selected from the FDA database when the hydrogens are not taken into account and when they are considered (columns 2 and 6 respectively). Additionally, the molecule BestComp from the FDA dataset, which maximizes the shape similarity and the corresponding score value Tc, both when the input molecules include the hydrogens and when they not, is shown. Notice that these experiments were accomplished using OpR since, according to the previous results, it is the most efficient algorithm. For the sake of completeness, the average execution time (in seconds), in both cases, has also been included. As it can be seen, in 15 out of 40 cases, the BestComp molecule differs, depending on whether the hydrogens are considered or not. Additionally, and as expected, the computing time decreases when hydrogens are not considered (see columns 5 and 9). This means that excluding the hydrogens of the molecules is not an appropriate simplification; although the computing effort is shorter, the molecule that which maximizes the shape similarity can change.
Table 5

Results obtained by OpR for 40 query compounds from the FDA database.

queryWithout hydrogensWith hydrogensBestComp w/o Hevaluated with H
nABestCompTcTimenABestCompTcTimeTc
DB005297DB008280.92161.210DB092940.869135.50.701
DB003319DB011890.94077.420DB092100.862255.80.710
DB0135215DB003060.891116.529DB003060.889361.10.884
DB0136512DB001910.94496.730DB001910.935406.90.928
DB0038019DB008160.842165.135DB010410.852477.40.802
DB0621620DB003700.905169.237DB003700.876500.10.874
DB0069325DB016190.841215.237DB016190.863553.40.854
DB0761524DB012500.799205.440DB007210.790576.50.713
DB0921925DB004340.819223.140DB013200.845636.20.764
DB0067421DB016190.865169.942DB016190.801556.70.786
DB0119827DB004020.933223.945DB004020.892624.70.890
DB0088725DB066140.745213.545DB008370.742613.00.686
DB0024628DB012610.761250.750DB012610.756737.60.751
DB0038128DB010230.819227.553DB010230.828728.60.823
DB0923728DB010540.717227.454DB010540.752759.40.745
DB0087630DB090390.664262.054DB090390.674800.80.665
DB0025432DB005950.877263.155DB005950.848814.70.838
DB0035127DB048390.941220.757DB048390.934748.30.928
DB0119629DB002860.784244.760DB002860.797820.40.794
DB0162133DB011480.694267.166DB011480.715924.20.708
DB0923632DB002700.672280.766DB010540.682940.20.615
DB0890337DB003330.653289.369DB003330.679968.50.673
DB0063223DB004640.724130.469DB004640.740696.40.732
DB0141942DB066050.630359.270DB066050.6711086.40.667
DB0032043DB014130.629355.680DB007280.6171139.00.596
DB0072838DB013390.820284.691DB013390.8391094.00.837
DB0050350DB008450.499416.698DB007010.5411465.30.442
DB0123249DB010820.549395.7100DB002120.6171411.80.581
DB0030955DB005410.634377.3110DB005410.6241348.20.621
DB0478686DB010780.387598.7120DB005110.4321657.80.405
DB0911450DB089930.476388.3130DB089930.5121799.60.510
DB0643957DB002070.515434.8137DB002070.5911871.50.533
DB0107866DB005110.502485.9140DB005110.5821819.40.570
DB0159068DB008770.469495.1151DB008770.5571995.50.545
DB0489480DB003640.482550.4152DB006460.5371797.10.495
DB0040394DB000350.394609.5167DB088740.4702130.00.446
DB0073287DB010450.434628.5169DB062870.4842204.40.470
DB00050102DB005690.396664.1194DB005690.4892248.00.483
DB06699117DB000910.454725.6221DB090990.5142482.60.496
DB06219128DB005120.422828.4229DB005120.4432796.00.414
mean440.686330.0860.7041124.60.674

Two experiments were carried out, one excluding the hydrogen atoms for all the molecules (a common practice in most VS tools in the literature) and the other hand considering the hydrogens in all the molecules. For each study and query, its nA without and with hydrogens, the BestComp with the highest Tc and the computing time, in second, are shown. Finally, the optimized BestComp obtained when no hydrogens are considered is re-evaluated, but including the hydrogens (last column).

Results obtained by OpR for 40 query compounds from the FDA database. Two experiments were carried out, one excluding the hydrogen atoms for all the molecules (a common practice in most VS tools in the literature) and the other hand considering the hydrogens in all the molecules. For each study and query, its nA without and with hydrogens, the BestComp with the highest Tc and the computing time, in second, are shown. Finally, the optimized BestComp obtained when no hydrogens are considered is re-evaluated, but including the hydrogens (last column). Finally, for a fair comparison in terms of score value, the optimized BestComp obtained by OpR when no hydrogens are considered is re-evaluated, but considering now the hydrogens. As we can see, the obtained score value is always smaller than the one obtained when the hydrogens are included (compare columns 8 and 10). This means that the BestComp molecule found by OpR when the hydrogens are considered is indeed more similar than the one proposed when the hydrogens are excluded. The Fig. 10 illustrates this fact.
Figure 10

Query compound DB06439 is represented by the red structure. Hydrogens are white atoms. Colours remain fixed. (a) Tc = 0.515 where the compound DB00207 is the yellow structure. (b) Tc = 0.591 where the compound DB00207 is the green structure. (c) Tc = 0.533 where the compound DB00207 is the orange structure. (d) The three previous compounds are optimized with respect to the query.

Query compound DB06439 is represented by the red structure. Hydrogens are white atoms. Colours remain fixed. (a) Tc = 0.515 where the compound DB00207 is the yellow structure. (b) Tc = 0.591 where the compound DB00207 is the green structure. (c) Tc = 0.533 where the compound DB00207 is the orange structure. (d) The three previous compounds are optimized with respect to the query. In addition, the impact on the classification when the hydrogen atoms are considered has also been evaluated when DUD and DUD-E databases are considered as input. The algorithms OpR and OpF have been selected to this aim. The corresponding results are shown in Tables 6 and 7. Notice that WEGA has not been included in the study since it never considers the hydrogens.
Table 6

DUD database with hydrogens.

nameAUCTime
OpROpFOpROpF
AvSDAvSDAvSDAvSD
ace0.400.0010.420.021894.38.58451.30.343
ache0.720.0020.680.0072448.195.702132.80.336
ada0.790.0060.750.021227.910.04715.30.092
alr20.460.0070.480.009187.46.65615.00.073
ampc0.740.0130.730.015131.41.5619.20.021
ar0.860.0030.840.003748.137.04066.50.151
cdk20.620.0030.600.011449.611.77430.10.143
comt0.400.0080.410.008136.07.4988.70.041
cox10.590.0010.580.006141.69.84212.10.035
cox20.900.0010.880.0053768.599.262237.30.578
dhfr0.590.0040.530.0073946.177.654217.40.432
egfr0.560.0020.570.0045896.4131.069379.90.484
er_agonist0.740.0030.710.010751.720.64459.10.324
er_antagonist0.690.0040.730.008887.323.42152.90.209
fgfr10.420.0000.460.0021782.431.747112.20.309
fxa0.660.0090.610.0113089.241.870166.90.443
gart0.280.0110.340.013469.35.37428.70.183
gpb0.850.0020.820.008329.83.58027.90.178
gr0.770.0040.760.0111222.864.35895.10.280
hivpr0.740.0100.740.0072049.994.105113.90.732
hivrt0.700.0080.690.009470.917.54031.10.174
hmga0.840.0040.820.008855.823.48356.60.162
hsp900.770.0120.810.015412.618.48926.20.063
inha0.590.0100.530.0051392.143.31489.50.289
mr0.870.0030.860.004235.46.25521.40.092
na0.830.0020.800.009479.89.48440.00.275
p380.310.0040.370.0066491.8129.148346.90.598
parp0.590.0040.590.006232.48.26019.20.126
pde50.770.0060.750.0061399.212.28678.80.473
pdgfrb0.440.0040.460.0082704.093.157143.20.893
pnp0.710.0040.680.017193.91.97814.90.054
ppar_gamma0.730.0060.730.0123000.940.167139.50.172
pr0.680.0110.660.013544.425.76036.80.274
rxr_alpha0.890.0230.870.015414.19.42125.50.152
sahh0.880.0060.810.012227.113.65715.50.036
src0.440.0020.460.0053727.873.533219.20.510
thrombin0.560.0100.570.0061517.816.97792.90.210
tk0.650.0030.640.011125.63.78611.60.065
trypsin0.270.0040.300.008733.310.18936.30.187
vegfr20.620.0030.600.007861.653.93054.00.280
mean0.650.0050.640.0091389.534.81583.30.262

For each query compound, the average AUC value and the mean running time (in seconds) over 100 independent executions were computed with both OpR and OpF. For the sake of completeness, the SD is also provided for both OpR and OpF versions. The last row of the table shows average values for the query molecules.

Table 7

DUD-E database with hydrogens.

nameAUCTime
OpROpFOpROpF
AvSDAvSDAvSDAvSD
aa2ar0.540.0000.540.00112648.0307.385256.315.165
abl10.560.0000.580.0034178.4107.358184.43.043
ace0.630.0000.630.0016514.7169.817174.36.823
aces0.220.0000.230.00110542.9234.978194.17.856
ada0.680.0000.700.0031435.550.09640.21.630
ada170.530.0000.570.0019711.7254.865235.61.251
adrb10.410.0010.390.0025819.1180.508238.14.965
adrb20.410.0000.420.0018295.1243.728167.79.623
akt10.260.0010.290.0035113.7105.351205.43.847
akt20.470.0000.430.0022762.078.460115.82.685
aldr0.560.0010.550.0062156.184.68787.30.626
ampc0.680.0000.560.015465.211.38812.50.532
andr0.780.0000.750.0013845.493.711196.96.801
aofb0.410.0000.410.001941.414.38924.11.065
bace10.580.0000.530.0038931.4248.917303.57.397
braf0.530.0000.520.0034113.9105.411102.83.975
cah20.500.0020.510.0012636.363.463145.94.771
casp30.450.0000.480.0012751.963.38488.83.199
cdk20.640.0000.630.00214337.1270.933407.18.501
comt0.630.0050.560.003441.112.08123.51.007
cp2c90.450.0000.450.0021980.433.51568.42.496
cp3a40.550.0000.540.0047613.5271.865211.56.706
csf1r0.510.0000.540.0015659.4189.853146.70.106
cxcr40.750.0000.690.0011712.466.77130.80.612
def0.760.0000.720.0022013.040.48855.51.596
dhi10.750.0000.750.0026446.4158.161189.54.195
dpp40.620.0000.610.00115566.7374.754328.814.946
drd30.370.0000.390.00114175.3269.919431.713.124
dyr0.420.0030.380.0025729.796.321373.84.095
egfr0.500.0000.510.00218151.4354.857336.44.806
esr10.570.0010.600.00210530.6293.861240.50.841
esr20.640.0000.630.0038166.9185.100207.55.593
fa100.630.0040.610.00113762.4325.381628.210.031
fa70.480.0010.500.0064005.9117.43888.62.803
fabp40.740.0030.670.0051366.249.41651.30.834
fak10.710.0010.600.0062801.981.060163.63.498
fgfr10.470.0020.470.001281.75.8109.50.613
fkb1a0.780.0010.730.0052286.271.67272.13.259
fnta0.540.0010.480.00133347.0569.0401131.113.666
fpps0.780.0000.750.001902.023.60137.02.852
gcr0.640.0000.620.0015936.8117.042194.45.002
glcm0.330.0030.280.001923.632.98321.80.044
gria20.580.0000.550.0023159.495.870100.36.522
grik10.540.0000.570.0041198.122.33845.24.290
hdac20.390.0000.360.0032752.988.17655.11.536
hdac80.400.0000.360.0043001.774.00283.52.987
hivint0.380.0000.380.0011542.351.84763.82.193
hivpr0.760.0000.730.00126933.4678.027764.70.995
hivrt0.560.0010.520.0015961.3173.759233.79.952
hmdh0.850.0000.800.0044998.3136.453127.51.861
hs90a0.660.0000.650.0021772.926.21956.41.329
hxk40.650.0000.500.0031488.341.27359.92.133
igf1r0.460.0010.430.0045161.5144.369174.95.849
inha0.400.0000.420.004680.828.81811.70.104
ital0.410.0020.440.0035063.5158.899129.60.267
jak20.720.0000.690.0022058.158.98448.43.409
kif110.830.0000.680.0033439.9123.65854.43.115
kit0.380.0000.380.0013238.8110.01491.88.589
kith0.720.0010.690.003766.29.36824.10.997
kpcb0.570.0000.530.0053258.890.82494.72.935
lck0.410.0000.400.00111895.8307.804247.012.443
lkha40.570.0000.570.0033373.8105.16351.10.528
mapk20.630.0010.610.0011820.067.42560.53.590
mcr0.780.0000.730.0011616.948.16846.32.394
met0.710.0050.680.00511546.8450.466250.36.843
mk010.440.0000.390.0021259.433.75539.01.265
mk100.450.0000.460.0022520.886.61470.22.139
mk140.540.0010.530.00228472.3565.522692.514.399
mmp130.610.0000.580.00018288.5482.792358.318.123
mp2k10.450.0000.540.0013429.171.07569.36.095
nos10.340.0000.350.0012571.573.86358.72.016
nram0.880.0000.860.0021839.646.52843.63.260
pa2ga0.670.0000.660.0042310.867.88459.15.073
parp10.640.0000.660.0007534.6164.474261.37.200
pde5a0.500.0000.480.0037966.8132.170127.70.662
pgh10.700.0000.700.0033045.189.026102.05.014
pgh20.720.0000.700.0024841.0142.901201.67.323
plk10.600.0000.540.0034137.3131.70759.50.316
pnph0.720.0000.670.0051321.849.86029.90.095
ppara0.670.0000.650.0029214.1279.082184.12.299
ppard0.390.0010.370.0067555.7234.371194.54.637
pparg0.410.0000.370.00113606.9308.089388.712.057
prgr0.800.0000.750.0045894.0155.765208.64.872
ptn10.350.0000.360.0021233.629.35536.52.168
pur20.470.0000.370.0071035.231.17728.71.284
pygm0.610.0000.620.0021091.730.71836.01.144
pyrd0.810.0000.810.0041474.637.57034.40.277
reni0.680.0010.650.0056085.1234.368253.13.662
rock10.560.0000.560.0011414.944.63126.40.190
rxra0.720.0000.550.0062236.758.55550.50.268
sahh0.850.0000.610.002549.812.6089.60.279
src0.570.0030.550.00123435.6686.1521187.215.975
tgfr10.530.0000.530.0012329.276.98248.80.365
thb0.790.0000.750.0033150.582.19283.23.025
thrb0.500.0000.480.00213973.6228.488444.43.482
try10.590.0000.600.00212992.8261.488384.50.510
tryb10.420.0000.380.0043221.792.22663.82.376
tysy0.600.0000.580.0043038.2113.09972.70.099
urok0.380.0000.370.0012944.992.74477.00.357
vgfr20.520.0000.540.0019023.2150.967242.15.486
wee10.700.0010.560.0023547.3120.77277.22.831
xiap0.860.0000.800.0053272.1112.115101.84.684
mean0.580.0000.550.0035878.3148.367171.64.183

For each query compound, the average AUC value and the mean running time (in seconds) over 100 independent executions were computed with both OpR and OpF. For the sake of completeness, the standard deviation SD is also provided for both OpR and OpF versions. The last row of the table shows average values for the query molecules.

DUD database with hydrogens. For each query compound, the average AUC value and the mean running time (in seconds) over 100 independent executions were computed with both OpR and OpF. For the sake of completeness, the SD is also provided for both OpR and OpF versions. The last row of the table shows average values for the query molecules. DUD-E database with hydrogens. For each query compound, the average AUC value and the mean running time (in seconds) over 100 independent executions were computed with both OpR and OpF. For the sake of completeness, the standard deviation SD is also provided for both OpR and OpF versions. The last row of the table shows average values for the query molecules. Broadly speaking, the mean AUC value increases slightly when the hydrogen atoms are considered in DUD database, for both OpR and OpF algorithms. See last row of Tables 3 and 6. In particular, an increment of 0.03 (resp. 0.01) has been obtained for OpR (resp. OpF). In addition, for 23 out of 40 cases, OpR obtains better AUC values when the hydrogens are considered. Regarding OpF, it happens for 20 out of 40 instances. The same increasing tendency can be appreciated in the mean AUC value when the DUD-E database is considered. Please, see Tables 4 and 7. In this case, an increment of 0.02 has been obtained for both OpR and OpF algorithms. Both OpR and OpF obtain better AUC values in more than half of the cases (58 out of 102 for OpR and 67 out of 102 for OpF). In general terms, considering the hydrogens increases the average computing time. Compare again Tables 3 and 6 for DUD database, and Tables 4 and 7 for DUD-E benchmark. As can be seen, the time increases 2.9x times for both OpR and OpF when DUD is considered as input. For the DUD-E case, the increase is of 4.8x and 5.7x for OpR and OpF, respectively. Of course, the larger the number of atoms considered for a compound, the higher the computing time associated to its evaluation, but the more realistic the associated scoring function value. Therefore, based on the results, it can be concluded that a more realistic classification of compounds can be obtained if hydrogen atoms are considered. In such a case, the computing time can be reduced by using high-performance computing approaches.

Results obtained for Maybridge database

Finally, a study has been conducted to show the utility of OpR, i.e. it can find good quality solutions when possible. The effectiveness of OpR has been analyzed when it is executed with the Maybridge database considering hydrogens. In particular, a set of query compounds were selected from such a database. The choice procedure was carried out as follows: the Maybridge dataset was initially sorted according to the number of atoms of the compounds and split into 38 intervals. Then, a single compound was randomly chosen for each interval. Table 8 summarizes the obtained results. In particular, it is shown: (i) the number nC of compounds with a number of atoms included in the interval ; (ii) the randomly selected query from such an interval, and (iii) the other molecule from Maybridge (BestComp) with the highest shape similarity value (Tc) according to OpR. The last row of the table shows the total number of compounds with nA < 95 (resp. nA ≥ 95) and the average Tc value. Notice that there exist intervals with 0 compounds, we note those cases by including ‘−’ in the corresponding columns.
Table 8

Maybridge database.

[i, j)nCQueries with nA < 95[i, j)nCQueries with nA ≥ 95
queryBestCompTcqueryBestCompTc
[0, 5)0[95, 100)6JFD01206JFD012030.930
[5, 10)2CD08226RF016820.940[100, 105)3JFD00633JFD019150.875
[10, 15)93AC10702KM033310.982[105, 110)3JFD02451JFD024520.762
[15, 20)968AC10402RF033150.939[110, 115)3JFD01915JFD006330.877
[20, 25)3469AC11546NRB008910.940[115, 120)1JFD02945RH004770.512
[25, 30)7050AC10751AC119680.991[120, 125)2BTB14731JFD016020.508
[30, 35)10414AC12586RH015480.895[125, 130)1JFD01714JFD017160.676
[35, 40)10623AC10018JFD006240.939[130, 135)0
[40, 45)9015AC10608HTS013690.867[135, 140)1JFD02946RJC017010.474
[45, 50)6085AW00180AW001740.873[140, 145)0
[50, 55)3008AW00136HTS032940.849[145, 150)1JFD02949JFD006550.552
[55, 60)1479JFD00968RJC020930.993[150, 155)2BTB12204BTB122050.600
[60, 65)648JFD03035NRB032910.972[155, 160)2BTB12205BTB122040.600
[65, 70)247HTS13346HTS133430.982[160, 165)1RJC01719BTB122140.487
[70, 75)108JFD01818RJC032310.976[165, 170)2RJC01701JFD024510.645
[75, 80)57JFD01718JFD017160.957[170, 175)0
[80, 85)50NRB03718NRB037750.991[175, 180)0
[85, 90)40JFD00292JFD002940.877[180, 185)1JFD02950JFD006550.417
[90, 95)14JFD01716JFD017180.959[185, 190)0
mean533700.940290.637

The number nC of queries from the database with a number of atoms is shown. From each interval, a query has been randomly selected, and the other molecule from the database (BestComp) with the highest Tc has been computed by using OpR. Note that the score Tc is equal to 1 when the query compound is compared with itself for all the instances, so that BestComp really represents the second most similar molecule to the query.

Maybridge database. The number nC of queries from the database with a number of atoms is shown. From each interval, a query has been randomly selected, and the other molecule from the database (BestComp) with the highest Tc has been computed by using OpR. Note that the score Tc is equal to 1 when the query compound is compared with itself for all the instances, so that BestComp really represents the second most similar molecule to the query. As can be seen in Table 8, OpR obtains an average Tc value of 0.940 for queries with nA < 95. This is not rare since the number of compounds with less than 95 atoms is equal to 53370, so the probability of finding similar molecules is relatively high. On the contrary, the average Tc value obtained by OpR for molecules with more than 95 atoms is equal to 0.637, which is not a bad figure if we consider that only 29 out of 53399 molecules have more than 95 atoms. Even so, OpR obtains good quality solutions for queries with more than 95 atoms. See for example, the instances JFD0120 and JFD0063, with 96 and 104 atoms, respectively. For those two cases, OpR has found compounds with Tc values of 0.930 and 0.875, even when the number of molecules with similar sizes is not high. Let focus now on the worst cases, i.e. those where OpR obtains the lowest Tc values. They are JFD02950 and JFD02946 with 180 and 135 atoms, respectively. Notice that there are not molecules in the database with similar sizes. More precisely, there are just 10 molecules, including JFD02950 and JFD02946, with . Therefore, the probability of discovering similar molecules in terms of shape is very low, since the most likely is that they do not exist. Then, from the results, it is possible to infer that OpR finds a high-quality solution to a given query when it exists in the corresponding database.

Conclusions and Future Work

This work has introduced the SSM OptiPharm, based on novel metaheuristic approaches and illustrated its performance in terms of prediction accuracy and running time when processing well-known benchmarks such as DUD, and in addition FDA dataset. Comparison made with WEGA show that OptiPharm offers the same predictive accuracy but at a much lower computational cost (average speedup is 5x). Another of the advantages of the method compared with WEGA is that its optimization algorithm is easily parameterizable so that very different heuristic schemes can be tested, and so it adapts itself to a given database depending on the average molecular size and topology, to name a few. Also, bearing in mind that OptiPharm, unlike WEGA, allows optimizing including the hydrogen atoms of the compounds. Results have shown that its consideration improves the predictions, although it is more costly from a computational point of view. High-performance computing approaches may be a good alternative to deal with this drawback. OptiPharm has been designed with parallelism in mind. Notice that each pose in the population can generate a new offspring without the participation of the remaining quaternions in the population, meaning that the Reproduction method can be easily parallelized by dividing the poses in the population among the available processing units. Similarly, the poses can also be enhanced by distributing them in the Improvement procedure. This means that OptiPharm can be drastically accelerated by using high-performance computing with practically no effort. In the future, several programming paradigms based on both shared and distributed memory architectures will be implemented and analyzed. In particular, a parallel version of OptiPharm will be implemented to be executed on GPUs, and compared with the GPU-accelerated WEGA[58].
  34 in total

1.  Drug discovery: a historical perspective.

Authors:  J Drews
Journal:  Science       Date:  2000-03-17       Impact factor: 47.728

Review 2.  High-throughput screening in drug metabolism and pharmacokinetic support of drug discovery.

Authors:  R E White
Journal:  Annu Rev Pharmacol Toxicol       Date:  2000       Impact factor: 13.820

3.  SQ: a program for rapidly producing pharmacophorically relevent molecular superpositions.

Authors:  M D Miller; R P Sheridan; S K Kearsley
Journal:  J Med Chem       Date:  1999-05-06       Impact factor: 7.446

Review 4.  Molecular dynamics simulations of biomolecules.

Authors:  Martin Karplus; J Andrew McCammon
Journal:  Nat Struct Biol       Date:  2002-09

Review 5.  In silico research in drug discovery.

Authors:  G C Terstappen; A Reggiani
Journal:  Trends Pharmacol Sci       Date:  2001-01       Impact factor: 14.819

6.  Variable selection and model validation of 2D and 3D molecular descriptors.

Authors:  Anthony Nicholls; Norah E MacCuish; John D MacCuish
Journal:  J Comput Aided Mol Des       Date:  2004 Jul-Sep       Impact factor: 3.686

7.  Benchmarking sets for molecular docking.

Authors:  Niu Huang; Brian K Shoichet; John J Irwin
Journal:  J Med Chem       Date:  2006-11-16       Impact factor: 7.446

8.  Enrichment of high-throughput screening data with increasing levels of noise using support vector machines, recursive partitioning, and laplacian-modified naive bayesian classifiers.

Authors:  Meir Glick; Jeremy L Jenkins; James H Nettles; Hamilton Hitchings; John W Davies
Journal:  J Chem Inf Model       Date:  2006 Jan-Feb       Impact factor: 4.956

9.  Managing, profiling and analyzing a library of 2.6 million compounds gathered from 32 chemical providers.

Authors:  Aurélien Monge; Alban Arrault; Christophe Marot; Luc Morin-Allory
Journal:  Mol Divers       Date:  2006-09-21       Impact factor: 2.943

10.  DrugBank: a comprehensive resource for in silico drug discovery and exploration.

Authors:  David S Wishart; Craig Knox; An Chi Guo; Savita Shrivastava; Murtaza Hassanali; Paul Stothard; Zhan Chang; Jennifer Woolsey
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

View more
  3 in total

1.  Electrostatic-field and surface-shape similarity for virtual screening and pose prediction.

Authors:  Ann E Cleves; Stephen R Johnson; Ajay N Jain
Journal:  J Comput Aided Mol Des       Date:  2019-10-24       Impact factor: 3.686

2.  A two-layer mono-objective algorithm based on guided optimization to reduce the computational cost in virtual screening.

Authors:  Miriam R Ferrández; Savíns Puertas-Martín; Juana L Redondo; Horacio Pérez-Sánchez; Pilar M Ortigosa
Journal:  Sci Rep       Date:  2022-07-27       Impact factor: 4.996

3.  On the Value of Using 3D Shape and Electrostatic Similarities in Deep Generative Methods.

Authors:  Giovanni Bolcato; Esther Heid; Jonas Boström
Journal:  J Chem Inf Model       Date:  2022-03-10       Impact factor: 4.956

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.