Literature DB >> 35187297

Similarity-Based Virtual Screen Using Enhanced Siamese Deep Learning Methods.

Mohammed Khaldoon Altalib1,2, Naomie Salim1.   

Abstract

Traditional drug production is a long and complex process that leads to new drug production. The virtual screening technique is a computational method that allows chemical compounds to be screened at an acceptable time and cost. Several databases contain information on various aspects of biologically active substances. Simple statistical tools are difficult to use because of the enormous amount of information and complex data samples of molecules that are structurally heterogeneous recorded in these databases. Many techniques for capturing the biological similarity between a test compound and a known target ligand in LBVS have been established. However, despite the good performances of the above methods compared to their prior, especially when dealing with molecules that have homogeneous active structural elements, they are not satisfied when dealing with molecules that are structurally heterogeneous. Deep learning models have recently achieved considerable success in a variety of disciplines due to their powerful generalization and feature extraction capabilities. Also, the Siamese network has been used in similarity models for more complicated data samples, especially with heterogeneous data samples. The main aim of this study is to enhance the performance of similarity searching, especially with molecules that are structurally heterogeneous. The Siamese architecture will be enhanced using two similarity distance layers with one fusion layer to further improve the similarity measurements between molecules and then adding many layers after the fusion layer for some models to improve the retrieval recall. In this architecture, several methods of deep learning have been used, which are long short-term memory (LSTM), gated recurrent unit (GRU), convolutional neural network-one dimension (CNN1D), and convolutional neural network-two dimensions (CNN2D). A series of experiments have been carried out on real-world data sets, and the results have shown that the proposed methods outperformed the existing methods.
© 2022 The Authors. Published by American Chemical Society.

Entities:  

Year:  2022        PMID: 35187297      PMCID: PMC8851658          DOI: 10.1021/acsomega.1c04587

Source DB:  PubMed          Journal:  ACS Omega        ISSN: 2470-1343


Introduction

Drug development is a lengthy and complicated procedure that ends in the creation of new drug production. In the course of conventional drug research and development, a biomolecular target is identified and the experiments for high-performance screening are performed to identify bioactive compounds for specified goals. The development of high-performing research testing is expensive and time-consuming. This process includes specialized laboratories with chemical and biological libraries.[1] In fact, the probability of success is low, and the acceptance and widespread use of approximately 1 out of 5000 identified drug applicants are estimated.[2] The increased computer capabilities, on the other hand, allowed several million chemical compounds to be screened at an acceptable pace and cost. The virtual screening technique is a computational method for searching small molecules in huge libraries and choosing the most likely binding structures with a drug objective.[3−6] Virtual screening (VS) is conducted in the early discovery phases in which broad chemical libraries comprise the most promising lead compounds. In the last few years, the development of drugs has been accelerated by virtual screening (VS). Two main virtual screening techniques exist, namely, structure-based virtual screening (SBVS) and ligand-based virtual screening (LBVS).[7] The SBVS techniques seek indirect compounds that are appropriate for the biological objective binding site. The central technology of SBVS methods is molecular docking.[8] On the other hand, the LBVS approach is used constantly for the prediction of molecular properties and for measuring molecular similarity because the method to represent the molecules is easy and accurate. The significance of applications of similarity searching stems from the importance of lead optimization in drug discovery programs, in which close neighbors are looking into an initial lead compound to find decent compounds.[9−12] Recently, modern deep learning (DL) techniques were introduced in many fields, and they developed in the last years, opening a new door for researchers. The success of DL techniques benefits from the rapid growth of the DL algorithms and the advancement of high-performance computing techniques. Moreover, DL techniques have fewer generalization errors, which allow them to achieve reasonable results on certain benchmarks or competitive tests and make more precise predictions regarding molecular properties.[13−18] Also, features can be automatically discovered from input data using deep learning techniques.[3,19,20] In addition, the Siamese network is commonly used for solving image similarity and text similarity issues. It has been used for more complicated data samples, especially with heterogeneous data samples, with various dimensionalities and type characteristics.[21,22] Information about different aspects of biologically active compounds is held in a variety of databases. Some databases contain classes of molecules that have structurally homogeneous active elements like the MDL Drug Data Report (MDDR_DR2) data set, and other databases contain classes of molecules that are structurally heterogeneous like MDDR_DR3 and Maximum Unbiased Validation (MUV); however, the vast volume of information stored and complex data samples of molecules that are structurally heterogeneous make it difficult to carry out simple statistical tools. In this study, the Siamese deep learning model will be enhanced by using two distance layers and then a fusion layer that combines the results from two distances layers, it is appropriate to combine them to further improve the similarity measurements between molecules, particularly when dealing with different types of descriptors. In some models, multiple layers have been added after the fusion layer to improve the retrieval recall. In this architecture, several methods of deep learning have been used here which are long short-term memory (LSTM-RNN), gated recurrent unit (GRU-RNN), both are in the recurrent neural network, convolutional neural network-one dimension (CNN1D), and convolutional neural network-two dimensions (CNN2D). The following are the paper’s main contributions: The Siamese deep learning model will be enhanced using two distance layers and then a fusion layer that combines the results from two distance layers to add further improvements for the similarity measurements between molecules, particularly when dealing with different types of descriptors, and then adding many layers after the fusion layer for some models to improve the retrieval recall. In comparison to benchmark approaches, the suggested method demonstrated encouraging results in terms of overall performance, especially when dealing with heterogeneous classes of molecules.

Related Work

Similarity-based virtual screening is widely considered to be one of the essential aspects of drug discovery. Several approaches were used to increase the retrieval effectiveness of the similarity searching methods. The 2D similarity methods have become widely used and very common. The fundamental theory behind the calculation of molecular similarity is that structurally similar molecules seem to be more likely to possess similar properties than structurally dissimilar molecules. The purpose of similarity searching, therefore, is to retrieve molecules that are structurally very similar to the reference structures of the consumer. Various coefficient methods allow for the quantification of the similarity/difference between molecule pairs. Many other studies have tested the output of several similarity coefficients, showing that the Tanimoto coefficient surpassed other similarities.[23−26] The Tanimoto coefficient has thus become the most common indicator of the similarity of chemical compounds used in chemoinformatics. Different approaches have been used over the years to enhance the performance of search methods for similarities. Some experiments attempted to incorporate methods from various disciplines. Many parallels between the retrieval of text information and cheminformatics have already indicated that techniques developed for the retrieval of text documents may be employed to improve the similitude of molecular searching.[27] Therefore, some approaches to molecular similarities, such as the Bayesian inference network, used by virtual ligand screening were originally based on text retrieval domains. In virtual screening, for example, Abdo et al.[28,29] have used the Bayesian network, which outpaced Tanimoto. In addition, the reweighting techniques were used in the text field to model retrieval of documents and adapted in the cheminformatic field in the retrieval model.[28,30] The fragment reweighting techniques were also used by Ahmed et al. to strengthen the Bayesian network.[28] Al-Dabagh used the concepts of quantum mechanics theory to enhance molecular similarity searching and molecular ranking of chemical compounds in LBVS.[31] Himmat M. created a new similarity measure by reweighting various bit strings and derived it from existing similarity measures. In addition, the author proposed ranking strategies for developing a substitute ranking technique.[32] Nasser and colleagues employed deep belief networks (DBNs) to reweight molecular features where many descriptors were utilized, each representing distinct relevant aspects, and integrated all new features from all descriptors to provide a new descriptor for similarity searches.[33] In recent years, new technologies of deep learning (DL) have been adopted and applied in drug discovery and bioinformatics and cheminformatics studies, opening a new door to computational decision making and to assist in the understanding of molecular mechanisms and the development of new therapeutic options for a variety of diseases.[20,34] Gómez-Bombarelli et al. proposed an autoencoder model that produces new molecules by converting discrete molecular representations to multidimensional continuous representations.[35,36] Skalic et al. proposed a new model to produce new molecules by converting the seed compound into a three-dimensional (3D) representation using a variational autoencoder and then sequencing SMILES tokens using convolutional and recurrent neural network systems to explore uncharted areas of the chemical space that still have lead compound-like characteristics.[37] Gao et al. proposed a generative network complex (GNC) model to create new drug like molecules using gradient descent in the latent space of an autoencoder for multiproperty optimization.[38] Hamza et al. used CNN to determine its precision during the prediction of orphan compound activities.[39] Also, Mendolia et al. used convolutional neural networks (CNNs), which are intended to identify a set of candidate compounds for a specific target protein in terms of their biological activity, and both 1D and 2D CNNs were trained separately to test the performance of every single descriptor.[40] Moreover, several researchers have proposed to exploit RNN-based methods for chemoinformatics. The majority of the researchers have utilized the model as a prediction or classifier model. Wan and Zeng proposed a model for compound–protein interaction prediction using DL methods, in which they adopted a commonly used NLP approach called feature incorporation.[100] Their model was built into multidimensional vectors, both ligand details (molecular fingerprints) and protein sequences. Also, SMILES representation of molecules is used in the RNN model. The RNN model was used to learn SMILES’ coding grammar, which can be converted into a molecular graph.[41] In addition, Goh et al. used SMILES as an input feature to the RNN model for predicting the molecular properties.[42] Furthermore, other studies reported that deep learning methods in the Siamese architecture as a similarity model produce the best performance that can be used with more complicated data samples, especially with heterogeneous data samples, with various dimensionalities and type characteristics.[21,22] For example, Yu et al. employed CNN Siamese architectures to assess whether two people are from the same family, allowing missing people to be reunited with their relatives.[43] Jonas et al. used the LSTM Siamese neural network to calculate the similarity between sentences.[44] In this method, an exponential Manhattan distance was used to measure the similarity between two sentences. In the drug discovery domain, Dhami et al. was using images as an input to predict drug interactions in a Siamese convolution network architecture.[45] Jeon et al. proposed a method to use MLP Siamese neural networks (ReSimNet) in structure-based virtual screening (SBVS) to calculate the distance by cosine similarity.[22] Despite the good performances of the above methods compared to their prior, especially when dealing with molecules that have homogeneous active structural elements like classes of molecules in the MDL Drug Data Report data set MDDR_DR2, however, the performances are not satisfied when dealing with molecules of structurally heterogeneous nature like classes of molecules in the MDL Drug Data Report data set MDDR_DR3 and Maximum Unbiased Validation (MUV) data set. The main goal of this research is to improve the retrieval effectiveness of the similarity model, especially with molecules that are structurally heterogeneous, and because of the power of deep learning for dealing with big data and the power of the Siamese architecture for dealing with complicated data samples, especially with heterogeneous data samples; therefore, they have been used in this study. Many methods of deep learning will be examined as a similarity model through the enhanced Siamese architecture. These methods of deep learning include long short-term memory (LSTM-RNN), gated recurrent unit (GRU-RNN), convolutional neural network-one dimension (CNN1D), and convolutional neural network-two dimensions (CNN2D).

Methods

A Siamese neural network contains two artificial neural networks that are the same, each able to handle the hidden input data representation, which have to be linked to a final layer using a distance layer to predict whether or not two vectors fall under the same category. Since all of the weights and biases are related, the networks that make up the Siamese architecture are called twins, which means that both networks are symmetric. Both error backpropagation and feed-forward perceptron are used by the two neural networks during training. Therefore, it has been used for more complicated data samples, especially with heterogeneous data samples, with various dimensionalities and type characteristics. In this paper, the Siamese deep learning model will be enhanced. Figure shows the flowchart of steps for enhancing the Siamese architecture.
Figure 1

Flowchart of steps for enhancing the Siamese architecture.

Flowchart of steps for enhancing the Siamese architecture. The steps for enhancing the Siamese architecture of deep learning methods include the following: Studying and analyzing many models of Siamese architectures in different fields, like Dhami et al. and Jeon et al. in the field of structure-based virtual screening and Jonas et al. in the text field. All previous studies used one distance layer. In this study, two distance layers are used, and then, one fusion layer combines the results from distance layers. The reason for using more than one distance layer is to further improve the similarity measurements between molecules, particularly when dealing with different types of descriptors. In general, there are two inputs and one output in this architecture; the output value represents the degree of similarity between the inputs. In this study, many layers have been added after the fusion layer for some models to improve the retrieval recall. The hyperparameters of the Siamese deep learning similarity model such as the number of epochs and batch size, optimization, and the activation function are tuned to achieve a good retrieval recall result. Here, four methods of deep learning have been used in this architecture; these methods include these methods include long short-term memory (LSTM-RNN), gated recurrent unit (GRU-RNN), convolutional neural network-one dimension (CNN1D), and convolutional neural network-two dimensions (CNN2D). The following subsections explain each of the methods individually.

Enhanced Siamese RNN Similarity Model

Recurrent neural networks are artificial neural networks that form the link between nodes by means of a directed diagram along a time stream. The recurrent neural networks use internal state memory for sequence processing compared to neural feed-forward networks. The recurrent neural networks’ dynamic behavior enables them to be very helpful and applicable to audio processing, handwriting recognition, and many such applications. However, recurring neural networks face the problems of vanishing gradients during backpropagation. If the gradient value is extremely small, it cannot lead to effective learning. As a short-term memory solution, LSTM and GRU have been developed. LSTM and GRU have been developed as a solution for short-term memory. They have internal mechanisms, which can monitor the flow of information, called gates.[46]

Enhanced Siamese LSTM Similarity Model

Long short-term memory (LSTM), is an RNN structure with feedback links that allow everything that a Turing machine can do or compute. A single LSTM unit is made up of a cell, an input gate, an output gate, and a forgotten door, allowing the cell to arbitrarily record the value. The data flow in and out of the LSTM cell is monitored by gates.[47] An enhanced Siamese LSTM structure was used to determine how similar two molecules are, so the architecture has two inputs, one from the query and the other from the fingerprint data set, representing the fingerprint of molecules. The one-output architecture represents the degree of similarity, which means that the output has two classes: if the value is 1, it means high similarity, and if the value is 0, it means high dissimilarity; the weights have also been linked in this architecture so that LSTMa = LSTMb. In this model, each input layer has two cell dimensions (32,32), each of this matrix is linked to one molecular fingerprint feature, and then each input layer is linked to distance layers; two distances have been used: the first one is the Manhattan distance,[48] which can be represented as where dAB is the Manhatten distance, fA is the feature of molecule’s query, and fB is the feature of molecule’s data set, and the second distance is the exponential Manhattan distance,[44] which can be given as where EAB is the exponential Manhatten distance, fA is the feature of molecule’s query, and fB is the feature of molecule’s data set. Next, a fusion layer is added to fuse between two distance layers (Manhattan, exponential Manhattan). Then, three layers are added after the fusion layer; the cells in these layers are 512, 256, and 1, respectively. The output is one of the two cases: 1, meaning the two input molecules are similar, and 0, meaning the two input molecules are dissimilar. The ReLU activation function has been used for all dense layers except the last one, in which the sigmoid activation function has been used. Moreover, the RMSprop optimizer has been used, the loss function is binary_crossentropy, and the batch size is 64. The architecture of the Siamese RNN-LSTM similarity model is illustrated in Figure .
Figure 2

Architecture of the enhanced Siamese RNN-LSTM similarity model.

Architecture of the enhanced Siamese RNN-LSTM similarity model.

Enhanced Siamese GRU Similarity Model

The GRU, recognized as the gated recurrent unit, is an RNN architecture that is similar to LSTM units. Instead of the LSTM input, output, and forget gate, the GRU consists of a reset gate and an update gate. The update gate lets the model decide how much of the previous knowledge (from previous time steps) needs to be followed on to the future.[49] An enhanced Siamese GRU framework has been used to determine how similar two molecules are; therefore, the architecture has two inputs, representing the fingerprint of molecules, one from the query and the other from a data set of the fingerprint. Also, the architecture has one output that represents the degree of similarity; 1 means high similarity or 0 means high dissimilarity. Also, the weights have been tied such that GRUa = GRUb in this architecture; each input layer has two dimensions (32,32) of cells, each one connected to one feature of the molecular fingerprint, and then each input layer is connected to distance layers. Two distances are used (as mentioned in the previous subsection): the first one is the Manhattan distance, and the second distance is the exponential Manhattan distance. Next, a fusion layer is added to fuse between two distance layers (Manhattan, exponential Manhattan). Then, three layers are added after the fusion layer; the cells in these layers are 512, 256, and 1, respectively. The output is one of the two cases: 1, meaning the two input molecules are similar, and 0, meaning the two input molecules are dissimilar. The ReLU activation function has been used for all layers except the last one, in which the sigmoid activation function has been used. Moreover, the RMSprop has been used, binary_crossentropy is the loss function, and 64 is the batch size. Figure demonstrates the architecture of the Siamese RNN-GRU similarity model.
Figure 3

Architecture of the enhanced Siamese RNN-GRU similarity model.

Architecture of the enhanced Siamese RNN-GRU similarity model.

Enhanced Siamese CNN Similarity Model

The CNN is a type of high feed-forward network that can be easily trained and generalized compared to other networks with connectivity between the adjacent layers.[50,51] In this work, the Siamese CNN framework has been used to determine how similar two molecules are. CNN1D (one dimension) and CNN2D (two dimensions) have been used in this architecture as follows.

Enhanced Siamese CNN1D Similarity Model

CNNs, whether they have one, two, or three dimensions, function the same way. The difference is the input data structure and how the filtration, often referred to as a convolution kernel or detector of features, travels over the data. In this work, the Siamese CNN1D framework is used to calculate the similarity between a reference structure of molecular and a database structure of molecular based on fingerprints. Thus, the architecture has two inputs, representing the fingerprint of molecules, one from the reference structure (query) and the other from the database structure. Also, the architecture has one output, representing the degree of similarity. If the value is 1, it means high similarity, and if the value is 0, it means high dissimilarity. Also, weights have been tied such that CNN1Da = CNN1Db in this architecture; there are two inputs, each input layer of convolution neural network (1D-CNN) received the molecular fingerprint, followed by another layer of the 1D convolution neural network (1D-CNN), followed by a max pooling size equal 2. The layer is formed by 64 filters with a kernel size equal to 3; the activation function is a rectified linear unit (ReLU), followed by a flatten layer and then a dense layer with a sigmoid activation function. There are two distances used: Manhattan distance and exponential Manhattan distance accordingly. Next, a fusion layer has been added to fuse between two distance layers (Manhattan, exponential Manhattan). Then, one layer has been added after the fusion layer, which represented the output layer. The ReLU activation function is used for all dense layers except the layer before the distance layers and the output layer in which the sigmoid activation function has been used. Moreover, the RMSprop optimizer has been used, binary_crossentropy is the loss function, and 64 is the batch size. Figure demonstrates the architecture of the Siamese CNN1D similarity model.
Figure 4

Architecture of the enhanced Siamese CNN1D similarity model.

Architecture of the enhanced Siamese CNN1D similarity model.

Enhanced Siamese CNN2D Similarity Model

An enhanced Siamese CNN2D framework has been used to calculate the similarity between a reference structure and a database structure based on 2D fingerprints; therefore, the architecture has two inputs, representing the fingerprint of molecules, one from the reference structure (query) and the other from the database structure. Also, the architecture has one output, which represents the degree of similarity; this means that the output has two classes: if the value is 1, it means high similarity, and if the value is 0, it means high dissimilarity. Also, weights have been tied such that CNN2Da = CNN2Db in this architecture. As mentioned above, there are two inputs: each input layer of the convolution neural network (2D-CNN) received the molecular fingerprint. The layer is formed of 64 filters with a kernel size equal to (3,3); the activation function is a rectified linear unit (ReLU), followed by another layer of a 2D convolution neural network (2D-CNN) formed of 64 filters with a kernel size equal to (3,3), a max pooling size equal to (2,2), a flatten layer, and then a dense layer with a sigmoid activation function. Two distances are used: the first one is the Manhattan distance, and the second distance is the exponential Manhattan distance. Next, a fusion layer is added to fuse between two distance layers (Manhattan, exponential Manhattan). Then, three layers are added after the fusion layer; the number of cells in these layers are 512, 256, and 1, respectively. The ReLU activation function is used for all dense layers except the layer before the distance layers and the output layer in which the sigmoid activation function has been used. Moreover, the RMSprop optimizer has been used, the loss function is binary_crossentropy, and the batch size is 64. Figure demonstrates the architecture of the Siamese CNN2D similarity model.
Figure 5

Architecture of the enhanced Siamese CNN2D similarity model.

Architecture of the enhanced Siamese CNN2D similarity model.

Experimental Design

Data Sets

Experiments were conducted using MDL Drug Data Report data sets (MDDR-DS1, MDDR-DS2, and MDDR-DS3)[52] and the Maximum Unbiased Validation (MUV) data set,[53] the most common cheminformatics database. In these databases, all molecules have been translated to the Pipeline Pilot, ECFC-4, and these databases have recently been used by our study community. With ten reference structures chosen randomly from each activity class, the screening experiments were carried out. MDDR-DS1 has 102 516 molecules (active and inactive). The active molecules (about 8300 molecules) comprise 11 activity groups, some with structurally homogeneous active elements and others with structurally heterogeneous (i.e., structurally diverse) active elements. Database MDDR-DS2 also has 102 516 molecules (active and inactive). The active molecules (about 5100 molecules) consist of 10 homogeneous activity classes. Database MDDR-DS3 has 102 516 molecules (active and inactive). The active molecules (about 8600 molecules) consist of 10 heterogeneous activity classes. Tables –3 provide descriptions of all three data sets. Each row of the table includes the activity class, the number of molecules belonging to the class, as well as a diversity of groups, which were measured as the average similarity of Tanimoto, computed by ECFC-4 for all pairs of molecules. Rohrer and Baumann recorded the second data collection (MUV), as seen in Table . There are 17 interaction groups in this data set, with each class containing up to 30 active and 15 000 inactive molecules. The class composition for this data set indicates that it involves classes with high diversity or more heterogeneous operations. In the previous articles, our research group has used these data collections.
Table 1

MDDR-DS1 Structure Activity Classes

activity indexactive moleculesactivity classpairwise similarity
314201130renin inhibitors0.290
31432943angiotensin II AT1 antagonists0.229
37110803thrombin inhibitors0.180
71 523750HIV protease inhibitors0.198
427311246substance P antagonists0.149
07701395D2 antagonists0.138
062453595HT reuptake inhibitors0.122
78374453protein kinase C inhibitors0.120
062358275HT1A agonists0.133
062337525HT3 antagonist0.140
78331636cyclooxygenase inhibitors0.108
Table 3

MDDR-DS3 Structure Activity Classes

activity indexactive moleculesactivity classpairwise similarity
09249900muscarinic (M1) agonists0.111
31281106dopamine-hydroxylase inhibitors0.125
12464505nitric oxide synthase inhibitors0.102
71522700reverse transcriptase inhibitors0.103
43210957aldose reductase inhibitors0.119
124551400NMDA receptor antagonists0.098
75721636aromatase inhibitors0.110
783512111lipoxygenase inhibitors0.113
78348617phospholipase A2 inhibitors0.123
78331636cyclooxygenase inhibitors0.108
Table 4

MUV Structure Activity Classes

activity indexactivity classpairwise similarity
466S1P1 rec. (agonists)0.117
644Rho-Kinase2 (inhibitors)0.122
600SF1 (inhibitors)0.123
689Eph rec. A4 (inhibitors)0.113
652HIV RT-RNase (inhibitors)0.099
712HSP 90 (inhibitors) 300.106
692SF1 (agonists)0.114
733ER-b-Coact. Bind. (inhibitors)0.114
713ER-a-Coact. Bind. (inhibitors)0.113
810FAK (inhibitors)0.107
737ER-a-Coact. Bind. (potentiators)0.129
846FXIa (inhibitors)0.161
832cathepsin G (inhibitors)0.151
858D1 rec. (allosteric modulators)0.111
852FXIIa (inhibitors)0.150
548PKA (inhibitors)0.128
859M1 rec. (allosteric inhibitors)0.126

Performance Evaluation Measures

The efficiency of the proposed methods is evaluated as follows: The first way to evaluate the performance of the retrieval model is to use the Recall metric, which is the portion of active chemical compounds within the top 1 and 5% of the ranking test set that can be found. This measure has been used in previous studies.[28,31−33,54−65] The whole data is divided into K sets of equal size: one of them as a test set, and the remaining sets as training sets. Selection of a test set will change in each iteration, and the mean of recall values from all iterations is calculated as the final result. This method is called k-fold cross validation, as shown in Figure . In each iteration, ten queries are tested, which are randomly selected from the activity class, and then the mean value of these ten queries is calculated.For instance, the improvement percentage of GRU was calculated using the improvement equation with Tan, BIN, SQB, and SDBN. Next, the mean value was calculated; if the result value was positive, there was an improvement in retrieval recall of GRU compared with previous studies, and if the result value was negative, the retrieval recall of GRU was worse. Next, the mean value of improvement overall classes was calculated. Here, this will apply to all proposed methods. However, the improvement percentage for each previous method was also calculated compared with the proposed methods, for example, the improvement percentage of TAN, compared with GRU, LSTM, CNN1D, and CNN2D, then, the mean value was calculated for each class, and then the mean value of all classes in the data set was calculated.
Figure 6

Idea of cross validation for training and testing data.

Comparison methods: The second way is current approaches that can be used in assessing the results of the proposed model. These approaches include the following. TAN: Over the years, the Tanimoto similarity coefficient has been the search benchmark method in LBVS. The Tanimoto-based model for similarities employs the Tanimoto coefficient in its continuous form, which is suitable to nonbinary fingerprint data.[23] The second method is Bayesian inference in the MDDR data set (DS1, DS2, DS3, and MUV) for the ECFC-4 descriptor. This is an alternative method for calculating the similarity of molecular fingerprints.[29,61] The third method is quantum similarity search SQB(Complex) in the MDDR data set (DS1, DS2, DS3, and MUV) for the ECFC-4 descriptor. This method utilizes a quantum mechanics approach.[31] SDBN: The latest study is a multidescriptor-based on Stack of deep belief networks method in the MDDR data set (DS1, DS2, and DS3) for ECFC-4, ECFP-4, and EPFP-4 descriptors. The molecular features are reweighted using deep belief networks.[33] The third significant measure that can be used to evaluate the proposed methods, known as the significance test, is the Kendall W concordance test. This significance test has been used in previous studies.[28,33,54,60,61,63,64] This test can be interpreted as the concordance coefficient, which is a measure of agreement among the raters. Each case is a judge or rater in the Kendall W test, whereas each variable is an object or person being judged. For each variable, thus, the number of ranks is computed. The Kendall W test range is between 0, indicating no agreement, and 1, indicating full agreement. For example, the rank r by judge number j, which represents an activity class, where there are n objects and m judges in total, is given to object I as the similarity search tool. It is then possible to calculate the total rank given to object I as[66]whereas the complete ranks’ mean meaning isThe squared deviation sum δ is defined as Then, the Kendall W test is defined asThe Kendall’s W statistical values can be between zero and one since the variance of the number of ranks separated by the maximum possible value has been calculated, which happens when all judges are in absolute agreement. This test shows whether a group of judges can make equivalent decisions about the rating of a set of items or not. The definitions used in this analysis suggest that judges were considered to be the behavior groups of each of the data sets, whereas the recall rates of the different search models were considered to be the items. The outcomes of the Kendall coefficient that are related to significance levels are a significant part of this experiment. This implies verifying whether the value of the coefficient may have happened by chance or not. If the value was important (for which both 0.01 and 0.05 cutoff values were used), it was then possible to assign the item an overall ranking. For a more evident comparison between the recall values of the proposed methods and previous studies, the improvement percentage for each proposed method was calculated using eq .[67] Idea of cross validation for training and testing data.

Results and Discussion

The ECFC-4 descriptor’s experimental findings on the MDDR-DS1, MDDR-DS2, MDDR-DS3, and MUV data sets are provided in Tables –12, respectively, using 1 and 5% cutoffs. The results of the proposed methods of deep learning compared to the benchmark TAN and previous studies BIN, SQB, and SDBN are recorded in these tables. For the top 1% and 5%of the activity class, each row in the tables lists the recall values, and in each row, the best recall rate is shaded. In the tables, the mean rows relate to the average of all activity classes when combined, and the rows of shaded cells are the total number of shaded cells have the top values for each technique over the full range of classes of activity. The distribution of results in tables is provided in boxplots in Figures –14.
Table 5

Top 1% Retrieval Results for MDDR-DS1 Data Set for Descriptor ECFC-4

Table 12

Top 5% Retrieval Results for MUV Data Set for Descriptor ECFC-4

Figure 7

Boxplot for recall result distribution for each method in MDDR-DS1 at the top 1%.

Figure 14

Boxplot for recall result distribution for each method in MUV at the top 5%.

Boxplot for recall result distribution for each method in MDDR-DS1 at the top 1%. The MDDR-DS1 recall values for the 1 and 5% cutoffs recorded in Tables and 6, respectively, showed that the proposed Siamese deep learning approaches were obviously superior to the benchmark TAN method and other studies. In addition, among other Siamese deep learning strategies, the CNN1D approach gives the best retrieval recall results in Table in each of mean and the number of shaded cells, when compared, followed by the CNN2D method, GRU, SDNB, BIN, LSTM, SQB, and TAN. The boxplot in Figure shows the comparison among methods for distribution of results in MDDR-DS1 at the top 1%, in view of maximum values, upper quartile values, mean values, median values, and lower quartile values. So, the top four methods in view of maximum values are CNNID, CNN2D, GRU, and LSTM; in upper quartile values are CNNID, CNN2D, GRU, and LSTM; in mean values are CNNID, CNN2D, GRU, and SDBN; in median values are CNNID, CNN2D, SDBN, and GRU; and in lower quartile values are CNNID, CNN2D, SDBN, and GRU. Also, by comparison, the CNN1D approach offered the best retrieval recall results in Table , in each of mean and the number of shaded cells, followed by the CNN2D method, GRU, LSTM, SDNB, BIN, SQB, and TAN. The boxplot in Figure shows the comparison among methods for distribution of results in MDDR-DS1 at the top 5%, in view of maximum values, upper quartile values, mean values, median values, and lower quartile values. So, the top four methods in view of maximum values are CNNID, SDBN, CNN2D, and BIN; in upper quartile values are CNNID, CNN2D, GRU, and SDBN; in mean values are CNNID, CNN2D, GRU, and LSTM; in median values are CNNID, CNN2D, LSTM, and GRU; and in lower quartile values are CNNID, CNN2D, LSTM, and GRU.
Table 6

Top 5% Retrieval Results for MDDR-DS1 Data Set for Descriptor ECFC-4

Figure 8

Boxplot for recall result distribution for each method in MDDR-DS1 at the top 5%.

Boxplot for recall result distribution for each method in MDDR-DS1 at the top 5%. Furthermore, the MDDR-DS2 recall values recorded for the top 1% in Table show that the proposed Siamese deep learning method (CNN1D) is clearly superior to the benchmark TAN method and previous studies. The CNN1D method gives the best retrieval recall results in each mean and the number of shaded cells. The second best are SDBN, BIN, and then SQB methods in view of the mean value, followed by Siamese CNN2D, LSTM, GRU, and finally Siamese TAN. The boxplot in Figure shows the comparison among methods for distribution of results in MDDR-DS2 at the top 1%, in view of maximum values, upper quartile values, mean values, median values, and lower quartile values. So, the top four methods in view of maximum values are BIN, SQB, CNN1D, and SDBN; in upper quartile values are SDBN, CNN1D, CNN2D, and BIN; in mean values are CNNID, BIN, SQB, and CNN2D; in median values are CNNID, CNN2D, SDBN, and BIN; and in lower quartile values are CNNID, SDBN, BIN, and SDBN. However, by comparison, the MDDR-DS2 recall values recorded for 5% cutoffs in Table show that the BIN method gave the best retrieval recall results in view of the mean and the number of shaded cells. The second best are SQB, SDBN, CNN1D, CNN2D, LSTM, and finally TAN in view of the mean value. The boxplot in Figure shows the comparison among methods for distribution of results in MDDR-DS2 at the top 5%, in view of maximum values, upper quartile values, mean values, median values, and lower quartile values. So, the top four methods in view of maximum values are BIN, SQB, SDBN, and CNN1D; in upper quartile values are BIN, SQB, SDBN, and CNN1D; in mean values are BIN, SQB, SDBN, and CNN1D; in median values are BIN, SQB, SDBN, and CNN1D; and in lower quartile values are BIN, SQB, SDBN, and CNN1D.
Table 7

Top 1% Retrieval Results for MDDR-DS2 Data Set for Descriptor ECFC-4

Figure 9

Boxplot for recall result distribution for each method in MDDR-DS2 at the top 1%.

Table 8

Top 5% Retrieval Results for MDDR-DS2 Data Set for Descriptor ECFC-4

Figure 10

Boxplot for recall result distribution for each method in MDDR-DS2 at the top 5%.

Boxplot for recall result distribution for each method in MDDR-DS2 at the top 1%. Boxplot for recall result distribution for each method in MDDR-DS2 at the top 5%. In addition, the MDDR-DS3 recall values recorded for the top 1% and 5% in Tables and 10, respectively, show that the proposed Siamese deep learning methods are clearly superior to the benchmark TAN method and other studies. Likewise, in Table , the CNN1D method gives the best retrieval recall results in view of mean and the number of shaded cells, compared to previous studies and other methods of Siamese deep learning. Next, the second one is Siamese CNN2D, followed by SDBN, GRU, BIN, SQB, TAN, and LSTM. The boxplot in Figure shows the comparison among methods for distribution of results in MDDR-DS3 at the top 1%, in view of maximum values, upper quartile values, mean values, median values, and lower quartile values. So, the top four methods in view of maximum values are CNN1D, CNN2D, GRU, and SDBN; in upper quartile values are CNN1D, CNN2D, GRU, and SDBN; in mean values are CNN1D, CNN2D, SDBN, and GRU; in median values are CNN1D, CNN2D, SDBN, and GRU; and in lower quartile values are CNN1D, CNN2D, SDBN, and GRU. However, by comparison, in Table , the CNN1D method gives the best retrieval recall results in view of the mean and the number of shaded cells, compared to previous studies and other methods of Siamese deep learning, followed by Siamese CNN2D, GRU, SDBN, TAN, BIN, SQB, and finally LSTM. The boxplot in Figure shows the comparison among methods for distribution of results in MDDR-DS3 at the top 5%, in view of maximum values, upper quartile values, mean values, median values, and lower quartile values. So, the top four methods in view of maximum values are CNN1D, CNN2D, GRU, and LSTM; in upper quartile values are CNN1D, CNN2D, GRU, and LSTM; in mean values are CNN1D, CNN2D, GRU, and SDBN; in median values are CNN1D, CNN2D, GRU, and SDBN; and in lower quartile values are CNN1D, CNN2D, SDBN, and GRU.
Table 9

Top 1% Retrieval Results for MDDR-DS3 Data Set for Descriptor ECFC-4

Table 10

Top 5% Retrieval Results for MDDR-DS3 Data Set for Descriptor ECFC-4

Figure 11

Boxplot for recall result distribution for each method in MDDR-DS3 at the top 1%.

Figure 12

Boxplot for recall result distribution for each method in MDDR-DS3 at the top 5%.

Boxplot for recall result distribution for each method in MDDR-DS3 at the top 1%. Boxplot for recall result distribution for each method in MDDR-DS3 at the top 5%. Moreover, the MUV data set recall values recorded for 1 and 5% cutoffs in Tables and 12, respectively, show that the proposed Siamese deep learning CNN methods are clearly superior to the benchmark TAN method and previous studies. Likewise, in Table , the CNN1D Method gives the best retrieval recall results in view of the mean. Next, the second best are BIN and Siamese CNN2D, followed by GRU, LSTM, SQB, and finally TAN. The boxplot in Figure shows the comparison among methods for distribution of results in MUV at the top 1%, in view of maximum values, upper quartile values, mean values, median values, and lower quartile values. So, the top four methods in view of maximum values are CNN1D, CNN2D, BIN, and SQB; in upper quartile values are CNN1D, CNN2D, BIN, and SQB; in mean values are CNN1D, BIN, CNN2D, and GRU; in median values are BIN, CNN2D, CNN1D, and SQB; and in lower quartile values are BIN, CNN2D, CNN1D, and GRU. By comparison, in Table , the CNN1D method gives the best retrieval recall results in view of the mean and the number of shaded cells, followed by CNN2D, BIN, SQB, TAN, GRU, and LSTM. The boxplot in Figure shows the comparison among methods for distribution of results in MUV at the top 5%, in view of maximum values, upper quartile values, mean values, median values, and lower quartile values. So, the top four methods in view of maximum values are CNN1D, CNN2D, BIN, and SQB; in upper quartile values are CNN1D, CNN2D, BIN, and SQB; in mean values are CNN1D, CNN2D, BIN, and SQB; in median values are BIN, CNN2D, SQB, and CNN1D; and in lower quartile values are BIN, CNN2D, GRU, and SQB.
Table 11

Top 1% Retrieval Results for MUV Data Set for Descriptor ECFC-4

Figure 13

Boxplot for recall result distribution for each method in MUV at the top 1%.

Boxplot for recall result distribution for each method in MUV at the top 1%. Boxplot for recall result distribution for each method in MUV at the top 5%. Moreover, the Kendall W concordance test has been used. Table shows the ranking of enhanced Siamese deep learning (RNN-GRU, RNN-LSTM, CNN1D, CNN2D) methods based on previous studies TAN, BIN, SQB, and SDBN using Kendall W test results for MDDR-DS1, MDDR-DS2, MDDR-DS3, and MUV at the top 1% and top 5%. The first method is Tanimoto coefficient TAN, the second method is Bayesian inference (ABDO),[29] the third method is quantum similarity search SQB-Complex (Al-dabagh),[31] and the last method is multidescriptor-based on Stack of deep belief networks (Nasser).[33] For all of the data sets used, the Kendall W test of the top 1% shows that the significance test (P) values are less than 0.05. This means that the enhanced Siamese deep learning methods are significant in all cases with a cutoff of 1%. Therefore, the general ranking of all methods of deep learning indicates that the enhanced Siamese CNN methods are superior to previous studies and benchmark TAN; the overall ranking for methods shows that CNN1D has the top rank among other methods in DS1, DS2, DS3 data sets, while BIN method has top rank in the MUV data set.
Table 13

Ranking of Enhanced Siamese Deep Learning (RNN-GRU, RNN-LSTM, CNN1D, CNN2D) Methods based on TAN, BIN, SQB, and SDBN Using Kendall W Test Results for DS1, DS2, DS3, and MUV at the Top 1% and 5%

data setretrieval percentage (%)WPrank methods
DS110.642.24 × 10–81-2-3-4-5-6-7-8-
   CNN1DCNN2DSDBNGRUBINLSTMSQBTAN
   7.916.455.364.274.003.002.552.45
50.661.1601 × 10–81-2-3-4-5-6-7-8-
   CNN1DCNN2DGRUSDBNLSTMBINTANSQB
   7.736.734.914.644.274.002.641.91
DS210.491.471 × 10–51-2-3-4-5-6-7-8-
   CNN1DSDBNBINCNN2DSQBLSTMGRUTAN
   6.95.85.654.94.853.22.91.8
50.472.8157 × 10–51-23-4-5-6-7-8-
   BINSQBSDBNCNN1DCNN2DTANLSTMGRU
   6.856.255.55.1462.952.25
DS310.641.4015 × 10–71-2-34-5-6-7-8-
   CNN1DCNN2DSDBNBINGRUSQBTANLSTM
   7.456.4564.43.93.12.91.8
50.747.00 × 10–91-2-3-4-5-6-7-8-
   CNN1DCNN2DSDBNGRULSTMSQBTANBIN
   7.77.35.14.932.82.62.6
MUV10.529.62 × 10–101-2-3-4-5-6-7- 
   BINCNN2DCNN1DGRULSTMSQBTAN 
   6.235.23553.763.242.711.82 
50.339.5856 × 10–61-2-3-4-5-6-7- 
   BINCNN2DCNN1DSQBGRUTANLSTM 
   5.565.214.913.653.472.762.44 
Same as with the results of the Kendall W test of the top 5%. The results indicate that the probability values (p) related are below 0.05. This denotes that deep learning methods for enhanced Siamese are important in all cases at a cutoff of 5%. As a result, the overall ranking of all methods of deep learning indicates that enhanced Siamese CNN1D is superior to previous studies for DS1 and DS3. In DS2 and MUV, BIN has the top rank at the top 5%. Figures and 16 show the ranking of enhanced Siamese deep learning (RNN-GRU, RNN-LSTM, CNN1D, CNN2D) methods based on TAN, BIN, SQB, and SDBN using Kendall W test results for DS1, DS2, DS3, and MUV at the top 1% and 5%.
Figure 15

Ranking of enhanced Siamese deep learning (RNN-GRU, RNN-LSTM, CNN1D, CNN2D) methods based on TAN, BIN, SQB, and SDBN using Kendall W test results for DS1, DS2, DS3, and MUV at the top 1%.

Figure 16

Ranking of enhanced Siamese deep learning (RNN-GRU, RNN-LSTM, CNN1D, CNN2D) methods based on TAN, BIN, SQB, and SDBN using Kendall W test results for DS1, DS2, DS3, and MUV at the top 5%.

Ranking of enhanced Siamese deep learning (RNN-GRU, RNN-LSTM, CNN1D, CNN2D) methods based on TAN, BIN, SQB, and SDBN using Kendall W test results for DS1, DS2, DS3, and MUV at the top 1%. Ranking of enhanced Siamese deep learning (RNN-GRU, RNN-LSTM, CNN1D, CNN2D) methods based on TAN, BIN, SQB, and SDBN using Kendall W test results for DS1, DS2, DS3, and MUV at the top 5%. For another comparison between the recall values of the proposed methods and prior studies, the improvement percentage is calculated for proposed methods and prior methods for each data set, as shown in Table . In the DS1 data set, the proposed CNN methods have positive values at the top 1%, meaning that there is improvement in retrieval recall compared with prior methods; besides that, CNN1D has the top value of improvement percentage, followed by CNN2D, while all previous methods have negative values, meaning that the retrieval recall is worse compared with the proposed methods. For the top 5%, all proposed methods have positive values, meaning that there is improvement in retrieval recall compared with prior methods, and CNN1D has the top value of improvement, followed by CNN2D, GRU, and LSTM, while all prior methods have negative values at the top 5%, meaning that the retrieval recall is worse compared with the proposed methods.
Table 14

Improvement Percentage of the Proposed Methods and Prior Methods for Each Data Set

 
previous studies
proposed methods
TANBINSQBSDBNGRULSTMCNN1DCNN2D
DS1top 1%–59.037–27.955–34.942–14.029–1.155–15.05139.32025.401
top 5%–57.819–47.508–58.173–31.66916.85814.10839.50931.406
DS2top 1%–37.9112.4001.8523.946–5.190–4.98911.2433.606
top 5%–20.0627.7467.1375.812–8.796–7.2751.597–2.530
DS3top 1%–79.723–56.487–70.460–6.758–35.122–107.77044.87231.746
top 5%–86.018–87.382–91.183–38.26217.277–9.26656.48049.208
MUVtop 1%–93.35016.035–77.736 –3.198–14.24124.25522.283
top 5%–20.65210.123–11.820 –17.121–25.8515.63710.929
Also, in the DS3 data set, the proposed methods have positive values at the top 1%, except GRU and LSTM, meaning that there is an improvement in retrieval recall compared with prior methods, and CNN1D has the top value of improvement, followed by CNN2D. The same as with the top 5%, all proposed methods have positive values, except LSTM, meaning that there is improvement in retrieval recall compared with prior methods, and CNN1D has the top value of improvement, followed by CNN2D and GRU, while all prior methods have negative values at the top 1% and 5%, meaning that the retrieval recall is worse compared with the proposed methods. Moreover, in the MUV data set at the top 1%, the proposed CNN methods have positive values, which means there is improvement in retrieval recall compared with prior methods; also, the previous study on the BIN method has positive values. CNN1D has the top value of improvement, followed by CNN2D and BIN methods. The same as with the top 5%, the proposed methods, except RGU and LSTM, have positive values, meaning that there is improvement in retrieval recall compared with prior methods; also, the previous study on the BIN method has positive values. CNN2D has the top value of improvement, followed by BIN and CNN1D methods, while GRU and LSTM have negative values, meaning that the retrieval recall is worse compared with previous methods. Also, the TAN, SQB, and SDBN have negative values, meaning that the retrieval recall is worse compared with the proposed methods. However, in the DS2 data set, the proposed CNN methods have positive values at the top 1%, meaning that there is improvement in retrieval recall compared with prior methods; also, the previous studies have positive values, meaning that there is improvement in retrieval recall compared with the proposed methods, but the proposed CNN1D method has a top value of improvement, followed by SDBN, CNN2D, BIN, and SQB. In the top 5%, only CNN1D has a positive value. On the other side, the previous studies have positive values for BIN, SQB, and SDBN methods and BIN has the top value of improvement, followed by SQB, SDBN, and the proposed CNN1D method.

Conclusions

Many techniques for capturing the biological similarity between a test compound and a known target ligand in LBVS have been established. LBVS is based on the premise that the target-binding behavior of related property compounds will be related. In spite of the good performances of the above methods compared to their prior, especially when dealing with molecules that have homogeneous active structural elements, however, the performances are not satisfied when dealing with molecules that are structurally heterogeneous. The main goal of this research is to improve the retrieval effectiveness of the similarity model, especially with molecules that have structurally heterogeneous, and because of their powerful generalization, feature extraction capabilities, and the power of deep learning for dealing with big data, also the power of Siamese architecture with dealing with complicated data samples, especially with heterogeneous data samples. Therefore, they have been used in this study. The Siamese deep learning models have been enhanced using two distance layers and then a fusion layer that combines the results from two distance layers and then adding multiple layers after the fusion layer for some models to improve the similarity recall between a test compound and a known target ligand. In this architecture, several deep learning methods have been used, which are LSTM, GRU, CNN1D, and CNN2D. The results showed that the significance of the proposed methods, especially Siamese CNN similarity models, obviously outperformed the standard Tanimoto coefficient (TAN) and previous studies (BIN, SQB, SDNB) at both top 1% and 5%, especially when the model deals with MDDR-DS1, MDDR-DS3, and MUV data sets that include heterogeneous classes.
Table 2

MDDR-DS2 Structure Activity Classes

activity indexactive moleculesactivity classpairwise similarity
07707207adenosine (A1) agonists0.229
42710111CCK agonists0.361
314201130renin inhibitors0.290
64200113cephalosporins0.322
641001346monocyclic-lactams0.336
64500126carbapenems0.260
642201051carbacephems0.269
75755455vitamin D analogues0.386
75755455vitamin D analogues0.386
07708156adenosine (A2) agonists0.305
  37 in total

Review 1.  Similarity searching using 2D structural fingerprints.

Authors:  Peter Willett
Journal:  Methods Mol Biol       Date:  2011

Review 2.  An overview of molecular fingerprint similarity search in virtual screening.

Authors:  Ingo Muegge; Prasenjit Mukherjee
Journal:  Expert Opin Drug Discov       Date:  2015-12-04       Impact factor: 6.098

3.  Scaffold hopping using clique detection applied to reduced graphs.

Authors:  Edward J Barker; David Buttar; David A Cosgrove; Eleanor J Gardiner; Paula Kitts; Peter Willett; Valerie J Gillet
Journal:  J Chem Inf Model       Date:  2006 Mar-Apr       Impact factor: 4.956

4.  Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data.

Authors:  Sebastian G Rohrer; Knut Baumann
Journal:  J Chem Inf Model       Date:  2009-02       Impact factor: 4.956

5.  Shape-Based Generative Modeling for de Novo Drug Design.

Authors:  Miha Skalic; José Jiménez; Davide Sabbadin; Gianni De Fabritiis
Journal:  J Chem Inf Model       Date:  2019-02-28       Impact factor: 4.956

Review 6.  Deep learning for computational chemistry.

Authors:  Garrett B Goh; Nathan O Hodas; Abhinav Vishnu
Journal:  J Comput Chem       Date:  2017-03-08       Impact factor: 3.376

7.  Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?

Authors:  Dávid Bajusz; Anita Rácz; Károly Héberger
Journal:  J Cheminform       Date:  2015-05-20       Impact factor: 5.514

8.  MoleculeNet: a benchmark for molecular machine learning.

Authors:  Zhenqin Wu; Bharath Ramsundar; Evan N Feinberg; Joseph Gomes; Caleb Geniesse; Aneesh S Pappu; Karl Leswing; Vijay Pande
Journal:  Chem Sci       Date:  2017-10-31       Impact factor: 9.825

Review 9.  Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases.

Authors:  Ahmet Sureyya Rifaioglu; Heval Atas; Maria Jesus Martin; Rengul Cetin-Atalay; Volkan Atalay; Tunca Doğan
Journal:  Brief Bioinform       Date:  2019-09-27       Impact factor: 11.622

10.  A Quantum-Based Similarity Method in Virtual Screening.

Authors:  Mohammed Mumtaz Al-Dabbagh; Naomie Salim; Mubarak Himmat; Ali Ahmed; Faisal Saeed
Journal:  Molecules       Date:  2015-10-02       Impact factor: 4.411

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.