Mohammed Khaldoon Altalib1,2, Naomie Salim1. 1. School of Computing, Universiti Teknologi Malaysia, Johor Bahru 81310, Malaysia. 2. Computer Science Department, College of Education for Pure Sciences, University of Mosul, 41002 Mosul, Iraq.
Abstract
Traditional drug production is a long and complex process that leads to new drug production. The virtual screening technique is a computational method that allows chemical compounds to be screened at an acceptable time and cost. Several databases contain information on various aspects of biologically active substances. Simple statistical tools are difficult to use because of the enormous amount of information and complex data samples of molecules that are structurally heterogeneous recorded in these databases. Many techniques for capturing the biological similarity between a test compound and a known target ligand in LBVS have been established. However, despite the good performances of the above methods compared to their prior, especially when dealing with molecules that have homogeneous active structural elements, they are not satisfied when dealing with molecules that are structurally heterogeneous. Deep learning models have recently achieved considerable success in a variety of disciplines due to their powerful generalization and feature extraction capabilities. Also, the Siamese network has been used in similarity models for more complicated data samples, especially with heterogeneous data samples. The main aim of this study is to enhance the performance of similarity searching, especially with molecules that are structurally heterogeneous. The Siamese architecture will be enhanced using two similarity distance layers with one fusion layer to further improve the similarity measurements between molecules and then adding many layers after the fusion layer for some models to improve the retrieval recall. In this architecture, several methods of deep learning have been used, which are long short-term memory (LSTM), gated recurrent unit (GRU), convolutional neural network-one dimension (CNN1D), and convolutional neural network-two dimensions (CNN2D). A series of experiments have been carried out on real-world data sets, and the results have shown that the proposed methods outperformed the existing methods.
Traditional drug production is a long and complex process that leads to new drug production. The virtual screening technique is a computational method that allows chemical compounds to be screened at an acceptable time and cost. Several databases contain information on various aspects of biologically active substances. Simple statistical tools are difficult to use because of the enormous amount of information and complex data samples of molecules that are structurally heterogeneous recorded in these databases. Many techniques for capturing the biological similarity between a test compound and a known target ligand in LBVS have been established. However, despite the good performances of the above methods compared to their prior, especially when dealing with molecules that have homogeneous active structural elements, they are not satisfied when dealing with molecules that are structurally heterogeneous. Deep learning models have recently achieved considerable success in a variety of disciplines due to their powerful generalization and feature extraction capabilities. Also, the Siamese network has been used in similarity models for more complicated data samples, especially with heterogeneous data samples. The main aim of this study is to enhance the performance of similarity searching, especially with molecules that are structurally heterogeneous. The Siamese architecture will be enhanced using two similarity distance layers with one fusion layer to further improve the similarity measurements between molecules and then adding many layers after the fusion layer for some models to improve the retrieval recall. In this architecture, several methods of deep learning have been used, which are long short-term memory (LSTM), gated recurrent unit (GRU), convolutional neural network-one dimension (CNN1D), and convolutional neural network-two dimensions (CNN2D). A series of experiments have been carried out on real-world data sets, and the results have shown that the proposed methods outperformed the existing methods.
Drug development is a lengthy and complicated procedure that ends
in the creation of new drug production. In the course of conventional
drug research and development, a biomolecular target is identified
and the experiments for high-performance screening are performed to
identify bioactive compounds for specified goals. The development
of high-performing research testing is expensive and time-consuming.
This process includes specialized laboratories with chemical and biological
libraries.[1] In fact, the probability of
success is low, and the acceptance and widespread use of approximately
1 out of 5000 identified drug applicants are estimated.[2] The increased computer capabilities, on the other
hand, allowed several million chemical compounds to be screened at
an acceptable pace and cost. The virtual screening technique is a
computational method for searching small molecules in huge libraries
and choosing the most likely binding structures with a drug objective.[3−6] Virtual screening (VS) is conducted in the early discovery phases
in which broad chemical libraries comprise the most promising lead
compounds. In the last few years, the development of drugs has been
accelerated by virtual screening (VS). Two main virtual screening
techniques exist, namely, structure-based virtual screening (SBVS)
and ligand-based virtual screening (LBVS).[7] The SBVS techniques seek indirect compounds that are appropriate
for the biological objective binding site. The central technology
of SBVS methods is molecular docking.[8] On
the other hand, the LBVS approach is used constantly for the prediction
of molecular properties and for measuring molecular similarity because
the method to represent the molecules is easy and accurate. The significance
of applications of similarity searching stems from the importance
of lead optimization in drug discovery programs, in which close neighbors
are looking into an initial lead compound to find decent compounds.[9−12]Recently, modern deep learning (DL) techniques were introduced
in many fields, and they developed in the last years, opening a new
door for researchers. The success of DL techniques benefits from the
rapid growth of the DL algorithms and the advancement of high-performance
computing techniques. Moreover, DL techniques have fewer generalization
errors, which allow them to achieve reasonable results on certain
benchmarks or competitive tests and make more precise predictions
regarding molecular properties.[13−18] Also, features can be automatically discovered from input data using
deep learning techniques.[3,19,20] In addition, the Siamese network is commonly used for solving image
similarity and text similarity issues. It has been used for more complicated
data samples, especially with heterogeneous data samples, with various
dimensionalities and type characteristics.[21,22]Information about different aspects of biologically active
compounds
is held in a variety of databases. Some databases contain classes
of molecules that have structurally homogeneous active elements like
the MDL Drug Data Report (MDDR_DR2) data set, and other databases
contain classes of molecules that are structurally heterogeneous like
MDDR_DR3 and Maximum Unbiased Validation (MUV); however, the vast
volume of information stored and complex data samples of molecules
that are structurally heterogeneous make it difficult to carry out
simple statistical tools. In this study, the Siamese deep learning
model will be enhanced by using two distance layers and then a fusion
layer that combines the results from two distances layers, it is appropriate
to combine them to further improve the similarity measurements between
molecules, particularly when dealing with different types of descriptors.
In some models, multiple layers have been added after the fusion layer
to improve the retrieval recall. In this architecture, several methods
of deep learning have been used here which are long short-term memory
(LSTM-RNN), gated recurrent unit (GRU-RNN), both are in the recurrent
neural network, convolutional neural network-one dimension (CNN1D),
and convolutional neural network-two dimensions (CNN2D). The following
are the paper’s main contributions:The Siamese deep learning model will be enhanced using
two distance layers and then a fusion layer that combines the results
from two distance layers to add further improvements for the similarity
measurements between molecules, particularly when dealing with different
types of descriptors, and then adding many layers after the fusion
layer for some models to improve the retrieval recall.In comparison to benchmark approaches, the suggested
method demonstrated encouraging results in terms of overall performance,
especially when dealing with heterogeneous classes of molecules.
Related Work
Similarity-based
virtual screening is widely considered to be one
of the essential aspects of drug discovery. Several approaches were
used to increase the retrieval effectiveness of the similarity searching
methods. The 2D similarity methods have become widely used and very
common. The fundamental theory behind the calculation of molecular
similarity is that structurally similar molecules seem to be more
likely to possess similar properties than structurally dissimilar
molecules. The purpose of similarity searching, therefore, is to retrieve
molecules that are structurally very similar to the reference structures
of the consumer. Various coefficient methods allow for the quantification
of the similarity/difference between molecule pairs. Many other studies
have tested the output of several similarity coefficients, showing
that the Tanimoto coefficient surpassed other similarities.[23−26] The Tanimoto coefficient has thus become the most common indicator
of the similarity of chemical compounds used in chemoinformatics.
Different approaches have been used over the years to enhance the
performance of search methods for similarities. Some experiments attempted
to incorporate methods from various disciplines. Many parallels between
the retrieval of text information and cheminformatics have already
indicated that techniques developed for the retrieval of text documents
may be employed to improve the similitude of molecular searching.[27] Therefore, some approaches to molecular similarities,
such as the Bayesian inference network, used by virtual ligand screening
were originally based on text retrieval domains. In virtual screening,
for example, Abdo et al.[28,29] have used the Bayesian
network, which outpaced Tanimoto. In addition, the reweighting techniques
were used in the text field to model retrieval of documents and adapted
in the cheminformatic field in the retrieval model.[28,30] The fragment reweighting techniques were also used by Ahmed et al.
to strengthen the Bayesian network.[28] Al-Dabagh
used the concepts of quantum mechanics theory to enhance molecular
similarity searching and molecular ranking of chemical compounds in
LBVS.[31] Himmat M. created a new similarity
measure by reweighting various bit strings and derived it from existing
similarity measures. In addition, the author proposed ranking strategies
for developing a substitute ranking technique.[32] Nasser and colleagues employed deep belief networks (DBNs)
to reweight molecular features where many descriptors were utilized,
each representing distinct relevant aspects, and integrated all new
features from all descriptors to provide a new descriptor for similarity
searches.[33]In recent years, new
technologies of deep learning (DL) have been
adopted and applied in drug discovery and bioinformatics and cheminformatics
studies, opening a new door to computational decision making and to
assist in the understanding of molecular mechanisms and the development
of new therapeutic options for a variety of diseases.[20,34] Gómez-Bombarelli et al. proposed an autoencoder model that
produces new molecules by converting discrete molecular representations
to multidimensional continuous representations.[35,36] Skalic et al. proposed a new model to produce new molecules by converting
the seed compound into a three-dimensional (3D) representation using
a variational autoencoder and then sequencing SMILES tokens using
convolutional and recurrent neural network systems to explore uncharted
areas of the chemical space that still have lead compound-like characteristics.[37] Gao et al. proposed a generative network complex
(GNC) model to create new drug like molecules using gradient descent
in the latent space of an autoencoder for multiproperty optimization.[38] Hamza et al. used CNN to determine its precision
during the prediction of orphan compound activities.[39] Also, Mendolia et al. used convolutional neural networks
(CNNs), which are intended to identify a set of candidate compounds
for a specific target protein in terms of their biological activity,
and both 1D and 2D CNNs were trained separately to test the performance
of every single descriptor.[40] Moreover,
several researchers have proposed to exploit RNN-based methods for
chemoinformatics. The majority of the researchers have utilized the
model as a prediction or classifier model. Wan and Zeng proposed a model for compound–protein
interaction prediction using DL methods, in which they adopted a commonly
used NLP approach called feature incorporation.[100] Their model was built into multidimensional vectors, both
ligand details (molecular fingerprints) and protein sequences. Also,
SMILES representation of molecules is used in the RNN model. The RNN
model was used to learn SMILES’ coding grammar, which can be
converted into a molecular graph.[41] In
addition, Goh et al. used SMILES as an input feature to the RNN model
for predicting the molecular properties.[42]Furthermore, other studies reported that deep learning methods
in the Siamese architecture as a similarity model produce the best
performance that can be used with more complicated data samples, especially
with heterogeneous data samples, with various dimensionalities and
type characteristics.[21,22] For example, Yu et al. employed
CNN Siamese architectures to assess whether two people are from the
same family, allowing missing people to be reunited with their relatives.[43] Jonas et al. used the LSTM Siamese neural network
to calculate the similarity between sentences.[44] In this method, an exponential Manhattan distance was used
to measure the similarity between two sentences. In the drug discovery
domain, Dhami et al. was using images as an input to predict drug
interactions in a Siamese convolution network architecture.[45] Jeon et al. proposed a method to use MLP Siamese
neural networks (ReSimNet) in structure-based virtual screening (SBVS)
to calculate the distance by cosine similarity.[22]Despite the good performances of the above methods
compared to
their prior, especially when dealing with molecules that have homogeneous
active structural elements like classes of molecules in the MDL Drug
Data Report data set MDDR_DR2, however, the performances are not satisfied
when dealing with molecules of structurally heterogeneous nature like
classes of molecules in the MDL Drug Data Report data set MDDR_DR3
and Maximum Unbiased Validation (MUV) data set. The main goal of this
research is to improve the retrieval effectiveness of the similarity
model, especially with molecules that are structurally heterogeneous,
and because of the power of deep learning for dealing with big data
and the power of the Siamese architecture for dealing with complicated
data samples, especially with heterogeneous data samples; therefore,
they have been used in this study. Many methods of deep learning will
be examined as a similarity model through the enhanced Siamese architecture.
These methods of deep learning include long short-term memory (LSTM-RNN),
gated recurrent unit (GRU-RNN), convolutional neural network-one dimension
(CNN1D), and convolutional neural network-two dimensions (CNN2D).
Methods
A Siamese neural network contains two artificial
neural networks
that are the same, each able to handle the hidden input data representation,
which have to be linked to a final layer using a distance layer to
predict whether or not two vectors fall under the same category. Since
all of the weights and biases are related, the networks that make
up the Siamese architecture are called twins, which means that both
networks are symmetric. Both error backpropagation and feed-forward
perceptron are used by the two neural networks during training. Therefore,
it has been used for more complicated data samples, especially with
heterogeneous data samples, with various dimensionalities and type
characteristics. In this paper, the Siamese deep learning model will
be enhanced. Figure shows the flowchart of steps for enhancing the Siamese architecture.
Figure 1
Flowchart
of steps for enhancing the Siamese architecture.
Flowchart
of steps for enhancing the Siamese architecture.The steps for enhancing the Siamese architecture of deep learning
methods include the following:Studying and analyzing many models of
Siamese architectures in different fields, like Dhami et al. and Jeon
et al. in the field of structure-based virtual screening and Jonas
et al. in the text field.All previous studies used one distance
layer. In this study, two distance layers are used, and then, one
fusion layer combines the results from distance layers. The reason
for using more than one distance layer is to further improve the similarity
measurements between molecules, particularly when dealing with different
types of descriptors.In general, there are two inputs and
one output in this architecture; the output value represents the degree
of similarity between the inputs. In this study, many layers have
been added after the fusion layer for some models to improve the retrieval
recall.The hyperparameters
of the Siamese deep
learning similarity model such as the number of epochs and batch size,
optimization, and the activation function are tuned to achieve a good
retrieval recall result.Here, four methods
of deep learning have been used in this architecture;
these methods include these methods include long short-term memory
(LSTM-RNN), gated recurrent unit (GRU-RNN), convolutional neural network-one
dimension (CNN1D), and convolutional neural network-two dimensions
(CNN2D). The following subsections explain each of the methods individually.
Enhanced Siamese RNN Similarity Model
Recurrent neural
networks are artificial neural networks that form
the link between nodes by means of a directed diagram along a time
stream. The recurrent neural networks use internal state memory for
sequence processing compared to neural feed-forward networks. The
recurrent neural networks’ dynamic behavior enables them to
be very helpful and applicable to audio processing, handwriting recognition,
and many such applications. However, recurring neural networks face
the problems of vanishing gradients during backpropagation. If the
gradient value is extremely small, it cannot lead to effective learning.
As a short-term memory solution, LSTM and GRU have been developed.
LSTM and GRU have been developed as a solution for short-term memory.
They have internal mechanisms, which can monitor the flow of information,
called gates.[46]
Enhanced
Siamese LSTM Similarity Model
Long short-term memory (LSTM),
is an RNN structure with feedback
links that allow everything that a Turing machine can do or compute.
A single LSTM unit is made up of a cell, an input gate, an output
gate, and a forgotten door, allowing the cell to arbitrarily record
the value. The data flow in and out of the LSTM cell is monitored
by gates.[47] An enhanced Siamese LSTM structure
was used to determine how similar two molecules are, so the architecture
has two inputs, one from the query and the other from the fingerprint
data set, representing the fingerprint of molecules. The one-output
architecture represents the degree of similarity, which means that
the output has two classes: if the value is 1, it means high similarity,
and if the value is 0, it means high dissimilarity; the weights have
also been linked in this architecture so that LSTMa = LSTMb.In this model, each input layer has two cell dimensions (32,32),
each of this matrix is linked to one molecular fingerprint feature,
and then each input layer is linked to distance layers; two distances
have been used: the first one is the Manhattan distance,[48] which can be represented aswhere dAB is the Manhatten distance, fA is the feature of molecule’s query,
and fB is the feature of molecule’s
data set, and the second distance is the exponential Manhattan distance,[44] which can be given aswhere EAB is the
exponential Manhatten
distance, fA is the feature of molecule’s
query, and fB is the feature of molecule’s
data set.Next, a fusion layer is added to fuse between two
distance layers
(Manhattan, exponential Manhattan). Then, three layers are added after
the fusion layer; the cells in these layers are 512, 256, and 1, respectively.
The output is one of the two cases: 1, meaning the two input molecules
are similar, and 0, meaning the two input molecules are dissimilar.
The ReLU activation function has been used for all dense layers except
the last one, in which the sigmoid activation function has been used.
Moreover, the RMSprop optimizer has been used, the loss function is
binary_crossentropy, and the batch size is 64. The architecture of
the Siamese RNN-LSTM similarity model is illustrated in Figure .
Figure 2
Architecture of the enhanced
Siamese RNN-LSTM similarity model.
Architecture of the enhanced
Siamese RNN-LSTM similarity model.
Enhanced Siamese GRU Similarity Model
The GRU, recognized as the gated recurrent unit, is an RNN architecture
that is similar to LSTM units. Instead of the LSTM input, output,
and forget gate, the GRU consists of a reset gate and an update gate.
The update gate lets the model decide how much of the previous knowledge
(from previous time steps) needs to be followed on to the future.[49] An enhanced Siamese GRU framework has been used
to determine how similar two molecules are; therefore, the architecture
has two inputs, representing the fingerprint of molecules, one from
the query and the other from a data set of the fingerprint. Also,
the architecture has one output that represents the degree of similarity;
1 means high similarity or 0 means high dissimilarity. Also, the weights
have been tied such that GRUa = GRUb in this architecture; each input
layer has two dimensions (32,32) of cells, each one connected to one
feature of the molecular fingerprint, and then each input layer is
connected to distance layers.Two distances are used (as mentioned
in the previous subsection): the first one is the Manhattan distance,
and the second distance is the exponential Manhattan distance. Next,
a fusion layer is added to fuse between two distance layers (Manhattan,
exponential Manhattan). Then, three layers are added after the fusion
layer; the cells in these layers are 512, 256, and 1, respectively.
The output is one of the two cases: 1, meaning the two input molecules
are similar, and 0, meaning the two input molecules are dissimilar.
The ReLU activation function has been used for all layers except the
last one, in which the sigmoid activation function has been used.
Moreover, the RMSprop has been used, binary_crossentropy is the loss
function, and 64 is the batch size. Figure demonstrates the architecture of the Siamese
RNN-GRU similarity model.
Figure 3
Architecture of the enhanced Siamese RNN-GRU
similarity model.
Architecture of the enhanced Siamese RNN-GRU
similarity model.
Enhanced
Siamese CNN Similarity Model
The CNN is a type of high feed-forward
network that can be easily
trained and generalized compared to other networks with connectivity
between the adjacent layers.[50,51] In this work, the Siamese
CNN framework has been used to determine how similar two molecules
are. CNN1D (one dimension) and CNN2D (two dimensions) have been used
in this architecture as follows.
Enhanced Siamese CNN1D
Similarity Model
CNNs, whether they have one, two, or three
dimensions, function
the same way. The difference is the input data structure and how the
filtration, often referred to as a convolution kernel or detector
of features, travels over the data. In this work, the Siamese CNN1D
framework is used to calculate the similarity between a reference
structure of molecular and a database structure of molecular based
on fingerprints. Thus, the architecture has two inputs, representing
the fingerprint of molecules, one from the reference structure (query)
and the other from the database structure. Also, the architecture
has one output, representing the degree of similarity. If the value
is 1, it means high similarity, and if the value is 0, it means high
dissimilarity. Also, weights have been tied such that CNN1Da = CNN1Db
in this architecture; there are two inputs, each input layer of convolution
neural network (1D-CNN) received the molecular fingerprint, followed
by another layer of the 1D convolution neural network (1D-CNN), followed
by a max pooling size equal 2. The layer is formed by 64 filters with
a kernel size equal to 3; the activation function is a rectified linear
unit (ReLU), followed by a flatten layer and then a dense layer with
a sigmoid activation function. There are two distances used: Manhattan
distance and exponential Manhattan distance accordingly. Next, a fusion
layer has been added to fuse between two distance layers (Manhattan,
exponential Manhattan). Then, one layer has been added after the fusion
layer, which represented the output layer. The ReLU activation function
is used for all dense layers except the layer before the distance
layers and the output layer in which the sigmoid activation function
has been used. Moreover, the RMSprop optimizer has been used, binary_crossentropy
is the loss function, and 64 is the batch size. Figure demonstrates the architecture of the Siamese
CNN1D similarity model.
Figure 4
Architecture of the enhanced Siamese CNN1D similarity
model.
Architecture of the enhanced Siamese CNN1D similarity
model.
Enhanced
Siamese CNN2D Similarity Model
An enhanced Siamese CNN2D
framework has been used to calculate
the similarity between a reference structure and a database structure
based on 2D fingerprints; therefore, the architecture has two inputs,
representing the fingerprint of molecules, one from the reference
structure (query) and the other from the database structure. Also,
the architecture has one output, which represents the degree of similarity;
this means that the output has two classes: if the value is 1, it
means high similarity, and if the value is 0, it means high dissimilarity.
Also, weights have been tied such that CNN2Da = CNN2Db in this architecture.
As mentioned above, there are two inputs: each input layer of the
convolution neural network (2D-CNN) received the molecular fingerprint.
The layer is formed of 64 filters with a kernel size equal to (3,3);
the activation function is a rectified linear unit (ReLU), followed
by another layer of a 2D convolution neural network (2D-CNN) formed
of 64 filters with a kernel size equal to (3,3), a max pooling size
equal to (2,2), a flatten layer, and then a dense layer with a sigmoid
activation function.Two distances are used: the first one is
the Manhattan distance, and the second distance is the exponential
Manhattan distance. Next, a fusion layer is added to fuse between
two distance layers (Manhattan, exponential Manhattan). Then, three
layers are added after the fusion layer; the number of cells in these
layers are 512, 256, and 1, respectively. The ReLU activation function
is used for all dense layers except the layer before the distance
layers and the output layer in which the sigmoid activation function
has been used. Moreover, the RMSprop optimizer has been used, the
loss function is binary_crossentropy, and the batch size is 64. Figure demonstrates the
architecture of the Siamese CNN2D similarity model.
Figure 5
Architecture of the enhanced
Siamese CNN2D similarity model.
Architecture of the enhanced
Siamese CNN2D similarity model.
Experimental Design
Data
Sets
Experiments were conducted
using MDL Drug Data Report data sets (MDDR-DS1, MDDR-DS2, and MDDR-DS3)[52] and the Maximum Unbiased Validation (MUV) data
set,[53] the most common cheminformatics
database. In these databases, all molecules have been translated to
the Pipeline Pilot, ECFC-4, and these databases have recently been
used by our study community. With ten reference structures chosen
randomly from each activity class, the screening experiments were
carried out. MDDR-DS1 has 102 516 molecules (active and inactive).
The active molecules (about 8300 molecules) comprise 11 activity groups,
some with structurally homogeneous active elements and others with
structurally heterogeneous (i.e., structurally diverse) active elements.
Database MDDR-DS2 also has 102 516 molecules (active and inactive).
The active molecules (about 5100 molecules) consist of 10 homogeneous
activity classes. Database MDDR-DS3 has 102 516 molecules (active
and inactive). The active molecules (about 8600 molecules) consist
of 10 heterogeneous activity classes. Tables –3 provide descriptions of all three data sets. Each row of
the table includes the activity class, the number of molecules belonging
to the class, as well as a diversity of groups, which were measured
as the average similarity of Tanimoto, computed by ECFC-4 for all
pairs of molecules. Rohrer and Baumann recorded the second data collection
(MUV), as seen in Table . There are 17 interaction groups in this data set, with each class
containing up to 30 active and 15 000 inactive molecules. The
class composition for this data set indicates that it involves classes
with high diversity or more heterogeneous operations. In the previous
articles, our research group has used these data collections.
Table 1
MDDR-DS1 Structure Activity Classes
activity index
active molecules
activity class
pairwise similarity
31420
1130
renin inhibitors
0.290
31432
943
angiotensin II AT1 antagonists
0.229
37110
803
thrombin inhibitors
0.180
71 523
750
HIV protease inhibitors
0.198
42731
1246
substance P antagonists
0.149
07701
395
D2 antagonists
0.138
06245
359
5HT reuptake inhibitors
0.122
78374
453
protein kinase C inhibitors
0.120
06235
827
5HT1A agonists
0.133
06233
752
5HT3 antagonist
0.140
78331
636
cyclooxygenase inhibitors
0.108
Table 3
MDDR-DS3 Structure
Activity Classes
activity index
active molecules
activity class
pairwise similarity
09249
900
muscarinic (M1) agonists
0.111
31281
106
dopamine-hydroxylase inhibitors
0.125
12464
505
nitric oxide synthase inhibitors
0.102
71522
700
reverse transcriptase inhibitors
0.103
43210
957
aldose reductase inhibitors
0.119
12455
1400
NMDA receptor antagonists
0.098
75721
636
aromatase inhibitors
0.110
78351
2111
lipoxygenase inhibitors
0.113
78348
617
phospholipase A2 inhibitors
0.123
78331
636
cyclooxygenase inhibitors
0.108
Table 4
MUV Structure Activity Classes
activity index
activity class
pairwise similarity
466
S1P1 rec. (agonists)
0.117
644
Rho-Kinase2 (inhibitors)
0.122
600
SF1 (inhibitors)
0.123
689
Eph rec. A4 (inhibitors)
0.113
652
HIV RT-RNase
(inhibitors)
0.099
712
HSP 90 (inhibitors) 30
0.106
692
SF1 (agonists)
0.114
733
ER-b-Coact. Bind. (inhibitors)
0.114
713
ER-a-Coact.
Bind. (inhibitors)
0.113
810
FAK (inhibitors)
0.107
737
ER-a-Coact. Bind. (potentiators)
0.129
846
FXIa (inhibitors)
0.161
832
cathepsin
G (inhibitors)
0.151
858
D1 rec. (allosteric modulators)
0.111
852
FXIIa (inhibitors)
0.150
548
PKA (inhibitors)
0.128
859
M1 rec. (allosteric
inhibitors)
0.126
Performance Evaluation Measures
The
efficiency of the proposed methods is evaluated as follows:The first way to
evaluate the performance
of the retrieval model is to use the Recall metric, which is the portion
of active chemical compounds within the top 1 and 5% of the ranking
test set that can be found. This measure has been used in previous
studies.[28,31−33,54−65]The whole data is divided into K sets
of equal size:
one of them as a test set, and the remaining sets as training sets.
Selection of a test set will change in each iteration, and the mean
of recall values from all iterations is calculated as the final result.
This method is called k-fold cross validation, as shown in Figure . In each iteration,
ten queries are tested, which are randomly selected from the activity
class, and then the mean value of these ten queries is calculated.For instance, the improvement percentage of
GRU was calculated using the improvement equation with Tan, BIN, SQB,
and SDBN. Next, the mean value was calculated; if the result value
was positive, there was an improvement in retrieval recall of GRU
compared with previous studies, and if the result value was negative,
the retrieval recall of GRU was worse. Next, the mean value of improvement
overall classes was calculated. Here, this will apply to all proposed
methods. However, the improvement percentage for each previous method
was also calculated compared with the proposed methods, for example,
the improvement percentage of TAN, compared with GRU, LSTM, CNN1D,
and CNN2D, then, the mean value was calculated for each class, and
then the mean value of all classes in the data set was calculated.
Figure 6
Idea of
cross validation for training and testing data.
Comparison methods:
The second way
is current approaches that can be used in assessing the results of
the proposed model. These approaches include the following.TAN: Over the years, the Tanimoto similarity
coefficient has been the search benchmark method in LBVS. The Tanimoto-based
model for similarities employs the Tanimoto coefficient in its continuous
form, which is suitable to nonbinary fingerprint data.[23]The second method is Bayesian inference
in the MDDR data set (DS1, DS2, DS3, and MUV) for the ECFC-4 descriptor.
This is an alternative method for calculating the similarity of molecular
fingerprints.[29,61]The third method is quantum similarity
search SQB(Complex) in the MDDR data set (DS1, DS2, DS3, and MUV)
for the ECFC-4 descriptor. This method utilizes a quantum mechanics
approach.[31]SDBN: The latest study is a multidescriptor-based
on Stack of deep belief networks method in the MDDR data set (DS1,
DS2, and DS3) for ECFC-4, ECFP-4, and EPFP-4 descriptors. The molecular
features are reweighted using deep belief networks.[33]The third significant measure that
can be used to evaluate the proposed methods, known as the significance
test, is the Kendall W concordance test. This significance test has
been used in previous studies.[28,33,54,60,61,63,64] This test
can be interpreted as the concordance coefficient, which is a measure
of agreement among the raters. Each case is a judge or rater in the
Kendall W test, whereas each variable is an object or person being
judged. For each variable, thus, the number of ranks is computed.
The Kendall W test range is between 0, indicating no agreement, and
1, indicating full agreement. For example, the rank r by judge number j, which represents an activity class, where there are n objects and m judges in total, is given to object I as the similarity search tool. It is then possible to
calculate the total rank given to object I as[66]whereas the complete ranks’
mean meaning
isThe squared deviation sum δ is defined
asThen, the Kendall W test is
defined
asThe Kendall’s
W statistical values
can be between zero and one since the variance of the number of ranks
separated by the maximum possible value has been calculated, which
happens when all judges are in absolute agreement. This test shows
whether a group of judges can make equivalent decisions about the
rating of a set of items or not. The definitions used in this analysis
suggest that judges were considered to be the behavior groups of each
of the data sets, whereas the recall rates of the different search
models were considered to be the items. The outcomes of the Kendall
coefficient that are related to significance levels are a significant
part of this experiment. This implies verifying whether the value
of the coefficient may have happened by chance or not. If the value
was important (for which both 0.01 and 0.05 cutoff values were used),
it was then possible to assign the item an overall ranking.For a more evident comparison
between
the recall values of the proposed methods and previous studies, the
improvement percentage for each proposed method was calculated using eq .[67]Idea of
cross validation for training and testing data.
Results and Discussion
The ECFC-4 descriptor’s
experimental findings on the MDDR-DS1,
MDDR-DS2, MDDR-DS3, and MUV data sets are provided in Tables –12, respectively, using 1 and 5% cutoffs. The results of the proposed
methods of deep learning compared to the benchmark TAN and previous
studies BIN, SQB, and SDBN are recorded in these tables. For the top
1% and 5%of the activity class, each row in the tables lists the recall
values, and in each row, the best recall rate is shaded. In the tables,
the mean rows relate to the average of all activity classes when combined,
and the rows of shaded cells are the total number of shaded cells
have the top values for each technique over the full range of classes
of activity. The distribution of results in tables is provided in
boxplots in Figures –14.
Table 5
Top 1% Retrieval
Results for MDDR-DS1
Data Set for Descriptor ECFC-4
Table 12
Top 5% Retrieval Results for MUV
Data Set for Descriptor ECFC-4
Figure 7
Boxplot for recall result distribution
for each method in MDDR-DS1
at the top 1%.
Figure 14
Boxplot for recall result distribution
for each method in MUV at
the top 5%.
Boxplot for recall result distribution
for each method in MDDR-DS1
at the top 1%.The MDDR-DS1 recall values for the 1 and 5% cutoffs
recorded in Tables and 6, respectively, showed that the proposed
Siamese deep learning
approaches were obviously superior to the benchmark TAN method and
other studies. In addition, among other Siamese deep learning strategies,
the CNN1D approach gives the best retrieval recall results in Table in each of mean and
the number of shaded cells, when compared, followed by the CNN2D method,
GRU, SDNB, BIN, LSTM, SQB, and TAN. The boxplot in Figure shows the comparison among
methods for distribution of results in MDDR-DS1 at the top 1%, in
view of maximum values, upper quartile values, mean values, median
values, and lower quartile values. So, the top four methods in view
of maximum values are CNNID, CNN2D, GRU, and LSTM; in upper quartile
values are CNNID, CNN2D, GRU, and LSTM; in mean values are CNNID,
CNN2D, GRU, and SDBN; in median values are CNNID, CNN2D, SDBN, and
GRU; and in lower quartile values are CNNID, CNN2D, SDBN, and GRU.
Also, by comparison, the CNN1D approach offered the best retrieval
recall results in Table , in each of mean and the number of shaded cells, followed by the
CNN2D method, GRU, LSTM, SDNB, BIN, SQB, and TAN. The boxplot in Figure shows the comparison
among methods for distribution of results in MDDR-DS1 at the top 5%,
in view of maximum values, upper quartile values, mean values, median
values, and lower quartile values. So, the top four methods in view
of maximum values are CNNID, SDBN, CNN2D, and BIN; in upper quartile
values are CNNID, CNN2D, GRU, and SDBN; in mean values are CNNID,
CNN2D, GRU, and LSTM; in median values are CNNID, CNN2D, LSTM, and
GRU; and in lower quartile values are CNNID, CNN2D, LSTM, and GRU.
Table 6
Top 5% Retrieval
Results for MDDR-DS1
Data Set for Descriptor ECFC-4
Figure 8
Boxplot
for recall result distribution for each method in MDDR-DS1
at the top 5%.
Boxplot
for recall result distribution for each method in MDDR-DS1
at the top 5%.Furthermore, the MDDR-DS2 recall values recorded for
the top 1%
in Table show that
the proposed Siamese deep learning method (CNN1D) is clearly superior
to the benchmark TAN method and previous studies. The CNN1D method
gives the best retrieval recall results in each mean and the number
of shaded cells. The second best are SDBN, BIN, and then SQB methods
in view of the mean value, followed by Siamese CNN2D, LSTM, GRU, and
finally Siamese TAN. The boxplot in Figure shows the comparison among methods for distribution
of results in MDDR-DS2 at the top 1%, in view of maximum values, upper
quartile values, mean values, median values, and lower quartile values.
So, the top four methods in view of maximum values are BIN, SQB, CNN1D,
and SDBN; in upper quartile values are SDBN, CNN1D, CNN2D, and BIN;
in mean values are CNNID, BIN, SQB, and CNN2D; in median values are
CNNID, CNN2D, SDBN, and BIN; and in lower quartile values are CNNID,
SDBN, BIN, and SDBN. However, by comparison, the MDDR-DS2 recall values
recorded for 5% cutoffs in Table show that the BIN method gave the best retrieval recall
results in view of the mean and the number of shaded cells. The second
best are SQB, SDBN, CNN1D, CNN2D, LSTM, and finally TAN in view of
the mean value. The boxplot in Figure shows the comparison among methods for
distribution of results in MDDR-DS2 at the top 5%, in view of maximum
values, upper quartile values, mean values, median values, and lower
quartile values. So, the top four methods in view of maximum values
are BIN, SQB, SDBN, and CNN1D; in upper quartile values are BIN, SQB,
SDBN, and CNN1D; in mean values are BIN, SQB, SDBN, and CNN1D; in
median values are BIN, SQB, SDBN, and CNN1D; and in lower quartile
values are BIN, SQB, SDBN, and CNN1D.
Table 7
Top 1% Retrieval
Results for MDDR-DS2
Data Set for Descriptor ECFC-4
Figure 9
Boxplot for recall result distribution
for each method in MDDR-DS2
at the top 1%.
Table 8
Top 5% Retrieval Results for MDDR-DS2
Data Set for Descriptor ECFC-4
Figure 10
Boxplot for recall result distribution
for each method in MDDR-DS2
at the top 5%.
Boxplot for recall result distribution
for each method in MDDR-DS2
at the top 1%.Boxplot for recall result distribution
for each method in MDDR-DS2
at the top 5%.In addition, the MDDR-DS3 recall values recorded for
the top 1%
and 5% in Tables and 10, respectively, show that the proposed Siamese
deep learning methods are clearly superior to the benchmark TAN method
and other studies. Likewise, in Table , the CNN1D method gives the best retrieval recall
results in view of mean and the number of shaded cells, compared to
previous studies and other methods of Siamese deep learning. Next,
the second one is Siamese CNN2D, followed by SDBN, GRU, BIN, SQB,
TAN, and LSTM. The boxplot in Figure shows the comparison among methods for distribution
of results in MDDR-DS3 at the top 1%, in view of maximum values, upper
quartile values, mean values, median values, and lower quartile values.
So, the top four methods in view of maximum values are CNN1D, CNN2D,
GRU, and SDBN; in upper quartile values are CNN1D, CNN2D, GRU, and
SDBN; in mean values are CNN1D, CNN2D, SDBN, and GRU; in median values
are CNN1D, CNN2D, SDBN, and GRU; and in lower quartile values are
CNN1D, CNN2D, SDBN, and GRU. However, by comparison, in Table , the CNN1D method
gives the best retrieval recall results in view of the mean and the
number of shaded cells, compared to previous studies and other methods
of Siamese deep learning, followed by Siamese CNN2D, GRU, SDBN, TAN,
BIN, SQB, and finally LSTM. The boxplot in Figure shows the comparison among methods for
distribution of results in MDDR-DS3 at the top 5%, in view of maximum
values, upper quartile values, mean values, median values, and lower
quartile values. So, the top four methods in view of maximum values
are CNN1D, CNN2D, GRU, and LSTM; in upper quartile values are CNN1D,
CNN2D, GRU, and LSTM; in mean values are CNN1D, CNN2D, GRU, and SDBN;
in median values are CNN1D, CNN2D, GRU, and SDBN; and in lower quartile
values are CNN1D, CNN2D, SDBN, and GRU.
Table 9
Top 1% Retrieval
Results for MDDR-DS3
Data Set for Descriptor ECFC-4
Table 10
Top 5% Retrieval Results for MDDR-DS3
Data Set for Descriptor ECFC-4
Figure 11
Boxplot for recall result
distribution for each method in MDDR-DS3
at the top 1%.
Figure 12
Boxplot for recall result distribution
for each method in MDDR-DS3
at the top 5%.
Boxplot for recall result
distribution for each method in MDDR-DS3
at the top 1%.Boxplot for recall result distribution
for each method in MDDR-DS3
at the top 5%.Moreover, the MUV data set recall values recorded
for 1 and 5%
cutoffs in Tables and 12, respectively,
show that the proposed Siamese deep learning CNN methods are clearly
superior to the benchmark TAN method and previous studies. Likewise,
in Table , the CNN1D
Method gives the best retrieval recall results in view of the mean.
Next, the second best are BIN and Siamese CNN2D, followed by GRU,
LSTM, SQB, and finally TAN. The boxplot in Figure shows the comparison among methods for
distribution of results in MUV at the top 1%, in view of maximum values,
upper quartile values, mean values, median values, and lower quartile
values. So, the top four methods in view of maximum values are CNN1D,
CNN2D, BIN, and SQB; in upper quartile values are CNN1D, CNN2D, BIN,
and SQB; in mean values are CNN1D, BIN, CNN2D, and GRU; in median
values are BIN, CNN2D, CNN1D, and SQB; and in lower quartile values
are BIN, CNN2D, CNN1D, and GRU. By comparison, in Table , the CNN1D method gives the
best retrieval recall results in view of the mean and the number of
shaded cells, followed by CNN2D, BIN, SQB, TAN, GRU, and LSTM. The
boxplot in Figure shows the comparison among methods for
distribution of results in MUV at the top 5%, in view of maximum values,
upper quartile values, mean values, median values, and lower quartile
values. So, the top four methods in view of maximum values are CNN1D,
CNN2D, BIN, and SQB; in upper quartile values are CNN1D, CNN2D, BIN,
and SQB; in mean values are CNN1D, CNN2D, BIN, and SQB; in median
values are BIN, CNN2D, SQB, and CNN1D; and in lower quartile values
are BIN, CNN2D, GRU, and SQB.
Table 11
Top 1% Retrieval
Results for MUV
Data Set for Descriptor ECFC-4
Figure 13
Boxplot for recall result distribution
for each method in MUV at
the top 1%.
Boxplot for recall result distribution
for each method in MUV at
the top 1%.Boxplot for recall result distribution
for each method in MUV at
the top 5%.Moreover, the Kendall W concordance test has been
used. Table shows
the ranking
of enhanced Siamese deep learning (RNN-GRU, RNN-LSTM, CNN1D, CNN2D)
methods based on previous studies TAN, BIN, SQB, and SDBN using Kendall
W test results for MDDR-DS1, MDDR-DS2, MDDR-DS3, and MUV at the top
1% and top 5%. The first method is Tanimoto coefficient TAN, the second
method is Bayesian inference (ABDO),[29] the
third method is quantum similarity search SQB-Complex (Al-dabagh),[31] and the last method is multidescriptor-based
on Stack of deep belief networks (Nasser).[33] For all of the data sets used, the Kendall W test of the top 1%
shows that the significance test (P) values are less
than 0.05. This means that the enhanced Siamese deep learning methods
are significant in all cases with a cutoff of 1%. Therefore, the general
ranking of all methods of deep learning indicates that the enhanced
Siamese CNN methods are superior to previous studies and benchmark
TAN; the overall ranking for methods shows that CNN1D has the top
rank among other methods in DS1, DS2, DS3 data sets, while BIN method
has top rank in the MUV data set.
Table 13
Ranking of Enhanced
Siamese Deep
Learning (RNN-GRU, RNN-LSTM, CNN1D, CNN2D) Methods based on TAN, BIN,
SQB, and SDBN Using Kendall W Test Results for DS1, DS2, DS3, and
MUV at the Top 1% and 5%
data
set
retrieval percentage
(%)
W
P
rank methods
DS1
1
0.64
2.24 × 10–8
1-
2-
3-
4-
5-
6-
7-
8-
CNN1D
CNN2D
SDBN
GRU
BIN
LSTM
SQB
TAN
7.91
6.45
5.36
4.27
4.00
3.00
2.55
2.45
5
0.66
1.1601 × 10–8
1-
2-
3-
4-
5-
6-
7-
8-
CNN1D
CNN2D
GRU
SDBN
LSTM
BIN
TAN
SQB
7.73
6.73
4.91
4.64
4.27
4.00
2.64
1.91
DS2
1
0.49
1.471 × 10–5
1-
2-
3-
4-
5-
6-
7-
8-
CNN1D
SDBN
BIN
CNN2D
SQB
LSTM
GRU
TAN
6.9
5.8
5.65
4.9
4.85
3.2
2.9
1.8
5
0.47
2.8157 × 10–5
1-
2
3-
4-
5-
6-
7-
8-
BIN
SQB
SDBN
CNN1D
CNN2D
TAN
LSTM
GRU
6.85
6.25
5.5
5.1
4
6
2.95
2.25
DS3
1
0.64
1.4015 × 10–7
1-
2-
3
4-
5-
6-
7-
8-
CNN1D
CNN2D
SDBN
BIN
GRU
SQB
TAN
LSTM
7.45
6.45
6
4.4
3.9
3.1
2.9
1.8
5
0.74
7.00 × 10–9
1-
2-
3-
4-
5-
6-
7-
8-
CNN1D
CNN2D
SDBN
GRU
LSTM
SQB
TAN
BIN
7.7
7.3
5.1
4.9
3
2.8
2.6
2.6
MUV
1
0.52
9.62 × 10–10
1-
2-
3-
4-
5-
6-
7-
BIN
CNN2D
CNN1D
GRU
LSTM
SQB
TAN
6.23
5.235
5
3.76
3.24
2.71
1.82
5
0.33
9.5856 × 10–6
1-
2-
3-
4-
5-
6-
7-
BIN
CNN2D
CNN1D
SQB
GRU
TAN
LSTM
5.56
5.21
4.91
3.65
3.47
2.76
2.44
Same as with the results of the Kendall W test of
the top 5%. The
results indicate that the probability values (p)
related are below 0.05. This denotes that deep learning methods for
enhanced Siamese are important in all cases at a cutoff of 5%. As
a result, the overall ranking of all methods of deep learning indicates
that enhanced Siamese CNN1D is superior to previous studies for DS1
and DS3. In DS2 and MUV, BIN has the top rank at the top 5%. Figures and 16 show the ranking of enhanced Siamese deep learning
(RNN-GRU, RNN-LSTM, CNN1D, CNN2D) methods based on TAN, BIN, SQB,
and SDBN using Kendall W test results for DS1, DS2, DS3, and MUV at
the top 1% and 5%.
Figure 15
Ranking of enhanced Siamese deep learning (RNN-GRU, RNN-LSTM,
CNN1D,
CNN2D) methods based on TAN, BIN, SQB, and SDBN using Kendall W test
results for DS1, DS2, DS3, and MUV at the top 1%.
Figure 16
Ranking
of enhanced Siamese deep learning (RNN-GRU, RNN-LSTM, CNN1D,
CNN2D) methods based on TAN, BIN, SQB, and SDBN using Kendall W test
results for DS1, DS2, DS3, and MUV at the top 5%.
Ranking of enhanced Siamese deep learning (RNN-GRU, RNN-LSTM,
CNN1D,
CNN2D) methods based on TAN, BIN, SQB, and SDBN using Kendall W test
results for DS1, DS2, DS3, and MUV at the top 1%.Ranking
of enhanced Siamese deep learning (RNN-GRU, RNN-LSTM, CNN1D,
CNN2D) methods based on TAN, BIN, SQB, and SDBN using Kendall W test
results for DS1, DS2, DS3, and MUV at the top 5%.For another comparison between the recall values of the proposed
methods and prior studies, the improvement percentage is calculated
for proposed methods and prior methods for each data set, as shown
in Table . In the
DS1 data set, the proposed CNN methods have positive values at the
top 1%, meaning that there is improvement in retrieval recall compared
with prior methods; besides that, CNN1D has the top value of improvement
percentage, followed by CNN2D, while all previous methods have negative
values, meaning that the retrieval recall is worse compared with the
proposed methods. For the top 5%, all proposed methods have positive
values, meaning that there is improvement in retrieval recall compared
with prior methods, and CNN1D has the top value of improvement, followed
by CNN2D, GRU, and LSTM, while all prior methods have negative values
at the top 5%, meaning that the retrieval recall is worse compared
with the proposed methods.
Table 14
Improvement Percentage
of the Proposed
Methods and Prior Methods for Each Data Set
previous
studies
proposed
methods
TAN
BIN
SQB
SDBN
GRU
LSTM
CNN1D
CNN2D
DS1
top 1%
–59.037
–27.955
–34.942
–14.029
–1.155
–15.051
39.320
25.401
top 5%
–57.819
–47.508
–58.173
–31.669
16.858
14.108
39.509
31.406
DS2
top
1%
–37.911
2.400
1.852
3.946
–5.190
–4.989
11.243
3.606
top 5%
–20.062
7.746
7.137
5.812
–8.796
–7.275
1.597
–2.530
DS3
top 1%
–79.723
–56.487
–70.460
–6.758
–35.122
–107.770
44.872
31.746
top 5%
–86.018
–87.382
–91.183
–38.262
17.277
–9.266
56.480
49.208
MUV
top 1%
–93.350
16.035
–77.736
–3.198
–14.241
24.255
22.283
top 5%
–20.652
10.123
–11.820
–17.121
–25.851
5.637
10.929
Also, in the DS3 data set, the proposed methods have positive values
at the top 1%, except GRU and LSTM, meaning that there is an improvement
in retrieval recall compared with prior methods, and CNN1D has the
top value of improvement, followed by CNN2D. The same as with the
top 5%, all proposed methods have positive values, except LSTM, meaning
that there is improvement in retrieval recall compared with prior
methods, and CNN1D has the top value of improvement, followed by CNN2D
and GRU, while all prior methods have negative values at the top 1%
and 5%, meaning that the retrieval recall is worse compared with the
proposed methods.Moreover, in the MUV data set at the top 1%,
the proposed CNN methods
have positive values, which means there is improvement in retrieval
recall compared with prior methods; also, the previous study on the
BIN method has positive values. CNN1D has the top value of improvement,
followed by CNN2D and BIN methods. The same as with the top 5%, the
proposed methods, except RGU and LSTM, have positive values, meaning
that there is improvement in retrieval recall compared with prior
methods; also, the previous study on the BIN method has positive values.
CNN2D has the top value of improvement, followed by BIN and CNN1D
methods, while GRU and LSTM have negative values, meaning that the
retrieval recall is worse compared with previous methods. Also, the
TAN, SQB, and SDBN have negative values, meaning that the retrieval
recall is worse compared with the proposed methods.However,
in the DS2 data set, the proposed CNN methods have positive
values at the top 1%, meaning that there is improvement in retrieval
recall compared with prior methods; also, the previous studies have
positive values, meaning that there is improvement in retrieval recall
compared with the proposed methods, but the proposed CNN1D method
has a top value of improvement, followed by SDBN, CNN2D, BIN, and
SQB. In the top 5%, only CNN1D has a positive value. On the other
side, the previous studies have positive values for BIN, SQB, and
SDBN methods and BIN has the top value of improvement, followed by
SQB, SDBN, and the proposed CNN1D method.
Conclusions
Many techniques for capturing the biological similarity between
a test compound and a known target ligand in LBVS have been established.
LBVS is based on the premise that the target-binding behavior of related
property compounds will be related. In spite of the good performances
of the above methods compared to their prior, especially when dealing
with molecules that have homogeneous active structural elements, however,
the performances are not satisfied when dealing with molecules that
are structurally heterogeneous.The main goal of this research
is to improve the retrieval effectiveness
of the similarity model, especially with molecules that have structurally
heterogeneous, and because of their powerful generalization, feature
extraction capabilities, and the power of deep learning for dealing
with big data, also the power of Siamese architecture with dealing
with complicated data samples, especially with heterogeneous data
samples. Therefore, they have been used in this study. The Siamese
deep learning models have been enhanced using two distance layers
and then a fusion layer that combines the results from two distance
layers and then adding multiple layers after the fusion layer for
some models to improve the similarity recall between a test compound
and a known target ligand. In this architecture, several deep learning
methods have been used, which are LSTM, GRU, CNN1D, and CNN2D. The
results showed that the significance of the proposed methods, especially
Siamese CNN similarity models, obviously outperformed the standard
Tanimoto coefficient (TAN) and previous studies (BIN, SQB, SDNB) at
both top 1% and 5%, especially when the model deals with MDDR-DS1,
MDDR-DS3, and MUV data sets that include heterogeneous classes.
Authors: Edward J Barker; David Buttar; David A Cosgrove; Eleanor J Gardiner; Paula Kitts; Peter Willett; Valerie J Gillet Journal: J Chem Inf Model Date: 2006 Mar-Apr Impact factor: 4.956