Literature DB >> 34984306

Topological Distance-Based Electron Interaction Tensor to Apply a Convolutional Neural Network on Drug-like Compounds.

Abstract

Deep learning (DL) models in quantitative structure-activity relationship fed the molecular structure directly to the network without using human-designed descriptors by representing molecule as a graph or string (e.g., SMILES code). However, these two representations were oversimplification of real molecules to reflect chemical properties of molecular structures. Given that the choice of molecular representation determines the architecture of the DL model to apply, a novel way of molecular representation can open a way to apply diverse DL networks developed and used in other fields. A topological distance-based electron interaction (TDEi) tensor has been developed in this study inspired by the quantum mechanical model of the molecule, which defines a molecule with electrons and protons. In the TDEi tensor, the atomic orbital (AO) of each atom is represented by an electron configuration (EC) vector, which is a bit string based on the presence and absence of electrons in each AO according to spin indicated by positive and negative signs. Interactions between EC vectors were calculated based on the topological distance between atoms in a molecule. As a molecular structure was translated into 3D array, CNN models (modified VGGNet) were applied using a TDEi tensor to predict four physicochemical properties of drug-like compound datasets: MP (275,131), Lipop (4193), Esol (1127), and Freesolv (639). Models achieved good prediction accuracy. PCA showed that a stronger correlation was observed between the extracted features and the target endpoint as features were extracted from the deeper layer.

Entities: Chemical

Year: 2021 PMID： 34984306 PMCID： PMC8717557 DOI： 10.1021/acsomega.1c05693

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Diverse quantitative structure–activity relationship (QSAR) models have been used in drug discovery projects to reduce the time and cost required for drug discovery by predicting properties or activities of molecules with their structure alone.[1−3] Even though QSAR models have been successful in filtering out poor molecules in the early phase of drug discovery, the models have failed to discover good drug candidates based on their prediction outcomes alone, which implies that the prediction accuracy of QSAR models is not satisfactory yet.[4] Recently, deep learning (DL) algorithms have been applied in QSAR model development with expectation of significant prediction accuracy improvement.[5] Even though most of DL studies claim that the application of DL algorithms improved prediction accuracy,[6−10] it is not certain whether use of DL can truly improve prediction accuracy of QSAR models. Jiang et al. compared the performance of machine learning (ML) models and DL models on multiple datasets and concluded that a descriptor-based support vector machine outperformed DL models.[11] In order to achieve a breakthrough in drug discovery using DL, more research studies are still required to test diverse deep neural network (DNN) architectures on drug-like compounds to see if DL can truly improve prediction accuracy. An artificial neural network was frequently used in traditional QSAR studies with a descriptor vector as an input in a shallow network architecture having only one hidden layer. A descriptor-based DL model used a deeper network architecture by staking more hidden layers inside the network. In more advanced DL algorithm, methods to feed the molecular structures directly to the network were developed instead of using molecular descriptors. The molecular structure can be represented with different theoretical models,[7] and a topological graph is a common way to define molecular structures with edges (chemical bonds) and nodes (atoms). A graph convolutional neural network (GCN) takes the molecular graph as an input, represented by an adjacency matrix with the feature of each atom and their neighbor information. The feature for each atom represents the character of the atom based on its microenvironment, and neighbor information is described by the distance between the atoms, or connectivity features, such as chemical bond features.[8,12,13] The simplified molecular input line entry system (SMILES) code is a string format representation for molecular structures[14] and is broadly used in public databases. As molecular structures can be represented in a string format, techniques used in natural language processing were adopted to input molecular structures directly to the network through one-hot encoding on each symbol of SMILES[15] or SMILES-embeddings.[16] Even though the graph and SMILES were widely used for molecular structure representation, these were oversimplification of molecules in real life to reflect chemical properties observed in molecular structures. Thus, more molecular representations need to be developed to represent molecules realistically. Given that the choice of molecular representation determines the architecture of DNN to apply, development of a novel representation can open a way to apply diverse DL algorithms, developed and used in other fields, on drug-like compound datasets. A quantum mechanical (QM) model of the molecule is represented by electrons and protons. Since a chemical bond can be described by distribution of electrons between closely located atoms, the input file for QM calculation did not specify bonding information. In the SMILES and mol file format, intrinsic hydrogen is usually omitted; however, it should be explicitly specified in QM calculation to elaborate the electronic structure of the molecule. QM descriptors have been used in a wide range of QSAR studies to establish a correlation between molecular orbital (MO) energy-derived descriptors and the target endpoints.[17] In order to calculate electron interactions between atoms in a molecule, an autocorrelation descriptor was used to calculate the interaction of QM properties between two atoms within a certain topological distance.[18,19] This method uses human-preprocessed information based on different levels of QM calculation theories. To my best knowledge, no DNN studies defined a molecule with a QM model so that the DNN automatically extracts QM property interactions. Given that MO energies were calculated in QM through linear combination of atomic orbitals (AOs), interactions between AOs in a molecule can be used to generate an input for DL models so that the network can efficiently weigh the interactions between AOs for prediction of the target properties in a data-driven way. In this study, a topological distance-based electron interaction (TDEi) tensor was developed to convert the molecular structure into a 3D array format to feed the electronic structure of the molecule into convolutional neural network (CNN) architecture. In the TDEi tensor calculation, AOs in each atom were represented by the electron configuration (EC) vector, and the number of interactions between each AO was calculated to convert the molecular structure into the TDEi tensor. Particularly, the TDEi tensor was designed to be adjustable according to the size of the data and the complexity of the molecular structure by changing the EC vector and topological distance channel. Here, modified VGGNet was used in model development with the TDEi tensor, and it achieved good prediction accuracy for prediction of four physicochemical properties: normal melting point (MP), water solubility (Esol), octanol/water distribution coefficient (Lipop), and hydration free energy (Freesolv). Analysis of the features revealed that a higher correlation was achieved between the features and the target properties when features were extracted from the deeper layer of the VGGNet.

Methods

TDEi Tensor Calculation

In QM calculation, coefficients for linear combination of AOs were estimated in the density matrix, which is a diagonal matrix whose rows and columns are AOs in a molecule. As a molecule can be translated into a matrix format with AO information, the concept was adopted in the TDEi tensor design. The size of the density matrix is dependent on the number of AOs in the molecule; however, the input shape must be equal regardless of the size of the molecule in order to input them into the CNN. Therefore, the electron interaction (Ei) matrix was designed with rows and columns with a fixed size of the EC vector so that every input has an identically sized matrix. As the size of the Ei matrix was fixed, molecular geometry differences were lost in the matrix because the EC vector was solely based on the composition of molecules. Since molecules with identical compositions can have different molecular geometries, the Ei matrix was generated for each topological distance within a molecule to consider the topological structure of a molecule. In QM calculation, 3D geometry of a molecule is used to calculate the distance between atoms. The 3D structure of molecules can be obtained by geometry optimization; however, calculated 3D geometry is not suitably accurate. The 3D structures available in PubChem were also the optimized structures. Thus, 2D structural information alone was used in this study due to the absence of precise 3D molecular geometry for compounds in datasets.

Definition of an Electron Configuration Vector

The TDEi tensor was calculated based on the EC of atoms in a molecule. The EC vector was defined in a previous study by giving a zero for each unoccupied AO and one for each occupied AO with two different electron spins marked by positive and negative signs.[20] The EC vector can be varied by combining degenerated AOs or electron spins because these electrons possess identical levels of energy. It is particularly significant in the case of handling small sized data set to integrate the information for efficient model training. Such information condensation was successful in the prediction model development with a small dataset.[21] The possible variations of EC vectors are summarized in Figure .

Figure 1

Calculation of electron configuration (EC) vectors of each atom. The EC vector of each atom was used in order to calculate electron interaction matrices. (A) Indices of the full EC vector are atomic orbitals with different spins, and the EC vector was designed to be reduced by integrating (B) atomic orbitals in an identical energy level (degenerate orbitals) and (C) in different spins. (D–F) Each EC vector was condensed to reduce sparsity in feature space.

Electron Interaction Matrix Calculation

Instead of using chemical bonds, the TDEi tensor checks neighboring atoms to define the topological structure of a molecular geometry. Intrinsic hydrogens were not indicated in SMILES; however, they should be specified in order to correctly calculate the electronic structure of a molecule; therefore, all hydrogens present in the molecule were specifically added. The GetDistanceMatrix function implemented in RDKit (version 2018.09.01) was used to obtain the topological distance (TD) of atoms within a molecule. Topological distance 0 means the atom itself (Figure A); thus, the EC vector of the atom was multiplied by a row and a column in order to calculate the number of electron interactions within the atom. The topological distance 0 matrix is the sum of the Ei matrices for all atoms within a molecule. When the topological distance was higher than one, the pairs of atoms within the molecule were taken to calculate the Ei matrices between them. Since the Ei matrices between two paired atoms were calculated twice, they were divided by two, and all Ei matrices for each topological distance were summed. All Ei matrices with topological distances greater than one were calculated as explained in Figure B.

Figure 2

Electron interaction matrices (Eis) in each specified topological distance were calculated based on the EC vector of each atom in a molecule. (A) Eis within topological distance 0 are electron interactions within an atom, and (B) Eis within topological distance longer than 0 are Eis between pair of atoms according to the distance. After calculation of Eis, they were summed within the identical topological distance.

Concatenating Electron Interaction Matrices in Different Topological Distances

The Ei matrix can be calculated from any topological distance. In the example molecule (Figure ), atom pairs existed up to a topological distance of 4. The Ei matrices from atom pairs with a greater topological distance can be calculated if the size of the molecule increases. When Ei matrices were prepared from predetermined topological distances, they were concatenated to form the TDEi tensor (Figure ). As the TDEi tensor size can be varied based on the size of the EC vector and the topological distance, it can be flexibly adjusted according to the size of the data or the diversity of the molecular structure.

Figure 3

Possible atom pairs. In this molecule, the longest topological distance was 4D; however, further atom pairs can be found if the size of the molecule increases.

Figure 4

Preparation of the TDEi tensor. The TDEi tensor was prepared by concatenating Ei matrices in each topological distance and designed to be adaptable to structural diversity and the size of datasets by adjusting EC vectors and topological distances.

Possible atom pairs. In this molecule, the longest topological distance was 4D; however, further atom pairs can be found if the size of the molecule increases. Preparation of the TDEi tensor. The TDEi tensor was prepared by concatenating Ei matrices in each topological distance and designed to be adaptable to structural diversity and the size of datasets by adjusting EC vectors and topological distances.

Preparation of Datasets

In this study, four datasets were selected for the regression tasks: MP, Esol, Lipop, and Freesolv. There are wide ranges of available datasets used in DL model development studies[8,11,13,16] such as human immunodeficiency virus replication inhibition, human ß-secretase 1 inhibition, blood–brain barrier penetration, toxicity in clinical trials, drug adverse reactions, biological targets screened in Tox21 and ToxCast, and PubChem BioAssay data; however, they were seriously imbalanced whether they were binary or multiple classification tasks. Thus, these were not used in this study. The MP was obtained from the study of Tetko et al., in which 275,131 compounds were extracted with their MP values from patent documents.[22] The dataset was divided into training, validation, and external test sets by a random split in a ratio of 8:1:1. ESOL, Freesolv, and Lipop were obtained from the study by Jiang et al.[11] The datasets were already divided into training, validation, and externa test sets by the authors, thus I used them as such. The number of data and the range of the endpoint are listed in Table , and the chemical space of the datasets were plotted to verify the structural diversity in the training, validation, and external test sets (Figure ).

Table 1

Datasets for Model Building

		training set		validation set		test set
endpoints	total num.	num.	range	num.	range	num.	range
MP	275,131	220,104	–199.0 to 517.0	27,513	–157.15 to 420	27,514	–185.18 to 438.65
Lipop	4193	3354	–1.5 to 4.5	420	–1.42 to 4.49	419	–1.17 to 4.5
Esol	1127	901	–11.6 to 1.58	113	–9.16 to 0.94	113	–8.40 to 1.02
Freesolv	639	511	–25.47 to 3.16	64	–9.76 to 3.43	64	–20.52 to 3.12

Figure 5

Structure diversities between training, validation, and external test set. (A) The normal melting point (MP) data was the largest data set. Datasets for (B) Lipop, (C) Esol, and (D) Freesolv were the octanol/water partition coefficient, water solubility, and hydration free energy, respectively.

Model Development and Validation

The CNN was used in this study for model development, through Tensorflow 2.2.0,[23] and the network architecture was modified from VGGNet as (1) the size of the initial filter channel was reduced by half, from 64 to 32, (2) the filter shape was reduced from three by three to two by two, (3) average pooling was used to minimize information loss, and (4) a convolutional layer was applied once instead of twice before the pooling layer (Figure A). A grid search was performed on the CNN architectures, activation functions, and epoch numbers to obtain the finest hyperparameters for model development. Model training was conducted using the NEURON system of the National Supercomputing Center of South Korea, which is mainly composed of GPU nodes (https://www.ksc.re.kr/eng/resource/neuron). The prediction accuracy of the model was measured using four metrics: mean absolute error (MAE), normalized mean absolute error (NMAE), R square (R2), and Spearman’s rank correlation coefficient (Sr).where ypred is the model’s prediction values, yobs is the observation values, n is the number of compounds, yobs® is the average of observation values, and d is the difference between the ranks of each compound. The prediction model with an R2 higher than 0.6, on the external test set, is considered as an accurate model. Even though the model did not achieve R2 > 0.6, it was still able to make an accurate prediction of the target value when the NMAE was less than 10%. As the QSAR model was used in the prioritization of compounds, an Sr higher than 0.6 implies that the model’s prediction is valid and useful in relative comparison of chemicals, even if the NMAE is over 10%.[24] It is important to assess the model prediction accuracy with more than one metric since one metric alone cannot properly represent the prediction accuracy of the model.

Figure 6

CNN architecture used in this study and electron interactions integrated through each layer. (A) VGGNet was modified by decreasing the channel size of filter, filter size in the convolutional layer to two by two, and the number of convolutional layers before the pooling layer. In grid search, diverse CNN architectures were tested; however, increasing the number of layers did not improve prediction accuracy. (B) Electron interactions were combined through each convolution and average pooling layer. Since the full electron configuration vector size was 53, all electron interactions were considered when the TDEi tensor passed the last pooling layer. (C) Electron interactions integrated in each convolution and pooling layer were shown as an example.

Model Analysis

VGGNets with the TDEi tensor were developed over four datasets; however, the model developed with the MP alone was analyzed because this model was trained with the largest dataset in this study. In the QSAR study, the capacity to separate different molecular structures was the most significant point in the descriptor design. As the CNN extracted features from the TDEi tensor, the performance of these features in distinguishing compounds along the MP was examined. The features were extracted from the middle of the VGGNet before the final prediction value was calculated. Principal component analysis (PCA) was used to project extracted tensors into 2D space.

Results and Discussion

TDEi Parameter Search

The TDEi parameter search results are presented in the supplementary tables: MP (Table S1), Lipop (Table S2), Esol (Table S3), and Freesolv (Table S4). As the TDEi tensor can be varied by changing the EC vectors and topological distances, the influence of different options in the TDEi tensor on prediction accuracy was analyzed. First, an appropriate EC vector size was searched. In the Lipop, Esol, and Freesolv datasets, a dramatic decrease in prediction accuracy was observed regardless of topological distance when the EC vector size was reduced, whereas the MP model showed a mild decrease in prediction accuracy. The full EC vector achieved the highest accuracies in MP, Lipop, and Esol, whose data size was greater than 1000, and the condensed full EC vector in Freesolv, whose data size was less than 1000. According to this experiment, full or condensed full EC vectors should be used in the development of the models for drug-like compounds. Second, a suitable topological distance was checked. Desirable prediction accuracy was achieved in the MP when the TDEi tensor with topological distance 3 was used, and a further increase in topological distance did not lead to a significant improvement in the accuracy. In the other three datasets, the TDEi tensor with topological distance 2 achieved the highest accuracy. When prediction accuracy variation in the external test set was compared between the MP and other three datasets, prediction accuracy was gradually improved in the MP, whereas fluctuation of prediction accuracy was observed in other three datasets. This difference may be derived from the data size difference. As bigger data was used, the training process of the model seems much stable than other models trained with smaller data. As the prediction accuracy of the model varied significantly based on the EC vector size and the topological distance of the TDEi tensor in each dataset, a preliminary search was required to select the most suitable option for the TDEi tensor, which was selected such as topological distance 3 with a full EC vector for MP, 2 with a full EC vector for Lipop and Esol, and 2 with a condensed full EC vector for Freesolv. In this study, experiments were performed on small drug-like compounds. TDEi tensor options could be changed if the structural diversity of datasets is different from that of drug-like molecules.[25,26]

Model Prediction Accuracy

The VGGNet was optimized with the best option of TDEi tensors in the preliminary search for each dataset (Table ), and the goodness-of-fit of each model is shown in Figure . The R2 of the MP model was 0.565 for the external data set. Prediction errors between 0 and 400 °C were relatively high as data points were widely distributed across the best-fit line (Figure A). However, NMAE = 5.27% indicates that prediction values were accurate on average, and Sr = 0.729 implied that the model correctly ordered the molecules according to MP values. The Lipop model achieved an R2 of 0.516 on the external test set, and the NMAE was 10.93%. Even though the NMAE is slightly over 10%, Sr = 0.724 showed that the model can be used to compare lipophilicity of the molecules, which is a common use of logP in chemical space visualization (Figure B). The models developed by Esol and Freesolv achieved high R2 (Figure C,D). Most of DL studies prove their prediction accuracy with one metric; however, more than one metric should be used to examine prediction accuracy of the model thoroughly. Particularly, R2 can vary significantly even though MAE is not varied much.

Table 2

Best Prediction Results for Each Endpoint

			training set				validation set				test set
endpoints	TD	EC vector	MAE	NMAE	R²	S_r	MAE	NMAE	R²	S_r	MAE	NMAE	R²	S_r
MP	3D	full	31.959	4.46%	0.584	0.742	33.352	5.78%	0.553	0.720	32.874	5.27%	0.565	0.729
Lipop	2D	full	0.450	7.50%	0.726	0.867	0.654	11.07%	0.525	0.740	0.620	10.93%	0.516	0.724
Esol	2D	full	0.346	2.63%	0.948	0.976	0.557	5.52%	0.872	0.922	0.465	4.94%	0.896	0.951
Freesolv	2D	full (cond.a)	0.425	1.48%	0.968	0.989	0.729	5.53%	0.875	0.942	0.563	2.38%	0.961	0.979

cond.: condensed.

Figure 7

Examination of goodness-of-fit on four endpoints: (A) normal melting point, (B) octanol/water partition coefficient, (C) aqueous solubility, and (D) solvation free energy.

Examination of goodness-of-fit on four endpoints: (A) normal melting point, (B) octanol/water partition coefficient, (C) aqueous solubility, and (D) solvation free energy. cond.: condensed. VGGNet was developed with more convolutional layers to examine whether the prediction accuracy would improve significantly. However, adding more convolutional layers or increasing the number of nodes within fully connected layers did not lead to a meaningful improvement in prediction accuracy. Thus, CNN models with deeper layers, such as ResNet and Inception, were not applied. This was similar to the previous study where increasing the weights within the neural network architecture did not always improve prediction accuracy.[20] Moreover, it is important to find a model architecture with the minimum number of weights achieving the highest prediction accuracy because the use of an excessive number of weights in the model could induce false positives in prediction outcomes.[11] Among the four datasets used in this study, the experimental error in MP data was available in the reference. According to the study, the inherent experimental error was 35 °C measured from 18,058 duplicated compounds,[22] which was larger than the MAE of the MP model in this study. This experimental error indicated that the lowest error achieved by the model is around 35 °C in reality. Therefore, prediction accuracy improvement for this dataset is limited by data itself, and ML and DL algorithms were not able to make meaningful improvement any further. In DL-based QSAR studies, datasets were collected from a wide range of studies to increase the volume of data; however, it inevitably leads to increase in noises in it since experimental values were measured with different experimental protocols.[27] Therefore, understanding the experimental errors of the dataset is of great aid in determining whether the prediction accuracy of the model is meaningful. In DL model studies, the prediction accuracy of the model was compared with others to prove that their own methods achieved improvement in prediction accuracy. Even though prediction accuracy of the DL model was numerically improved compared to other methods, these achievements might not be meaningful if the prediction errors were lower than inherent experimental errors in the dataset since the prediction error cannot be smaller than the experimental error.[1] Unfortunately, it is difficult to find studies that compared the prediction accuracy of the QSAR model with the experimental errors of the target endpoint. It may be attributed to dataset curation being done without an understanding of their inherent experimental errors. Unless the prediction errors of QSAR models are analyzed based on experimental errors in the dataset first, a simple comparison between the prediction accuracy of QSAR models may be inadequate to provide decisive evidence of significant improvement in prediction accuracy. As models in computer vision predict unambiguous labels, a higher prediction accuracy always implies a better model. If the problem of mislabeling was excluded from the discussion, then prediction models in computer vision achieved great success because of certainty in the dataset. It is practically impossible to obtain experimental noise-free datasets in drug-like compound datasets. To make a successful case, inherent experimental errors in the dataset must be understood precisely in order that the models are trained and validated reliably. PCA was performed to exhibit how the feature space was varied as the TDEi tensor was processed within the VGGNet. In Figure , dots are brighter if the value is higher and darker if they are lower. Initially, the original TDEi tensor’s feature space established a low correlation with MP (Figure A). Once the TDEi tensor was processed up to the last convolutional layer (the 8th layer), compounds were well prioritized (Figure B). An additional pooling layer strengthened the trend in data distribution by separating compounds with a low melting point to the upper left side and a high melting point to the lower right side in the projected space (Figure C). When the extracted features from the convolutional layer and pooling layer were processed in a fully connected layer, most of the compounds were arranged with a stronger correlation with their MP values (Figure D).

Figure 8

Results of principal component analysis (PCA). The MP prediction model was analyzed since it was developed with the largest dataset. Each point was colored based on the MP value: a brighter color implies a higher MP, whereas a darker color means a lower MP. (A) Initially, the TDEi tensor itself could not sufficiently prioritize compounds according to the MP. (B) Features extracted after the last convolutional layer showed that data points were arranged along the trend of MP values. (C) The trend was strengthened after the last average pooling layer, and (D) stronger correlation was established after the fully connected layer. To examine the difference in feature extraction by deepening the VGGNet, PCA was identically performed with an increased number of convolutional layers. In Figure , the convolutional layer is applied once before the pooling layer. In the deeper VGGNet, convolutional layers were applied twice with identical hyperparameters before the pooling layer. Features from the last pooling layer, and the second fully connected layer were extracted and visualized (Figure ). PCA showed that the features extracted after additional convolutional layers were strongly correlated with the MP. Lack of improvement in prediction accuracy even after the increase in correlation in the deeper VGGNet may be attributed to inherent experimental noise in the dataset.

Figure 9

Results of using a deeper layer of VGGNet in feature space variation. Results shown in this figure were obtained by deepening the VGGNet in Figure by applying a convolutional layer twice before the average pooling layer. (A) Features extracted from the last pooling layer prioritize compounds accurately. (B) After the fully connected layer, features were in stronger correlation with the target endpoint. In the CNN models trained with image data, initially, the fundamental level of features was extracted, and a higher level of features was found as the layer went deeper. The EC is fundamental information compared to atom-level features; thus, the use of EC in the CNN was expected to fully harness the CNN’s automatic feature extraction capacity through filters establishing significant electron interactions for prediction of the target endpoint. According to the analysis of electron interactions in the CNN (Figure B,C), interactions with a wider range of electrons were established as the TDEi tensor passed through each convolution and average pooling layer, and interactions between all AOs were calculated by the last average pooling layer. Given that MOs were calculated through a combination of AOs, values from the 10th layer, which is the fifth pooling layer (Figure A), were similar to the molecular orbital values of a molecule. Thus, the CNN automatically extracts MO energy-like values through five convolution and pooling layers, and then these values were fed into fully-connected layers to calculate the physicochemical properties. Feature space variation in PCA supported this idea because the extracted features were rearranged with a stronger correlation toward the MP values as the layer went deeper.

Comparative Study between CNN Models for Drug Molecules

Most of DL studies used a GCN by defining molecular structure as a graph. In my best knowledge, there were only few studies that developed the CNN on drug molecule data. Application of CNN to molecular structure data was limited due to the lack of adequate method to convert a molecular structure into a 3D array. Meyer et al. used an image of a 2D structure as an input for the CNN.[28] However, the use of a 2D image caused issue of structure standardization since the pixel of the image can be easily changed if a molecular structure image on 2D space was rotated. Moreover, chemical features cannot be extracted from the 2D image alone, and thus, Meyer et al. used an additional fingerprint-based prediction model to compensate the possible disadvantage of the image-based CNN model. Hirohara et al. used the SMILES matrix whose column is a one-hot encoded vector of each symbol in SMILES.[15] Hirohara et al. used normalization algorithm on the SMILES string so that only one SMILES code is produced from a molecule. However, in this approach, the limitation is in applicable length of the SMILES code and production of sparse information in the matrix. Hirohara et al. fixed the size of matrix by 400 rows, maximum length of SMILES string, and 42 columns, length of SMILES symbols. Therefore, this is a limitation to calculate the SMILES matrix for a larger molecule whose SMILES string is longer than 400 characters, and also it produces large sparse matrix when a small-sized molecule was converted into the matrix. Karpov et al. used SMILES embedding to prepare the input for the CNN model.[16] Preparation for SMILES embedding requires data augmentation and a Transformer architecture training. Once the Transformer encoder part was ready, then output of the encoder part becomes a fixed size of matrix, which is called SMILES embedding and used as an input for CNN to train the model for prediction of target endpoint. Thus, its computational cost is highly expensive. The TDEi tensor efficiently standardizes chemical structures without extra training processes by converting atoms into AO-level information and applying a fixed-topological distance to produce an identically sized tensor regardless of size and complexity of a molecule. In the case of one-hot encoded matrix, a sparse matrix is produced since all vectors were zero except only one relevant position, while the use of an EC vector makes the TDEi tensor contain richer information than a one-hot encoded vector. Therefore, the TDEi tensor is a better way to overcome the standardization issue and input sparsity problem. The fact that the TDEi tensor has a shape of 3D array is also another advantage that all the CNN architectures developed in computer vision can be easily applied to the TDEi tensor. On the other hand, there is a disadvantage using a QM model of a molecule. DL models that defined a molecule as a graph can identify significant molecular fragments after the model was trained.[13,15,16] However, the TDEi tensor cannot be used for molecular fragment identification since it is based on electrons within a molecule and topological distance between them.

Conclusions

Graphs and SMILES were the most commonly used molecular representations in DL-based QSAR models. In this study, a TDEi tensor was developed to represent molecular structures based on interactions between electrons within a molecule. The TDEi tensor was calculated following the steps as (1) AOs of each atom were described by an EC vector, which is a bit string according to the presence and absence of electrons in each AO based on their spin indicated by positive and negative signs, (2) the number of electron interactions were calculated based on EC vectors between two atoms, and (3) electron interactions in each topological distance were concatenated within a 3D array. The TDEi tensor prepared from a molecule was used as a CNN model input. Modified VGGNet was trained with the TDEi tensor as an input to predict four physicochemical properties: MP, Lipop, Esol, and Freesolv. The TDEi tensor was designed to be adjustable by changing EC vector size and depth of topological distance in order to cope with structural diversity and small size dataset. The result in the preliminary search showed that appropriate option for EC vector and topological distance depth were varied for each dataset. When VGGNet was optimized for each endpoint, models achieved good prediction accuracy. In VGGNet, it was expected that the network weighs each electron interactions automatically to predict the target endpoint. In order to visualize the assumption, MP data was analyzed. When feature space changes were traced by PCA, a stronger correlation was found between the features and the target endpoint as features were extracted from the deeper layer.

21 in total

1. Beware of R(2): Simple, Unambiguous Assessment of the Prediction Accuracy of QSAR and QSPR Models.

Authors: D L J Alexander; A Tropsha; David A Winkler
Journal: J Chem Inf Model Date: 2015-07-09 Impact factor: 4.956

2. Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism.

Authors: Zhaoping Xiong; Dingyan Wang; Xiaohong Liu; Feisheng Zhong; Xiaozhe Wan; Xutong Li; Zhaojun Li; Xiaomin Luo; Kaixian Chen; Hualiang Jiang; Mingyue Zheng
Journal: J Med Chem Date: 2019-08-27 Impact factor: 7.446

3. Development of blood brain barrier permeation prediction models for organic and inorganic biocidal active substances.

Authors: Hyun Kil Shin; Sangwoo Lee; Ha-Na Oh; Donggon Yoo; Seungmin Park; Woo-Keun Kim; Myung-Gyun Kang
Journal: Chemosphere Date: 2021-03-18 Impact factor: 7.086

Review 4. In silico models in drug development: where we are.

Authors: Janet Piñero; Laura I Furlong; Ferran Sanz
Journal: Curr Opin Pharmacol Date: 2018-09-08 Impact factor: 5.547

5. Deep Learning for Drug-Induced Liver Injury.

Authors: Youjun Xu; Ziwei Dai; Fangjin Chen; Shuaishi Gao; Jianfeng Pei; Luhua Lai
Journal: J Chem Inf Model Date: 2015-10-13 Impact factor: 4.956

6. MoleculeNet: a benchmark for molecular machine learning.

Authors: Zhenqin Wu; Bharath Ramsundar; Evan N Feinberg; Joseph Gomes; Caleb Geniesse; Aneesh S Pappu; Karl Leswing; Vijay Pande
Journal: Chem Sci Date: 2017-10-31 Impact factor: 9.825

Review 7. Deep Learning for Deep Chemistry: Optimizing the Prediction of Chemical Patterns.

Authors: Tânia F G G Cova; Alberto A C C Pais
Journal: Front Chem Date: 2019-11-26 Impact factor: 5.221

8. Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models.

Authors: Dejun Jiang; Zhenxing Wu; Chang-Yu Hsieh; Guangyong Chen; Ben Liao; Zhe Wang; Chao Shen; Dongsheng Cao; Jian Wu; Tingjun Hou
Journal: J Cheminform Date: 2021-02-17 Impact factor: 5.514

9. The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS.

Authors: Igor V Tetko; Daniel M Lowe; Antony J Williams
Journal: J Cheminform Date: 2016-01-22 Impact factor: 5.514

10. Convolutional neural network based on SMILES representation of compounds for detecting chemical motif.

Authors: Maya Hirohara; Yutaka Saito; Yuki Koda; Kengo Sato; Yasubumi Sakakibara
Journal: BMC Bioinformatics Date: 2018-12-31 Impact factor: 3.169

1 in total

1. A Deep Learning-Based Quantitative Structure-Activity Relationship System Construct Prediction Model of Agonist and Antagonist with High Performance.

Authors: Yasunari Matsuzaka; Yoshihiro Uesawa
Journal: Int J Mol Sci Date: 2022-02-15 Impact factor: 5.923

1 in total