Literature DB >> 35685352

Prediction of GPCR activity using machine learning.

Prakarsh Yadav¹, Parisa Mollaei¹, Zhonglin Cao¹, Yuyang Wang¹, Amir Barati Farimani^1,2,3.

Abstract

GPCRs are the target for one-third of the FDA-approved drugs, however; the development of new drug molecules targeting GPCRs is limited by the lack of mechanistic understanding of the GPCR structure-activity-function relationship. To modulate the GPCR activity with highly specific drugs and minimal side-effects, it is necessary to quantitatively describe the important structural features in the GPCR and correlate them to the activation state of GPCR. In this study, we developed 3 ML approaches to predict the conformation state of GPCR proteins. Additionally, we predict the activity level of GPCRs based on their structure. We leverage the unique advantages of each of the 3 ML approaches, interpretability of XGBoost, minimal feature engineering for 3D convolutional neural network, and graph representation of protein structure for graph neural network. By using these ML approaches, we are able to predict the activation state of GPCRs with high accuracy (91%-95%) and also predict the activation state of GPCRs with low error (MAE of 7.15-10.58). Furthermore, the interpretation of the ML approaches allows us to determine the importance of each of the features in distinguishing between the GPCRs conformations.

Entities: Chemical

Keywords: Convolutional Neural Networks; G-Protein Coupled Receptors; GPCRs; Graph Neural Networks; Machine Learning; Protein activation; protein featurization

Year: 2022 PMID： 35685352 PMCID： PMC9163700 DOI： 10.1016/j.csbj.2022.05.016

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

G-Protein Coupled Receptors (GPCRs) are important proteins for signaling networks, such as the Ras signaling pathway and neurotransmission [1], [2], [3], [4], [5]. Through these networks the GPCR proteins regulate the physiological processes and pathogenesis of diseases associated with such signaling networks [6], [7], [8]. The GPCR proteins have been a subject of significant academic attention, which is aimed at elucidating the mechanism of activation for novel GPCRs by structural biology and biochemical methods [9]. The ubiquitous presence of GPCRs in signaling networks makes them an important target of drug molecule-based therapeutics [10]. Approximately, of all the drugs approved by the FDA target GPCRs with the objective of either activating (agonist) or deactivating (antagonist) the receptor [11]. With the developments in computational biology, biotechnology and pharmacology, there is an increased emphasis on efforts to regulate GPCRs activity via allosteric sites [12], [13]. To accomplish this task and design therapeutics specific to the conformation of GPCRs, it is important to have a quantitative description of the protein conformations. Such quantitative description will allow for the improvements in rational drug design and accelerate the development of highly specific therapeutics with minimal side effects [14]. The GPCR proteins are clustered together as a protein superfamily, which is further divided into subclasses based on the protein sequence similarity. The common feature in all [15] GPCRs is the presence of characteristic transmembrane helices, which create the ligand binding site in the extracellular domain and the binding site for G-protein or arrestin in the intracellular domain [16]. The binding of the ligand to the receptor causes conformational changes in the ligand binding site. These conformational changes are communicated across the protein towards the intracellular region and results in large translations and bending of the transmembrane helices [17]. The comprehensive understanding of the amino acid level conformational changes, which cause transmembrane helix movement, is critical for the design of effective therapeutics [18]. The GPCR protein structures can be represented as quantitative models of their features, such as the dihedral angles and contact distances. These quantitative models of GPCRs are expected to be high-dimensional and difficult to interpret without high-throughput computational methods [19], [20]. This leads to an interesting problem, can the features extracted from a GPCR structure model be used to predict its conformation? Machine Learning (ML) can help answer the question of predicting GPCR state by learning the important features which distinguish between the different conformations of the GPCR proteins. The developments in ML, especially Deep Learning (DL) methods, have increased their applicability to solve niche biological problems. For example, AlphaFold [21] and RoseTTAFold [22] have leveraged DL approaches to predict protein structure from sequence input. AlphaFold has also made available the predicted structure model of more than 350,000 protein sequences. Other works have used Convolutional Neural Network (CNN) for the prediction of protein–protein interactions [23], protein–ligand binding [24], [25], protein folding [26], protein phosphorylation site [27], and protein structure classification [28]. However, CNN is designed for structured data like images, while the features which describe the GPCR conformation are unstructured. To this end, Graph Neural Networks (GNNs) [29], [30], [31], [32] are introduced for modeling the nodes and their relationships (edges) within the unstructured data. Modern GNNs learn the representations of graphs via aggregated message passing between the nodes [33]. Recently, GNNs have been leveraged in multiple domains concerning proteins, including protein-compound interaction [34], [35], [36], protein folding prediction [21], [22], and function estimation [37], [38], [39]. Also, the application of ML has been further extended to discovery of GPCR agonist [40] and GPCR bioactive ligands [41]. We build upon the motivation of previous works utilizing ML for biological problems and create ML models which can predict the state of a given GPCRs protein structure and also describe the extent of its activation. To accomplish this goal we require a large amount of training data comprising the GPCRs proteins in different conformations. We used the structure models for more than 500 GPCRs proteins determined by experimental methods, which includes the active and inactive conformations of the GPCRs, in addition to some intermediate conformations which may represent the transition between the two key states. We use the structure information for refined models of GPCRs from the GPCRdb server to develop quantitative models for the different conformations of the GPCR proteins, by predicting their state (active, inactive, or intermediate) [42]. In this work, we present 3 approaches for this task, biophysics-aware feature engineering followed by shallow ML methods, 3D Convolutional Neural Networks (CNNs) with voxelization, and graph representation of protein followed by Graph Neural Networks (GNNs). The shallow ML method takes the biophysics-aware features as input, while the DL methods, including CNN and GNN, automatically extract features from the GPCR protein structures, thus requiring little or no domain knowledge. We interpret and rank the importance of all the engineered features. The biophysics-aware feature engineering allowed us to discover and incorporate the important residue interactions and contact distances with the ML models. The ranking of engineered features also enabled us to identify the residue positions which are important for elucidating the GPCR conformation. The accuracy and robustness of the three approaches are benchmarked against each other in the tasks of predicting the activation state (classification) and the percentage of activation of the conformation (regression).

Methods

Machine Learning Models and GPCR structure dataset

The GPCRdb contains information about experimental data, phylogenetic diagrams, structures, and analysis tools for GPCRs [42]. This database provides insights into the molecular mechanisms of GPCR activation, signal transduction, protein binding, and allosteric modulation [42]. From the GPCRs listed on the GPCRdb, we collect the protein data bank (PDB) structures of the biological complexes containing the G Protein Coupled Receptor [43]. Our dataset contains 555 PDB structures coming from 105 unique receptors’ types included in GPCRdb (See supporting information for the structures and their properties). Each PDB structure of a GPCR is then converted to 3 different representations, including manually selected features, voxelization representation, and graph representation. These representations are used as input for different ML models (i.e. engineered features for XGBoost [44], voxel representation for 3D convolutional neural network and graph representation for the graph neural network) to predict the activation state or the activity level (percentage activation) (Fig. 1).

Fig. 1

The three frameworks that we developed to predict the GPCR activation state. The PDB structure of each GPCR is converted into a representation (i.e. manually selected features, voxelized space, or graph representation) before being used as input to each machine learning model. The model will then output either the state or activation percentage of the GPCR.

Feature Extraction and Shallow Machine Learning

The PDB structures for GPCRs often contain co-crystallized lipid molecules, non-GPCR proteins, and drug molecules. Therefore, the PDB structures are preprocessed before being used to generate the input for the ML methods. As a data preprocessing step, we extract only the structure of the transmembrane (TM) region of the receptors and removed all other segments of the proteins from the complex. The activation state labels for training the ML model are obtained from GPCRdb. The labels for the classification task correspond to 3 states of the receptors: active state, inactive state, and intermediate state. The intermediate state corresponds to the structure transitioning from between the active and inactive conformation. The labels for the regression task show the activation level for each receptor, which were also obtained from the GPCRdb. The truncated structures are then aligned with respect to each other to identify the position of amino acids in GPCRs with different peptide sequences. The aligned structures are used for feature extraction to describe the receptors’ state. The input features for the shallow ML models comprise the contact distances and distribution of reciprocal interatomic distances (DRID) of the identified amino acid pairs and the TM helix pairs. To assess the performance of shallow ML models, we train and evaluate different ML models, including Random Forest, Support Vector Machine, XG-Boost, and Logistic Regression. 5-fold cross validation is performed for these models at both the classification task and the regression task. To manually engineer the protein features for shallow ML prediction, we select 420 residue pairs by randomly picking 20 residue pairs on each of the 21 possible TMs pairs. The average closest heavy-atom distance of the residue pairs is then calculated from all of the 555 protein structures in the dataset. It is observed that for a given contact distance pair, a larger difference between the average values of active and inactive conformation (, measured as a distance between peaks of histograms in active (green plot) and inactive (red plot) states, Fig. 6c-f) correlates with higher prediction accuracy (Fig. 6a). To achieve high classification and regression prediction accuracy, we rank the 420 random residue pairs based on the prediction accuracy by using each individual feature. We then choose the top 105 features (5 features per TM pair) to train our model (Table S3). Furthermore, in the list of 105 pairs, we observe 21 residue pairs containing at least one residue belongs to the polar network of GPCR activation (Table S2). For these features we extract the contact distance and the Distribution of Inverse Reciprocal Distances (DRID). The concatenated vector of contact distances and DRID is the input vector for the ML models. XGBoost model and Random Forest classifiers are implemented to predict the states of the GPCRs and corresponding regressors are implemented to predict the activation level. XGBoost is a gradient boosting framework which utilizes second order derivatives and Random Forest is an ensemble method which uses decision trees and takes the mode of the trees as the output [44], [45]. We implement these algorithms from the Scikit-learn python library [46]. Random Forest classifier is implemented with 25 decision trees in the ensemble, bootstrapped initialization of the trees, and Gini impurity for calculation of information gain. The XGBoost classifier is implemented with the gbtree boosting method, learning rate of 0.3, max tree depth of 6, and uniform sampling of the training instances.

Fig. 6

(a) Correlation between prediction accuracy and difference of the mean of the closest heavy distances in active and inactive states (). (b) Comparison between prediction accuracy of Ransom Forest and XGBoost methods. The 105-pairs are 105 residue pairs with top accuracies over 420 random residue pairs. The 21-pairs is 21 residue pairs out of 105-pairs with at least one residue belongs to the polar network in GPCR activation. Each data point is taken over 555 proteins. (c),(d),(e),(f) Normalized histogram of closest distance of two non–hydrogen atoms of a residue pair over 555 proteins in active, intermediate, and inactive states with corresponding representation of conformational changes of the residue pairs on NTS1 protein. The Ballesteros-Weinstein numbers are used to label the amino acids.

Voxelization and 3D Convolutional Neural Network

Voxelization is a technique for mapping the continuous 3D space to a discrete 3D mesh grid using unit cubic cells (voxels) [47]. It has been used to generate geometrical representation of 3D objects for applications such as computer-aided design model classification [48] and 3D vision [49], [50]. Recently, voxelization has been introduced to the domain of molecular/atomic property predictions. The voxelized 3D space can preserve 3D atomic structural information while making ideal input to CNN [23]. CNN has achieved outstanding performance in applications such as image classification [51], [52], [53], [54]. The convolutional layers in the CNN function as automatic feature extractors that detect important features from the input without human supervision. Combining the voxelization and 3D CNN creates a framework to learn the representation via filters in an efficient way. CNN can automatically extract relevant features from the input so that feature engineering is no more needed. The combination of voxelization and CNN achieved high accuracy in tasks such as predicting bioactivity of small molecules [55], interatomic force and potential prediction [56], [57], and prediction of binding affinity of protein–ligand complexes [58]. The combination of 3-dimensional voxelization and CNN can be another approach to featurize the GPCRs. The voxelization of the GPCRs is shown by the cartoon at the top of Fig. 2. The 3D space including and around an atomic structure is represented using a grid of voxels, called the voxelized space. To enforce the consistency of the size of input to the CNN, We take the maximum value of the length, width, and height of all the GPCRs in the dataset and construct a voxelized space that can fit all GPCRs with the size of in the and z dimension. In this work, each voxel is a cube with the side length of , thus there are 458964 voxels in total. All voxels have an initial value of 0. If the cartesian coordinates of an atom are within a voxel, the voxel is then given a value of the atomic number of the atom in it (e.g. 6 for carbon and 8 for oxygen atom). In some rare cases when two atoms can appear in the same voxel, for example, a hydrogen atom is diagonally opposite to another atom in one voxel, the voxel is given the value with the greater atomic number of the two atoms. Since most overlapping happens between a non–hydrogen atom and a hydrogen atom, which has less effect on the GPCR property [23], taking the greater atomic number can minimize the information loss. 3D CNN differs from 2D CNN in the dimension of the convolution filters. For this specific work, we choose to use 3D CNN instead of 2D CNN because the former can better extract 3D structural features from the GPCRs data [25], [23], [26]. The voxelized GPCR structures are then fed into a 4-layer 3D CNN to extract features. There are 8, 16, 24, and 32 filters, and the kernels are cubes with sizes of 10, 6, 3, and 2 for each layer of the CNN, respectively. (see Fig. 2 The stride for the convolution calculation in all layers is 1. 3D batch normalization and average pooling are applied after each convolutional layer [59]. The feature vector output from the CNN has a size of 256. A 3-layer fully-connect neural network, which has 400, 256, and 64 neurons in each layer, is used to predict the activity of GPCR using the feature vector. ReLU activation [60] is used in both the CNN and the fully-connected neural network.

Fig. 2

Voxelization and 3D convolutional neural network for GPCRs. At the top of the figure is a cartoon showing the voxelization process of atomic structures. The left part of the figure shows the voxelized 3D structure of a GPCR. Cyan, blue, yellow, and red voxels represent carbon, nitrogen, sulfur, and oxygen atoms, respectively. Voxels containing hydrogen atoms are not shown to avoid an over-crowded view. A zoom-in view of the voxelization of one residue is shown. A 4-layer 3D CNN extracts a feature vector from the voxelized GPCR structure for activity prediction. CNN performance scale very well with the size of training sets. In addition, CNN is translation-rotation invariant, making data augmentation a viable solution and technique for enhancing their performance. We perform data augmentation [61], [53], [62] on the voxelized GPCR structures. Each voxelized GPCR structure is augmented by flipping along either , or z axis, and the augmented samples are given the same label (Figure S2). The size of the dataset is increased by 3 times through this method. During the 5-fold cross validation, original GPCR structures and their augmented samples in the training set are used for training the model, while only the original GPCR structures in the test set are used for testing.

Graph Neural Network

Previous ML and 3D CNN models require either manually engineered features or voxelization to process proteins into Euclidean data, which can lose important information. Graph Neural Networks (GNNs) [29], [31], on the other hand, are designed to learn representation directly from unstructured graphs, such as chemical compounds [24], [32], [21]. GNN is built upon the graphical data which consists of different numbers of nodes and edges in between. Such a strategy provides a more flexible way to model the protein structure and helps keep structural information. In our case, a GPCR is considered as a graph G, where V denotes all the residues (nodes) within a GPCR and E denotes connections (edges) between residues [63], [64], [34]. We further define as the node attribute for and as the edge attribute for . In the GPCR graph as shown in Fig. 3, each node represents a residue and each edge represents the distance between neighboring residues. We define the distance between residues as the distance of alpha carbons (), which captures the backbone structure of GPCRs. Neighboring residues within are connected by edges, which nicely covers the adjacent residues and meanwhile excludes remote residues to increase computational efficiency.

Fig. 3

Overview of the Graph Neural Network (GNN) framework for GPCR activation prediction. Left: Each node in the GPCR graph represents a residue and edges are created to connect residue pairs within 6Å. Nodes are described by the amino acid type and the dihedral angles, while edges are defined by the distance and direction between residues. Right: The GNN model takes in the GPCR graph and sequentially runs through the graph convolutional layers, a pooling layer, and fully-connect layers to predict the activation. For each node is defined as the 20-dimensional one-hot encoding of all the 20 amino acid type, , together with the dihedral angles and , namely . Each edge attribute is defined as , where is the distance and is the normalized directional vector between node u and v. In the experiments, combinations of different node and edge features are investigated. Namely, we compare the test accuracy of GNN models with different features included to explore which input features contribute more to the prediction of GPCR activation. For example, the GNN model is trained with only node amino acid type as the node feature in comparison with the model with both amino acid type and dihedral angles included. The Graph Neural Network (GNN) takes in the node and edge attributes and update the node representations iteratively through aggregation and combination operations [65], [66], [67], [29]. Let denote the representation of node v at the k-layer in the GNN and is initialized as , where is a linear projection which maps to the embedding dimension. Similarly, edge attributes are mapped to the same dimension through a linear projection , where . We build our GNN following the Graph Isomorphism Network (GIN) [68] with edge attributes included [69] as this has been demonstrated a powerful GNN model in various applications [33], [70]. The update rule of node representations in each graph convolution layer is defined as Eq. 1:where denotes all the nodes directly connected to u through edges, is the activation function, and is the non-linear update function. In our case, ReLU [60] is developed as activation function and is modeled by fully-connected layers. After K layers of graph convolutions, we obtain the updated node representations for . An average pooling is implemented to extract the graph representation as given in Eq. 2 Another prediction head is developed using fully connected layers which takes and outputs the prediction of the activation, which can be either the classification of whether the GPCR graph is activated or the regression of the activation percentage. In our GNN model, we develop 4 graph convolutional layers with the dimension of node representation 128 and a batch normalization layer [59] is added after each graph convolutional layer. The prediction head contains 2 hidden layers with dimensions 64 and 32, respectively. The model is trained for 100 epochs with batch size 32. Adam is leveraged to update the weights with an initial learning rate of 0.001 and weight decay . The learning rate decays to 0.0001 after training for the first 50 epochs [71]. Similarly to training the shallow machine learning and 3D CNN models, we run 5-fold validation and within each fold the whole dataset is randomly split to training/validation set by the ratio of 4:1.

Results and discussion

Experimental Results

Table 1 lists the GPCR state classification results of different machine learning methods on the 5-fold validation. We implement accuracy and F1 score as two metrics to evaluate the classification performance of each model. The accuracy measures the ratio of correctly predicted instances. The F1 score is defined as the harmonic mean of precision and recall of the predictions, where precision is the fraction of true positives and all positive predictions and recall is the fraction of true positive and the total positive samples. Comparing to accuracy, F1 score provides a better metric when data is imbalanced, i.e., the number of data varies in each category. In the shallow ML tests two protein featurization schemes were used; the distance between any two non-hydrogen atoms in 420 random pairs of residues and selected 105 pairs providing the most ML accuracies and those residues which are involved in the polar network of proteins. XGBoost model with 105-pairs surpasses the other models on both prediction accuracy and F1 score, indicating that the selected features capture the important residue networks which reflect the GPCR activation state. Comparing to XGBoost, GNN also achieves comparable performance while requiring much less feature engineering compared to the XGBoost and learning to extract important features related to the activity of GPCRs. The regression results for the percentage of GPCR activation, are also shown in Table 2 with both rooted mean square error (RMSE) and mean absolute error (MAE) reported. The results show that 3D CNN fails to compete with other methods.

Table 1

Performance of different models for GPCR state classification. We report the mean and standard deviation of the accuracy and F1 score in 5-fold validation.

Model	Accuracy	F1 score
XGBoost for 105-pairs	0.9586 (0.0044)	0.9571 (0.0033)
XGBoost for 21-pairs	0.9369 (0.0098)	0.9339 (0.0102)
3D CNN	0.9117 (0.0438)	0.7708 (0.1411)
3D CNN w/o data augmentation	0.8829 (0.0446)	0.7604 (0.1634)
GNN	0.9585 (0.0176)	0.9386 (0.0296)

Table 2

Performance of different models for regression to degree of GPCR activation. We report the mean and standard deviation of the RMSE and MAE in 5-fold validation.

Model	RMSE	MAE
XGBoost for 105-pairs	0.1291 (0.0701)	0.0715 (0.0074)
XGBoost for 21-pairs	0.1605 (0.0682)	0.0969 (0.0077)
3D CNN	0.1420 (0.0394)	0.1058 (0.0289)
3D CNN w/o data augmentation	0.1750 (0.0238)	0.1310 (0.0231)
GNN	0.1449 (0.0048)	0.0897 (0.0041)

Performance of different models for GPCR state classification. We report the mean and standard deviation of the accuracy and F1 score in 5-fold validation. Performance of different models for regression to degree of GPCR activation. We report the mean and standard deviation of the RMSE and MAE in 5-fold validation. One possible explanation for the lower performance of 3D CNN is that the voxelization of amino acids may lead to the loss of important information about the type of residues and their stereo-chemical properties. Additionally, the structures for GPCRs from X-ray diffraction and Cryo-EM can feature missing side chain information for certain residues and in certain cases has no information for the hydrogen atoms. Since the graph featurization method does not require atomic positions of all atoms, it is agnostic to the lack of such information. The graphical featurization of proteins captures the global structural features of GPCRs, such as the TM3-TM6 distance, in addition to the individual amino acid features. Therefore, even in the cases of missing residues from X-ray structures, this method is able to accurately classify the GPCR structure as active, inactive, and intermediate. Additionally, in the case of GPCRs, the global conformation changes are encoded into the relative distance between residue pairs. The absence of side-chain atom information will not eliminate the positional information of the residues. Since, in the manual featurization scheme we have calculated the closest heavy atom distances, the relative distance between residue pairs is still encoded into the XGBoost model input. This ensures high robustness against missing residues in the model input, as even if the positional information of a few atoms is missing, the remainder of the residue is sufficient for capturing the global features of GPCR conformation. In GNN, we conduct mean pooling which averages over all the updated node features to extract the feature for the whole GPCR structure. Thus, even some residues may be missing in the data. This will not affect the implementation of GNN models. Finally, even the 3D CNN methodology, which explicitly voxelizes the atomic positions of all residues, can capture the global features very well and generates accurate predictions for the GPCRs. To verify the impact of hydrogens on prediction accuracy of 3D CNN, we created an augmented dataset of GPCR structures with all hydrogens removed. We then evaluated the pre-trained model performance specifically for the structures for both instances (GPCRs with and without hydrogens). The results from Table. S5 and Table. S6, demonstrate that the performance of 3D CNN slightly drops for both classification and regression tasks with the absence of atomic information of hydrogen atoms, indicating that the presence of hydrogen is beneficial to 3D CNN in predicting activation of the GPCRs. To further boost the performance of 3D CNN more training data can be incorporated. It is noteworthy that the data augmentation improves the classification accuracy of 3D CNN from 0.8829 to 0.9117, and the regression MAE from 0.131 to 0.1058, respectively. Therefore we can expect a better performance of 3D CNN from a larger training dataset. To better understand the impact of mutations on GPCR conformations and sensitivity of the presented methods to such mutations, we analyzed cases of reported mutations in the GPCR and their corresponding effect on protein conformation. For instance, in the case of Neurotensin 1 Receptor (NTSR1), the experimental method of directed evolution has been used to obtain the NTSR1 protein in the inactive conformation (PDB3ZEV). Directed evolution caused the protein to acquire 11 point mutations, (namely A86L, H103D, H105Y, A161V, R167L, R213L, V234L, I253A, H305R, F358V, and S362A) [72], [73]. These mutations were attributed to higher stability of the inactive conformation of NTSR1 and they have a cumulative effect on the inactivation of NTSR1. Similarly, in the case of Glucagon like Peptide 1 Receptor (GLP-1R), the active (PDB5VAI) and the inactive (PDB6LN2) structures have a sequence identity of 90.48%, [74], [75] indicating that during the experimental processes of determining these structures, mutations were accumulated. However, the featurization schemes proposed in this work can explicitly encode such mutation information in the inputs to the GNN, 3D CNN, and XGBoost models. The featurizing schemes are able to incorporate atomic level information and the spatial orientation of amino acids, therefore, the feature space is influenced by the mutations occurring in the structures. Voxelization can encode spatial organization of atoms and atoms types, where even a single atom change is encoded into the model input. In the case of GNN, each node denotes the residue. Therefore, the mutation information is naturally encoded in the GNN models, as mutated GPCR possess different residues at certain positions. Finally, the inter-residue distance based featurization for XGBoost encodes the spatial atomic organization by considering the closest heavy atom distance between the amino acid pairs. These featurization schemes allow the encoding of point mutations into the model inputs and then predict the GPCR conformation, and percentage of activation with high accuracy.

Misclassified GPCRs and tSNE visualization

To further compare the machine learning models, we investigate the distribution of misclassified GPCRs by class and the confusion matrix for each model. Fig. 4a shows that GPCRs of intermediate state are difficult for the 3D CNN model to classify. 3D CNN misclassified 33, 6, and 10 GPCRs in intermediate, inactive, and active states, respectively, indicating that intermediate state GPCRs are very hard for 3D CNN to distinguish. Similar to 3D CNN, XGBoost model also misclassified more GPCRs of intermediate state than of the other two states. Compared with 3D CNN and XGBoost model, GNN demonstrates the same level of accuracy for GPCRs of all activation states. The confusion matrix of each model is normalized over the ground truth condition (every row). Although 3D CNN model achieves 98% and 95.5% accuracy in classifying inactive and active GPCRs (Fig. 4d), the model misclassifies 75% of the intermediate GPCRs to the inactive state. The low accuracy for 3D CNN in classifying the intermediate state GPCRs not only corresponds to its lower F1 score (Table 1) compared with the other 2 models but also hinders it to achieve higher accuracy as the GNN. Moreover, the XGBoost and GNN model tend to misclassify inactive and active GPCRs to the opposite activation state instead of the intermediate state. For example, XGBoost model misclassifies none of the inactive or active GPCRs as an intermediate state. On the other hand, the 3D CNN tends to misclassify the inactive or active GPCRs to the intermediate state instead of the opposite activate state. A reason for the difference of models in the misclassification is that the 3D CNN makes predictions based on the extracted structural features of the GPCRs, and the structural features can be very different between the inactive and active GPCRs.

Fig. 4

(a) Number of misclassified GPCRs in each activation state for the three models. (b)-(d) Normalized confusion matrix of each model.

(a) Number of misclassified GPCRs in each activation state for the three models. (b)-(d) Normalized confusion matrix of each model. The t-SNE [76] visualization of the features learnt by 3D CNN and GNN can help us rationalize the superior performance of the GNN method compared to 3D CNN. t-SNE is a dimensionality reduction technique that preserves the local structure of the data points (i.e. data points are similar to each other if they are clustered). It has been vastly adopted in the field of bioinformatics [77], [78], [79]. A latent feature vector for each GPCR is extracted during the training of GNN and 3D CNN models. The feature maps learnt by GNN and 3D CNN can be passed through a series of Fully Connected layers of neural networks (Multi-Layer Perceptron, MLP). A latent feature vector is the output of the last layer of the FC layers. The output of the last layer is a dense feature representation of the inputs as it represents all the information learnt by the preceding layers of the network. This latent feature vector can be used as an input for visualization in 2D space by using t-SNE. Here, t-SNE is used to map the latent feature vector of all GPCRs into a 2D space (Fig. 5). In this 2D space, the t-SNE embedding 1 is the mapping of the latent feature vector onto the first dimension and t-SNE embedding 2 is the mapping of the latent feature vector onto the second dimension. Although the active and inactive GPCRs are well-separated in the 2D visualization of the latent feature learned by 3D CNN (Fig. 5a), the boundary of the intermediate state GPCRs cluster is not obvious. This corresponds to the high misclassification rate of 3D CNN on intermediate state GPCRs because the feature extracted by 3D CNN does not distinguish the intermediate state from the others. On the other hand, in the case of GNN latent feature visualization, almost all GPCRs are clustered with other members of the same activation class. The comparison between the t-SNE visualizations of 3D CNN and GNN extracted features shows that GNN can maximize the difference between GPCRs of different activate states, which results in the higher accuracy of GNN in the classification task. Moreover, the t-SNE visualization can help to find mislabeled GPCRs. For example, the GLP1R_5NX2 GPCRs (bottom right corner of both Fig. 5, Fig. 5b) is labeled as an intermediate state but clustered with active state GPCRs using both GNN and 3D CNN learnt features. The classification of intermediate class is also particularly challenging due to the lack of consensus in the biophysics literature for defining intermediate state of GPCRs [80], [81]. The resultant ambiguity in the intermediate label creates a confusing classification task for both the 3D CNN and GNN models.

Fig. 5

t-SNE dimension-reduced visualization of GPCRs using features learned by (a) 3D CNN and (b) GNN. Each GPCR is colored based on its activate level. Rectangle, triangle, and circle scatter points represent inactive, intermediate and active GPCRs, respectively. Snapshots of some GPCRs that are clustered to other activation states are selected to be shown.

Feature Engineering and XGBoost

We investigate the prediction accuracy using those features involved in polar network in GPCR activation (Table S1) [82]. Such features include the hydrogen bonds stabilizing both the active and inactive states of opioid receptors which have to be rearranged to achieve the active conformation. Most of the residues engaged in the polar network are conserved, suggesting that they may have similar functions in GPCR activation. We observe that the list of top 105-pairs includes 21 residue pairs where at least one residue belongs to the polar network in GPCRs (Table S2). By using these 21-pairs of contact distances as input, we were able to predict the GPCR state with 92.25% accuracy for the Random Forest model and 93.69% accuracy for the XGBoost model. The comparison of prediction accuracies through Random Forest and XGBoost model reveals that the difference between their prediction accuracies for different datasets is no more than 1.44 percent (Fig. 6b). (a) Correlation between prediction accuracy and difference of the mean of the closest heavy distances in active and inactive states (). (b) Comparison between prediction accuracy of Ransom Forest and XGBoost methods. The 105-pairs are 105 residue pairs with top accuracies over 420 random residue pairs. The 21-pairs is 21 residue pairs out of 105-pairs with at least one residue belongs to the polar network in GPCR activation. Each data point is taken over 555 proteins. (c),(d),(e),(f) Normalized histogram of closest distance of two non–hydrogen atoms of a residue pair over 555 proteins in active, intermediate, and inactive states with corresponding representation of conformational changes of the residue pairs on NTS1 protein. The Ballesteros-Weinstein numbers are used to label the amino acids. We explore the normalized histogram of the closest distance between two non–hydrogen atoms of the residue pairs over 555 proteins and the prediction accuracy out of XGBoost model, with corresponding PyMOL [83] representation of the residue pair on NTS1 protein (Fig. 6c-f). Here is measured as the distance between peaks of histograms in active (green plot) and inactive (red plot) states. 3.40–7.49 residue pair achieve 72% prediction accuracy where 7.49 residue belongs to the polar network of GPCR activation. On NTS1 protein, N7.49 rotates 46.6 and translates 1.6A˚. On the other side, A3.40 rotates 6.6 and translates 1.5A˚ (Fig. 6c). The pair of 6.38–7.46 in which 7.46 residue engaged to the polar network obtained 69% prediction accuracy. The residue pair rearrangement on NTS1 protein is associated with a 3.6A˚ translation and 69.4 rotation of S7.46 and 4.1A˚ translation and 32.5 rotation of R6.38 (Fig. 6d). 6.31–3.32 residue pair, where 3.32 residue belongs to the polar network provides 68% prediction accuracy. R6.31 translates 17.6A˚ and rotates 114.2 while R3.32 has a 14.7 rotation and 1.1A˚ translation on NTS1 protein (Fig. 6e). Residue pair of 6.34–7.41 provides 67% prediction accuracy where 7.41 residue is involved in the polar network. On NTS1 protein, L7.41 has no significant conformational changes while V6.34 translates and rotates for 6.2A˚ and 33.2 respectively (Fig. 6f). Comparing the accuracy of XGBoost model for 21-pairs and 105-pairs reveals that higher dimension features improve the prediction accuracy from 93.69% to 95.86% and the regression MAE from 0.0969 to 0.0715 (Table 1, Table 2). The manually engineered features improve the performance of XGBoost model, making it comparable to the GNN model at the classification task (95.86% vs 95.85%) and better than the GNN model at the regression task (0.0715 vs 0.0897, Table 1, Table 2). In addition to using ML methods, correlating important residue positions with the GPCR structure can provide insights into the GPCR conformational landscape. By comparing the ranked list of features from our feature engineering method with the important residue interactions in literature, we conclude that the feature engineering method is able to identify the important descriptors of the GPCR states and conformations. The top ranking features which best distinguish between the active and inactive conformation of GPCRs, can be the potential targets of therapeutic molecules to regulate the GPCR structure–activity-function relationship.

Conclusion

In this work, we have developed 3 ML-based approaches to predict the discrete state and activation level of all the GPCR structures. To learn the structure activity relations in GPCRs, we developed 3D CNN, GNN, and XGBoost models. The 3D CNN approach requires minimal feature engineering and can extract important features from the voxelized representation of the protein structure. However, the 3D CNN approach also has the lowest accuracy of the 3 methods that we have developed (91%). The GNN approach, in comparison to 3D CNN, offers improvement in the prediction accuracy for both the classification (95.85%) and regression task (MAE 0.0897). The GNN approach incorporates the notion of feature engineering by generating a graph representation of the protein structure by encoding the residue type, dihedral angles. In the third approach, we have designed and engineered biophysics-aware features and rank these features. The top features were used to train the XGBoost model for the classification and regression tasks. The biophysics-aware features perform almost similar to the GNN model for the classification task (95.86%) and outperform the GNN method for the regression task (7.15%). We then interpret the important features learnt by our models. We use the feature importance to identify the residue pairs which can be used to distinguish between the active, inactive, and intermediate conformation of GPCRs. Finally, we propose a list of residue pairs that can be used to develop a quantitative description of the GPCR states. This work and its conclusions can be extended and applied to understand the transition between the active and inactive conformation of a GPCR protein and design therapeutics to identify the important reside pairs for the transition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

46 in total

Review 1. Impact of GPCR Structures on Drug Discovery.

Authors: Miles Congreve; Chris de Graaf; Nigel A Swain; Christopher G Tate
Journal: Cell Date: 2020-04-02 Impact factor: 41.582

Review 2. Energy landscapes as a tool to integrate GPCR structure, dynamics, and function.

Authors: Xavier Deupi; Brian K Kobilka
Journal: Physiology (Bethesda) Date: 2010-10

3. Structure-based protein function prediction using graph convolutional networks.

Authors: Vladimir Gligorijević; P Douglas Renfrew; Tomasz Kosciolek; Julia Koehler Leman; Daniel Berenberg; Tommi Vatanen; Chris Chandler; Bryn C Taylor; Ian M Fisk; Hera Vlamakis; Ramnik J Xavier; Rob Knight; Kyunghyun Cho; Richard Bonneau
Journal: Nat Commun Date: 2021-05-26 Impact factor: 14.919

Review 4. Action of molecular switches in GPCRs--theoretical and experimental studies.

Authors: B Trzaskowski; D Latek; S Yuan; U Ghoshdastider; A Debinski; S Filipek
Journal: Curr Med Chem Date: 2012 Impact factor: 4.530

5. Understanding Ligand Binding Selectivity in a Prototypical GPCR Family.

Authors: Giulio Mattedi; Francesca Deflorian; Jonathan S Mason; Chris de Graaf; Francesco L Gervasio
Journal: J Chem Inf Model Date: 2019-06-04 Impact factor: 4.956

6. GPCRdb in 2021: integrating GPCR sequence, structure and function.

Authors: Albert J Kooistra; Stefan Mordalski; Gáspár Pándy-Szekeres; Mauricio Esguerra; Alibek Mamyrbekov; Christian Munk; György M Keserű; David E Gloriam
Journal: Nucleic Acids Res Date: 2020-12-03 Impact factor: 16.971

Review 7. A Comprehensive Survey on Graph Neural Networks.

Authors: Zonghan Wu; Shirui Pan; Fengwen Chen; Guodong Long; Chengqi Zhang; Philip S Yu
Journal: IEEE Trans Neural Netw Learn Syst Date: 2021-01-04 Impact factor: 10.451

8. Development and evaluation of a deep learning model for protein-ligand binding affinity prediction.

Authors: Marta M Stepniewska-Dziubinska; Piotr Zielenkiewicz; Pawel Siedlecki
Journal: Bioinformatics Date: 2018-11-01 Impact factor: 6.937

9. DeepPhos: prediction of protein phosphorylation sites with deep learning.

Authors: Fenglin Luo; Minghui Wang; Yu Liu; Xing-Ming Zhao; Ao Li
Journal: Bioinformatics Date: 2019-08-15 Impact factor: 6.937

10. Highly accurate protein structure prediction with AlphaFold.

Authors: John Jumper; Richard Evans; Alexander Pritzel; Tim Green; Michael Figurnov; Olaf Ronneberger; Kathryn Tunyasuvunakool; Russ Bates; Augustin Žídek; Anna Potapenko; Alex Bridgland; Clemens Meyer; Simon A A Kohl; Andrew J Ballard; Andrew Cowie; Bernardino Romera-Paredes; Stanislav Nikolov; Rishub Jain; Demis Hassabis; Jonas Adler; Trevor Back; Stig Petersen; David Reiman; Ellen Clancy; Michal Zielinski; Martin Steinegger; Michalina Pacholska; Tamas Berghammer; Sebastian Bodenstein; David Silver; Oriol Vinyals; Andrew W Senior; Koray Kavukcuoglu; Pushmeet Kohli
Journal: Nature Date: 2021-07-15 Impact factor: 49.962