Literature DB >> 35150231

PIPENN: Protein Interface Prediction from sequence with an Ensemble of Neural Nets.

Bas Stringer¹, Hans de Ferrante¹, Sanne Abeln¹, Jaap Heringa¹, K Anton Feenstra¹, Reza Haydarlou¹.

Abstract

MOTIVATION: The interactions between proteins and other molecules are essential to many biological and cellular processes. Experimental identification of interface residues is a time-consuming, costly, and challenging task, while protein sequence data is ubiquitous. Consequently, many computational and machine learning approaches have been developed over the years to predict such interface residues from sequence. However, the effectiveness of different deep learning architectures and learning strategies for protein-protein, protein-nucleotide, and protein-small molecule interface prediction, has not yet been investigated in great detail. Therefore, we here explore the prediction of protein interface residues using six deep learning architectures and various learning strategies with sequence-derived input features.
RESULTS: We constructed a large data set dubbed BioDL, comprising protein-protein interactions from the PDB, and DNA/RNA and small molecule interactions from the BioLip database. We also constructed six DL architectures, and evaluated them on the BioDL benchmarks. This shows that no single architecture performs best on all instances. An ensemble architecture, which combines all six architectures, does consistently achieve peak prediction accuracy. We confirmed these results on the published benchmark set by Zhang & Kurgan (ZK448), and on our own existing curated homo- and heteromeric protein interaction data set. Our PIPENN sequence-based ensemble predictor outperforms current state-of-the-art sequence-based protein interface predictors on ZK448 on all interaction types, achieving an AUC-ROC of 0.718 for protein-protein, 0.823 for protein-nucleotide and 0.842 for protein-small molecule. AVAILABILITY: Source code and data sets at https://github.com/ibivu/pipenn/.

Entities: Chemical

Year: 2022 PMID： 35150231 PMCID： PMC9004643 DOI： 10.1093/bioinformatics/btac071

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Protein interactions are crucial in many biological and cellular processes (Jones and Thornton, 1996), such as transcription, signal transduction or enzymatic activity. Proteins, through their binding interfaces, interact with each other and a variety of other molecules, giving rise to all manner of cell functions. Knowledge about these interfaces provides essential clues about the mechanisms underlying associated activities. This molecular level knowledge can be obtained by experimental and computational methods, and is applied in many scientific and therapeutic areas (Sperandio, 2012). Protein interaction prediction refers to a set of computational methods that aim to predict different protein interaction types: protein–protein (PPI), protein–small molecule, protein–nucleotide (DNA/RNA), as summarized in Supplementary Figure S1 (Cui ; Wang ; Zhang and Kurgan, 2018). Such methods may utilize various classic Machine Learning (ML) (Cheng ; Hou , 2021) and Deep Learning (DL) architectures (Hanson ; Shi ). Input features may be based on information from protein structure and/or sequence. Notwithstanding the enormous progress that has been made in the area of structure prediction, no reliable structural information is available (e.g. Su ; Tunyasuvunakool ) for many organisms, types of proteins and protein regions. Moreover, the usefulness of predicted structures for interface prediction may be limited (e.g. Xie and Xu, 2021). Therefore, we aim to predict protein bindings of the mentioned interaction types at residue level, using only information related to protein sequence. We explore the following question: Which DL architecture and composition of architectural building blocks are able to improve the performance of sequence-based interface predictors? DL architectures are composed of multiple building blocks, such as initialization, regularization, loss, activation, each containing various parameters. The simplest neural architecture considered here is the fully connected Artificial Neural Network (ANN): every neuron in a layer is connected to all neurons in the next layer. This configuration is general purpose and structure agnostic. The input to this network is a single protein residue at a time, which means their sequence context is not considered. Convolutional Neural Network (CNN) architectures are already extensively used in various flavors in computational biology (e.g. Shi ), including protein interaction site prediction (e.g. Cui ). A CNN consists of three types of hidden layers: convolutional layers, pooling layers and fully connected layers. For predictions in a protein sequence, neurons are organized sequentially (1D spatial form) and every neuron is connected to the neurons of a local region (receptive field) in the previous layer. Pooling layers perform down-sampling to speed up computation, and fully connected layers perform the actual classification task based on the abstract representations of the original protein sequence input. Dilated Convolutional Networks (DCN) achieve large receptive fields by gradually increasing the dilation rate in subsequent layers (Yu and Koltun, 2016). This allows DCNs to remain relatively shallow, require few parameters and converge quickly. DCNs typically maintain high output resolutions without up-sampling, and have been successfully applied in many areas (Ho and Lin, 2018), including computational genomics (Gupta and Rush, 2017; Kelley ). The U-Net is the most commonly used CNN architecture, especially when the amount of training data is limited (Ronneberger ). In a U-Net, high-dimensional information is first reduced (contracted) to a smaller latent space, and subsequently increased (expanded) to the original dimensionality. Visually, this gives rise to a U-shape, from which this architecture derives its name. Though more typically used in image recognition tasks, we find U-Nets—like CNNs in general—equally applicable for 1D (sequence-based) predictions. The ResNet residual learning framework is a variation on CNN, which aims to solve the vanishing or exploding gradient problem, while retaining the ability to learn complex features (He ). Recurrent Neural Networks (RNN) are particularly suitable for learning nonlinear dependencies in sequential data, such as protein sequences. No limit is imposed on the size of the input series of amino acid symbols, but each symbol is represented by additional layers. To mitigate the vanishing/exploding gradient in the resulting many-layer models (Hochreiter and Schmidhuber, 1997) and retain a computationally efficient training, Gated Recurrent Unit (GRU) was proposed by Cho . Later, Chung showed that the performance of GRU is on par with the more complex LSTM (Hochreiter and Schmidhuber, 1997) on sequence modeling tasks, while taking much less time to train. Recently, hybrid architectures of CNNs and RNNs are increasingly employed in computational biology. The hybrid models aim to get the best of both worlds: the spatial aspects of CNNs combined with the temporal aspects of RNNs. For instance, Hanson combined residual convolutional network with LSTM to predict a protein’s residue–residue contacts (contact map prediction task). They use ResNet to capture spatial relationships between local residues, and LSTM to capture long-range relations between non-local residues. In the field of genomics, Quang and Xie (2016) integrated CNNs with bidirectional LSTMs for modeling of the properties and functions of non-coding DNA sequences. To learn a regulatory grammar in the DNA motifs, they use convolution layers to capture local patterns in the motif sequences, and recurrent layers to capture long-term dependencies between the motifs. Here, we explore the effectiveness of each of the above neural net architectures: ann, dnet, unet, rnet, rnn and cnet. We expect they will each capture overlapping but distinct patterns in protein sequence data, yielding different predictions for which residues are part of an interface. We therefore also include an ensemble architecture, which combines the outputs of these six neural nets into one. All models, collectively referred to as PIPENN, are trained on five different training sets. Their performance is benchmarked with 11 test sets, including the standardized benchmark dataset introduced by Zhang and Kurgan (2018), referred to as ZK448. The latter is also used to assess and compare the performance of the PIPENN ensemble method with competing models. To our knowledge, we are the first to apply DL architectures to different types of protein interface prediction at this scale, and obtain the best prediction results to date.

2 Materials and methods

2.1 Datasets

To perform experiments on different types of protein interaction data, i.e. PPI, small molecule and nucleotide (DNA/RNA), we used five datasets for training (see Supplementary Table S1) and 11 independent datasets for testing (see Supplementary Table S3). The HHC dataset, containing homo- and heteromeric proteins annotated with PPI interface residues, was already available in our group from Hou , 2017). Another part was obtained from Zhang and Kurgan (2018): the benchmark test set ZK448, which contains proteins annotated with residues for all types of protein interaction. The final part was newly constructed: the BioDL dataset that contains proteins annotated with residues for all types of protein interaction, as described below and illustrated in Figure 1.

Fig. 1.

Generation of the BioDL dataset from the PDB and BioLip databases

Generation of the BioDL dataset from the PDB and BioLip databases For small molecule and nucleotide interactions, we retrieved the whole BioLip (Yang ) database and extracted the interaction annotation data. For PPIs, we downloaded the coordinates of 138 729 protein structures of 2.5 Å resolution or lower, excluding fragments, from the PDB (Berman ) on 3 April 2019. Following the annotation criterion in BioLip, we annotated a residue as interacting if the distance between one of its atoms and any atom of the ligand is less than the sum of their Van der Waals radii plus 0.5 Å. The newly derived annotations and the existing annotations from BioLip are associated with PDB sequences. In order to test our models on the benchmark test set ZK448, where annotations are associated with Uniprot sequences, we mapped PDB sequences to Uniprot as follows. We used SIFTS (Velankar ) to retrieve all Uniprot sequences corresponding to the PDB entries. Entries with missing or conflicting Uniprot IDs were discarded. We mapped residues between PDB and Uniprot sequences by alignment with harsh penalties (mismatches—5, gap opening—20 and extension—50). Alignments where <80% of interaction site residues were mapped, or with more than two inserts, five deletions, three mismatches in the interaction sites, or five mismatches were discarded. Subsequently, BLASTClust was used to cluster the obtained Uniprot sequences at 25% sequence similarity. All clusters containing a Uniprot ID used in ZK448 were removed and thereafter randomly one sequence from each cluster was chosen. From this non-redundant dataset, based on the criterion used by Wang , proteins having sequences longer than 700 and shorter than 26 amino acids were removed. The final dataset was split into a training set BioDL_A_TR (95%) and a testing set BioDL_A_TE (5%), containing annotations for all three types of interactions. We further split these into proteins annotated with PPI (BioDL_P_TR & _TE), small molecule (BioDL_S_TR & _TE) and nucleotide (BioDL_N_TR & _TE) interactions. See Supplementary Tables S1 and S3 for dataset statistics of the training (TR) and test (TE) sets, respectively.

2.2 Data features

For each residue in our BioDL dataset, we record the amino acid type (AA); its conservation, represented by a Position Specific Scoring Matrix (PSSM) score; if it is part of a domain; and the length of the sequence it belongs to. Furthermore, we predict its accessible surface area (SA) and secondary structure (SS) from sequence. We include four training/testing labels for each residue, tracking whether the residue is part of a known interface with (i) other proteins, (ii) DNA or RNA, (iii) small molecule ligands or (iv) any of the above. PSSM profiles were generated for each sequence using PSI-BLAST (Altschul ), retrieving max. 500 sequence homologs from the NR70 database, using three iterations and an E-value threshold of 0.001. As a result, we obtained 20 PSSM values (one per amino acid type) for each residue. The PSSM scores were normalized using the sigmoid function. We used NetSurfP to predict from sequence the Absolute/Relative Surface Accessibility (ASA/RSA) and the propensities for α-helix (PA), β-sheet (PB) and coil (PC) (Petersen ). To indicate whether or not a residue belongs to a conserved protein domain, we employed the protein domain information from the Pfam (Mistry ) database for the Uniprot sequences. Finally, we included four windowed aggregates for each of the features. Each aggregated feature is the unweighted average of the feature value over a specific number of the neighboring residues: 3, 5, 7 or 9. They are abbreviated as _wm_, e.g. 9_wm_PB refers to the window mean of the predicted β sheet probabilities (PB) across a window of nine adjacent residues, annotating the one in the center. This leads to a total of 128 features, as detailed in Supplementary Table S13.

2.3 Learning architectures

The input layer of our ANN-based architecture ann consists of a number of neurons, each representing a feature of one amino acid (see Supplementary Fig. S4). The network has eight hidden layers and its output is a binary classification predicting whether or not an amino acid is part of an interface. In order to capture as much as possible information about a residue, the number of neurons in the first hidden layer is the highest, decreasing gradually in the subsequent layers to achieve more general representations. However, it has a large number of parameters (weights) and relatively long convergence time. We designed three different CNN architectures dnet, unet and rnet. The input layers for these consist of one neuron per feature per amino acid in the protein sequence. dnet is based on dilated CNN, has five convolutional layers, zero pooling layer and one fully connected layer (Supplementary Fig. S5). The dilation rate increases from 1 to 16. unet is based on the U-Net architecture and the contraction block starts with 1024 (length of padded protein sequence) and is gradually down-sampled by max-pooling to 32 at the bottleneck (Supplementary Fig. S6). For up-sampling transposed convolutions are used (Dumoulin and Visin, 2016). rnet is a residual CNN that allows us to experiment with deeper CNNs (Supplementary Fig. S7). We use a slightly modified version of the full pre-activation configuration (He ). In rnet, a residual block consists of two sequential full pre-activation configurations, each containing a Dropout, BatchNormalization and Parametric Rectified Linear Unit (PReLU) followed by a 1D convolution. We use eight such residual blocks. Our recurrent NN architecture rnn consists of two GRU layers, each containing 1024 cells with each cell having an output dimension of 128 that goes to the next cell (Supplementary Fig. S8). Finally, our cnet (Supplementary Fig. S8) is a hybrid architecture that simply combines our rnet residual and rnn recurrent architectures.

2.4 Architecture building blocks and parameters

DL architectures are composed of multiple building blocks, each containing various parameters. Different compositions enable learning architectures to provide great flexibility and a broad application area. The choice for building blocks and parameter values strongly influences the performance that may be achieved; however, this depends both on the architecture and on the dataset, making this one of the main challenges in DL. Below, we briefly motivate each of our choices.

Initialization method (IM)

We experimented with uniform and normal variants of the Glorot (Glorot and Bengio, 2010) and He (He ) IMs. He-Uniform initialization proved most suitable for our architectures.

Regularization method (RM)

To overcome overfitting, we explored combinations of RMs: Lasso (L1), Ridge (L2), BatchNormalization, Dropout and Early-Stopping. A combination of BatchNormalization, Dropout (20%) and Early-Stopping (based on the AUC metric) yielded best results.

Loss function (LF)

We explored a number of LFs: Mean Squared Error (MSE), Jaccard, Tversky and CrossEntropy. After running several experiments and due to having significant class imbalance in our data, we decided to use the CrossEntropy loss with an additional term that compensates for class imbalance.

Activation function (AF)

Our choice of AF is based on a neuron’s position in the network, the computational speed of calculating its gradient, and its differentiability. We used Sigmoid at the last layer of all our architectures. For the hidden layers, using Tanh for the rnn architecture and PReLU for all other architectures provided best performance.

Encoding scheme (ES)

We performed experiments with One-Hot and ProtVec to encode amino acid sequences. One-Hot represents each amino acid simply as a 20D sparse vector. ProtVec constructs 3-gram words of amino acids for each protein sequence from Swiss-Prot, and trains a skip-gram Neural Network on these data (Asgari and Mofrad, 2015). The output of the network is an embedding space of 100D dense vectors, which we used for as input features for our models. We achieved better performance with the One-Hot representation.

2.5 Training and testing procedures

We split each of the training datasets into two parts: train-set and val-set (Supplementary Table S1). The train-set part (80%) is used exclusively for training, and the val-set part (20%) is used for three purposes: (i) for gaining insight into the performance of an architecture, (ii) for terminating the training when the performance does not show any improvement and (iii) for training of the ensnet architecture. The training procedure is as follows. After all six architectures are trained on a train-set, their trained models are applied on the corresponding val-set. The obtained predictions of each model become the training data for the ensnet architecture, see overview in Figure 2.

Fig. 2.

Training and testing procedure of our predictors (Section 2.5 for the explanation of the procedure)

Training and testing procedure of our predictors (Section 2.5 for the explanation of the procedure) For all our architectures and datasets, we used the same hyperparameters, software platform and infrastructure. The hyperparameters that influence the performance and convergence time of the architectures are: batch size (8), optimization algorithm (Adam), learning rate (1e–4), maximum number of epochs (300), padding constants, bias vector usage and the float size (64). We used the Keras API of Tensorflow-2.1.0 to build our architectures and trained them on a Linux machine having 32 CPUs, 2 GPUs and 256 GB memory (see Supplementary Table S2 for the run-time statistics). We evaluated the performance of our trained models on the completely independent test sets shown in Supplementary Table S3, using Accuracy (ACC), Specificity (SPEC), Precision (PREC), Sensitivity/Recall (SENS), Balanced F-score (F1), Matthews Correlation Coefficient (MCC), average precision on the PR-curve (AP) and area under the ROC curve (AUC). We calculate P-values for differences in AUC-ROC using the approach by Hanley and McNeil (1982). Except for AP and AUCs, all metrics use the confusion matrix. MCC and AP are insensitive to class imbalance (on average about 11% of residues are interface, see Supplementary Table S1). All results shown are based on the Equal method (Zhang and Kurgan, 2018), by which a cutoff point is selected where FP and FN are equal. Hence, the values of PREC, SENS and F1 are always the same. Output (predictions) of the models is unpadded before applying the metrics. Performance plots were created using the Plotly package.

2.6 Feature importance

For scoring and ranking the importance of features, we used the KernelExplainer from the SHAP package (Lundberg and Lee, 2017). For each sample in the test set the contribution of each feature to the predicted outcome of a model is estimated, which is called an SHAP value.

3 Results

3.1 BioDL is sufficiently large for DL

DL architectures learn best when trained on a large dataset. We constructed the BioDL dataset based on BioLip and PDB, for training and testing of our various architectures. In total, over 6800 proteins are included, containing over 2 million residues of which 10.7% are interface residues; see Supplementary Table S1 for details. Runtimes for prediction are dominated by the feature generation, notably running PSI-Blast to generate the PSSMs, which takes up to 5 min per input protein. Within BioDL we have collected annotations of three specific types of protein interaction: PPI in BioDL_P, small molecule ligands in BioDL_S and nucleic acid interaction in BioDL_N. A fourth set annotates any of these three interactions: BioDL_A. For better comparison with our previous work using Random Forest models, we also include the HHC dataset from Hou , 2019). Each of these datasets are further split into training (_TR) and test (_TE) sets as shown in Figure 2; see Supplementary Table S3 for an overview and statistics of training sets. For HHC_TE, we also report separately for homomeric (Homo_TE) and heteromeric (Hetero_TE) PPI. Training duration can be found in Supplementary Table S2. As a further independent test set, we also used ZK448_TE from Zhang and Kurgan (2018), which also allows comparison with their benchmark results. The ZK448_TE, like BioDL, contains protein (ZK448_P_TE), small molecule (ZK448_S_TE) and nucleotide interaction annotations (ZK448_N_TE), as well as all combined (ZK448_A_TE). See Supplementary Table S3 for an overview and statistics of test sets. The overlap between the BioDL training sets is shown in a Venn diagram in Supplementary Figure S2; out of 6832 proteins in the training set there are only 85 with annotations for all three interaction types. This pattern is similar in the BioDL and ZK448 test sets (Supplementary Fig. S3). We make distinctions between the models that were trained on the BioDL_A_TR dataset containing all interaction types, i.e. generic, and those trained on a type-specific interaction data, i.e. BioDL_P_TR for PPI, BioDL_S_TR for small molecules and BioDL_N_TR for nucleic acids. We suffix those models correspondingly: _a, _p, _s and _n. Models trained on the PPI-specific HHC dataset are suffixed with _hhc.

3.2 Building block composition matters

We tested our predictors with different compositions of the architectural building blocks introduced in Section 2.4. These were trained on the smaller HHC PPI dataset for efficiency. Table 1 shows the metrics for different choices of building blocks for the dnet_hhc predictor. Other architectures and datasets show very similar trends (data not shown). dnet_hhc (the first row) corresponds to the composition with the highest performance in AUC: HeUniform kernel initialization, CrossEntropy loss function, 1D spatial form, PReLU activation function, Padding for unifying the input shape, One-Hot amino acid encoding and Dropout and BatchNormalization for regularization. Subsequent rows show the effect of substituting (→), adding (+) or removing (–) blocks. GlorotNormal kernel initialization yields highest AP; MeanSquaredError loss function highest accuracy (ACC), F1 and MCC. It is worth mentioning that for the best composition in dnet_hcc we did not use the MaxPooling. All other variations yield lower performances, and the impact of omitting Dropout, Padding and BatchNormalization is the strongest.

Table 1.

Impact of different architectural building blocks on the performance of the dnet_hhc PPI predictor trained on HHC_TR and tested on HHC_TE

Model	ACC	SPEC	F1	MCC	AP	AUC
dnet_hhc	0.784	0.868	0.403	0.272	0.381	0.733	0
hu → gn	0.783	0.868	0.401	0.269	0.398	0.730	–0.003
ce → mse	0.785	0.868	0.404	0.273	0.391	0.728	–0.005
1d → 2d	0.781	0.866	0.394	0.261	0.379	0.723	–0.010^*
pre → rel	0.780	0.866	0.392	0.258	0.390	0.720	–0.013^*
+ mp	0.784	0.868	0.403	0.272	0.387	0.718	–0.015^*
oh → pv	0.774	0.862	0.373	0.235	0.358	0.714	–0.019^**
− bn	0.770	0.860	0.364	0.224	0.327	0.696	–0.037^**
− pa	0.750	0.848	0.309	0.157	0.267	0.661	–0.072^**
− do	0.752	0.849	0.314	0.163	0.291	0.646	–0.087^**

Note: ‘hu → gn’: kernel initialization GlorotNormal instead of HeUniform used; ‘ce → mse’: loss function MeanSquaredError instead of CrossEntropy used; ‘1d → 2d’: spatial form 2D instead of 1D used; ‘pre → rel’: activation function RELU instead of PRELU used; ‘+ mp’: MaxPooling layer used; ‘ov → pv’: ProtVec encoding instead of One-Hot used; ‘− bn’: no BatchNormalization layer used; ‘− pa’: no padding used; ‘− do’: no Dropout layer used. Highest score per metric indicated in bold.

P < 0.05.

P < 0.0005.

Impact of different architectural building blocks on the performance of the dnet_hhc PPI predictor trained on HHC_TR and tested on HHC_TE Note: ‘hu → gn’: kernel initialization GlorotNormal instead of HeUniform used; ‘ce → mse’: loss function MeanSquaredError instead of CrossEntropy used; ‘1d → 2d’: spatial form 2D instead of 1D used; ‘pre → rel’: activation function RELU instead of PRELU used; ‘+ mp’: MaxPooling layer used; ‘ov → pv’: ProtVec encoding instead of One-Hot used; ‘− bn’: no BatchNormalization layer used; ‘− pa’: no padding used; ‘− do’: no Dropout layer used. Highest score per metric indicated in bold. P < 0.05. P < 0.0005.

3.3 Ensembling improves performance

We compared the performance of our ensemble predictor with those of our other DL predictors. Table 2 shows that ensnet_p improves the AUC by 0.016 (P < 0.0006) wrt the best scoring other DL predictor (dnet_p) on the PPI data, and it achieves the highest sensitivity (TPR) for any error (FPR) in the ROC plot (Fig. 3A). Moreover, the P/R plot in Figure 3B shows that rnet_p obtains the highest precision at low recall values but ensnet_p achieves the highest precision at high recall values. To further investigate the reason behind this improvement, we explored the relation between accuracy of the ensnet_p and the other predictors in scatter plots of MCC scores of ensnet_p versus the average and standard deviation of MCC scores of our six other predictors. Our analyses suggest that ensnet especially improves predictions for individual proteins where the average performance of the other models was already relatively high (Supplementary Fig. S12).

Table 2.

Performance of the ensemble PPI predictor ensnet_p compared with all other predictors trained on BioDL_P_TR and tested on BioDL_P_TE

Model	ACC	SPEC	F1	MCC	AP	AUC
ensnet_p	0.840	0.909	0.339	0.249	0.302	0.755
dnet_p	0.834	0.905	0.312	0.218	0.276	0.739
rnn_p	0.833	0.905	0.310	0.215	0.276	0.736
rnet_p	0.833	0.905	0.309	0.215	0.279	0.735
cnet_p	0.832	0.904	0.303	0.208	0.273	0.733
ann_p	0.833	0.905	0.309	0.214	0.270	0.729
unet_p	0.829	0.903	0.292	0.196	0.253	0.717

Note: Highest score per metric indicated in bold; AUC differences >0.10 are P < 0.05.

Fig. 3.

(A) ROC and (B) P/R plots of all six architecture models and the ensemble models, trained on BioDL_P_TR and tested on BioDL_P_TE PPI data. The ensnet_p clearly outperforms the six architecture models in the ROC plot, and in the P/R plot only rnet_p and rnn_p yield somewhat higher precision (∼0.6) at very low recall (0.01–0.02) Performance of the ensemble PPI predictor ensnet_p compared with all other predictors trained on BioDL_P_TR and tested on BioDL_P_TE Note: Highest score per metric indicated in bold; AUC differences >0.10 are P < 0.05.

3.4 Type-specific predictions are more accurate

In Table 3, we compare the generic ensnet_a with type-specific ensnet_p, ensnet_s and ensnet_n models, each trained on their corresponding BioDL training sets. When tested on the corresponding interaction-specific datasets BioDL_P_TE, BioDL_S_TE and BioDL_N_TE, the specific models consistently obtain performances higher than the interaction-generic ensnet_a, as one may expect.

Table 3.

Performance of ensnet_a, trained on the generic BioDL_A_TR dataset, compared with the ensnet p, s and n models trained on type-specific datasets containing protein, small molecule or nucleotide interaction interfaces and scored performance on the interaction-specific test sets as indicated

Model	Test set	ACC	SPEC	F1	MCC	AP	AUC
ensnet_a	BioDL_P_TE	0.828	0.902	0.289	0.192	0.248	0.733
ensnet_p		0.840	0.909	0.339	0.249	0.302	0.755
ensnet_a	BioDL_S_TE	0.937	0.967	0.339	0.306	0.289	0.826
ensnet_s		0.944	0.970	0.413	0.384	0.388	0.864
ensnet_a	BioDL_N_TE	0.901	0.947	0.272	0.219	0.238	0.835
ensnet_n		0.921	0.957	0.418	0.376	0.399	0.894

Note: Highest AUC per metric per test set indicated in bold (P < 1e−6).

3.5 Network architecture matters

We compared the performance of all seven of our models (six separate architectures and an ensemble) trained on HHC_TR and the four BioDL_*_TR, yielding 35 trained predictors. Each was applied to their corresponding test sets: three for HHC_TR-trained models (Homo_TE, Hetero_TE and combined HHC_TE), and two each for the BioDL_*_TR (BioDL_*_TE and ZK448_*_TE). See Supplementary Tables S4, S5, S6, S7 and S8, for HHC, BioDL_A, _P, _S and _N, respectively, for details. The ensemble ensnet predictors perform best on all test sets, as we already saw for the BioDL_P_TR models on BioDL_P_TE in Table 2. We further compared our ensnet_hhc, ensnet_a, ensnet_s and ensnet_n predictors with other published and available state-of-the-art sequence-based predictors on the same datasets. The published predictors use various methods and architectures including Random Forest, Logistic Regression, Support Vector Machine and Neural Networks. Table 4 shows comparisons with SeRenDIP and other PPI predictors benchmarked by Zhang and Kurgan (2019) on the PPI datasets, and SCRIBER and DRNApred on the ZK448 small molecule and nucleotide (DNA/RNA) datasets, respectively. All predictors mentioned here in Table 4 use similar input features, such as protein length, ASA, RSA, PSSM and secondary structure predicted from sequence.

Table 4.

Performance comparison of our ensnet models and other state-of-the-art sequence-based interaction prediction methods on applicable test sets

Model	Test set	ACC	SPEC	F1	MCC	AP	AUC
Protein–protein interaction (PPI)
ensnet_hhc	Homo_TE	0.767	0.849	0.485	0.335	0.491	0.769 ^**
SeRenDIP					0.277		0.724
ensnet_hhc	Hetero_TE	0.849	0.916	0.197	0.114	0.155	0.661 ^*
SeRenDIP					0.122		0.636
ensnet_a	ZK448_A_TE	0.785	0.870	0.385	0.254	0.357	0.729 ^**
SCRIBER^a		n.a.	0.896	0.333	0.230	0.287	0.715
SSWRF^a		n.a.	0.891	0.287	0.178	0.256	0.687
CRFPPI^a		n.a.	0.887	0.266	0.154	0.238	0.681
LORIS^a		n.a.	0.887	0.263	0.151	0.228	0.656
SPRINGS^a		n.a.	0.882	0.229	0.111	0.201	0.625
PSIVER^a		n.a.	0.874	0.191	0.066	0.170	0.581
SPRINT^a		n.a.	0.873	0.183	0.057	0.167	0.570
SPPIDER^a		n.a.	0.870	0.198	0.071	0.159	0.517
Protein–small molecule interaction
ensnet_s	ZK448_S_TE	0.899	0.945	0.419	0.364	0.409	0.849 ^**
SCRIBER		0.874	0.931	0.278	0.209	0.259	0.706
Protein–DNA/RNA interaction
ensnet_n	ZK448_N_TE	0.871	0.927	0.469	0.396	0.460	0.823 ^**
DRNApred (DNA)		0.830	0.903	0.294	0.198	0.240	0.609
DRNApred (RNA)		0.814	0.894	0.230	0.124	0.248	0.547

Note: Highest scores per metric per test set indicated in bold; confidence for difference in AUC-ROC with runner-up.

Metrics according to Zhang and Kurgan (2019) on their ZK448 test set.

P < 0.05.

P < 0.005.

Performance comparison of our ensnet models and other state-of-the-art sequence-based interaction prediction methods on applicable test sets Note: Highest scores per metric per test set indicated in bold; confidence for difference in AUC-ROC with runner-up. Metrics according to Zhang and Kurgan (2019) on their ZK448 test set. P < 0.05. P < 0.005. For comparing ensnet_hhc with SeRenDIP, we used exactly the same test sets as used by SeRenDIP. For comparing ensnet_a with the predictors as published in Zhang and Kurgan (2019), we exactly followed their testing approach: we calculated average of metrics over 10 subsets of randomly selected 50% of ZK448_A_TE, and we only considered PPIs as to be predicted interactions and all other types as non-interacting. For comparing ensnet_s with SCRIBER, we randomly selected 10 proteins (UniProt IDs in Supplementary Table S12) from ZK448_S_TE and calculated the comparison metrics based on the protein–small molecule binding propensities returned by their webserver. For comparing ensnet_n with DRNApred, we used all proteins in ZK448_N_TE (38 proteins) and calculated the comparison metrics based on the nucleotide interaction propensities returned by their webserver. The cutoff points were selected such that the number of false positives (FPs) and false negatives (FNs) are equal; this affects ACC, SPEC, F1 and MCC. As can be seen from Table 4, our ensnet_hhc, ensnet_a, ensnet_s and ensnet_n predictors perform better than the corresponding state-of-the-art methods, on virtually all considered metrics.

3.6 Feature importance

Figure 4 shows the top 15 ranking of the importance of the features, as measured by SHAP, for one of our models (ann_p), estimated from 5000 randomly selected amino acids from ZK448_P. As seen previously (Hou , 2021), protein sequence length is by far the most important feature. High values of length (red dots; residues of longer proteins), in general, have lower SHAP values, i.e. lower likelihood of a residue being predicted to be part of an interface. Also, the high importance of secondary structure, particularly coil (WP_PC) and α-helix (WP_PA) is consistent with our previous work. High probability of coil across the windows (red WM_PC) also has notable impact on the predictions, as can also be seen from the correlations in Supplementary Figure S14.

Fig. 4.

The top 15 ranking of the importance of the features based on the SHAP values for 5000 randomly selected amino acids in ZK448_P, indicating the contribution of a feature to a residue’s interface prediction. Colors represent the input values of a feature: blue for low and red for high values. The width of the distribution of a feature’s SHAP values shows its relative importance across the sampled residues. For aggregated features only the sum is shown, e.g. WM_PC is the sum of 3_wm_PC, 5_wm_PC, 7_wm_PC and 9_wm_PC

4 Discussion and conclusion

This work presents an in-depth and systematic comparison of multiple DL architectures for sequence-based prediction of protein interface residues. We include a series of neural nets, PIPENN, whose ensemble method performs well on generic and type-specific interface prediction tasks, including PPI, small molecule and nucleotide (DNA/RNA) interface prediction from sequence. We explored multiple combinations of DL architecture building blocks, such as spatial forms, encoding schemes, network initializations, loss and activation functions and regularization mechanisms. Selected combinations resulted in six models and an ensemble, which we trained on existing and newly constructed training datasets. Performance was benchmarked on several independent test sets, facilitating fair comparison. Comparing the performance of our models to that of other published and available state-of-the-art sequence-based predictors on the same test sets, shows that our ensemble predictors obtain most accurate predictions on all interface types. It is worth noting that the different prediction tasks are not equally difficult. We reproduce an earlier observation that homomeric PPI interface residues can be better predicted than heteromeric interfaces, as noted in Hou , 2019). Moreover, all architectures predict protein–small molecule and protein–nucleotide interface residues more accurately than protein–protein interface residues. This might be explained by differences in the size, specificity and structural heterogeneity of interfaces involved with these respective types of interaction: nucleotide interfaces are generally much smaller than PPI interfaces, affecting their variety and relative sequence locality; small molecule interactions typically require very specific chemical properties, leading to equally specific interface compositions; and the structural similarity between two random strands of DNA is much larger than that between two random proteins, so it stands to reason the similarity between (and consequently, predictability of) two protein–DNA interfaces is also greater than the similarity between two protein–protein interfaces. One issue that likely affects our observations, and those of related studies as well, is that we necessarily operate under a ‘closed world’ assumption: not all interfaces are known, yet we assume that residues which were never observed to be part of an interface, truly are not part of one. Consequently, future experimental data are likely to reveal some of the residues that we label as negatives (not interacting) should in fact have been labeled positive. Recent advances in protein structure prediction mean that structures are now available for increasing amounts of proteins (e.g. Su ; Tunyasuvunakool ), which opens up new types of features to be included in DL methods for interface prediction (e.g. Dai and Bailey-Kellogg, 2021; Xie and Xu, 2021). These are really exciting developments, and e.g. Dai and Bailey-Kellogg (2021) report for their ‘Unbound PInet (Aug 50)’ AUC-ROC and F1 of around 0.6 for PPI interface prediction on the DBD3 and DBD5 test sets. Some of the underlying methodology for structure prediction may also be directly applied to interface contact prediction, as recently reviewed by Cui . In summary, we contribute the following: (i) BioDL, a dataset of protein sequences annotated with residue-level and type-specific interface annotation of sufficient size to perform DL; (ii) systematic characterization of different combinations of architectural building blocks, and their impact on the predictive performance of resulting neural nets; (iii) the PIPENN suite of neural net models, whose ensemble method outperforms state-of-the-art sequence-based models when it comes to predicting various types of protein interface. This conclusively demonstrates DL can contribute much to current efforts in computational protein interface prediction from sequence. We provide a public repository containing source code, datasets and pretrained models: https://github.com/ibivu/pipenn/. Click here for additional data file.

30 in total

1. Sequence representation approaches for sequence-based protein prediction tasks that use deep learning.

Authors: Feifei Cui; Zilong Zhang; Quan Zou
Journal: Brief Funct Genomics Date: 2021-03-02 Impact factor: 4.241

Review 2. Review and comparative assessment of sequence-based predictors of protein-binding residues.

Authors: Jian Zhang; Lukasz Kurgan
Journal: Brief Bioinform Date: 2018-09-28 Impact factor: 11.622

3. The meaning and use of the area under a receiver operating characteristic (ROC) curve.

Authors: J A Hanley; B J McNeil
Journal: Radiology Date: 1982-04 Impact factor: 11.105

4. SeRenDIP-CE: Sequence-based Interface Prediction for Conformational Epitopes.

Authors: Qingzhen Hou; Bas Stringer; Katharina Waury; Henriette Capel; Reza Haydarlou; Fuzhong Xue; Sanne Abeln; Jaap Heringa; K Anton Feenstra
Journal: Bioinformatics Date: 2021-05-11 Impact factor: 6.937

5. SmoPSI: Analysis and Prediction of Small Molecule Binding Sites Based on Protein Sequence Information.

Authors: Wei Wang; Keliang Li; Hehe Lv; Hongjun Zhang; Shixun Wang; Junwei Huang
Journal: Comput Math Methods Med Date: 2019-11-13 Impact factor: 2.238

6. Improved Protein Structure Prediction Using a New Multi-Scale Network and Homologous Templates.

Authors: Hong Su; Wenkai Wang; Zongyang Du; Zhenling Peng; Shang-Hua Gao; Ming-Ming Cheng; Jianyi Yang
Journal: Adv Sci (Weinh) Date: 2021-10-31 Impact factor: 16.806

7. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.

Authors: Ehsaneddin Asgari; Mohammad R K Mofrad
Journal: PLoS One Date: 2015-11-10 Impact factor: 3.240

8. Sequential regulatory activity prediction across chromosomes with convolutional neural networks.

Authors: David R Kelley; Yakir A Reshef; Maxwell Bileschi; David Belanger; Cory Y McLean; Jasper Snoek
Journal: Genome Res Date: 2018-03-27 Impact factor: 9.043

9. Pfam: The protein families database in 2021.

Authors: Jaina Mistry; Sara Chuguransky; Lowri Williams; Matloob Qureshi; Gustavo A Salazar; Erik L L Sonnhammer; Silvio C E Tosatto; Lisanna Paladin; Shriya Raj; Lorna J Richardson; Robert D Finn; Alex Bateman
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

10. Highly accurate protein structure prediction for the human proteome.

Authors: John Jumper; Demis Hassabis; Kathryn Tunyasuvunakool; Jonas Adler; Zachary Wu; Tim Green; Michal Zielinski; Augustin Žídek; Alex Bridgland; Andrew Cowie; Clemens Meyer; Agata Laydon; Sameer Velankar; Gerard J Kleywegt; Alex Bateman; Richard Evans; Alexander Pritzel; Michael Figurnov; Olaf Ronneberger; Russ Bates; Simon A A Kohl; Anna Potapenko; Andrew J Ballard; Bernardino Romera-Paredes; Stanislav Nikolov; Rishub Jain; Ellen Clancy; David Reiman; Stig Petersen; Andrew W Senior; Koray Kavukcuoglu; Ewan Birney; Pushmeet Kohli
Journal: Nature Date: 2021-07-22 Impact factor: 69.504

3 in total

1. Multi-task learning to leverage partially annotated data for PPI interface prediction.

Authors: Henriette Capel; K Anton Feenstra; Sanne Abeln
Journal: Sci Rep Date: 2022-06-21 Impact factor: 4.996

Review 2. Overview of methods for characterization and visualization of a protein-protein interaction network in a multi-omics integration context.

Authors: Vivian Robin; Antoine Bodein; Marie-Pier Scott-Boyer; Mickaël Leclercq; Olivier Périn; Arnaud Droit
Journal: Front Mol Biosci Date: 2022-09-08

3. ProteinGLUE multi-task benchmark suite for self-supervised protein modeling.

Authors: Henriette Capel; Robin Weiler; Maurits Dijkstra; Reinier Vleugels; Peter Bloem; K Anton Feenstra
Journal: Sci Rep Date: 2022-09-26 Impact factor: 4.996

3 in total