| Literature DB >> 35611239 |
Bhawna Mewara1, Soniya Lalwani1.
Abstract
The prominence of protein-protein interactions (PPIs) in system biology with diverse biological procedures has become the topic to discuss because it acts as a fundamental part in predicting the protein function of the target protein and drug ability of molecules. Numerous researches have been published to predict PPIs computationally because they provide an alternative solution to laboratory trials and a cost-effective way of predicting the most likely set of interactions at the entire proteome scale. In recent computational methods, deep learning has become a buzzword with numerous scientific researches. This paper presents, for the first time, a comprehensive survey of sequence-based PPI prediction by three popular deep learning architectures i.e. deep neural networks, convolutional neural networks and recurrent neural networks and its variants. The thorough survey discussed herein carefully mined every possible information, can help the researchers to further explore the success in this area.Entities:
Keywords: Deep learning; Deep networks; Long short term memory; Protein–protein interactions; Recurrent neural network
Year: 2022 PMID: 35611239 PMCID: PMC9119573 DOI: 10.1007/s42979-022-01197-8
Source DB: PubMed Journal: SN Comput Sci ISSN: 2661-8907
Fig. 1Structure of amino acid
Fig. 2Formation of peptide bond
Fig. 3Basic structure of DNNs with input units I, three hidden units h1, h2 and h3, in each layer and output units O. At each layer, the weighted sum and non-linear function of its inputs are computed to obtain an abstract representation
Fig. 4Basic structure of RNNs with an input unit I, a hidden unit h and an output unit O. The recurrent computation can be expressed more explicitly if the RNNs are unrolled in time. The index of each symbol represents the time step. In this way, ht receives input from I and h and then propagates the computed results to O and h
Fig. 5Basic structure of BRNNs unrolled in time. For each time step, there are two hidden layers. The information from both hidden units is propagated to O
Fig. 6The baseline structure of CNN
Fig. 7Publication analysis of PPI prediction approaches using DNs
Publication analysis of DN approaches in prediction of sequence-based PPIs
| Year [References] | Research objective | Approach | Considered dataset/corpora | Hyper parameters | Highest reported accuracy (in %) | |
|---|---|---|---|---|---|---|
2017 [ | Showed the effectiveness of PPI prediction by applying DL algorithm for the very first time (as per the article) | AC and CT + SAE Chosen AC for final model design Evaluation:tenfold CV | SSD: | The AC model with 400 neurons and the CT model with 700 neurons, having one-hidden layer in both | 97.19 ( | |
2017 [ | Aim to improve PPI prediction performance by effectively learning the representations of proteins from Common protein descriptors | 5 Descriptors-AAC; Dipeptide Composition; Composition, Transition and Distribution; Quasi-Sequence-Order Descriptors; Amphiphilic Pseudoamino Acid Composition; Used two separate DNN model (Difference in the input structure) Evaluation: fivefold CV | SSD: | LR-0.01 Batch size-64 Momentum rate-0.9 Adaptive LR parameter- SGD AF- ReLU Dropout rate – 0.2 | 98.14 ( | |
2019 [ | Considered the order relationship of the entire amino acid sequence and time constraints issues | Proposed a sequence-based method based on a novel representation of the matrix of sequence (MOS), then used DNN and combine to predict PPI | the number of hidden layer nodes—64, AF: ReLU, Adam optimizer, batch size:128, dropout: 0, LR: 0.01 | 94.34 | ||
2019 [ | Accurately predicting PPI based on the properties of AA on the protein primary sequences | Conjoint AAindex modules descriptor (CAM), ensemble of hybrid deep neural networks (fully connected recurrent neural network) Evaluation: tenfold CV | Input features: 343*6 LR: 1.4792765037709115E-5 L2 regularization: 1.978894473010271E-5 ADAM 6 layers, Dropout rate: 0.9 | 94.72 ( | ||
2019 [ | Discover effective feature, underlying patterns and inherent mappings | Used two separate DNN modules to squeeze out latent features from two embedding vectors obtained by Res2vec method of representation learning Evaluation: fivefold CV | SSD: | Residue dim: 20 Window size:4 Protein length: 850 N/w depth: 4 | 98.71 ( | |
2019 [ | Predict PPIs based on different representations of amino acid sequences | AC, LD, MCD; 9 independent DNNs having different parameter settings final two-layered NN for the prediction Evaluation: fivefold CV | SSD: | Different parameter acc. To the considered features. With ADAM optimizer and ReLU AF | 95.29 ( | |
2019 [ | PPI prediction to resolve operational time issue, and large costs as well as low prediction accuracy | ProtVec and protein signatures methods LSTM architecture | VCP protein data of ‘ BioGRID | Input seq length = 400 4 1D convolutional layers, 4 pooling layers with average pooling, 1 LSTM layer and 1 fully connected layer with Softmax with 1024 neurons | 92 ( | |
2019 [ | Extract deeply hidden protein feature and remove redundant information to enhance the prediction results | Used CNN for feature extraction and Feature selective rotation forest (FSRF) for noise elimination | SSD: | N/A | 97.75 ( | |
2020 [ | Optimize the predictive performance of PPI | AC,CT,LD Build DNN model and use dropout method | SSD: | LR- 0.001 Batch size: 128 AF: ReLU Adam optimizer Cost function: Cross-entropy Dropout: multiple | 98.60 ( | |
2020 [ | Improve the variety of the features to be considered in the prediction, insure the training time caused on the extra training either a backward one or a forward one | Physicochemical properties and then applied DWT and CWT with 25-scale mexh wavelet function; “Y-type” NN model, comprising a weight-sharing Bi-RNN layer, a buffer layer, and a dense layer Evaluation: fivefold CV | SSD: | Batch size:128, LR:0.05 | 99.57 ( | |
2020 [ | Employed multimodal information that integrates sequence-based and 3D structural information of proteins to improve prediction capability | AC and CT for sequence-based, Used ResNet50 for structural-based; LSTM classifier Evaluation: threefold CV | N/A | 97.20 ( | ||
2020 [ | Considered the issues of gradient-descent learning to the optimum global solution with increasing network size | CT; SAE and ML-ELM DNN for prediction Evaluation: fivefold CV | Indonesian Herbal Medicine-Herbs Analytics, STRING-DB | N/A | 89 (Herbs with SAE) | |
2021 [ | Hybrid method is an effective tool to accurately predict potential protein interactions | AAC, CT, LD; The DNN extracts the hidden information through a layer-wise abstraction from the raw features that are passed through the XGB classifier Evaluation: fivefold CV | Four standard intraspecies ( ( Human host with | For DNN LR- 0.01 Batch size: 64 Momentum Rate: 0.9 AF: ReLU Dropout rate: 0.2 Loss function: binary_crossentropy | 98.35 ( | |
2021 [ | Identify protein functions faster | Used AVL tree for numerical representation; 3-layered BiRNN with ReLU AF; followed by Flatten, Batch normalization and Dropout function; next two FC layer Evaluation: tenfold CV | From NCBI and BioGrid database | Dropout- 0.25 1FC = 512 neurons 2 FC—for prediction with sigmoid func SGD optimizer, LR-0.0001 Momentum = 0.9 binary crossentropy for model loss with 500 epochs | Multiple | |
2021 [ | The impact of integrating multiple features that are used in the prediction of PPIs either alone or integrated with some other features | 43 features generated by three different methods: 22 Evolutionary features based on generation of a PSSM using PSI-Blast algorithm, 17 structural features generated via a DL model SPIDER2, 7 features generated by popularly used physiochemical properties SAE used as classifier Evaluation: tenfold CV | Input-92 Unit- 75 AF-sigmoid LR-1 Momentum = 0.5 | 83.55 ( | ||
2021 [ | Address the problem of protein–protein interaction by employing AEs solely | AC, CT; An ensemble of two AEs three types of NN architectures were used: Joint-Joint architecture, Siamese-Joint architecture; Siamese-Siamese architecture Evaluation: tenfold CV | SSD: | Two layered encoder having 600 neurons linked to a bottleneck layer of 300 neurons, followed by a symmetric decoder Selu AF, ADAM optimizer, Initial LR- 0.0005, batch size- 64 for 2000 epochs | 97.9 ( | |
2021 [ | Improving prediction performance | A new feature extraction method called MSR based on the spectral radius and BLOSUM62 matrix; AC was applied to supplement and extract effective sequence information; The GRNN used as the classifier Evaluation: tenfold CV | SSD: Two significant PPI networks (CD9 and Wnt) | N/A | 99.97 ( | |
2021 [ | Proposed a data encoding method Sequence-Statistic-Content (SSC) for feature extraction to enhance precision by providing more features with extra information | Protein sequence encoded by three-channel format using statistical information. Then 2D CNN is used with SSC encoded features for prediction task | 4 conv. Layer Dropout: 0.25 AF: Leaky ReLU | 78.40 ( | ||
2018 [ | Investigated the capability of auto feature engineering | Embedding layer + 3-layered CNNs, LSTM + FC Evaluation: fivefold CV | SSD: | Input length = 1200 Bacth size = 128 3 Conv layer with filter length = 10 and ReLU, 3 Max-Pooling layer LSTM Layer Adam optimizer | 98.78 ( | |
2018 [ | The features are learned through an optimization process, leveraging the increasing amount of available PPI data | Tokenization, embedding layer, a recurrent layer (with GRU units) and a fully connected layer each for two branches, Branch normalization, Dropout layer Evaluation: fivefold CV | Employed the categorical cross entropy, paired with the RMSProp gradient descent optimization algorithm Input length-1000 Embedding- 512 features RNN Output dimension = 64 | 97.98 ( | ||
2018 [ | Leveraging existing high-quality experimental PPI data and evolutionary information of a protein pair under prediction | Employed three modules: Convolutional module (convolutional layer, ReLU, batch normalization, and pooling layer) Random Projection module ( 2 FC sub-networks) Prediction Module (perform element-wise multiplication to calculate probability score) Evaluation: tenfold CV | SSD: | Multiple | 94.55 ( | |
2019 [ | Focus on both robust local features and contextualized information, which are significant for capturing the mutual influence of proteins sequences Address three ppi prediction tasks: Interaction prediction, estimation of binding affinity and prediction of interaction type | Deep Siamese architecture of residual RCNN: convolution layers with pooling and bidirectional residual gated recurrent units Evaluation: fivefold CV | N/A | 97.09 ( | ||
2019 [ | Compare two carefully designed deep learning models and show pitfalls to avoid while predicting PPIs | Compared two DL models: a FC model and a recurrent model intended to show the downsides which are needed to avoid while predicting PPIs | Multiple parameters Common in both models: loss function: binary cross-entropy, Adam optimizer, LR: 0.001 | Multiple | ||
2020 [ | Efficient computation performance to accelerate PPI prediction | Embedding method to represent AA sequences; Powerful feature extraction using proposed ResNet algorithm; ResPPI algorithm is a combinational process of five residual units and each residual unit comprises of: three 2D convolution layers each followed by batch normalization and then a mapping function and ReLU; FC layer having a softmax function for binary classification Evaluation: fivefold CV | LR-0.001 Batch size-32 Input length -128 2 Conv. Layer- 32 2 Conv. Layer-64 ReLU Softmax for prediction Binary cross-entropy for loss minimization | 96.69 | ||
2021 [ | Generalizes better to new species and is robust to limitations in training data size Checks for compatible residual responsible for interaction in two proteins | Protein Embedding Projection module; Contact module; Interaction module Evaluation: fivefold CV | SSD: | Projection Dimension = 100, a hidden Dimension = 50, a convolutional filter with width 2 a local max-pooling width = 9 Weights were initialized using PyTorch defaults. Batch size = 25, the Adam optimizer with a LR of 0.001, and trained all models for 10 epochs | Multiple | |
2022 [ | Used mask multi-scale CNN to contribute in prediction enhancement by providing additional insights into each input neuron | Numerous convolution filters arranged in parallel fashion to extract deeper and refined protein features from the profiles. Also employed single-protein class and masking operation | SSD: | LR: 0.001 AMSGrad optimizer | 98.12 ( | |
2017 [ | Identify PPIs in biological literature / Incorporate linguistic and semantic information | Embedding layer, Bi-RNN, FC Evaluation: tenfold CV and cross-corpus (CC) | Embedding- 200, LSTM-400, RMSProp optimizer Dropout rate- 0.5 | P | 87 | |
| R | 87.4 | |||||
| F-1 | 87.2 | |||||
| ( | ||||||
2018 [ | Efficient information extraction from the large collection of biomedical texts for PPI identification | A Shortest Dependency Path (SDP) was created to interpret more relevant information using a Bi-directional LSTM (Bi-LSTM) Part-of-Speech (POS) and Position features were also explored Evaluation: tenfold CV | Number of LSTM units 64 Dropout rate-0:3 Sigmoid—ADAM Optimization algorithm Adam Epochs-130 Size of MLP layer output-30 | P | 91.1 | |
| R | 82.2 | |||||
| F-1 | 86.45 | |||||
| ( | ||||||
2019 [ | Introducing attention mechanism to pay more attention to the most influential segments of texts for a relationship category | Underlying architecture is same as 6 with minor changes: include an attention layer and used a stacking strategy in the Bi-LSTM unit | Number of LSTM units 64 Dropout ratio 0.3 Activation function Sigmoid Optimization algorithm Adam Epochs (AiMed & BioInfer) 115 Epochs (HPRD50, IEPA & LLL) 50 Size of MLP layer output 30 No. of LSTM layers 6 Context vector size 75 | P | 93.96 | |
| R | 92.63 | |||||
| F-1 | 93.29 | |||||
| ( | ||||||
2019 [ | Identify PPI from bio-medical text | Traversed the PPI-related sentences through the network topology of tree-like structure in such a way that each unit of tLSTM is accomplished to gain information from its children Combined tLSTM with structure attention mechanism Evaluation: tenfold CV | Number of layers 1/2 Embedding dimensions 200 Hidden dimensions 300/400/500 Batch size 10/16/20 Number of epochs 30/40/50 Dropout rate 0:5/0:1 LR 0:001/0:015 LR decay 0:05 ADAM and SGD optimizer | P | 88.9 | |
| R | 89.3 | |||||
| F-1 | 89.1 | |||||
| ( | ||||||
SSD Species Specific Dataset, P Precision, R Recall, F-1 F-measure
Short names given for datasets considered by cited papers
| S. No | Dataset | Short Name | S. No | Dataset | Short Name |
|---|---|---|---|---|---|
| 1 | AiMed | 11 | |||
| 2 | 12 | ||||
| 3 | 13 | HPRD50 | |||
| 4 | 14 | IEPA | |||
| 5 | 15 | LLL | |||
| 6 | BioInfer | 16 | |||
| 7 | Benchmark Dataset | 17 | |||
| 8 | C | 18 | |||
| 9 | Drosophila melanogaster | 19 | |||
| 10 | 20 | Yersinia pestis |
Benchmark Dataset: 2010 HPRD, the 2010 HPRD NR, the DIP (Human), HIPPIE, inWeb_inbiomap
Intuition behind some popular manually crafted features used by cited papers under Strategy A
| S. No | Features | Perception behind chosen features |
|---|---|---|
| 1 | AC | A protein sequence is treated as a set of signals which is then transformed in digitized form using suitable physicochemical properties which are promoted to scrutinize protein features |
| 2 | CT | |
| 3 | LD | Extract fine information of protein interaction from the segments of continuous as well as discontinuous amino acids simultaneously |
| 4 | MCD | Employed the interfaces between serially remote but spatially near residues of amino acid to appropriately cover many overlying continuous and discontinuous segments present in sequence |
| 5 | Protein Signature | Signature generation approach which considers the amino acid sequence and its length and generate a numerical representation for each protein sequence |
Fig. 8Performance analysis of highest accuracy reported by various approaches of Strategy-A (in %). The dataset name is mentioned in bracket alongwith the accuracy (best). Approach used by [69] is performing best using ‘k’ dataset
Fig. 9Performance analysis of highest accuracy reported by various approaches of Strategy-B (in %). The dataset name is mentioned in bracket along with the accuracy (best). The best accuracy is achieved by the approach used in [74] on ‘g’ advocated the proficiency of auto-feature engineering
Fig. 10Analysis of highest performance reported by cited papers under Strategy-C (in %). The attention layer approach used in [92] performed best using corpora ‘a’
Fig. 11Categorization of number of published papers according to Strategy
Fig. 12Performance analysis of manual implementation of approaches employed by [61, 75]. A: Implementation of [61] on k dataset; B: Implementation of [61] on r dataset; C: Implementation of [75] on r dataset
Comparison of the deliberated approaches with state-of-the-art methods
| References | Approach | Acc (%) |
|---|---|---|
| [ | ||
| [ | ||
| [ | ||
| [ | ||
| [ | AC + SVM | 87.36 |
| [ | ACC + SVM | 89.33 |
| [ | LD + SVM | 88.56 |
| [ | MCD + SVM | 91.36 |
| [ | CT + SVM | 83.9 |
| [ | AC + CT + LD + MAC + E-ELM | 87.5 |
| [ | MLD + RF | 88.3 |
| [ | LD + KNN | 86.15 |
| [ | Phylogenetic bootstrap | 75.8 |
| [ | HKNN | 84 |
| [ | Signature products | 83.4 |
| [ | Ensemble of HKNN | 86.6 |
aPerformance highlighted in bold are the various approaches discussed in pervious sections that used DNs for PPI prediction