| Literature DB >> 36212542 |
Farzan Soleymani1, Eric Paquet2, Herna Viktor3, Wojtek Michalowski4, Davide Spinello1.
Abstract
Most proteins perform their biological function by interacting with themselves or other molecules. Thus, one may obtain biological insights into protein functions, disease prevalence, and therapy development by identifying protein-protein interactions (PPI). However, finding the interacting and non-interacting protein pairs through experimental approaches is labour-intensive and time-consuming, owing to the variety of proteins. Hence, protein-protein interaction and protein-ligand binding problems have drawn attention in the fields of bioinformatics and computer-aided drug discovery. Deep learning methods paved the way for scientists to predict the 3-D structure of proteins from genomes, predict the functions and attributes of a protein, and modify and design new proteins to provide desired functions. This review focuses on recent deep learning methods applied to problems including predicting protein functions, protein-protein interaction and their sites, protein-ligand binding, and protein design. CrownEntities:
Keywords: Deep learning; Protein design; Protein–protein interaction; Sequence-based; Structure-based
Year: 2022 PMID: 36212542 PMCID: PMC9520216 DOI: 10.1016/j.csbj.2022.08.070
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1primary structure.
Fig. 2Amino acids.
Physicochemical properties of 20 amino acids. Column (a) steric parameters (graph shape index) [54], [57]; (b) volume; (c) isoelectric point; (d) helix probability [58]; (e) sheet probability [58]; (f) hydrophobicity [59]; (g) hydrophilicity [59]; (h) side-chain residue size [52], [54], [60]; (i) polarity [52]; (j) polarizability [52]); (SASA) solvent-accessible surface area; (NCN) net charge number [52].
| Amino acid | Symbol | a | b | c | d | e | f | g | h | i | j | SASA | NCN |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Alanine | A | 1.28 | 1.00 | 6.11 | 0.42 | 0.23 | 0.62 | −0.50 | 27.50 | 8.10 | 0.046 | 1.181 | 0.007187 |
| Cysteine | C | 1.77 | 2.43 | 6.35 | 0.17 | 0.41 | 0.29 | −1.00 | 44.60 | 5.50 | 0.128 | 1.461 | −0.03661 |
| Aspartate | D | 1.60 | 2.78 | 2.95 | 0.25 | 0.20 | −0.90 | 3.00 | 40.00 | 13.00 | 0.105 | 1.587 | −0.02382 |
| Glutamate | E | 1.56 | 3.78 | 3.09 | 0.42 | 0.21 | −0.74 | 3.00 | 62.00 | 12.30 | 0.151 | 1.862 | 0.006802 |
| Phenylalanine | F | 2.94 | 5.89 | 5.67 | 0.30 | 0.38 | 1.19 | −2.50 | 115.50 | 5.20 | 0.29 | 2.228 | 0.037552 |
| Glycine | G | 0.00 | 0.00 | 6.07 | 0.13 | 0.15 | 0.48 | 0.00 | 0.00 | 9.00 | 0.00 | 0.881 | 0.179052 |
| Histidine | H | 2.99 | 4.66 | 7.69 | 0.27 | 0.30 | −0.40 | −0.50 | 79.00 | 10.40 | 0.23 | 2.025 | −0.01069 |
| Isoleucine | I | 4.19 | 4.00 | 6.04 | 0.30 | 0.45 | 1.38 | −1.80 | 93.50 | 5.20 | 0.186 | 1.81 | 0.021631 |
| Lysine | K | 1.89 | 4.77 | 9.99 | 0.32 | 0.27 | −1.50 | 3.00 | 100.00 | 11.30 | 0.219 | 2.258 | 0.017708 |
| Leucine | L | 2.59 | 4.00 | 6.04 | 0.39 | 0.31 | 1.06 | −1.80 | 93.50 | 4.90 | 0.186 | 1.931 | 0.051672 |
| Methionine | M | 2.35 | 4.43 | 5.71 | 0.38 | 0.32 | 0.64 | −1.30 | 94.10 | 5.70 | 0.221 | 2.034 | 0.002683 |
| Asparagine | N | 1.60 | 2.95 | 6.52 | 0.21 | 0.22 | −0.78 | 2.00 | 58.70 | 11.60 | 0.134 | 1.655 | 0.005392 |
| Proline | P | 2.67 | 2.72 | 6.80 | 0.13 | 0.34 | 0.12 | 0.00 | 41.90 | 8.00 | 0.131 | 1.468 | 0.23953 |
| Glutamine | Q | 1.56 | 3.95 | 5.65 | 0.36 | 0.25 | −0.85 | 0.20 | 80.70 | 10.50 | 0.18 | 1.932 | 0.049211 |
| Arginine | R | 2.34 | 6.13 | 10.74 | 0.36 | 0.25 | −2.53 | 3.00 | 105.00 | 10.50 | 0.291 | 2.56 | 0.043587 |
| Serine | S | 1.31 | 1.60 | 5.70 | 0.20 | 0.28 | −0.18 | 0.30 | 29.30 | 9.20 | 0.062 | 1.298 | 0.004627 |
| Threonine | T | 3.03 | 2.60 | 5.60 | 0.21 | 0.36 | −0.05 | −0.40 | 51.30 | 8.60 | 0.108 | 1.525 | 0.003352 |
| Valine | V | 3.67 | 3.00 | 6.02 | 0.27 | 0.49 | 1.08 | −1.50 | 71.50 | 5.90 | 0.14 | 1.645 | 0.057004 |
| Tryptophan | W | 3.21 | 8.08 | 5.94 | 0.32 | 0.42 | 0.81 | −3.40 | 145.50 | 5.40 | 0.409 | 2.663 | 0.037977 |
| Tyrosine | Y | 2.94 | 6.47 | 5.66 | 0.25 | 0.41 | 0.26 | −2.30 | 117.30 | 6.20 | 0.298 | 2.368 | 0.023599 |
The nonstandard amino acids [65], [66], [67].
| Name | Symbol | Abbr |
|---|---|---|
| Aspartic acid or Asparagine | B | Asx |
| Leucine or Isoleucine | J | Xle |
| Pyrrolysine | O | Pyl |
| Selenocysteine | U | Sec |
| Glutamic acid or Glutamine | Z | Glx |
| unknown amino acid | X | Unk |
amino acids classified by side chain properties.
| Charge | Positive | H, K, R |
| Negative | D, E | |
| Neutral | A, C, N, P, Q, S, F, G, I, L, M, T, V, W | |
| Polarity | Polar | Y |
| Nonpolar | C, D, E, H, K, N, Q, R, S, T | |
| Aromaticity | Aliphatic | A, F, G, I, L, M, P, V, W |
| Aromatic | I, L, V | |
| Neutral | F, H, W, Y | |
| Size | Small | A, C, D, E, Q, R, S, G, K, M, N, P, T |
| Medium | A, G, P, S | |
| Large | D, N, T |
Fig. 3peptide bond formation. The N-terminus is on the left, and the C-terminus is on the right.
Fig. 5tertiary structure [70].
Fig. 6myoglobin illustrates a type of tertiary structure consisting of helices connected by loop segments.
Fig. 7aspartate transcarbamoylase is an enzyme at the beginning of the pathway for pyrimidine synthesis, presents a remarkable example of quaternary structure.
Fig. 8a small part of collagen separated by chains.
Fig. 9haemoglobin, a globular protein.
Fig. 10bacteriorhodopsin, a membrane protein.
Fig. 11Autoencoder architecture.
Fig. 12LSTM architecture.
Fig. 13convolutional neural network architecture. The input is the feature matrices of two proteins. The output predicts the interaction score between two proteins.
Fig. 14graph convolutional network architecture for PPI prediction.
Fig. 15GAN architecture.
Fig. 16variational autoencoder architecture.
Fig. 17the conditional GAN architecture.
Fig. 18schematic illustration of PRISM algorithm as an example of template-based docking method for PPI prediction [236] (a) If the template interface on complementary partners (IL and IR) are similar to any two targets surfaces (TL and TR), these two targets may interact and create a protein complex. The black points illustrate hot spot residues. (b) The algorithm flowchart includes the template data set and the target data set. The surface of each partner of the template interface is aligned with the target surfaces. If the matching threshold for hot spot residues passes, the target proteins may form an interacting pair [236], [224].
Summary of advantages and disadvantages of structure-based deep learning methods for PPI prediction.
| GCN-based [2017] | This study proposed a pairwise classification architecture in which | The proposed convolution operators and | The accuracy of this approach is examined based on |
| IntPred [2018] | This method uses a random forest to predict protein–protein interface sites at | The performance of a binary classifier can be evaluated using different measurements, | The performance of this method depends on the application. |
| Graph-based generative model [2019] | This method uses a graph transformer model for designing protein sequences given | This method uses a self-attention mechanism to capture higher-order, | The evaluation dataset only contained chains up to a length of 500, |
| Struct2Graph [2020] | In this method, graph embeddings of each protein are obtained using an assigned GCN. | Struct2Graph only uses 3-D structural information to predict the PPI. | Limited availability of 3-D structural information |
| LSTM-based [2020] | The proposed method integrates the 3-D structure and sequence-based information of proteins to predict PPIs. | This method performs well despite being trained on a low number of instances. | Limited availability of 3-D structural information |
Datasets of structure-based methods.
| GCN-based [2017] | Version 5 of the docking benchmark dataset was used by this study |
| IntPred [2018] | The training dataset comprised 58,397 biological units from protein, interfaces, structures and assemblies (PISA), |
| Graph-based generative model [2019] | The dataset was obtained from the CATH (version.4.2) |
| Struct2Graph [2020] | The database was generated based only on direct/physical protein interactions. Therefore, IntAct |
| LSTM-based [2020] | This study used two PPI datasets, Pan’s PPI dataset |
Summary of advantages and disadvantages of sequence-based Deep Learning methods regarding PPI prediction.
| SVM-conjoint triad [2007] | Each protein sequence was represented in this study by a vector of amino acid features. | The 20 standard amino acids were clustered into several classes based on | The limited available information on protein pairs restricts the applicability of this method. |
| SVM-autocovariance [2008] | This method combined a new feature representation using autocovariance (AC) and a support vector machine (SVM). | The conjoint triad (CT) method only considered the attributes of an amino acid and | The model achieved a low prediction accuracy of |
| UNISPPI [2013] | This method used a decision tree model, predicting PPIs using only 20 amino acid frequency combinations from | This method was scalable due to using a limited number of attributes. | Instances with a classification score of 0.50 were classified as neither PPIs nor non-PPIs, |
| SVM-based method [2015] | PPI prediction was addressed by integrating a support vector machine (SVM) and | This method extracts more information hidden in protein primary sequences than | SVM algorithms performed relatively poorly with noisy data and are unsuitable for |
| DeepPPI [2017] | The DeepPPI method used a deep neural network architecture network for each protein to | This method can capture informative features of protein pairs by a layer-wise abstraction. | The accuracy of DeepPPI for All Human/Yeast dataset are relatively low, |
| Stacked Autoencoder (SAE) [2017] | This method used a stacked autoencoder to predict PPI. | SAE can learn hidden interaction features of protein sequences. | They used a synthetic negative interaction dataset, |
| DPPI [2018] | This method performed sequence-based PPI prediction using a deep, | DPPI addresses interactions for both homodimeric and heterodimeric proteins. | This method yields lower PPI prediction accuracy on the S.cerevisiae core dataset from |
| PIPR [2019] | This study proposed an end-to-end framework for PPI prediction based on amino acid sequences using a | The Siamese-based learning architecture captured the mutual influence of protein pairs and | RCNN was built using bidirectional gated recurrent units (bidirectional-GRU). |
| S-VGAE [2020] | This model proposed a signed variational graph autoencoder (S-VGAE) that combined sequence information and graph structure. | In this method, the cost function was modified only to consider highly confident interactions, | This model used the conjoint triad (CT) method |
| ACT-SVM [2020] | This method performed feature extraction on the protein sequence to obtain a vector, | They have observed that SVM method outperforms K-Nearest Neighbour (KNN), ANN, RFM, | Finding the proper kernel and hyperparameters was challenging, |
| D-SCRIPT [2021] | Deep sequence contact residue interaction prediction transfer (D-script) is an interpretable deep learning method | D-SCRIPT generalised to new species considering the sparsity of training data for most model organisms | Despite its performance for cross-species PPI prediction, D-SCRIPT underperformed on within-species evaluations. |
| SPNet [2021] | The Siamese pyramid network (SPNet) architecture used self-binding and folding amino acid sequences to | This architecture consisted of a multilevel pyramid feature structure encompassing various PPI mechanisms | |
| BiLSTM-RF [2021] | The BiLSTM-RF model extracted features of protein pairs in the human database. | BiLSTM extracted the sequence and position of the biological information in the protein sequence. | LSTM models are computationally demanding and slow. |
| Heterogeneous Network [2021] | PPI prediction was performed using a computational sequence and network representation learning-based model. | The protein node contained protein attribute and | Model accuracy is relatively low compared to other deep learning methods such as DPPI |
| OR-RCNN [2021] | This method was called ordinal regression and recurrent convolutional neural network (OR-RCNN), | This method offered better accuracy compared to some existing models, such as autocovariance | The RCNN was built using bidirectional gated recurrent units (bidirectional-GRU). |
| DeepTrio [2022] | The DeepTrio method used a deep-learning framework based on a mask multiscale CNN architecture that | DeepTrio is available both online and offline. | DeepTrio yields lower PPI prediction accuracy on the S.cerevisiae core dataset from |
Datasets of sequence-based methods.
| SVM-conjoint triad [2007] | A dataset comprising 16,443 nonredundant entries of experimentally verified PPI was extracted from |
| SVM-autocovariance [2008] | The PPI data was extracted from the S. cerevisiae core subset of the Database of Interacting Proteins |
| SVM-based [2015] | This method was evaluated using S. cerevisiae and H. pylori PPI datasets. |
| DeepPPI [2017] | The dataset evaluating the DeepPPI comprised 11,188 negative and positive protein pairs from S. cerevisiae obtained from |
| Stacked Autoencoder (SAE) [2017] | Pan’s PPI dataset was acquired from |
| DPPI [2018] | This study used human and yeast datasets from |
| PIPR [2019] | Guo’s dataset |
| ACT-SVM [2020] | Following |
| S-VGAE [2020] | This study used data from the human protein reference database (HPRD) and the |
| D-SCRIPT [2021] | This study used a dataset from the STRING database (version 11) |
| SPNet [2021] | This study used the dataset of |
| BiLSTM-RF [2021] | A nonredundant human dataset was retrieved from the DIP database. |
| Heterogeneous Network [2021] | The 20 amino acids are divided into four groups based on their side chain polarity |
| OR-RCNN [2021] | This study used datasets derived from the STRING database |
| DeepTrio [2022] | The training and testing datasets were obtained from the Biological General Repository for Interaction Datasets (BioGRID) |
Databases for PPI prediction
| Protein–Protein Interactions | ||||
| STRING | Functional associations between protein pairs, which contains 67,592,464 proteins from 14094 organisms; 20,052,394,042 interactions. | 2021 | https://string-db.org/ | |
| IntAct | Contains manually curated datasets (topical), interactomes (for 16 different species) and annotations of experimental evidence. | 2021 | https://www.ebi.ac.uk/intact/home | |
| Biogrid | Contains 2,467,140 protein and genetic interactions, 29,417 chemical interactions and 1,128,339 PTMs from major model organism species. | 2020 | http://www.thebiogrid.org/ | |
| DIP | Experimentally determined PPI database including biological information of proteins, PPIs and experimental techniques for identifying interactions. | 2020 | https://dip.doe-mbi.ucla.edu/dip/Main.cgi | |
| Negatome 2.0 | Contains 21,795 interactions, with scores of zero and one, using text mining from literature and analysing protein complexes from PDB. | 2014 | http://mips.helmholtz-muenchen.de/proj/ppi/negatome/ | |
| MINT | Experimentally curated PPI database that includes approximately 117001 PPIs from 607 different species. | 2012 | https://mint.bio.uniroma2.it/ | |
| HPRD | Consists of 41,327 PPIs, 93,710 PTMs, 22,490 Subcellular Localizations and 112,158 Protein Expressions. | 2010 | http://www.hprd.org | |
| BIND | PPIs collected from of humans, yeasts, nematodes, etc. | 2005 | http://download.baderlab.org/BINDTranslation | |
| Protein sequences | UniProt | A collection of protein sequence and functional information, including UniProtKB, UniParc, UniRef and Proteomes. UniProtKB contains 567,483 reviewed (Swiss-Prot)—manually annotated, and 231,354,261 unreviewed (TrEMBL)—computationally analysed, protein sequences. | 2020 | http://www.uniprot.org |
| SWISS-MODEL | A web-based integrated service providing information for protein structure homology modelling. The repository contains 2,217,470 models from SWISS-MODEL for UniProtKB targets, as well as 180,107 structures from PDB with mapping to UniProtKB. | 2020 | https://swissmodel.expasy.org/ | |
| PIR | Integrated protein resources, including protein sequences and high-quality annotations by integrating more than 90 biological databases. | 2022 | http://pir.georgetown.edu/ | |
| Higher-level structures | RCSB PDB | Information about the 3-D structure of proteins, nucleic acids, and complex assemblies. 191144 structures, 57349 human sequence structures, and 14406 nucleic acid-containing Structures | 2021 | https://www.rcsb.org/ |
| SCOP | Classification of known proteins and a comprehensive description of the structural and evolutionary relationships between them. As of 2022-05-30, this dataset contains 72,448 non-redundant domains, representing 858,316 protein structures. | 2022 | http://scop.mrc-lmb.cam.ac.uk/scop | |
| Genomic information | CGD | A resource for genomic sequence data, genes and protein information for | 2022 | http://www.candidagenome.org/ |