Literature DB >> 27669239

Identification of Protein-Protein Interactions via a Novel Matrix-Based Sequence Representation Model with Amino Acid Contact Information.

Abstract

Identification of protein-protein interactions (PPIs) is a difficult and important problem in biology. Since experimental methods for predicting PPIs are both expensive and time-consuming, many computational methods have been developed to predict PPIs and interaction networks, which can be used to complement experimental approaches. However, these methods have limitations to overcome. They need a large number of homology proteins or literature to be applied in their method. In this paper, we propose a novel matrix-based protein sequence representation approach to predict PPIs, using an ensemble learning method for classification. We construct the matrix of Amino Acid Contact (AAC), based on the statistical analysis of residue-pairing frequencies in a database of 6323 protein-protein complexes. We first represent the protein sequence as a Substitution Matrix Representation (SMR) matrix. Then, the feature vector is extracted by applying algorithms of Histogram of Oriented Gradient (HOG) and Singular Value Decomposition (SVD) on the SMR matrix. Finally, we feed the feature vector into a Random Forest (RF) for judging interaction pairs and non-interaction pairs. Our method is applied to several PPI datasets to evaluate its performance. On the S . c e r e v i s i a e dataset, our method achieves 94 . 83 % accuracy and 92 . 40 % sensitivity. Compared with existing methods, and the accuracy of our method is increased by 0 . 11 percentage points. On the H . p y l o r i dataset, our method achieves 89 . 06 % accuracy and 88 . 15 % sensitivity, the accuracy of our method is increased by 0 . 76 % . On the H u m a n PPI dataset, our method achieves 97 . 60 % accuracy and 96 . 37 % sensitivity, and the accuracy of our method is increased by 1 . 30 % . In addition, we test our method on a very important PPI network, and it achieves 92 . 71 % accuracy. In the Wnt-related network, the accuracy of our method is increased by 16 . 67 % . The source code and all datasets are available at https://figshare.com/s/580c11dce13e63cb9a53.

Entities: Chemical Disease Gene Species

Keywords: amino acid contact; feature extraction; protein sequence; protein–protein interactions; substitution matrix representation

Mesh：

Substances：
Amino Acids
Proteins

Year: 2016 PMID： 27669239 PMCID： PMC5085656 DOI： 10.3390/ijms17101623

Source DB: PubMed Journal: Int J Mol Sci ISSN： 1422-0067 Impact factor: 5.923

1. Introduction

Protein–protein interactions (PPIs) are fundamental importance to discover the molecular mechanism in biological systems. Identification of PPIs is important for elucidating protein functions and researching biological processes in a cell. In recent years, many prediction methods have been developed for the large-scale analysis of PPIs. Generally, these technologies refer to three categories of information, such as co-evolution information, natural language processing, and protein sequence feature. Lots of methods analyze the co-evolution trend of protein–protein interactions [1,2,3,4,5,6,7,8]. They extract the evolution information of homologous proteins via multiple sequence alignment. It was possible for them to evaluate the relationship between protein pairs by linear correlation coefficient, the similarity measurement of phylogenetic trees or a log-likelihood score. Several technologies have been developed to find PPI evidence from PubMed abstracts, based on Natural Language Processing (NLP) [9,10]. According to a certain semantic model, it automatically extracts relevant pieces of information from literature, as a large number of known PPIs are stored in biology and medicine relevant scientific literature. However, these methods of co-evolution are very difficult to compute because they need a large number of homology proteins. The problem of NLP is that PPI information can be missing from literature, thus prediction may be incomplete. A large number of studies accurately predict PPIs using protein sequence features to describe amino acids. Utilizing machine learning methods in this task, one of the most important computational challenges is to extract useful features from protein sequences. Guo et al. [11] use auto-correlation (AC) values of seven different physicochemical scales to describe an amino acid sequence. This method has been applied to predict the database of PPIs. Shen et al. [12] describe a protein sequence by amino acid groups, and its feature vector is formed by the occurrence of conjoint triads (CT). Zhou [13] and Yang [14] split the amino acid sequence into ten local regions of varying length and their compositions are represented by multiple overlapping continuous and discontinuous interaction information within one protein sequence. For each local region, they calculate three local descriptors (LD), such as composition (C), transition (T) and distribution (D). On the basis of LD, You et al. [15,16] expand the range of description by constructing multi-scale local descriptor (MLD) regions, and achieve higher prediction accuracy of the PPI dataset. Huang et al. [17] use BLOSUM62 [18] to construct a new matrix representation from the protein sequence, and achieve higher prediction accuracy on the PPI dataset. Existing approaches use physical and chemical properties of amino acids, position information of amino acids and evolutionary information to represent protein sequences. Wong et al. adopt the Physicochemical Property Response Matrix combined with the Local Phase Quantization descriptor (PR-LPQ) [19] as the feature of the protein sequence. However, they do not consider the contact information between various types of amino acids, which is important information to predict PPIs. Therefore, we will use amino acid contact information to improve the prediction accuracy on PPI identification. In this paper, we propose a novel matrix-based protein sequence representation approach for predicting PPIs, using amino acid contact information to improve prediction accuracy and an ensemble learning method for classification. First, we construct the Amino Acid Contact (AAC) matrix, based on 6323 protein–protein complexes from a Protein Data Bank. We use the AAC matrix to represent the protein sequence as a Substitution Matrix Representation (SMR) matrix. Then, we extract the feature vector by applying Histogram of Oriented Gradient (HOG) and Singular Value Decomposition (SVD) algorithms on the SMR matrix. Finally, we feed the feature vector into Random Forest (RF) for judging interaction pairs and non-interaction pairs. For the performance evaluation, our method is applied to the PPI dataset. The prediction results show that our method achieves accuracy and sensitivity. Compared with existing methods, the accuracy of our method is increased by percentage points. Further demonstrating the effectiveness of our method, we also test it on the PPI dataset. Our method achieves accuracy and sensitivity, the accuracy of our method is increased by . On the PPIs dataset, our method achieves accuracy and sensitivity, and the accuracy of our method is increased by . In addition, we test our method on an important PPI network, and it achieves accuracy. In the Wnt-related network [12,20], accuracy of our method is increased by , compared to the method of CT [12]. We also use the PPI dataset to construct a model to predict the other five independent species PPI datasets. Compared with the state-of-the-art works, the accuracy of our method is increased by overall.

2. Results

In our experiment, we test our method on eight different PPI datasets to evaluate the performance of our proposed approach. Benchmark PPI datasets include one dataset, two datasets, one dataset, one dataset, one dataset, one dataset, and one dataset. First, we independently analyze the performance of two protein representations, such as the Histogram of Oriented Gradient (HOG) and Singular Value Decomposition (SVD). Second, we compare our method with other outstanding methods in the , and datasets. Then, we use PPIs dataset to construct a model to predict the other five independent species PPI datasets. Our proposed method achieves a high performance on , and datasets, so we evaluate the prediction performance of our model on five independent testing datasets. Our experiments suggest that experimentally identified interactions in one organism are able to predict interactions in other organisms. In addition, we test our method on an important PPI network, and compare it to state-of-the-art works. We use primary experimental information to predict a real PPI network, which is assembled by pairwise PPI data. At last, we analyze the performance of different protein representation approaches by our method.

2.1. PPI Datasets

We test on eight different PPI datasets for evaluating the performance of our proposed approach. The first PPI dataset, described by You et al. [16], is collected from the core subset in the database of interacting proteins (DIP) [21]. They remove the protein sequence, which is more than 40% sequence identity, to one another or fewer than 50 residues. The remaining 5594 pairs of proteins formed the final positive dataset. In addition, non-interacting pairs are selected uniformly based on an assumption that proteins occupying different subcellular localizations do not interact. Finally, the negative dataset is consisted of 5594 protein pairs, and their subcellular localization are different. The positive and negative datasets are combined into a total of 11,188 protein pairs. The second PPI dataset, described by Martin et al. [22], is composed of 2916 protein pairs (1458 interacting pairs and 1458 non-interacting pairs). The third PPI dataset is collected from Human Protein References Database (HPRD) as described by Huang et al. [17]. Huang et al. constructed the dataset by 8161 protein pairs (3899 interacting pairs and 4262 non-interacting pairs). The other five datasets include (4013 interacting pairs), (6954 interacting pairs), (1412 interacting pairs), (313 interacting pairs), and one additional dataset (1420 interacting pairs) used by Zhou et al. [13]. These species-specific PPI datasets are employed in our experiment to verify the effectiveness of our proposed method.

2.2. Evaluation Measurements

To test the robustness of our method, we repeat the process of a random selection of training sets and test sets, model-building and model-evaluating. This process is fivefold cross validation. There are seven parameters: overall prediction accuracy (ACC), sensitivity (SN), specificity (Spec), positive predictive value (PPV), negative predictive value (NPV), weighted average of the PPV and sensitivity (), Matthew’s correlation coefficient (MCC), which are defined as follows: where the true positive (TP) is represented as the number of actual PPIs which are predicted correctly by our model; the false negative (FN) is the number of true interacting proteins that are missed; the true negative (TN) is the number of true non-interacting pairs that are predicted correctly, and the false positive (FP) is the number of true non-interacting pairs that are predicted as interacting proteins. In our experiment, the ACC is the proportion of true results (the percentage of correctly identified interacting and noninteracting protein pairs) among the total number of samples. The SN is the proportion of interacting protein pairs that are correctly identified. The Spec measures the proportion of noninteracting protein pairs that are correctly identified. The PPV and NPV are the probability that positive and negative prediction are correct, respectively. The is a weighted average of the SN and PPV. It considers both the SN and the PPV of the test to compute the score. The MCC is a more stringent measure of taking into account true and false positives and negatives. Furthermore, it is a correlation coefficient between the observed and predicted binary classifications. The MMC returns a value in . A coefficient of indicates the disagreement between prediction and real facts, 0 is nearly random prediction, and represents a perfect prediction of PPIs.

2.3. Experimental Environment

In this paper, our proposed sequence-based PPIs predictor is implemented using MATLAB (R2009a, the MathWorks, Inc., Natick, MA, USA). All programs are carried out on a computer with GHz 6-core CPU, 32 GB memory and Windows operating system(Microsoft Corporation, Redmond, WA, USA). Two RF parameters, the number of decision trees and split are 1000 and 30.

2.4. Performance of PPI Prediction

We use eight different PPI datasets to evaluate the performance of our proposed method. The proposed approach is compared with other usual methods on , and datasets. Then, we test our method on five other datasets, including , , , , and .

2.4.1. Performance on the Dataset

We use the first PPI dataset as investigated in You et al. [16] to evaluate the performance of our model.

Performance of HOG and SVD

In order to understand the contribution of feature representation components, we test the performance of HOG and SVD for PPI prediction. We use the dataset, which is randomly divided into five subsets via a five-fold cross validation. Among them, four subsets are used for training and the remaining set for testing. The cross validation can minimize the impact of data dependency to improve the reliability of experimental results. The prediction result is shown in Table 1. Average accuracies for HOG, SVD and ensemble representation are , and , respectively. Obviously, the HOG approach has better performance than the SVD method. Using ensemble representation, the average accuracy can be raised by percentage points.

Table 1

Analyze the performance of the Histogram of Oriented Gradient (HOG) and Singular Value Decomposition (SVD) on dataset by Random Forest (RF) classifier.

Feature	Classifier	ACC (%)	SN (%)	Spec (%)	PPV (%)	NPV (%)	F1 (%)	MCC (%)
HOG	RF	93.86 ± 0.47	90.67 ± 0.47	97.05 ± 0.74	96.86±0.72	91.22 ± 0.64	93.66 ± 0.40	87.90 ± 0.94
SVD	RF	92.93 ± 0.52	90.25 ± 0.70	95.59 ± 1.36	95.38 ± 1.22	90.76 ± 0.42	92.74 ± 0.47	85.99 ± 1.10
HOG + SVD	RF	94.83 ± 0.26	92.40 ± 0.50	97.26 ± 0.31	97.10 ± 0.35	92.79 ± 0.59	94.69 ± 0.24	89.77 ± 0.50

Five-Fold Cross-Validation Results

The prediction result of our method on the dataset is shown in Table 2. The average accuracy, precision, sensitivity, and MCC are , , , and , respectively. Standard deviations of these criteria values are , , , and , respectively. High accuracies and low standard deviations of these criterion values show that our proposed model is effective and stable for predicting PPIs.

Table 2

Five-fold cross validation result obtained by using our proposed method on the dataset.

Testing Set	ACC (%)	SN (%)	Spec (%)	PPV (%)	NPV (%)	F1 (%)	MCC (%)
1	94.73	92.70	96.72	96.53	93.08	94.58	89.52
2	95.13	92.80	97.31	97.01	93.51	94.86	90.31
3	95.04	92.67	97.47	97.40	92.84	94.98	90.19
4	94.81	92.24	97.40	97.27	92.59	94.69	89.75
5	94.46	91.60	97.39	97.28	91.91	94.35	89.09
Average	94.83 ± 0.26	92.40 ± 0.50	97.26 ± 0.31	97.10 ± 0.35	92.79 ± 0.59	94.69 ± 0.24	89.77 ± 0.50

Comparing with Existing Methods

We compare the prediction performance of our proposed method with that of other existing methods on the dataset, as shown in Table 3.

Table 3

Comparison of the prediction performance between our proposed method and other state-of-the-art works on the dataset. N/A means not available.

Method	Feature	Classifier	ACC (%)	SN (%)	PPV (%)	MCC (%)
Our method	HOG + SVD	RF	94.83 ± 0.26	92.40 ± 0.50	97.10 ± 0.35	89.77 ± 0.50
You’s work [15]	MLD	RF	94.72 ± 0.43	94.34 ± 0.49	98.91 ± 0.33	85.99 ± 0.89
You’s work [23]	AC + CT + LD + MAC	E-ELM	87.00 ± 0.29	86.15 ± 0.43	87.59 ± 0.32	77.36 ± 0.44
You’s work [16]	MCD	SVM	91.36 ± 0.36	90.67 ± 0.69	91.94 ± 0.62	84.21 ± 0.59
Wong’s work [19]	PR-LPQ	Rotation Forest	93.92 ± 0.36	91.10 ± 0.31	96.45 ± 0.45	88.56 ± 0.63
Guo’s work [11]	ACC	SVM	89.33 ± 2.67	89.93 ± 3.68	88.87 ± 6.16	N/A
Guo’s work [11]	AC	SVM	87.36 ± 1.38	87.30 ± 4.68	87.82 ± 4.33	N/A
Zhou’s work [13]	LD	SVM	88.56 ± 0.33	87.37 ± 0.22	89.50 ± 0.60	77.15 ± 0.68
Yang’s work [14]	LD	KNN	86.15 ± 1.17	81.03 ± 1.74	90.24 ± 1.34	N/A

* The feature representation of protein-protein interaction include the Histogram of Oriented Gradient (HOG), Singular Value Decomposition (SVD), Multi-scale Local Descriptor (MLD), Auto-Correlation (AC), Conjoint Triads (CT), Local Descriptors (LD), Moran autocorrelation (MAC), Multi-scale Continuous and Discontinuous (MCD), Local Phase Quantization descriptor (PR-LPQ) and Auto Cross Covariance (ACC). The classifiers include the Random Forest (RF), Ensemble Extreme Learning Machine (E-ELM), Support Vector Machine (SVM) and K-Nearest Neighbor (KNN).

It can be observed that high prediction accuracy of is obtained for our proposed model. We use the same PPI dataset, and compare our experimental result with You et al. [15,16,23], Wong et al. [19], Guo et al. [11], Zhou et al. [13], Yang et al. [14], where Random Forest (RF), Ensemble Extreme Learning Machines (EELM), Support Vector Machine (SVM), Rotation Forest, Support Vector Machine (SVM), or k-Nearest Neighbor (KNN) is performed with MLD, AC + CT+LD + Moran autocorrelation (MAC), Multi-scale Continuous and Discontinuous (MCD), PR-LPQ, AC, ACC, or LD scheme as the input feature vectors, respectively. Their prediction accuracies are , , , , , , , and , respectively, whereas our prediction accuracy is . Our method has the highest prediction accuracy on the PPI dataset, compared with all of the above methods. Our method has the best performance in the Matthew’s correlation coefficient, and the prediction MCC of our method is also the best.

2.4.2. Performance on the Dataset

In order to highlight the advantage of our method, we also test it on the dataset described by Martin et al. [22]. We compare the prediction performance between our proposed method and other previous works including MLD [15], AC + CT + LD + MAC [23], MCD [16], Discrete Cosine Transformation (DCT) + Substitution Matrix Representation (SMR) [17], LD [13], phylogenetic bootstrap [24], signature products [22], K-local hyperplane distance nearest neighbor algorithm (HKNN) [25], ensemble of HKNN [26] and boosting. In Table 4, we can see that the average prediction performances of our method, such as sensitivity, PPV, accuracy and MCC achieved by proposed predictor, are, , , and , respectively. The prediction accuracy of our method is better than all of the above methods, and the prediction PPV of our method is also the best.

Table 4

Comparison of the prediction performance between our proposed method and other methods on the dataset. N/A means not available.

Methods	ACC (%)	SN (%)	PPV (%)	MCC (%)
Our method	89.06	88.15	89.79	78.15
You’s work (MLD) [15]	88.30	92.47	85.99	79.19
You’s work (AC + CT + LD + MAC) [23]	87.50	88.95	86.15	78.13
You’s work (MCD) [16]	84.91	83.24	86.12	74.40
Huang’s work (DCT + SMR) [17]	86.74	86.43	87.01	76.99
Zhou’s work [13]	84.20	85.10	83.30	N/A
Phylogenetic bootstrap [24]	75.80	69.80	80.20	N/A
HKNN [25]	84.00	86.00	84.00	N/A
Signature products [22]	83.40	79.90	85.70	N/A
Ensemble of HKNN [26]	86.60	86.70	85.00	N/A
Boosting	79.52	80.37	81.69	70.64

2.4.3. Performance on Dataset

We also test our method on the dataset, which is used by Huang et al. [17]. We compare the prediction performance between our proposed method and Huang’s work [17] on dataset, as showed in Table 5. Our method achieves the results that prediction accuracy, sensitivity and MCC are , and , respectively. The prediction accuracy, sensitivity and MCC reported by Huang et al. [17] are , and , respectively. Again, our method obtains better prediction results than Huang’s work on the dataset, in terms of accuracy and MCC.

Table 5

Comparison of the prediction performance between our proposed method and other methods on the dataset.

Methods	ACC (%)	SN (%)	PPV (%)	MCC (%)
Our method	97.60	96.37	98.59	95.21
Huang’s work (DCT + SMR) [17]	96.30	92.63	99.59	92.82

2.5. PPI Identification on Independent across Species Datasets

Our test on the two datasets above shows very good prediction results. In addition, our methods are tested on five other independent species’ datasets. If a large number of physically interacting proteins in one organism exist in a co-evolved relationship, their respective orthologs in other organisms interact as well. In this section, we use all 11,188 samples of the dataset as the training set and other species datasets (, , , and ) as test sets. We use the same feature extraction method as described above. The performance of these five experiments is summarized in Table 6. The accuracies are , , , , and on , , , and datasets, respectively. It shows that the model is capable of predicting PPIs from other species. The prediction result of our method is better than You’s work [15], Huang’s work [17] and Zhou’s work [13], in terms of accuracy.

Table 6

Prediction results on five independent species by our proposed method, based on the dataset as the training set. N/A means not available.

Species	Testing Pairs	ACC(%)
Species	Testing Pairs	Our Method	You’s Work [15]	Huang’s Work [17]	Zhou’s Work [13]
E.coli	6954	93.18	89.30	66.08	71.24
C.elegans	4013	90.28	87.71	81.19	75.73
H.sapiens	1412	94.58	94.19	82.22	76.27
H.pylori	1420	92.03	90.99	82.18	N/A
M.musculus	313	92.25	91.96	79.87	76.68

2.6. PPI Network Prediction

The useful application of the PPI prediction method is the capability of predicting PPI networks. Our method predicts one of the important PPI networks assembled by PPIs pairwise. The Wnt-related network is a typical crossover network, and its related pathway is essential in signal transduction. Ulrich et al. [20] has demonstrated the protein interaction topology of the Wnt-related network. Shen et al. [12] have tested their method on the network. The accuracy of their method is in the network: there are 96 PPI pairs in this network, and 73 PPI pairs are predicted correctly by their method. We also try to predict PPIs in the Wnt-related network. The prediction result shows that 89 interactions among 96 PPIs in the network are discovered by our method, and the accuracy is , which is better than Shen’s work [12]. The prediction result and the Wnt-related network are shown in Figure 1. Dark blue lines are true prediction, and red lines are false prediction.

Figure 1

A crossover network for the Wnt-related pathway.

2.7. Comparison of Different Protein Representation Approaches

Loris Nanni et al. [27,28] described some methods for protein representation matrix containing Amino-Acid Sequence (AAS), Position-Specific Scoring Matrix (PSSM), and Physicochemical Property Response Matrix (PR), and so on. We analyze the performance of BLOSUM62 [18], AAC matrix, AAC + BLOSUM62, AAS, PSSM and PR as protein representation matrix by our method (HOG and SVD algorithm), showed in Table 7. In addition, PR can not use the SVD algorithm, and it is only processed by HOG algorithm. Here, we test these different protein representation matrix on , and datasets, respectively. Accuracy values of AAC matrix by our method are , and on three datasets. Compared to other protein representation methods, the prediction accuracy of AAC is better than all of the above methods on and datasets.

Table 7

Comparison of different protein representation approaches by our method.

Dataset	ACC(%)
Dataset	AAC	BLOSUM62	AAC + BLOSUM62	PSSM	AAS	PR
S.cerevisiae	94.83 ± 0.26	94.32 ± 0.21	94.34 ± 0.63	94.21 ± 0.57	94.19 ± 0.66	93.37 ± 0.38
H.pylori	89.06 ± 0.96	88.62 ± 1.13	89.16 ± 1.09	88.51 ± 1.04	87.59 ± 1.27	84.67 ± 1.29
Human	97.60 ± 0.29	97.56 ± 0.13	97.59 ± 0.16	97.55 ± 0.33	97.46 ± 0.48	96.56 ± 0.91

3. Discussion

At present, a lot of computational methods are used to predict PPIs. However, the performance and effectiveness of previous prediction models can still be enhanced. In this paper, we develop a new method for predicting PPIs, via primary sequences of two proteins. The prediction model is constructed based on an ensemble feature representation scheme. We use HOG and SVD to improve the performance in predicting PPIs, via Random Forest. To test the performance of the AAC matrix, we compare it with other common protein representation approaches. These approaches include BLOSUM62, AAS, PSSM and PR, which represent a protein sequence as a matrix. In addition, these representation matrices are extracted feature by HOG and SVD algorithm. The performance of our method is better than all of the above methods on the and datasets. From the experimental results, our method is applied to three datasets and the prediction ability of our approach is better than that of other existing state-of-the-art PPI prediction methods. The prediction result shows that our method achieves accuracy on the dataset. Our method achieves accuracy for the PPI dataset. On the dataset, the experimental results show that our method achieves accuracy. In addition, our proposed method has also obtained good prediction accuracy on cross-species experiments of five other independent datasets. In addition, the proposed method achieves more than accuracy on , , , and datasets, respectively. Our results indicate that the proposed model can be successfully applied to other species, where experimental PPI data is not available. It should be noticed that the biological hypothesis of mapping PPIs from one species to another species is that large numbers of physically interacting proteins in one organism are co-evolved. The most important issue of PPI prediction methods is the accurately predicting PPI networks. We extend our method to predict an important PPI network, and the accuracy of our method is increased compared with CT. General PPI networks are crossover networks, so our method is useful in practical applications. All of these results verify that our proposed method is a very useful support tool for future PPI network research. Because the proposed method adopts an effective feature extraction method and captures useful protein sequence information, the performance of our method is good on above data sets. In future work, we will extend our method to predict other important PPI networks.

4. Materials and Methods

In this paper, we propose a novel method to extract features from protein sequences, for predicting protein–protein interactions. First, we construct Amino Acid Contact (AAC) matrix, based on 6323 protein–protein complexes from the Protein Data Bank. We use an AAC matrix to represent the protein sequence as a Substitution Matrix Representation (SMR) matrix. Then, we use Histogram of Oriented Gradient (HOG) and Singular Value Decomposition (SVD) algorithms to extract the feature vector from the SMR matrix. Finally, we feed the feature vector into a specific classifier for PPI prediction.

4.1. Amino Acid Contact Matrix

Inspired by previous work [29], we consider 20 amino acid types and one solvent contacting residues in protein surfaces. The Amino Acid Contact (AAC) matrix is obtained from the statistical analysis of residue-pairing frequencies in one protein–protein complex database. We select 6323 complexes from the Protein Data Bank [30]. These complexes are made up of two or more protein subunits and their structures are determined by X-rays with cutoff values of resolution Å and sequence identity . We define a pair of residues from two subunits as a contact pair, if two atoms (one from each subunit) are within distance d (set to be in our method). The AAC matrix is correlated to statistical observed numbers of pairwise contacts on the interface. The amino acid contact between two amino acid types i and j is defined as follows: where type 0 corresponds to the solvent. The number of i-j contact is defined as , and the number of i-0 contact is defined as . These values are the estimation of actual numbers of contacts, where is the contact number between residue types i and j, and is the contact number between residue type i and water in each complex. In addition, the expected number of contacts is defined as follows: and where p denotes a complex of protein pair in the data set; is the fraction of residue type i in all residues for each complex; and are total numbers of residue–residue contacts and residue–water contacts in each complex, respectively.

4.2. Substitution Matrix Representation

We represent the protein sequence as a Substitution Matrix Representation (SMR) matrix, mentioned by Yu et al. [31] and Huang et al. [17]. The given L-length protein sequence can be represented as one matrix, based on a substitution matrix. We use the above AAC matrix as the substitution matrix, which is used for replacing a residue–water contact with a residue–residue contact. represents the distance of i-type of amino acid contacting to j-position of the given protein sequence in the interaction process, which is defined as follows: where is one of twenty amino acid types, is one of L positions in the given protein sequence, and is the amino acid type of j-position. denotes the substitution matrix.

4.3. Histogram of Oriented Gradient

In Nanni’s work [32], they explored a method for representing a protein as an image and extracted features from the image using continuous wavelet transform for protein classification. In this paper, the Histogram of Oriented Gradients (HOG) [33,34] is a feature descriptor, used in computer vision and image processing for the purpose of object detection. In our work, SMR can be regarded as a special images matrix, which contains the AAC information. The essential thought of applying the HOG descriptor is that local object appearance and shape can be described by the distribution of intensity gradients, which can be used to describe local detail features of the signal, and the schematic diagram of HOG is shown in Figure 2.

Figure 2

The schematic diagram for calculating Histogram of Oriented Gradient (HOG).

4.3.1. Gradient Computation

The most common method of gradient computation is to apply the one-dimensional centered point discrete derivative mask in both of the horizontal and vertical directions. Gradient values and represent the horizontal and vertical directions, which can be computed as follows: Then, the gradient magnitude and the gradient direction can be calculated as follows: Here, we get the gradient magnitude matrix γ and the gradient direction matrix α, which are two matrices. The gradient magnitude of γ matrix are corresponding to the α matrix. Values of the gradient direction is evenly spread over 0 to 360 degrees.

4.3.2. Dividing Matrix and Calculating Histogram

The gradient magnitude matrix γ and the gradient direction matrix α can be divided into 9 sub-matrices with the same size. Each cell within one sub-matrix contains information of the gradient magnitude and the gradient direction. There are overlapping edge region between each cell to simplify the calculation and divide region. As a result, the information is continuous between each sub-matrix. The location relational mapping between sub-matrix and matrix is defined as follows: where p and q are subscripts of the sub-matrix (, , the total is 9), and a and b are inside location subscripts of the sub-matrix (, ). For every sub-matrix, we create 9 orientation-based histogram channels on account of the gradient direction, including 0–40, 40–80, …, 320–360. Then, we cast the weighted vote for each orientation-based histogram channel, based on the gradient magnitude. In the sub-matrix k (), the gradient direction determines the histogram channel to which the cell belongs, and the corresponding histogram channel is increased by the gradient magnitude . Since for each sub-matrix we can get 9 histogram channels, we will obtain channels for 9 sub-matrices. Therefore, we get a vector from one protein sequence.

4.3.3. Normalization

To obtain the invariance in every local matrix, we normalize the vector v. The normalization factor can be calculated as follows: where ϵ is a small constant, and here we set it as .

4.4. Singular Value Decomposition

In linear algebra, the Singular Value Decomposition (SVD) is a factorization of a real or complex matrix. The SVD is often used for image signal compression and de-noising. Formally, SVD of one matrix M is a factorization of the form as follows: where U is a real or complex unitary matrix (), Σ is a rectangular diagonal matrix with nonnegative real numbers on the diagonal (), and is a real or complex unitary matrix (). The diagonal entries of Σ are known as the singular values of M. The columns of U and the columns of V are called left-singular vectors and right-singular vectors of M, respectively. We apply SVD to decompose the transposed matrix of the SMR matrix , in order to extract fixed-size features from variable-length protein sequences. SVD could acquire the potential pattern of the original matrix, and can get entries. Therefore, we get a vector by all entries .

4.5. Random Forest Classifier

In this paper, the feature space of each pair of proteins is composed of HOG and SVD. Specifically, we extract features to be encoded to represent one protein sequence. Therefore, each pair of proteins can be encoded to be represented as features . We define 962-dimentional feature vector as the input data of the classifier model. The class label t of interacting pair or non-interacting pair is set as 1 or , respectively. We feed the feature vector into a Random Forest model for judging interaction pairs and non-interaction pairs. Random Forest (RF) is an algorithm for classification developed by Leo Breiman [35], which uses an ensemble of classification trees. Each classification tree is built by using a bootstrap sample of the training data, while each split candidate set is a random subset of variables. The bagging and random variable selection can cause low correlation of individual trees. RF has been demonstrated to have excellent performance in classification tasks. We randomly choose N cases from the original data with replacement for building the training set to grow the classification tree. At each node, k variables are selected at random out of K input variables ( and ), and the best split on these k variables is used to split the node. The value of k is held constant during the forest growing. For new cases, classification results can be obtained by the voting method on these trees.

5. Conclusions

In this paper, we develop a new method for predicting PPIs by primary sequences of two proteins. The prediction model is constructed based on random forest and an ensemble feature representation scheme (HOG and SVD feature). From the experimental results, it can be seen that the prediction performance of the proposed method is better than that of previous methods on several common data sets. What’s more, we extend our method to predict an important PPI network, and the accuracy of our method is obviously higher than that of the CT. All these results demonstrate that our proposed method is a very promising and useful support tool for future proteomics research.

28 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Similarity of phylogenetic trees as indicator of protein-protein interaction.

Authors: F Pazos; A Valencia
Journal: Protein Eng Date: 2001-09

3. The Database of Interacting Proteins: 2004 update.

Authors: Lukasz Salwinski; Christopher S Miller; Adam J Smith; Frank K Pettit; James U Bowie; David Eisenberg
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

4. Amino acid substitution matrices from protein blocks.

Authors: S Henikoff; J G Henikoff
Journal: Proc Natl Acad Sci U S A Date: 1992-11-15 Impact factor: 11.205

5. Wavelet images and Chou's pseudo amino acid composition for protein classification.

Authors: Loris Nanni; Sheryl Brahnam; Alessandra Lumini
Journal: Amino Acids Date: 2011-10-13 Impact factor: 3.520

6. Predicting subcellular location of apoptosis proteins with pseudo amino acid composition: approach from amino acid substitution matrix and auto covariance transformation.

Authors: Xiaoqing Yu; Xiaoqi Zheng; Taigang Liu; Yongchao Dou; Jun Wang
Journal: Amino Acids Date: 2011-02-23 Impact factor: 3.520

7. An ensemble of K-local hyperplanes for predicting protein-protein interactions.

Authors: Loris Nanni; Alessandra Lumini
Journal: Bioinformatics Date: 2006-02-15 Impact factor: 6.937

8. Accurate prediction of protein-protein interactions from sequence alignments using a Bayesian method.

Authors: Lukas Burger; Erik van Nimwegen
Journal: Mol Syst Biol Date: 2008-02-12 Impact factor: 11.429

9. Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners.

Authors: Carlo Baldassi; Marco Zamparo; Christoph Feinauer; Andrea Procaccini; Riccardo Zecchina; Martin Weigt; Andrea Pagnani
Journal: PLoS One Date: 2014-03-24 Impact factor: 3.240

10. Using Weighted Sparse Representation Model Combined with Discrete Cosine Transformation to Predict Protein-Protein Interactions from Protein Sequence.

Authors: Yu-An Huang; Zhu-Hong You; Xin Gao; Leon Wong; Lirong Wang
Journal: Biomed Res Int Date: 2015-10-28 Impact factor: 3.411

22 in total

1. A Novel Computational Framework for Predicting the Survival of Cancer Patients With PD-1/PD-L1 Checkpoint Blockade Therapy.

Authors: Xiaofan Su; Haoxuan Jin; Ning Du; Jiaqian Wang; Huiping Lu; Jinyuan Xiao; Xiaoting Li; Jian Yi; Tiantian Gu; Xu Dan; Zhibo Gao; Manxiang Li
Journal: Front Oncol Date: 2022-06-27 Impact factor: 5.738

2. An Ameliorated Prediction of Drug-Target Interactions Based on Multi-Scale Discrete Wavelet Transform and Network Features.

Authors: Cong Shen; Yijie Ding; Jijun Tang; Xinying Xu; Fei Guo
Journal: Int J Mol Sci Date: 2017-08-16 Impact factor: 5.923

3. Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree.

Authors: Chang Zhou; Hua Yu; Yijie Ding; Fei Guo; Xiu-Jun Gong
Journal: PLoS One Date: 2017-08-08 Impact factor: 3.240

4. Special Protein Molecules Computational Identification.

Authors: Quan Zou; Wenying He
Journal: Int J Mol Sci Date: 2018-02-10 Impact factor: 5.923

5. Bacillus subtilis Biofilm Development - A Computerized Study of Morphology and Kinetics.

Authors: Sarah Gingichashvili; Danielle Duanis-Assaf; Moshe Shemesh; John D B Featherstone; Osnat Feuerstein; Doron Steinberg
Journal: Front Microbiol Date: 2017-11-07 Impact factor: 5.640

6. A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features.

Authors: Changli Feng; Zhaogui Ma; Deyun Yang; Xin Li; Jun Zhang; Yanjuan Li
Journal: Front Bioeng Biotechnol Date: 2020-05-05

7. A Novel Triple Matrix Factorization Method for Detecting Drug-Side Effect Association Based on Kernel Target Alignment.

Authors: Xiaoyi Guo; Wei Zhou; Yan Yu; Yijie Ding; Jijun Tang; Fei Guo
Journal: Biomed Res Int Date: 2020-05-28 Impact factor: 3.411

8. Protein-Protein Interactions Prediction Based on Graph Energy and Protein Sequence Information.

Authors: Da Xu; Hanxiao Xu; Yusen Zhang; Wei Chen; Rui Gao
Journal: Molecules Date: 2020-04-16 Impact factor: 4.411

9. Identification of DNA-protein Binding Sites through Multi-Scale Local Average Blocks on Sequence Information.

Authors: Cong Shen; Yijie Ding; Jijun Tang; Jian Song; Fei Guo
Journal: Molecules Date: 2017-11-28 Impact factor: 4.411

10. 4mCPred-MTL: Accurate Identification of DNA 4mC Sites in Multiple Species Using Multi-Task Deep Learning Based on Multi-Head Attention Mechanism.

Authors: Rao Zeng; Song Cheng; Minghong Liao
Journal: Front Cell Dev Biol Date: 2021-05-10