| Literature DB >> 31969974 |
Xiaodi Yang1, Shiping Yang2, Qinmengge Li3, Stefan Wuchty4,5,6,7, Ziding Zhang1.
Abstract
The identification of human-virus protein-protein interactions (PPIs) is an essential and challenging research topic, potentially providing a mechanistic understanding of viral infection. Given that the experimental determination of human-virus PPIs is time-consuming and labor-intensive, computational methods are playing an important role in providing testable hypotheses, complementing the determination of large-scale interactome between species. In this work, we applied an unsupervised sequence embedding technique (doc2vec) to represent protein sequences as rich feature vectors of low dimensionality. Training a Random Forest (RF) classifier through a training dataset that covers known PPIs between human and all viruses, we obtained excellent predictive accuracy outperforming various combinations of machine learning algorithms and commonly-used sequence encoding schemes. Rigorous comparison with three existing human-virus PPI prediction methods, our proposed computational framework further provided very competitive and promising performance, suggesting that the doc2vec encoding scheme effectively captures context information of protein sequences, pertaining to corresponding protein-protein interactions. Our approach is freely accessible through our web server as part of our host-pathogen PPI prediction platform (http://zzdlab.com/InterSPPI/). Taken together, we hope the current work not only contributes a useful predictor to accelerate the exploration of human-virus PPIs, but also provides some meaningful insights into human-virus relationships.Entities:
Keywords: AC, Auto Covariance; ACC, Accuracy; AUC, area under the ROC curve; AUPRC, area under the PR curve; Adaboost, Adaptive Boosting; CT, Conjoint Triad; Doc2vec; Embedding; Human-virus interaction; LD, Local Descriptor; MCC, Matthews correlation coefficient; ML, machine learning; MLP, Multiple Layer Perceptron; MS, mass spectroscopy; Machine learning; PPIs, protein-protein interactions; PR, Precision-Recall; Prediction; Protein-protein interaction; RBF, radial basis function; RF, Random Forest; ROC, Receiver Operating Characteristic; SGD, stochastic gradient descent; SVM, Support Vector Machine; Y2H, yeast two-hybrid
Year: 2019 PMID: 31969974 PMCID: PMC6961065 DOI: 10.1016/j.csbj.2019.12.005
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Workflow of our computational pipeline to predict human-virus PPIs. In the dataset preparation step, we constructed positive and negative data samples, utilizing human-virus protein interaction data from HPIDB as well as SwissProt database. Furthermore, we randomly sampled 80% as training data, while remining data was used as an independent test set. In the feature extraction step, we formed a corpus of sequence information from such protein data to train a doc2vec model, allowing us to extract/infer protein sequence specific features. Representing 80% of interactions between proteins through such feature embeddings as training data we used Random Forests (RF) to predict protein interactions using 5-fold cross-validation and independent test sets (remaining 20% of interaction data). In the final step, we compared our doc2vec + RF model with combinations of different encoding schemes such as the Conjoint Triad (CT), Local Descriptor (LD) and Auto Covariance (AC) and widely used ML methods such as Support Vector Machine (SVM), Multiple Layer Perceptron (MLP) and Adaptive Boosting (Adaboost).
Fig. 2Performance of various classifiers in predicting human-virus PPIs based on doc2vec encoding. Areas under the Precision-Recall curves (AUPRC) indicate that Random Forests (RF) outperformed Support Vector Machine (SVM), Multiple Layer Perceptron (MLP) and Adaptive Boosting (Adaboost) (A) applying 5-fold cross-validation and (B) using an independent test set.
Fig. 3Performance of RF classifier in predicting human-virus PPIs based on different sequence-based encoding schemes. Areas under the Precision-Recall curves (AUPRC) indicate that doc2vec encoding provided best prediction performance compared to a combination of Local Descriptor (LD), Conjoint Triad (CT) and Auto Covariance (AC) as well as these encoding techniques separately (A) applying 5-fold cross-validation and (B) using an independent test set.
Fig. 4Performance of various combinations of ML algorithms and sequence-based encoding schemes in predicting human-virus PPIs. Areas under the Precision-Recall curves (AUPRC) show that our pipeline that combined doc2vec embedding and Random Forests (RF) outperforms other combinations, (A) applying 5-fold cross-validation and (B) using an independent test. Considering the computational costs of SVM, note that only half of the whole samples were used to train and assess the SVM classifiers.
Performance comparison of our doc2vec + RF model with Alguwzizani et al.’s and Barman et al.’s methods using Barman et al.’s dataset.
| Method | SN (%) | SP (%) | ACC (%) | PPV (%) | NPV (%) | MCC | AUC | F1 (%) |
|---|---|---|---|---|---|---|---|---|
| Our model | 81.85 | 76.45 | 79.17 | 77.83 | 80.67 | 0.584 | 0.871 | 79.79 |
| Alguwzizani et al.’s SVM | 73.72 | 83.48 | 78.60 | 81.69 | 76.06 | 0.575 | 0.847 | 77.50 |
| Barman et al.’s SVM | 67.00 | 74.00 | 71.00 | 72.00 | NA | 0.440 | 0.730 | 69.41 |
| Barman et al.’s RF | 55.66 | 89.08 | 72.41 | 82.26 | NA | 0.480 | 0.760 | 66.39 |
The performance was assessed through 5-fold cross-validation.
The corresponding values were retrieved from [54].
The corresponding values were retrieved from [53].
NA means the corresponding parameter is not available. SN: Sensitivity; SP: Specificity; ACC: Accuracy; PPV: Positive Predictive Value (PPV = Precision); NPV: Negative Predictive Value (NPV = TN/(TN + FN)); MCC: Matthews Correlation Coefficient; AUC: the area under the ROC curve; F1 = 2 × (Precision × Recall)/(Precision + Recall).
Performance comparison of our doc2vec + RF model with DeNovo and Alguwzizani et al.’s method using the test set of DeNovo.
| Method | SN (%) | SP (%) | ACC (%) | PPV (%) | NPV (%) | MCC | AUC | F1 (%) |
|---|---|---|---|---|---|---|---|---|
| Our model | 90.33 | 96.17 | 93.23 | 95.99 | 90.74 | 0.866 | 0.981 | 93.07 |
| Alguwzizani et al.’s SVM | 86.35 | 86.59 | 86.47 | 86.56 | 86.39 | 0.729 | 0.926 | NA |
| DeNovo | 80.71 | 83.06 | 81.90 | NA | NA | NA | NA | NA |
The corresponding values were retrieved from [54].
NA means the corresponding parameter is not available.
The corresponding values were retrieved from [43]. SN: Sensitivity, SP: Specificity, ACC: Accuracy, PPV: Positive Predictive Value (PPV = Precision); NPV: Negative Predictive Value (NPV = TN/(TN + FN)); MCC: Matthews Correlation Coefficient; AUC: the area under the ROC curve; F1 = 2 × (Precision × Recall)/(Precision + Recall).