| Literature DB >> 36249566 |
Sho Tsukiyama1, Hiroyuki Kurata1.
Abstract
Viral infections represent a major health concern worldwide. The alarming rate at which SARS-CoV-2 spreads, for example, led to a worldwide pandemic. Viruses incorporate genetic material into the host genome to hijack host cell functions such as the cell cycle and apoptosis. In these viral processes, protein-protein interactions (PPIs) play critical roles. Therefore, the identification of PPIs between humans and viruses is crucial for understanding the infection mechanism and host immune responses to viral infections and for discovering effective drugs. Experimental methods including mass spectrometry-based proteomics and yeast two-hybrid assays are widely used to identify human-virus PPIs, but these experimental methods are time-consuming, expensive, and laborious. To overcome this problem, we developed a novel computational predictor, named cross-attention PHV, by implementing two key technologies of the cross-attention mechanism and a one-dimensional convolutional neural network (1D-CNN). The cross-attention mechanisms were very effective in enhancing prediction and generalization abilities. Application of 1D-CNN to the word2vec-generated feature matrices reduced computational costs, thus extending the allowable length of protein sequences to 9000 amino acid residues. Cross-attention PHV outperformed existing state-of-the-art models using a benchmark dataset and accurately predicted PPIs for unknown viruses. Cross-attention PHV also predicted human-SARS-CoV-2 PPIs with area under the curve values >0.95. The Cross-attention PHV web server and source codes are freely available at https://kurata35.bio.kyutech.ac.jp/Cross-attention_PHV/ and https://github.com/kuratahiroyuki/Cross-Attention_PHV, respectively.Entities:
Keywords: 1D-CNN, One-dimensional-CNN; AC, Accuracy; AUC, Area under the curve; CNN, Convolutional neural network; Convolutional neural network; DT, Decision tree; F1, F1-score; HV-PPIs, Human-virus PPIs; HuV-PPI, Human–unknown virus PPI; Human; LR, Linear regression; MCC, Matthews correlation coefficient; PPIs, Protein-protein interactions; Protein–protein interaction; RF, Random forest; SARS-CoV-2; SARS-CoV-2, Severe acute respiratory syndrome coronavirus 2; SN, Sensitivity; SP, Specificity; SVM, Support vector machine; T-SNE, T-distributed stochastic neighbor embedding; Virus; W2V, Word2vec; Word2vec
Year: 2022 PMID: 36249566 PMCID: PMC9546503 DOI: 10.1016/j.csbj.2022.10.012
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Statistical features of the HuV-PPI dataset.
| Name | Virus in training data | Virus in test data | Training samples | Test samples |
|---|---|---|---|---|
| H1N1 | Viruses other than H1N1 | H1N1 | 64,934 | 18,136 |
| H3N2 | Viruses other than H3N2 | H3N2 | 81,554 | 1516 |
| H5N1 | Viruses other than H5N1 | H5N1 | 81,868 | 1170 |
Statistical features of the SARS-CoV-2-PPI dataset.
| Dataset | All samples | Positive samples | Negative samples | Human proteins | Virus proteins |
|---|---|---|---|---|---|
| Balanced | 28,436 | 14,218 | 14,218 | 14,426 | 14 |
| Imbalanced | 85,308 | 14,218 | 71,090 | 20,192 | 14 |
Fig. 1Workflow of word2vec-based encoding. (A) Amino acid sequences were converted into arrangements of consecutive 4-mers. (B) Amino acid sequences in the UniProtKB/Swiss-Prot database were converted as representations of 4-mers and used for training the word2vec model. (C) Each 4-mer in the amino acid sequence was converted into a feature vector using the trained word2vec model. The resultant feature vectors were concatenated into a feature matrix.
Fig. 2Structure of cross-attention PHV. Cross-attention PHV is composed of three sub-networks. The word2vec (W2V)-based feature matrices of humans and viruses were input into the convolutional embedding module. To extract interaction features between two protein sequences, multi-head attention layers were employed in the cross-attention module. Finally, the feature vectors generated by the global max-pooling layer were concatenated to compute a final score through three linear layers.
Fig. 3Prediction performance of word2vec-based cross-attention PHV with respect to k-mer value. The models were evaluated via 5-fold cross-validation on Denovo's training dataset.
Fig. 4Comparison of performance between the word2vec-based and binary encodings in cross-attention PHV. Models trained via 5-fold cross-validation were evaluated with Denovo's test dataset.
Fig. 5Comparison of performance between cross-attention–based and self-attention–based neural networks. Models trained via 5-fold cross-validation were evaluated with Denovo's test dataset.
Comparison of the performance of cross-attention PHV with existing state-of-the-art models on Denovo's test dataset. Data regarding the performance of existing models were obtained from the respective papers. Bold values indicate the highest value for each measurement.
| SN | SP | AC | MCC | AUC | F1 | |
|---|---|---|---|---|---|---|
| Denovo [2015] | 0.807 | 0.831 | 0.819 | NA | NA | NA |
| Zhou et al.'s model [2018] | 0.800 | 0.889 | 0.845 | 0.692 | 0.897 | NA |
| Alguwaizani et al.'s model [2018] | 0.864 | 0.866 | 0.865 | 0.729 | 0.926 | NA |
| Yang et al's model (Doc2vec + RF) [2020] | 0.903 | 0.962 | 0.932 | 0.866 | 0.981 | 0.931 |
| DeepViral (seq) [2021] | 0.894 | 0.969 | 0.931 | 0.865 | 0.960 | 0.929 |
| DeepViral (joint)[2021] | 0.903 | 0.939 | 0.881 | 0.976 | 0.937 | |
| Yang et al.'s model (CNN) [2021] | 0.908 | 0.974 | 0.941 | NA | NA | 0.939 |
| Cross Attention-PHV | 0.967 |
Fig. 6Comparison of the performance of cross-attention PHV and LSTM-PHV in predicting PPIs for unknown viruses. (A) Performance on the H1N1 dataset, which regards H1N1 as an unknown virus. (B) Performance on the H3N2 dataset, which regards H3N2 as an unknown virus. (C) Performance on the H5N1 dataset, which regards H5N1 as an unknown virus.
Fig. 7Comparison of the performance of cross-attention PHV with LSTM-PHV in predicting human–SARS-CoV-2 PPIs. (A) Performance on a balanced dataset (positive:negative = 1:1). (B) Performance on an imbalanced dataset (positive:negative = 1:5).
Fig. 8t-SNE–based visualization of features generated during prediction of PPIs using the HuV-PPI test datasets. The word2vec-based feature matrices, hidden feature matrices, and feature vectors were retrieved from the neural networks. The feature matrices were transformed into vectors by sampling the maximum values of each feature. The human and virus feature vectors were then concatenated. The t-SNE maps for the H1N1, H3N2, and H5N1 datasets are shown at the left, center, and right, respectively. Blue, yellow, green, and red marks indicate false-positive, false-negative, true-negative, and true-positive samples, respectively. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 9t-SNE–visualized map of the respective human and virus feature vectors on the HuV-PPI datasets.