| Literature DB >> 33488064 |
Li-Na Jia1, Xin Yan2,3, Zhu-Hong You4, Xi Zhou4, Li-Ping Li4, Lei Wang4,1, Ke-Jian Song5.
Abstract
The study of protein self-interactions (SIPs) can not only reveal the function of proteins at the molecular level, but is also crucial to understand activities such as growth, development, differentiation, and apoptosis, providing an important theoretical basis for exploring the mechanism of major diseases. With the rapid advances in biotechnology, a large number of SIPs have been discovered. However, due to the long period and high cost inherent to biological experiments, the gap between the identification of SIPs and the accumulation of data is growing. Therefore, fast and accurate computational methods are needed to effectively predict SIPs. In this study, we designed a new method, NLPEI, for predicting SIPs based on natural language understanding theory and evolutionary information. Specifically, we first understand the protein sequence as natural language and use natural language processing algorithms to extract its features. Then, we use the Position-Specific Scoring Matrix (PSSM) to represent the evolutionary information of the protein and extract its features through the Stacked Auto-Encoder (SAE) algorithm of deep learning. Finally, we fuse the natural language features of proteins with evolutionary features and make accurate predictions by Extreme Learning Machine (ELM) classifier. In the SIPs gold standard data sets of human and yeast, NLPEI achieved 94.19% and 91.29% prediction accuracy. Compared with different classifier models, different feature models, and other existing methods, NLPEI obtained the best results. These experimental results indicated that NLPEI is an effective tool for predicting SIPs and can provide reliable candidates for biological experiments.Entities:
Keywords: Self-interacting protein; evolutionary information; natural language processing; stacked auto-encoder
Year: 2020 PMID: 33488064 PMCID: PMC7768313 DOI: 10.1177/1176934320984171
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Figure 1.The flowchart of NLPEI model.
Figure 2.Structure of auto-encoder.
Figure 3.Structure of stacked auto-encoders.
The five-fold cross-validation results performed by NLPEI on human data set.
| Testing set | First fold (%) | Second fold (%) | Third fold (%) | Fourth fold (%) | Fifth fold (%) | Average (%) |
|---|---|---|---|---|---|---|
| Acc. | 94.25 | 94.02 | 94.42 | 94.45 | 93.82 | 94.19 ± 0.27 |
| Spe. | 100.00 | 99.97 | 97.20 | 99.94 | 100.00 | 99.42 ± 1.24 |
| NPV | 94.11 | 93.90 | 96.78 | 94.34 | 93.66 | 94.56 ± 1.27 |
| AUC | 63.22 | 65.39 | 74.63 | 71.51 | 62.55 | 67.46 ± 5.34 |
The five-fold cross-validation results performed by NLPEI on yeast data set.
| Testing set | First fold (%) | Second fold (%) | Third fold (%) | Fourth fold (%) | Fifth fold (%) | Average (%) |
|---|---|---|---|---|---|---|
| Acc. | 91.24 | 91.64 | 91.48 | 91.16 | 90.92 | 91.29 ± 0.28 |
| Spe. | 99.64 | 99.44 | 98.12 | 99.02 | 99.73 | 99.19 ± 0.66 |
| NPV | 91.23 | 91.64 | 92.81 | 91.80 | 90.86 | 91.67 ± 0.74 |
| AUC | 68.44 | 75.27 | 65.15 | 64.39 | 59.49 | 66.55 ± 5.83 |
Figure 4.ROC curves of five-fold cross-validated performed by NLPEI on human data set.
Figure 5.ROC curves of five-fold cross-validated performed by NLPEI on yeast data set.
The five-fold cross-validation results performed by KNN and RF classifier models on human data set.
| Model | Testing set | First fold (%) | Second fold (%) | Third fold (%) | Fourth fold (%) | Fifth fold (%) | Average (%) |
|---|---|---|---|---|---|---|---|
| KNN | Acc. | 91.26 | 91.55 | 91.12 | 91.75 | 91.09 | 91.35 ± 0.29 |
| Spe. | 98.87 | 99.00 | 98.96 | 99.28 | 99.12 | 99.05 ± 0.16 | |
| NPV | 92.18 | 92.36 | 91.95 | 92.33 | 91.78 | 92.12 ± 0.25 | |
| AUC | 56.57 | 51.70 | 50.28 | 49.65 | 54.25 | 52.49 ± 2.89 | |
| RF | Acc. | 89.65 | 89.91 | 89.34 | 89.62 | 89.40 | 89.58 ± 0.23 |
| Spe. | 97.09 | 96.90 | 96.67 | 96.50 | 96.98 | 96.83 ± 0.24 | |
| NPV | 92.07 | 92.48 | 92.08 | 92.54 | 91.86 | 92.21 ± 0.29 | |
| AUC | 56.27 | 52.24 | 51.67 | 50.98 | 53.76 | 52.99 ± 2.10 |
Figure 6.Comparison of different classifier models on human dataset.
The five-fold cross-validation results performed by KNN and RF classifier models on yeast data set.
| Model | Testing set | First fold (%) | Second fold (%) | Third fold (%) | Fourth fold (%) | Fifth fold (%) | Average (%) |
|---|---|---|---|---|---|---|---|
| KNN | Acc. | 88.02 | 88.26 | 87.06 | 86.74 | 87.79 | 87.57 ± 0.65 |
| Spe. | 97.48 | 98.19 | 97.28 | 98.07 | 97.47 | 97.70 ± 0.41 | |
| NPV | 89.93 | 89.60 | 89.13 | 88.10 | 89.68 | 89.29 ± 0.73 | |
| AUC | 63.71 | 49.41 | 50.08 | 54.83 | 57.48 | 55.10 ± 5.86 | |
| RF | Acc. | 85.85 | 85.77 | 86.09 | 84.24 | 85.22 | 85.44 ± 0.74 |
| Spe. | 94.41 | 94.94 | 95.65 | 94.66 | 94.57 | 94.85 ± 0.49 | |
| NPV | 90.18 | 89.67 | 89.42 | 88.17 | 89.39 | 89.37 ± 0.74 | |
| AUC | 62.88 | 50.20 | 50.36 | 55.05 | 57.10 | 55.12 ± 5.27 |
Figure 7.Comparison of different classifier models on yeast dataset.
The five-fold cross-validation results performed by AC, DCT, and NL feature descriptor models on human data set.
| Model | Testing set | First fold (%) | Second fold (%) | Third fold (%) | Fourth fold (%) | Fifth fold (%) | Average (%) |
|---|---|---|---|---|---|---|---|
| AC | Acc. | 91.86 | 91.22 | 92.17 | 91.68 | 92.12 | 91.81 ± 0.38 |
| Spe. | 99.94 | 99.94 | 99.91 | 99.84 | 99.97 | 99.92 ± 0.05 | |
| NPV | 91.89 | 91.26 | 92.25 | 91.80 | 92.13 | 91.87 ± 0.38 | |
| AUC | 52.02 | 51.05 | 46.55 | 47.87 | 48.93 | 49.28 ± 2.25 | |
| DCT | Acc. | 90.19 | 91.31 | 91.02 | 91.40 | 91.32 | 91.05 ± 0.50 |
| Spe. | 98.70 | 99.02 | 98.78 | 98.14 | 98.75 | 98.68 ± 0.33 | |
| NPV | 91.22 | 92.07 | 92.02 | 92.97 | 92.33 | 92.12 ± 0.63 | |
| AUC | 49.69 | 52.79 | 50.51 | 49.88 | 49.78 | 50.53 ± 1.30 | |
| NL | Acc. | 91.02 | 91.51 | 92.43 | 91.71 | 91.72 | 91.68 ± 0.51 |
| Spe. | 99.97 | 99.94 | 100.00 | 99.97 | 99.97 | 99.97 ± 0.02 | |
| NPV | 91.05 | 91.56 | 92.43 | 91.74 | 91.75 | 91.71 ± 0.50 | |
| AUC | 48.13 | 52.56 | 51.69 | 49.42 | 48.88 | 50.14 ± 1.90 |
The five-fold cross-validation results performed by AC, DCT, and NL feature descriptor models on yeast data set.
| Model | Testing set | First fold (%) | Second fold (%) | Third fold (%) | Fourth fold (%) | Fifth fold (%) | Average (%) |
|---|---|---|---|---|---|---|---|
| AC | Acc. | 87.06 | 88.50 | 87.70 | 87.78 | 87.15 | 87.64 ± 0.58 |
| Spe. | 98.17 | 98.20 | 98.27 | 97.48 | 97.45 | 97.91 ± 0.41 | |
| NPV | 88.40 | 89.86 | 88.94 | 89.72 | 89.03 | 89.19 ± 0.60 | |
| AUC | 51.67 | 54.92 | 54.69 | 51.27 | 53.59 | 53.23 ± 1.69 | |
| DCT | Acc. | 88.91 | 88.59 | 88.67 | 87.62 | 87.79 | 88.31 ± 0.57 |
| Spe. | 99.19 | 99.46 | 99.19 | 99.09 | 99.45 | 99.27 ± 0.17 | |
| NPV | 89.52 | 88.98 | 89.27 | 88.26 | 88.19 | 88.84 ± 0.60 | |
| AUC | 53.34 | 52.17 | 52.21 | 55.97 | 52.33 | 53.20 ± 1.60 | |
| NL | Acc. | 87.62 | 88.18 | 88.99 | 89.07 | 88.19 | 88.41 ± 0.61 |
| Spe. | 99.63 | 99.82 | 99.91 | 99.91 | 99.73 | 99.80 ± 0.12 | |
| NPV | 87.90 | 88.33 | 89.06 | 89.14 | 88.41 | 88.57 ± 0.52 | |
| AUC | 55.35 | 50.19 | 55.54 | 52.49 | 51.45 | 53.00 ± 2.37 |
Figure 8.Comparison of different feature descriptor models on human dataset.
Figure 9.Comparison of different feature descriptor models on yeast dataset.
Comparison of accuracy between NLPEI and other existing methods.
| Data Set | NLPEI (%) | SPAR (%) | PSPEL (%) | SLIPPER (%) | LocFuse (%) | PPIevo (%) |
|---|---|---|---|---|---|---|
| human | 94.19 | 92.09 | 91.30 | 91.10 | 80.66 | 78.04 |
| yeast | 91.29 | 76.96 | 86.86 | 71.90 | 66.66 | 66.28 |
Performance of NLPEI on independent data sets.
| Data Set | Acc. (%) | Spe. (%) | NPV. (%) | AUC. (%) |
|---|---|---|---|---|
| human | 90.73 | 99.65 | 91.02 | 47.96 |
| yeast | 88.18 | 99.82 | 88.33 | 50.24 |