| Literature DB >> 35360876 |
Li-Ping Li1,2, Bo Zhang1,2, Li Cheng3.
Abstract
Identification and characterization of plant protein-protein interactions (PPIs) are critical in elucidating the functions of proteins and molecular mechanisms in a plant cell. Although experimentally validated plant PPIs data have become increasingly available in diverse plant species, the high-throughput techniques are usually expensive and labor-intensive. With the incredibly valuable plant PPIs data accumulating in public databases, it is progressively important to propose computational approaches to facilitate the identification of possible PPIs. In this article, we propose an effective framework for predicting plant PPIs by combining the position-specific scoring matrix (PSSM), local optimal-oriented pattern (LOOP), and ensemble rotation forest (ROF) model. Specifically, the plant protein sequence is firstly transformed into the PSSM, in which the protein evolutionary information is perfectly preserved. Then, the local textural descriptor LOOP is employed to extract texture variation features from PSSM. Finally, the ROF classifier is adopted to infer the potential plant PPIs. The performance of CPIELA is evaluated via cross-validation on three plant PPIs datasets: Arabidopsis thaliana, Zea mays, and Oryza sativa. The experimental results demonstrate that the CPIELA method achieved the high average prediction accuracies of 98.63%, 98.09%, and 94.02%, respectively. To further verify the high performance of CPIELA, we also compared it with the other state-of-the-art methods on three gold standard datasets. The experimental results illustrate that CPIELA is efficient and reliable for predicting plant PPIs. It is anticipated that the CPIELA approach could become a useful tool for facilitating the identification of possible plant PPIs.Entities:
Keywords: evolutionary information; machine learning; plant; protein–protein interactions; sequence
Year: 2022 PMID: 35360876 PMCID: PMC8963800 DOI: 10.3389/fgene.2022.857839
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1The flowchart of the proposed CPIELA method.
The fivefold cross-validation results achieved on the A. thaliana dataset using the proposed CPIELA method.
| Testing set | Accu. (%) | Sen. (%) | Prec. (%) | Spec. (%) | MCC (%) | AUC |
|---|---|---|---|---|---|---|
| 1 | 98.43 | 97.23 | 99.56 | 99.58 | 96.90 | 0.9957 |
| 2 | 98.78 | 97.99 | 99.61 | 99.60 | 97.59 | 0.9961 |
| 3 | 98.39 | 97.04 | 99.76 | 99.77 | 96.83 | 0.9936 |
| 4 | 98.89 | 97.98 | 99.76 | 99.77 | 97.80 | 0.9957 |
| 5 | 98.67 | 97.58 | 99.76 | 99.77 | 97.37 | 0.9956 |
| Average |
|
|
|
|
|
|
The bold values in these Tables mean the highest value in every column.
The fivefold cross-validation results achieved on the Zea mays dataset using the proposed CPIELA method.
| Testing set | Accu. (%) | Sen. (%) | Prec. (%) | Spec. (%) | MCC (%) | AUC |
|---|---|---|---|---|---|---|
| 1 | 97.82 | 96.59 | 99.07 | 99.08 | 95.74 | 0.9914 |
| 2 | 98.28 | 97.34 | 99.22 | 99.22 | 96.62 | 0.992 |
| 3 | 97.98 | 97.05 | 98.89 | 98.91 | 96.04 | 0.9902 |
| 4 | 98.00 | 97.00 | 98.91 | 98.96 | 96.07 | 0.9893 |
| 5 | 98.37 | 97.65 | 99.07 | 99.09 | 96.79 | 0.9931 |
| Average |
|
|
|
|
|
|
The bold values in these Tables mean the highest value in every column.
The fivefold cross-validation results achieved on the Oryza sativa dataset using the proposed CPIELA method.
| Testing set | Accu. (%) | Sen. (%) | Prec. (%) | Spec. (%) | MCC (%) | AUC |
|---|---|---|---|---|---|---|
| 1 | 93.70 | 93.74 | 93.45 | 93.65 | 88.19 | 0.9558 |
| 2 | 93.59 | 92.17 | 95.17 | 95.09 | 88.00 | 0.9516 |
| 3 | 93.33 | 93.54 | 93.15 | 93.13 | 87.56 | 0.952 |
| 4 | 96.56 | 95.21 | 97.86 | 97.91 | 93.36 | 0.9826 |
| 5 | 92.92 | 93.49 | 92.32 | 92.36 | 86.84 | 0.9484 |
| Average |
|
|
|
|
|
|
The bold values in these Tables mean the highest value in every column.
FIGURE 2The predictive performance of the proposed CPIELA method via fivefold cross-validation. (A–C) The Receiver Operating Characteristic (ROC) curves of Arabidopsis thaliana, Zea mays, and Oryza sativa datasets. (D) The ROC curves performed by the CPIELA method on three plant PPIs datasets.
The fivefold cross-validation results achieved by different classifiers on the three plant datasets.
| Dataset | Classifier | Acc. (%) | Sen. (%) | Prec. (%) | MCC (%) | AUC |
|---|---|---|---|---|---|---|
|
| SVM | 89.37 ± 0.25 | 83.95 ± 0.51 | 94.16 ± 0.41 | 80.89 ± 0.39 | 0.9495 ± 0.0038 |
| RF | 97.21 ± 0.12 | 96.15 ± 0.19 | 98.22 ± 0.33 | 94.58 ± 0.22 | 0.9720 ± 0.0011 | |
| Our method |
|
|
|
|
| |
|
| SVM | 84.46 ± 0.20 | 77.55 ± 0.94 | 89.98 ± 0.47 | 73.5 ± 0.34 | 0.9179 ± 0.0048 |
| RF | 94.65 ± 0.60 | 94.28 ± 0.66 | 94.98 ± 0.81 | 89.87 ± 1.07 | 0.9472 ± 0.0060 | |
| Our method |
|
|
|
|
| |
|
| SVM | 88.95 ± 1.44 | 83.23 ± 2.52 | 94.00 ± 0.72 | 80.24 ± 2.28 | 0.9445 ± 0.0068 |
| RF | 90.90 ± 1.30 | 90.45 ± 1.58 | 91.29 ± 2.10 | 83.47 ± 2.11 | 0.9113 ± 0.0122 | |
| Our method |
|
|
|
|
|
The bold values in these Tables mean the highest value in every column.
FIGURE 3Prediction performance comparison of different classifiers using ROC curves in predicting plant protein–protein interactions. Shown in the plot are the ROC curves for (A) Arabidopsis thaliana, (B) Zea mays, (C) Oryza sativa datasets using RF (blue line), ROF (green line), SVM (red line), respectively. (D) ROC curves of different descriptors on three plant PPIs datasets.
The fivefold cross-validation results achieved on the three plant PPIs dataset among different descriptors using the proposed method.
| Dataset | Methods | Acc. (%) | Sen. (%) | Prec. (%) | Spec. (%) | MCC (%) | AUC |
|---|---|---|---|---|---|---|---|
|
| LPQ + RoF | 73.17 ± 0.72 | 72.55 ± 0.86 | 73.46 ± 0.84 | 73.79 ± 0.64 | 60.74 ± 0.69 | 0.7873 ± 0.0090 |
| LOOP + RoF |
|
|
|
|
|
| |
|
| LPQ + RoF | 94.17 ± 0.40 | 93.4 ± 0.64 | 94.86 ± 0.53 | 94.93 ± 0.50 | 89.02 ± 0.72 | 0.9639 ± 0.0031 |
| LOOP + RoF |
|
|
|
|
|
| |
|
| LPQ + RoF | 91.89 ± 0.64 | 92.14 ± 1.57 | 91.70 ± 0.87 | 91.65 ± 1.01 | 85.09 ± 1.07 | 0.9474 ± 0.0041 |
| LOOP + RoF |
|
|
|
|
|
|
The bold values in these Tables mean the highest value in every column.
The predictive performance comparison of different methods on the Oryza sativa dataset.
| Methods | Accu. (%) | Sen. (%) | Prec. (%) | Spec. (%) | MCC (%) | AUC |
|---|---|---|---|---|---|---|
|
| N/A | 89.28 ± 0.78 | 76.41 ± 1.55 | 72.44 ± 1.58 | 68.59 ± 1.17 | 0.8680 ± 0.8900 |
|
| N/A | 88.00 ± 1.34 | 87.30 ± 1.35 | 87.22 ± 1.16 | 78.26 ± 1.28 | 0.9199 ± 0.5800 |
|
| 82.60 ± 1.79 | 95.89 ± 0.91 | 75.79 ± 2.43 | 69.31 ± 3.53 | 67.65 ± 2.98 | 0.9440 ± 0.5800 |
|
| 75.31 ± 1.37 | 93.34 ± 1.59 | 68.61 ± 1.03 | 57.23 ± 2.90 | 54.26 ± 2.81 | 0.8760 ± 0.0096 |
|
| 81.54 ± 3.05 | 94.81 ± 0.65 | 75.10 ± 3.84 | 68.26 ± 6.61 | 65.50 ± 4.99 | 0.9309 ± 0.0052 |
|
| 66.63 ± 4.48 | 88.42 ± 4.77 | 62.02 ± 4.91 | 45.02 ± 12.49 | 37.39 ± 5.39 | 0.7931 ± 0.0126 |
|
| 80.95 ± 1.10 | 96.12 ± 1.15 | 73.70 ± 1.41 | 65.64 ± 2.40 | 64.99 ± 1.97 | 0.9360 ± 0.0017 |
| Our method |
|
|
|
|
|
|
DHT: discrete Hilbert transform (Cizek, 1970); KNN: k-nearest neighbors; RF: random forest; FFT: fast Fourier transform; DWT: discrete wavelet transform; AC: auto covariance; DCT: discrete cosine transform.
The bold values in these Tables mean the highest value in every column.
Summary of plant PPIs and proteins in different species.
| Species name | Common name | Number of proteins | Number of PPIs |
|---|---|---|---|
|
| Thale cress | 7, 437 | 56, 220 |
|
| Maize | 4, 841 | 28, 460 |
|
| Rice | 1, 834 | 9, 600 |
FIGURE 4The masks of Kirsch’s edge detector which is used for calculating responses in eight possible directions.