| Literature DB >> 26104144 |
Renxiang Yan1, Xiaofeng Wang2, Lanqing Huang1, Feidi Yan1, Xiaoyu Xue1, Weiwen Cai1.
Abstract
Protein three-dimensional (3D) structures provide insightful information in many fields of biology. One-dimensional properties derived from 3D structures such as secondary structure, residue solvent accessibility, residue depth and backbone torsion angles are helpful to protein function prediction, fold recognition and ab initio folding. Here, we predict various structural features with the assistance of neural network learning. Based on an independent test dataset, protein secondary structure prediction generates an overall Q3 accuracy of ~80%. Meanwhile, the prediction of relative solvent accessibility obtains the highest mean absolute error of 0.164, and prediction of residue depth achieves the lowest mean absolute error of 0.062. We further improve the outer membrane protein identification by including the predicted structural features in a scoring function using a simple profile-to-profile alignment. The results demonstrate that the accuracy of outer membrane protein identification can be improved by ~3% at a 1% false positive level when structural features are incorporated. Finally, our methods are available as two convenient and easy-to-use programs. One is PSSM-2-Features for predicting secondary structure, relative solvent accessibility, residue depth and backbone torsion angles, the other is PPA-OMP for identifying outer membrane proteins from proteomes.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26104144 PMCID: PMC4478468 DOI: 10.1038/srep11586
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1A pipeline of our methods.
The pipeline consists of three modules, prediction of structural features, identification of OMPs, and modeling of 3D structures for potential OMPs. First, a target protein is iteratively threaded through the local NCBI NR database for three iterations to generate sequence profiles. Profiles are then fed into the trained neural networks to predict structurally features. Second, the target protein is searched against an OMP sequence database by using PPA-OMP with a scoring function incorporating sequence profiles and predicted structural features. The target protein is judged to be an OMP or not by the significance of the top alignment. Third, the target protein is searched against a structurally known OMP database by PPA-OMP program if the target protein is predicted to be an OMP. The 3D structural models of the target are built using the alignment by PPA-OMP with the assistance of MODELLER23 program. Because PPA-OMP is used to search a sequence database and a structurally known database in this pipeline, PPA-OMP is used twice in this flow chart.
Figure 2DSSP and STRIDE for assignment of protein secondary structure on SCOPe_TEST1073 dataset.
Comparison of protein secondary structure prediction performance.
| PSSpred | 0.813 | 0.876 | 0.746 | 0.786 |
| Psipred | 0.800 | 0.813 | 0.711 | 0.835 |
| SPINE-X | 0.801 | 0.882 | 0.695 | 0.777 |
| SABLE | 0.783 | 0.823 | 0.662 | 0.809 |
| PSSM-2-Features | 0.798 | 0.869 | 0.728 | 0.764 |
| PSSpred | 0.804 | 0.836 | 0.727 | 0.818 |
| Psipred | 0.798 | 0.788 | 0.699 | 0.863 |
| SPINE-X | 0.800 | 0.873 | 0.679 | 0.801 |
| SABLE | 0.786 | 0.817 | 0.665 | 0.826 |
| PSSM-2-Features | 0.787 | 0.853 | 0.669 | 0.792 |
| Secondary structure assigned by DSSP | ||||
| Q3 | QH | QE | QC | |
| PSSpred | 0.801 | 0.877 | 0.759 | 0.751 |
| Psipred | 0.799 | 0.824 | 0.726 | 0.812 |
| SPINE-X | 0.788 | 0.881 | 0.707 | 0.743 |
| SABLE | 0.780 | 0.832 | 0.677 | 0.785 |
| PSSM-2-Features | 0.793 | 0.833 | 0.710 | 0.799 |
| PSSpred | 0.793 | 0.835 | 0.738 | 0.786 |
| Psipred | 0.799 | 0.799 | 0.714 | 0.843 |
| SPINE-X | 0.788 | 0.871 | 0.689 | 0.768 |
| SABLE | 0.786 | 0.826 | 0.678 | 0.805 |
| PSSM-2-Features | 0.780 | 0.831 | 0.663 | 0.796 |
aThe results here were tested on an independent dataset (i.e., SCOPe_TEST1073).
bThe results here were tested based on cross-validation on PDB_CS6001 dataset.
Input features and optimized window sizes for the training of structural properties.
| SS | 15 | 2,1 | 2 | PSSM, PSFM, CS, FT |
| RD | 17 | 1,1 | 2 | PSSM, PSFM, CS |
| Phi | 17 | 1 | 1 | PSSM, PSFM, CS |
| RSA | 21 | 1,1 | 2 | SS, PSSM, PSFM, CS |
aThere are one or two numbers in the column of number of hidden layers. If there are two numbers in, the two numbers are nodes in the first and second networks. Generally speaking, we use the second neural network to refine the prediction by the first neural network.
bPSSM, PSFM, FT and CS stand for position-specific scoring matrix, position-specific frequency matrix, amino acid’s fitness score to secondary structure and conservation score, respectively.
The mean absolute error (MAE) and Pearson’s correlation coefficient (Pcc) of various structural properties.
| Property | MAE | Pcc |
|---|---|---|
| SPINE-X Phi | 0.072 | 0.550 |
| PSSM-2-Features Phi | 0.082 | 0.546 |
| SPINE-X RSA | 0.168 | 0.673 |
| PSSM-2-Features RSA | 0.164 | 0.690 |
| PSSM-2-Features RD | 0.062 | 0.597 |
| SPINE-X Phi | 0.074 | 0.549 |
| PSSM-2-Features Phi | 0.082 | 0.546 |
| SPINE-X RSA | 0.153 | 0.688 |
| PSSM-2-Features RSA | 0.164 | 0.690 |
| PSSM-2-Features RD | 0.083 | 0.553 |
aThe results here were tested on the SCOPe_TEST1073 dataset.
bThe results here were tested based on cross-validation on PDB_CS6001 dataset.
Figure 3Number of proteins as a function of Pearson’s correlation coefficient (Pcc) for RSA on SCOPe_TEST1073 dataset.
Figure 4Number of proteins as a function of Pearson’s correlation coefficient (Pcc) for RD on SCOPe_TEST1073 dataset.
Figure 5Relationship between RSA and RD on SCOPe_TEST1073 dataset.
Figure 6ROC curves of different OMP discrimination methods assessed on R-dataset.
Comparison of receiver operator characteristics table for different methods.
| HHomp | 1400 | 1435 | 1437 | 1441 | 1442 | 1445 | 1449 | 1454 | 1455 | 1459 |
| PPA-OMP | 1314 | 1389 | 1452 | 1541 | 1564 | 1634 | 1667 | 1706 | 1728 | 1741 |
| Control-PPA | 1166 | 1291 | 1310 | 1319 | 1323 | 1346 | 1362 | 1393 | 1404 | 1417 |
aHere, false positives correspond to those non-OMPs that are predicted as OMPs.
bThe numbers in this line show various thresholds of false positives.
cThe numbers in these lines correspond to true positives that can be identified by methods tested here.