| Literature DB >> 35740966 |
Mohammad N Saqib1, Justyna D Kryś1, Dominik Gront1.
Abstract
The assignment of secondary structure elements in protein conformations is necessary to interpret a protein model that has been established by computational methods. The process essentially involves labeling the amino acid residues with H (Helix), E (Strand), or C (Coil, also known as Loop). When particular atoms are absent from an input protein structure, the procedure becomes more complicated, especially when only the alpha carbon locations are known. Various techniques have been tested and applied to this problem during the last forty years. The application of machine learning techniques is the most recent trend. This contribution presents the HECA classifier, which uses neural networks to assign protein secondary structure types. The technique exclusively employs Cα coordinates. The Keras (TensorFlow) library was used to implement and train the neural network model. The BioShell toolkit was used to calculate the neural network input features from raw coordinates. The study's findings show that neural network-based methods may be successfully used to take on structure assignment challenges when only Cα trace is available. Thanks to the careful selection of input features, our approach's accuracy (above 97%) exceeded that of the existing methods.Entities:
Keywords: deep learning; machine learning; multi-class classifier; neural networks; protein secondary structure; protein secondary structure assignment; protein structure prediction
Mesh:
Substances:
Year: 2022 PMID: 35740966 PMCID: PMC9220970 DOI: 10.3390/biom12060841
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Figure 1The network architecture. The neural network consists of an input, two hidden, and an output layer.
Figure 2The spatial features computed from C positions that are used in the HECA method: (A) local distances between i-th and (i + 2), (i + 3) and (i + 4) atoms, (B) the number of spatial neighbors and (C) the number of hydrogen bonds in the C-only definition.
Figure 3The overview of the HECA prediction. A vector of input values is computed from a given C-only structure for each N-residue fragment (here ). Each input row is used to assign H, E, and C classes to the middle residue of a segment. For example, the input row marked in bold font corresponds to a segment of TEAVD residues of 2gb1 deposit and predicts the secondary structure for the middle alanine.
The Q3 accuracy on the training and validation set for the HECA neural network with different input data sets (in percentages).
| Fragment | Local | Local + Neighbors | Local + Neighbors + Hbonds | |||
|---|---|---|---|---|---|---|
| Training | Validation | Training | Validation | Training | Validation | |
| 5 | 83.58 | 83.60 | 88.93 | 88.75 | 91.91 | 91.98 |
| 7 | 89.89 | 89.99 | 96.20 | 96.39 | 95.40 | 95.48 |
| 9 | 91.84 | 91.88 | 94.03 | 94.10 | 96.85 | 96.91 |
| 11 | 92.37 | 92.46 | 94.51 | 94.57 | 97.29 | 97.33 |
| 13 | 92.53 | 92.61 | 94.68 | 94.89 | 97.39 | 97.48 |
The summary of the HECA algorithm performance measured on a test data set.
| Fragment length | 5 | 7 | 9 | 11 | 13 |
| No. of ideally predicted proteins | 144 | 246 | 406 | 465 | 491 |
| Differences between predicted and true classes | H: 6.97% | H: 2.87% | H: 2.78% | H: 2.05% | H: 1.88% |
| E: 5.16% | E: 4.55% | E: 4.30% | E: 3.44% | E: 3.63% | |
| C: 2.27% | C: 7.69% | C: 7.69% | C: 4.50% | C: 4.44% | |
| Average differences | 8.13% | 5.03% | 3.70% | 3.33% | 3.31% |
Figure 4Q3 accuracy of the HECA method compared to the PCASSO approach (white and dark bars, respectively).
Confusion matrices for all (local + neighbors + hbonds) features and segments compared to local features and (bottom right).
| 5-mer | Predicted | 11-mer | Predicted | ||||
|---|---|---|---|---|---|---|---|
| H | E | C | H | E | C | ||
| H | 93.026% | 0.035% | 6.938% | H | 97.953% | 0.036% | 2.009% |
| E | 0.180% | 94.840% | 4.978% | E | 0.123% | 96.559% | 3.317% |
| C | 11.089% | 1.179% | 87.731% | C | 2.771% | 1.730% | 95.498% |
|
|
|
|
| ||||
|
|
|
|
|
|
| ||
| H | 98.11% | 0.028% | 1.859% | H | 97.680% | 0.205% | 2.114% |
| E | 0.134% | 96.37% | 3.49% | E | 0.684% | 85.680% | 13.634% |
| C | 2.80% | 1.634% | 95.563% | C | 3.398% | 7.043% | 89.557% |
Q3 accuracy of the HECA method compared with PCASSO and results by Nasr et al. [16].
| PDB | Nres | Type | HECA | PCASSO | Nasr et al. | PDB | Nres | Type | HECA | PCASSO | Nasr et al. |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1EAR_A | 135 |
| 0.957 | 0.936 | 0.956 | 4MNC_A | 299 | 0.950 | 0.849 | 0.930 | |
| 1GQI_A | 702 |
| 0.971 | 0.879 | 0.925 | 4MYD_A | 246 | 0.972 | 0.857 | 0.947 | |
| 1NUY_A | 322 | 0.960 | 0.859 | 0.891 | 4OH7_A | 296 |
| 0.953 | 0.427 | 0.909 | |
| 1OK0_A | 68 |
| 0.972 | 0.972 | 0.926 | 4P3H_A | 184 | 0.921 | 0.473 | 0.918 | |
| 1SDI_A | 207 |
| 0.962 | 0.901 | 0.981 | 4WKA_A | 363 | 0.962 | 0.853 | 0.909 | |
| 1UJ8_A | 66 |
| 0.917 | 0.794 | 0.939 | 4ZDS_A | 125 |
| 0.939 | 0.825 | 0.960 |
| 1Z6N_A | 160 |
| 0.945 | 0.867 | 0.950 | 5CKL_A | 175 |
| 0.933 | 0.883 | 0.937 |
| 2FGQ_X | 324 |
| 0.954 | 0.921 | 0.920 | 5CL8_A | 225 |
| 0.948 | 0.852 | 0.951 |
| 2FP1_A | 159 |
| 0.963 | 0.890 | 0.981 | 5CVW_A | 142 |
| 0.946 | 0.773 | 0.831 |
| 2FVY_A | 298 | 0.977 | 0.898 | 0.953 | 5GZK_A | 412 |
| 0.935 | 0.861 | 0.876 | |
| 2I5V_O | 239 |
| 0.979 | 0.951 | 0.954 | 5JUH_A | 130 | 0.948 | 0.627 | 0.862 | |
| 2JDA_A | 131 |
| 0.920 | 0.791 | 0.756 | 5LT5_A | 198 | 0.960 | 0.877 | 0.909 | |
| 2O1T_A | 428 | 0.907 | 0.847 | 0.930 | 5T9Y_A | 312 | 0.915 | 0.598 | 0.933 | ||
| 2OPC_A | 109 |
| 0.956 | 0.930 | 0.908 | 5TIF_A | 176 |
| 0.961 | 0.839 | 0.955 |
| 2QKV_A | 85 | 0.923 | 0.945 | 0.976 | 5UEB_A | 136 | 0.943 | 0.829 | 0.956 | ||
| 2RIN_A | 282 | 0.895 | 0.788 | 0.901 | 5W53_A | 297 |
| 0.963 | 0.947 | 0.970 | |
| 2RIQ_A | 129 |
| 0.888 | 0.903 | 0.938 | 5WEC_A | 104 |
| 0.946 | 0.866 | 0.981 |
| 2Z6R_A | 256 | 0.958 | 0.852 | 0.930 | 5YDE_A | 105 | 0.936 | 0.891 | 0.924 | ||
| 2ZDP_A | 104 | 0.972 | 0.798 | 0.904 | 5ZIM_A | 222 |
| 0.942 | 0.912 | 0.959 | |
| 3BQP_A | 74 |
| 0.937 | 0.900 | 1.000 | 6A2W_A | 159 |
| 0.957 | 0.831 | 0.975 |
| 3D2Y_A | 251 | 0.968 | 0.914 | 0.952 | 6E7E_A | 163 |
| 0.970 | 0.905 | 0.982 | |
| 3DXY_A | 200 | 0.932 | 0.869 | 0.970 | 6ER6_A | 82 |
| 0.931 | 0.886 | 0.976 | |
| 3KYJ_A | 123 |
| 0.961 | 0.658 | 0.992 | 6GEH_A | 250 | 0.968 | 0.910 | 0.956 | |
| 3LFK_A | 115 | 0.909 | 0.811 | 0.930 | 6I1A_A | 352 |
| 0.946 | 0.731 | 0.932 | |
| 3NJN_A | 108 |
| 0.903 | 0.517 | 0.861 | 6IY4_I | 86 | 0.881 | 0.838 | 0.907 | |
| 3OBQ_A | 135 | 0.950 | 0.907 | 0.956 | 6JH9_B | 22 |
| 0.655 | 0.689 | 0.773 | |
| 3Q40_A | 169 |
| 0.964 | 0.946 | 0.975 | 6JM5_A | 114 | 0.918 | 0.886 | 0.904 | |
| 3R87_A | 125 | 0.984 | 0.916 | 0.960 | 6JU1_A | 387 | 0.961 | 0.785 | 0.925 | ||
| 3RT2_A | 165 | 0.959 | 0.935 | 0.964 | 6JWF_A | 400 |
| 0.973 | 0.899 | 0.915 | |
| 3V4K_A | 180 | 0.967 | 0.672 | 0.967 | 6KTK_A | 362 | 0.956 | 0.495 | 0.945 | ||
| 3VK5_A | 247 | 0.956 | 0.889 | 0.976 | 6NEY_A | 119 |
| 0.936 | 0.880 | 0.933 | |
| 3VMK_A | 363 | 0.924 | 0.523 | 0.898 | 6NZS_A | 581 |
| 0.950 | 0.873 | 0.880 | |
| 3WDN_A | 119 | 0.952 | 0.872 | 0.933 | 6P80_A | 312 |
| 0.952 | 0.874 | 0.942 | |
| 4AYO_A | 428 |
| 0.965 | 0.875 | 0.914 | 6TM6_A | 90 |
| 0.855 | 0.876 | 0.878 |
| 4B20_A | 264 | 0.950 | 0.662 | 0.930 | 6TZX_A | 217 |
| 0.950 | 0.865 | 0.926 | |
| 4GMU_A | 604 | 0.960 | 0.863 | 0.904 | 6ULO_A | 310 |
| 0.940 | 0.340 | 0.923 | |
| 4JUI_A | 463 | 0.970 | 0.704 | 0.935 | 6YDR_A | 122 |
| 0.976 | 0.860 | 0.992 | |
| 4L9E_A | 108 | 0.947 | 0.904 | 0.917 |
| 0.942 | 0.820 | 0.931 |