| Literature DB >> 35677562 |
Xu Zhang1, Yiwei Liu2, Yaming Wang3, Liang Zhang4, Lin Feng2, Bo Jin2, Hongzhe Zhang1.
Abstract
In the field of bioinformatics, understanding protein secondary structure is very important for exploring diseases and finding new treatments. Considering that the physical experiment-based protein secondary structure prediction methods are time-consuming and expensive, some pattern recognition and machine learning methods are proposed. However, most of the methods achieve quite similar performance, which seems to reach a model capacity bottleneck. As both model design and learning process can affect the model learning capacity, we pay attention to the latter part. To this end, a framework called Multistage Combination Classifier Augmented Model (MCCM) is proposed to solve the protein secondary structure prediction task. Specifically, first, a feature extraction module is introduced to extract features with different levels of learning difficulties. Second, multistage combination classifiers are proposed to learn decision boundaries for easy and hard samples, respectively, with the latter penalizing the loss value of the hard samples and finally improving the prediction performance of hard samples. Third, based on the Dirichlet distribution and information entropy measurement, a sample difficulty discrimination module is designed to assign samples with different learning difficulty levels to the aforementioned classifiers. The experimental results on the publicly available benchmark CB513 dataset show that our method outperforms most state-of-the-art models.Entities:
Keywords: amino acid sequence; biology; combination classifier; deep learning; genetics; protein secondary structure
Year: 2022 PMID: 35677562 PMCID: PMC9170271 DOI: 10.3389/fgene.2022.769828
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1Prediction of the Dirichlet distribution of the three amino acid samples’ analysis.
FIGURE 2The overall architecture of the Multistage Combination Classifier Augmented Model (MCCM) for protein secondary structure prediction in genetics and bioinformatics.
Q3 and Q8 accuracy of different algorithms on the public CB513 dataset.
| Algorithms | Q3 | Q8 |
|---|---|---|
| DeepCNF (2016) | 81.80 | 69.1 |
| DCRNN (2018) | - | 69.70 |
| eCRRNN (2018) | 81.20 | 70.2 |
| DNSS2 (2021) | 82.56 | 73.36 |
| BLSTM (2015) | - | 67.40 |
| GSN (2014) | - | 66.40 |
| SSpro, free (2014) | 78.50 | 63.50 |
| JPRED4 (2015) | 81.70 | - |
| SecNet (2020) | 84.30 | 72.30 |
| MCCMdir | 82.12 | 69.79 |
| MCCMeasy | 86.94 | 71.78 |
| MCCM |
|
|
The bold values denote the best values of performance metrics.
Q3 and Q8 accuracy of variant models on the public CB513 dataset.
| Algorithms | Q3 | Q8 |
|---|---|---|
| MCCMc1 | 79.93 | 66.30 |
| MCCMc2 | 81.45 | 68.92 |
| MCCMconf | 81.00 | 66.42 |
| MCCMdir |
|
|
The bold values denote the best values of performance metrics.
FIGURE 3The loss value computed on the training and test datasets.
Prediction accuracy of each label in the Q8 states based on the CB513 dataset.
| Label | Types | Frequency | MCCMdir | MCCMeasy | MCCM |
|---|---|---|---|---|---|
| H | α-Helix | 30.86 | 91.97 | 89.91 |
|
| E | β-Strand | 21.25 | 83.67 | 80.08 |
|
| C | Coil | 21.14 | 63.73 | 63.92 |
|
| T | β-Turn | 11.81 | 53.96 |
| 74.37 |
| S | Bend | 9.81 | 26.35 | 23.68 |
|
| G | 310 Helix | 3.69 | 30.62 | 18.3 |
|
| B | β-Bridge | 1.39 | 4.57 | 3.47 |
|
| I | π-Helix | 0.04 | 0.00 |
|
|
The bold values denote the best values of performance metrics.
confusion matrix, of 84,765 test labels (MCCMdir, MCCMeasy, and MCCM).
| Accuracy (MCCMdir) | Pred freq. | True label | |||
|---|---|---|---|---|---|
| 82.12 | C | E | H | ||
| True freq. | 100% | 42.76 | 22.65 | 34.59 | |
| Predicted label | C | 44.53 |
| 3.76 | 3.85 |
| E | 21.07 | 5.27 |
| 0.48 | |
| H | 34.40 | 4.11 | 0.41 |
| |
The bold values denote the best values of performance metrics.
confusion matrix of 84,765 test labels (MCCMdir, MCCMeasy, and MCCM).
| Accuracy (MCCMdir) | Pred freq. | True label | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 69.79 | C | B | E | G | I | H | S | T | ||
| True freq. | 100% | 21.14 | 1.39 | 21.25 | 3.69 | 0.04 | 30.86 | 9.81 | 11.81 | |
| Predicted label | C | 23.33 |
| 0.04 | 3.58 | 0.27 | 0.00 | 1.04 | 1.28 | 1.46 |
| B | 0.14 | 0.69 |
| 0.33 | 0.02 | 0.00 | 0.11 | 0.09 | 0.09 | |
| E | 23.81 | 2.29 | 0.02 |
| 0.07 | 0.00 | 0.42 | 0.31 | 0.37 | |
| G | 2.57 | 0.65 | 0.00 | 0.21 |
| 0.00 | 0.91 | 0.12 | 0.68 | |
| I | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 0.03 | 0.00 | 0.00 | |
| H | 33.43 | 0.86 | 0.01 | 0.29 | 0.4 | 0.00 |
| 0.1 | 0.82 | |
| S | 5.12 | 3.48 | 0.01 | 1.07 | 0.18 | 0.00 | 0.69 |
| 1.79 | |
| T | 11.6 | 1.9 | 0. | 0.55 | 0.5 | 0.00 | 1.86 | 0.63 |
| |
The bold values denote the best values of performance metrics.