| Literature DB >> 33058261 |
Krzysztof Kotowski1, Tomasz Smolarczyk1, Irena Roterman-Konieczna2, Katarzyna Stapor1.
Abstract
Predicting protein function and structure from sequence remains an unsolved problem in bioinformatics. The best performing methods rely heavily on evolutionary information from multiple sequence alignments, which means their accuracy deteriorates for sequences with a few homologs, and given the increasing sequence database sizes requires long computation times. Here, a single-sequence-based prediction method is presented, called ProteinUnet, leveraging an U-Net convolutional network architecture. It is compared to SPIDER3-Single model, based on long short-term memory-bidirectional recurrent neural networks architecture. Both methods achieve similar results for prediction of secondary structures (both three- and eight-state), half-sphere exposure, and contact number, but ProteinUnet has two times fewer parameters, 17 times shorter inference time, and can be trained 11 times faster. Moreover, ProteinUnet tends to be better for short sequences and residues with a low number of local contacts. Additionally, the method of loss weighting is presented as an effective way of increasing accuracy for rare secondary structures.Entities:
Keywords: backbone angles estimation; deep learning; protein structure prediction; secondary structure prediction; solvent accessibility prediction
Year: 2020 PMID: 33058261 PMCID: PMC7756333 DOI: 10.1002/jcc.26432
Source DB: PubMed Journal: J Comput Chem ISSN: 0192-8651 Impact factor: 3.376
FIGURE 1The architecture of ProteinUnet secondary structure classification network. The regression network differs only in the number and activations of output layers [Color figure can be viewed at wileyonlinelibrary.com]
Comparison of mean training and prediction times for SPIDER3‐Single and ProteinUnet 10‐model ensembles
| Classification | Regression | |||
|---|---|---|---|---|
| SPIDER3‐Single | ProteinUnet | SPIDER3‐Single | ProteinUnet | |
| Mean training time per epoch (s) | 524.9 ± 1.7 | 42.0 ± 0.1 | 527.8 ± 1.7 | 45.9 ± 0.3 |
| Mean prediction time per chain in TS1197 (s) | 1.12 ± 0.54 | 0.062 ± 0.0025 | 1.13 ± 0.54 | 0.066 ± 0.0031 |
The comparison of performance for test sets between (a) original SPIDER3‐Single Iteration 2,[ ] (b) our reimplementation of SPIDER3‐Single, and (c) ProteinUnet according to fraction of residues in correctly predicted three and eight states (Q3 and Q8), Pearson CC, and MAE
| (a) | (b) | (c) | |
|---|---|---|---|
| TS1199 | TS1197 | TS1197 | |
| Q3 | 72.56% | 72.56% | 72.66% |
| Q8 | 60.11% | 59.88% | 60.06% |
| ASA (CC) | 0.671 | 0.669 | 0.667 |
| HSEα‐up (CC) | 0.612 | 0.608 | 0.602 |
| HSEα‐down (CC) | 0.568 | 0.566 | 0.567 |
| CN (CC) | 0.643 | 0.618 | 0.621 |
|
| 24.5 | 23.5 | 23.7 |
|
| 43.5 | 41.8 | 42.3 |
|
| 11.3 | 10.1 | 10.2 |
|
| 45.8 | 43.2 | 43.8 |
Abbreviations: ASA, accessible surface area; CC, correlation coefficients; HSE, half sphere exposure; MAE, mean absolute error.
Performance in secondary structure prediction by ProteinUnet and SPIDER3‐Single on TS1197 and CASP13[ ] according to mean accuracy and SD at the sequence level
| TS1197 | CASP13 | ||||||
|---|---|---|---|---|---|---|---|
| Mean (%) |
|
| Mean (%) |
|
| ||
| Q3 | ProteinUnet | 73.53 | 8.70 | .0152 | 74.39 | 8.13 | .0128 |
| SPIDER3‐Single | 73.18 | 9.04 | 75.12 | 7.65 | |||
| Q8 | ProteinUnet | 61.82 | 10.86 | <.0001 | 60.81 | 12.17 | .8961 |
| SPIDER3‐Single | 61.34 | 11.15 | 60.81 | 12.79 | |||
FIGURE 2Accuracy of the secondary structure prediction (Q3 and Q8) for individual amino acids for SPIDER3‐Single (red triangles) and ProteinUnet (green circles) on TS1197 dataset. Three‐letter codes were used for amino acid residues. The size of the bubble represents the frequency of the amino acids. The gray horizontal line marks the fraction of residues in correctly predicted three and eight states (Q3 and Q8) [Color figure can be viewed at wileyonlinelibrary.com]
FIGURE 3The accuracy of secondary structure prediction (Q3) for individual sequences against the sequence length for ProteinUnet (green circles) and SPIDER3‐Single (red triangles) on TS1197 dataset [Color figure can be viewed at wileyonlinelibrary.com]
FIGURE 4The comparison of mean accuracy at the sequence level for each Q8 state on TS1197 dataset between weighted and nonweighted ProteinUnet and nonweighted networks [Color figure can be viewed at wileyonlinelibrary.com]
FIGURE 5The distribution of regression outputs for TS1197 dataset. True values are presented with a solid gray line, prediction values for ProteinUnet with a solid green line and SPIDER3‐Single with a dashed red line [Color figure can be viewed at wileyonlinelibrary.com]
FIGURE 6Accuracy of predicted Q3 as a function of the number of local contacts on TS1197 dataset [Color figure can be viewed at wileyonlinelibrary.com]
FIGURE 7Accuracy of predicted Q3 as a function of the number of nonlocal contacts on TS1197 dataset [Color figure can be viewed at wileyonlinelibrary.com]