| Literature DB >> 34003870 |
Seonwoo Min1, HyunGi Kim1, Byunghan Lee2, Sungroh Yoon1,3.
Abstract
Heat shock proteins (HSPs) play a pivotal role as molecular chaperones against unfavorable conditions. Although HSPs are of great importance, their computational identification remains a significant challenge. Previous studies have two major limitations. First, they relied heavily on amino acid composition features, which inevitably limited their prediction performance. Second, their prediction performance was overestimated because of the independent two-stage evaluations and train-test data redundancy. To overcome these limitations, we introduce two novel deep learning algorithms: (1) time-efficient DeepHSP and (2) high-performance DeeperHSP. We propose a convolutional neural network (CNN)-based DeepHSP that classifies both non-HSPs and six HSP families simultaneously. It outperforms state-of-the-art algorithms, despite taking 14-15 times less time for both training and inference. We further improve the performance of DeepHSP by taking advantage of protein transfer learning. While DeepHSP is trained on raw protein sequences, DeeperHSP is trained on top of pre-trained protein representations. Therefore, DeeperHSP remarkably outperforms state-of-the-art algorithms increasing F1 scores in both cross-validation and independent test experiments by 20% and 10%, respectively. We envision that the proposed algorithms can provide a proteome-wide prediction of HSPs and help in various downstream analyses for pathology and clinical research.Entities:
Year: 2021 PMID: 34003870 PMCID: PMC8130922 DOI: 10.1371/journal.pone.0251865
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of related works.
| Method | Feature | Model | Non-HSP | HSP Families |
|---|---|---|---|---|
| iHSP-PseRAAC [ | PAAC | SVM | X | O |
| Ahmad | DPC | SVM | X | O |
| PredHSP [ | DPC | SVM | O | O |
| ir-HSP [ | SDPC | SVM | O | O |
Pre-trained protein language models.
| Model | Parameter (M) | Dimensions | Proteins (M) | |
|---|---|---|---|---|
| UniRep | RNN | 18 | 1,900 | 24 |
| PLUS-RNN | RNN | 59 | 2,048 | 15 |
| SeqVec | RNN | 94 | 1,024 | 33 |
| ProtXLNet | TFM | 409 | 1,024 | 216 |
| ProtBERT | TFM | 421 | 1,024 | 2,122 |
| ESM | TFM | 669 | 1,280 | 27 |
Fig 1Overview of DeepHSP and DeeperHSP.
Summary of cross-validation and independent test datasets.
| Class | Cross-Validation Dataset | Independent Test Dataset |
|---|---|---|
| Non-HSP | 9,965 | 500 |
| HSP20 | 354 | 12 |
| HSP40 | 1,257 | 52 |
| HSP60 | 159 | 8 |
| HSP70 | 278 | 53 |
| HSP90 | 52 | 35 |
| HSP100 | 81 | 20 |
Comparison of overall classification performance using 5-fold cross-validation.
| Model | Accuracy | F1 Score | Precision | Recall | Specificity | MCC | AUC-ROC | AUC-PR |
|---|---|---|---|---|---|---|---|---|
| PredHSP | 0.9128 | 0.6839 | 0.9044 | 0.5856 | 0.9409 | 0.6686 | 0.9496 | 0.7725 |
| ir-HSP | 0.9483 | 0.8276 | 0.9437 | 0.7611 | 0.9678 | 0.8147 | 0.9725 | 0.8651 |
| DeepHSP | 0.9682 | 0.8613 | 0.9617 | 0.7984 | 0.9778 | 0.8554 | 0.9779 | 0.8931 |
| DeeperHSP |
† Modified version for performance improvement.
Comparison of class-wise classification performance in terms of F1 score using 5-fold cross-validation.
| Model | Non-HSP | HSP20 | HSP40 | HSP60 | HSP70 | HSP90 | HSP100 | Average |
|---|---|---|---|---|---|---|---|---|
| PredHSP | 0.9502 | 0.6353 | 0.7573 | 0.4181 | 0.6159 | 0.6558 | 0.7544 | 0.6839 |
| ir-HSP | 0.9698 | 0.8190 | 0.8696 | 0.6175 | 0.8165 | 0.8315 | 0.8692 | 0.8276 |
| DeepHSP | 0.9812 | 0.8554 | 0.9540 | 0.7157 | 0.8149 | 0.8079 | 0.9001 | 0.8613 |
| DeeperHSP |
† Modified version for performance improvement.
Fig 2t-SNE plot of the latent representations of DeepHSP and DeeperHSP for the cross-validation dataset.
Comparison of overall classification performance using independent test.
| Model | Accuracy | F1 Score | Precision | Recall | Specificity | MCC | AUC-ROC | AUC-PR |
|---|---|---|---|---|---|---|---|---|
| PredHSP | 0.9324 | 0.8017 | 0.7643 | 0.9698 | 0.7993 | 0.9515 | 0.8154 | |
| ir-HSP | 0.9456 | 0.8302 | 0.9005 | 0.8092 | 0.9780 | 0.8240 | 0.9677 | 0.8400 |
| DeepHSP | 0.9500 | 0.8340 | 0.8870 | 0.8303 | 0.9805 | 0.8292 | 0.9431 | 0.8079 |
| DeeperHSP | 0.9230 |
† Modified version for performance improvement.
Fig 3Confusion matrices of the modified ir-HSP and DeeperHSP predictions for the independent test dataset.
Comparison of running time required for each algorithm.
| Model | Training (seconds) | Inference (seconds) |
|---|---|---|
| PredHSP | 315 | 5 |
| ir-HSP | 1,265 | 14 |
| DeepHSP | 80 | 1 |
| DeeperHSP | 2,112 | 120 |
† Modified version for performance improvement.
Fig 4Heatmaps of the F1 scores obtained from baseline algorithms using 5-fold cross-validation.
One-stage algorithms classify both non-HSPs and the six HSP families simultaneously. Two-stage algorithms use two models to filter out non-HSPs and classify the remaining HSPs into the six families.
Fig 5Boxplot of the F1 scores obtained from DeeperHSP with different pre-trained protein representations using 5-fold cross-validation.