| Literature DB >> 35651930 |
Huiling Zhang1,2, Ying Huang1,2, Zhendong Bei1,2, Zhen Ju1,2, Jintao Meng1,2, Min Hao3, Jingjing Zhang1,2, Haiping Zhang2, Wenhui Xi1,2.
Abstract
Residue distance prediction from the sequence is critical for many biological applications such as protein structure reconstruction, protein-protein interaction prediction, and protein design. However, prediction of fine-grained distances between residues with long sequence separations still remains challenging. In this study, we propose DuetDis, a method based on duet feature sets and deep residual network with squeeze-and-excitation (SE), for protein inter-residue distance prediction. DuetDis embraces the ability to learn and fuse features directly or indirectly extracted from the whole-genome/metagenomic databases and, therefore, minimize the information loss through ensembling models trained on different feature sets. We evaluate DuetDis and 11 widely used peer methods on a large-scale test set (610 proteins chains). The experimental results suggest that 1) prediction results from different feature sets show obvious differences; 2) ensembling different feature sets can improve the prediction performance; 3) high-quality multiple sequence alignment (MSA) used for both training and testing can greatly improve the prediction performance; and 4) DuetDis is more accurate than peer methods for the overall prediction, more reliable in terms of model prediction score, and more robust against shallow multiple sequence alignment (MSA).Entities:
Keywords: deep learning; multiple sequence alignment; protein structure reconstruction; residual network; residue distance prediction
Year: 2022 PMID: 35651930 PMCID: PMC9148999 DOI: 10.3389/fgene.2022.887491
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1The flowchart of MSA generation for the training set.
FIGURE 2The network architecture used in this work. (A) The network used by DuetDis; (B) the reference network; (C) basic modules used in the networks; dilated convolution.
The strategies used for the training of sub-models (N1_M1/N1_M2/N1_M3/N1_M4/N1_M5 are used for DuetDis).
| Sub-models | Network | Feature set | MSA | MSA shuffle |
|---|---|---|---|---|
| N1_M1 | Net1 | FeatSet1 | MSA_All | Yes |
| N1_M2 | Net1 | FeatSet1 | MSA_Top | No |
| N1_M3 | Net1 | FeatSet2 | MSA_Top | No |
| N1_M4 | Net1 | FeatSet2 | MSA_1 | No |
| N1_M5 | Net1 | FeatSet2 | MSA_2 | No |
| N2_M1 | Net2 | FeatSet1 | MSA_All | Yes |
| N2_M2 | Net2 | FeatSet1 | MSA_Top | No |
| N2_M3 | Net2 | FeatSet2 | MSA_Top | No |
| N2_M4 | Net2 | FeatSet2 | MSA_1 | No |
| N2_M5 | Net2 | FeatSet2 | MSA_2 | No |
FIGURE 3Prediction similarities between different sub-models for (A) all-range, (B) short-range, (C) mid-range, and (D) long-range contacts/distances.
The prediction precisions of N1_M1/N1_M2/N1_M3/N1_M4/N1_M5/N1_Ensemble for different sequence separations.
| Range | Method | Top-L | Top-L/2 | Top-L/5 |
|---|---|---|---|---|
| All | N1_M1 | 0.7769 | 0.8717 | 0.9206 |
| N1_M2 | 0.7587 | 0.8475 | 0.8941 | |
| N1_M3 | 0.7491 | 0.846 | 0.9027 | |
| N1_M4 | 0.7256 | 0.8266 | 0.8888 | |
| N1_M5 | 0.7319 | 0.8328 | 0.8942 | |
| N1_Ensemble | 0.7896 | 0.8786 | 0.9266 | |
| Short | N1_M1 | 0.2955 | 0.481 | 0.7389 |
| N1_M2 | 0.2928 | 0.4754 | 0.7287 | |
| N1_M3 | 0.2948 | 0.4757 | 0.7374 | |
| N1_M4 | 0.2824 | 0.4588 | 0.7109 | |
| N1_M5 | 0.2947 | 0.473 | 0.7219 | |
| N1_Ensemble | 0.2988 | 0.4918 | 0.7633 | |
| Medium | N1_M1 | 0.3512 | 0.5477 | 0.7725 |
| N1_M2 | 0.3422 | 0.5336 | 0.7514 | |
| N1_M3 | 0.342 | 0.5329 | 0.7533 | |
| N1_M4 | 0.3306 | 0.5135 | 0.7275 | |
| N1_M5 | 0.3371 | 0.5209 | 0.7352 | |
| N1_Ensemble | 0.3537 | 0.5592 | 0.7895 | |
| Long | N1_M1 | 0.6245 | 0.7696 | 0.865 |
| N1_M2 | 0.6062 | 0.7411 | 0.8273 | |
| N1_M3 | 0.594 | 0.7308 | 0.8246 | |
| N1_M4 | 0.5695 | 0.7091 | 0.8088 | |
| N1_M5 | 0.5742 | 0.712 | 0.8121 | |
| N1_Ensemble | 0.6416 | 0.7797 | 0.8626 |
The prediction precisions of N2_M1/N2_M2/N2_M3/N2_M4/N2_M5/N2_Ensemble for different sequence separations. -80
| Range | Method | Top-L | Top-L/2 | Top-L/5 |
|---|---|---|---|---|
| All | N2_M1 | 0.7532 | 0.8562 | 0.9103 |
| N2_M2 | 0.7435 | 0.839 | 0.8938 | |
| N2_M3 | 0.7148 | 0.8188 | 0.8828 | |
| N2_M4 | 0.7091 | 0.8119 | 0.8768 | |
| N2_M5 | 0.7071 | 0.8121 | 0.879 | |
| N2_Ensemble | 0.7590 | 0.8579 | 0.9153 | |
| Short | N2_M1 | 0.2864 | 0.4654 | 0.7172 |
| N2_M2 | 0.2901 | 0.4647 | 0.71 | |
| N2_M3 | 0.2852 | 0.4583 | 0.7014 | |
| N2_M4 | 0.2831 | 0.4547 | 0.6982 | |
| N2_M5 | 0.2825 | 0.4548 | 0.7002 | |
| N2_Ensemble | 0.3449 | 0.5396 | 0.7367 | |
| Medium | N2_M1 | 0.3413 | 0.5325 | 0.755 |
| N2_M2 | 0.3428 | 0.5267 | 0.7395 | |
| N2_M3 | 0.3298 | 0.5082 | 0.7206 | |
| N2_M4 | 0.3281 | 0.5042 | 0.7152 | |
| N2_M5 | 0.3283 | 0.5057 | 0.7159 | |
| N2_Ensemble | 0.3449 | 0.5396 | 0.7602 | |
| Long | N2_M1 | 0.6035 | 0.746 | 0.8473 |
| N2_M2 | 0.5997 | 0.7361 | 0.828 | |
| N2_M3 | 0.5638 | 0.7022 | 0.8066 | |
| N2_M4 | 0.5525 | 0.6877 | 0.7913 | |
| N2_M5 | 0.5548 | 0.6917 | 0.7941 | |
| N2_Ensemble | 0.6136 | 0.7508 | 0.8473 |
FIGURE 4The overall prediction precisions for (A) all-range, (B) short-range, (C) medium-range, and (D) long-range contacts/distances.
FIGURE 5Prediction performance in terms of precision, coverage, MCC, and the corresponding standard deviation (the shaded area around the curves) with the increasing probability (score) threshold given by the predictors. The numbers under the precision curve (blue) are the numbers of proteins with predictions returned using the corresponding (score) threshold on the x-axis.
FIGURE 6Prediction precisions of different methods for all-range, top-L, and top-L/5 predictions with the variation of N . The error bar is the standard deviation of all precisions (for top L/5 predictions) in each sub-test set.