| Literature DB >> 34884640 |
Jacob Stern1,2, Bryce Hedelius1, Olivia Fisher1, Wendy M Billings1, Dennis Della Corte1.
Abstract
The field of protein structure prediction has recently been revolutionized through the introduction of deep learning. The current state-of-the-art tool AlphaFold2 can predict highly accurate structures; however, it has a prohibitively long inference time for applications that require the folding of hundreds of sequences. The prediction of protein structure annotations, such as amino acid distances, can be achieved at a higher speed with existing tools, such as the ProSPr network. Here, we report on important updates to the ProSPr network, its performance in the recent Critical Assessment of Techniques for Protein Structure Prediction (CASP14) competition, and an evaluation of its accuracy dependency on sequence length and multiple sequence alignment depth. We also provide a detailed description of the architecture and the training process, accompanied by reusable code. This work is anticipated to provide a solid foundation for the further development of protein distance prediction tools.Entities:
Keywords: CASP; ProSPr; alphafold; contact; dataset; deep learning; distance; prediction; protein; retrainable
Mesh:
Substances:
Year: 2021 PMID: 34884640 PMCID: PMC8657919 DOI: 10.3390/ijms222312835
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Two example targets from the CASP14 test set. Left: experimental structures from which labels were derived. Middle: contact maps predicted with ProSPr ensemble on top of the diagonal; label on bottom. Right: visualization of auxiliary loss predictions on top with labels at bottom. Accessible surface area (ASA), torsion angles (PHI, PSI), secondary structure (SS).
CASP14 contact accuracies (see text for definition).
| ProSPr Model | Contact Accuracy (%) | |||
|---|---|---|---|---|
| Short | Mid | Long | Average | |
| A | 81.09% | 69.52% | 41.63% | 64.08% |
| B | 81.15% | 69.29% | 42.41% | 64.28% |
| C | 81.94% | 69.97% | 43.59% | 65.17% |
| Ensemble |
|
|
|
|
ProSPr ensemble contact accuracies (see text for definition).
| Target | Contact Accuracy | ||
|---|---|---|---|
| Short | Mid | Long | |
| T1045s2 | 0.833 | 0.924 | 0.694 |
| T1046s1 | 1.000 | 1.000 | 0.536 |
| T1046s2 | 0.892 | 0.574 | 0.303 |
| T1047s1 | 0.907 | 0.985 | 0.639 |
| T1047s2 | 1.000 | 0.983 | 0.852 |
| T1060s2 | 0.857 | 0.575 | 0.282 |
| T1060s3 | 0.976 | 0.955 | 0.793 |
| T1065s1 | 1.000 | 0.973 | 0.518 |
| T1065s2 | 1.000 | 1.000 | 0.870 |
| T1024 | 1.000 | 1.000 | 0.809 |
| T1026 | 0.750 | 0.425 | 0.494 |
| T1027 | 0.485 | 0.278 | 0.054 |
| T1029 | 0.891 | 0.818 | 0.220 |
| T1030 | 0.804 | 0.792 | 0.333 |
| T1031 | 0.686 | 0.457 | 0.105 |
| T1032 | 0.889 | 0.851 | 0.580 |
| T1033 | 0.750 | 0.316 | 0.216 |
| T1034 | 0.988 | 0.874 | 0.885 |
| T1035 | 0.412 | 0.080 | 0.000 |
| T1037 | 0.690 | 0.455 | 0.030 |
| T1038 | 0.720 | 0.538 | 0.407 |
| T1039 | 0.269 | 0.000 | 0.007 |
| T1040 | 0.318 | 0.222 | 0.027 |
| T1041 | 0.644 | 0.357 | 0.021 |
| T1042 | 0.487 | 0.441 | 0.058 |
| T1043 | 0.431 | 0.216 | 0.014 |
| T1049 | 1.000 | 0.939 | 0.440 |
| T1050 | 0.964 | 0.821 | 0.705 |
| T1052 | 0.728 | 0.600 | 0.417 |
| T1053 | 0.796 | 0.521 | 0.093 |
| T1054 | 1.000 | 1.000 | 0.710 |
| T1055 | 0.932 | 0.860 | 0.200 |
| T1056 | 0.823 | 0.829 | 0.661 |
| T1057 | 1.000 | 0.987 | 0.815 |
| T1058 | 0.821 | 0.678 | 0.678 |
| T1061 | 0.807 | 0.687 | 0.511 |
| T1064 | 0.615 | 0.500 | 0.094 |
| T1067 | 0.865 | 0.824 | 0.466 |
| T1068 | 0.926 | 0.813 | 0.204 |
| T1070 | 0.941 | 0.707 | 0.579 |
| T1073 | 1.000 | 1.000 | 1.000 |
| T1074 | 0.845 | 0.700 | 0.328 |
| T1076 | 0.970 | 0.947 | 0.911 |
| T1078 | 0.984 | 0.892 | 0.587 |
| T1079 | 0.956 | 0.964 | 0.739 |
| T1082 | 0.615 | 0.636 | 0.164 |
| T1083 | 0.909 | 0.783 | 0.909 |
| T1084 | 1.000 | 1.000 | 1.000 |
| T1087 | 1.000 | 0.810 | 0.714 |
| T1088 | 0.954 | 1.000 | 0.778 |
| T1089 | 0.972 | 0.813 | 0.624 |
| T1090 | 0.977 | 0.870 | 0.399 |
| T1091 | 0.832 | 0.571 | 0.071 |
| T1092 | 0.704 | 0.782 | 0.382 |
| T1093 | 0.673 | 0.519 | 0.109 |
| T1094 | 0.649 | 0.580 | 0.144 |
| T1095 | 0.722 | 0.711 | 0.448 |
| T1096 | 0.766 | 0.421 | 0.098 |
| T1099 | 0.800 | 0.375 | 0.101 |
| T1100 | 0.883 | 0.820 | 0.258 |
| T1101 | 0.960 | 0.988 | 0.783 |
Figure 2Left: correlation analysis of average accuracy (see text for definition) for CASP14 targets with MSA smaller than 400 sequences. Middle: correlation analysis for MSA deeper than 400 sequences. Right: correlation analysis of average accuracy and target amino acid sequence length.
Figure 3ProSPr network architecture and model architecture.
Figure 4Detailed view of ProSPr data pipeline. For training a protein structure in the pdb file format is used to create inputs and labels. For inference, a multiple sequence alignment in the a3m file format is expected.