| Literature DB >> 32770096 |
Abstract
As deep learning algorithms drive the progress in protein structure prediction, a lot remains to be studied at this merging superhighway of deep learning and protein structure prediction. Recent findings show that inter-residue distance prediction, a more granular version of the well-known contact prediction problem, is a key to predicting accurate models. However, deep learning methods that predict these distances are still in the early stages of their development. To advance these methods and develop other novel methods, a need exists for a small and representative dataset packaged for faster development and testing. In this work, we introduce protein distance net (PDNET), a framework that consists of one such representative dataset along with the scripts for training and testing deep learning methods. The framework also includes all the scripts that were used to curate the dataset, and generate the input features and distance maps. Deep learning models can also be trained and tested in a web browser using free platforms such as Google Colab. We discuss how PDNET can be used to predict contacts, distance intervals, and real-valued distances.Entities:
Year: 2020 PMID: 32770096 PMCID: PMC7414848 DOI: 10.1038/s41598-020-70181-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Comparison of the protein inter-residue distance prediction problem with the ‘depth prediction from a single image problem’ in computer vision. In both problems, the input to the deep learning model is a volume and the output is a 2D matrix. The depth predictions for this specific image (top right corner) were obtained with a pre-trained fully convolutional residual network (FCRN)[11].
Figure 2Example of long-range, medium-range, short-range, and local distances in a protein distance map. The distances between residue pairs 42–47, 42–52, 42–62, and 42–67 are examples of local, short, medium, and long-range distances, respectively (left). In the heatmap plot (right), the sequence separation domain for long, medium, and short-range distances, and local distances are [24+], [12, 23], [6, 11], and [0, 5], respectively. The labels in the x and y axis refer to the residue index in the corresponding protein sequence. The diagonal line shows that the residue pair i-i have a zero distance.
Evaluation metrics used for evaluation of predicted distances and/or contacts.
| Prediction | Metric | Description |
|---|---|---|
| Distances | MAE of the predicted long-range distances with corresponding true distances shorter than 8 Å | |
| MAE of the predicted long-range distances with corresponding true distances shorter than 12 Å | ||
| Contacts | Precision of top L long-range predicted contacts | |
| Precision of top NC long-range predicted contacts |
L stands for the length of the protein sequence in the native (true) structure, and NC is the number of true contacts in the structure.
Figure 3Reciprocating the distance matrix so that larger numbers represent shorter distances in the input distance matrix. The input distance matrix is shown in left and the reciprocated distance matrix is on the right. To avoid division-by-zero errors, all the diagonals in the input distance matrix are replaced by the mean of their neighbors.
MAE and the precision of the contact prediction method (PDNET-Contact), the distance prediction method (PDNET-Distance), and binned distance prediction method (PDNET-Binned), on the 150 proteins in the test set.
| Methods | |||
|---|---|---|---|
| PDNET-Contact | – | 69.5 | 61.1 |
| PDNET-Distance | 4.1 | 67.5 | 59.1 |
| PDNET-Binned | 3.8 | 67.8 | 60.5 |
| DeepCov | – | 55.1 | – |
| DEEPCON | – | 63.4 | 55.8 |
All MAEs are in Å, and the and for the three PDNET methods are shown in percentage. DeepCov and DEEPCON methods’ performances, which are trained on the same dataset, are also reported.
As reported in the DeepCov[16] paper.
Figure 4True distance map, true contact map and the native structure of ‘1a6m-A’ in the test set are shown in the top row. The output of PDNET-Distance, PDNET-Contact, and PDNET-Binned are shown in the bottom row.
Figure 5Comparison of true long-range distances and the distances predicted by our PDNET-Distance method for two random examples from the test dataset: ‘1a6m-A’ (top row) and ‘1gmi-A’ (bottom row). True distance maps and predicted distance maps are shown in the first and second columns respectively. The plots in the last column visualize the comparison of predicted long-range distances and the corresponding true distances. In these two plots, top L long-range predictions are shown in red and all other long-range distances are shown in blue.
Summary of the performance of PDNET-Contact and PDNET-Distance on the 9 CASP13 free-modeling domains (for which the native structures are publicly available) and on the 131 CAMEO hard-targets.
| Dataset | Methods | ||
|---|---|---|---|
| CASP13 FM | PDNET-Contact | 38.8 | 38.5 |
| PDNET-Distance | 32.3 | 30.3 | |
| Top CASP13 group | 45.0 | 45.7 | |
| CAMEO-HARD | PDNET-Contact | 48.3 | 45.1 |
| PDNET-Distance | 46.7 | 43.7 | |
| trRosetta | 48.0 | – |
Performance of CASP13’s top group and the trRosetta method on the CAMEO-HARD dataset are provided as a reference. Results of the trRosetta method are copied from the Yang et al.[18] since the predicted contacts are not publicly available.