| Literature DB >> 31615116 |
Nasrin Akhter1, Gopinath Chennupati2, Kazi Lutful Kabir3, Hristo Djidjev4, Amarda Shehu5,6,7,8.
Abstract
The energy landscape that organizes microstates of a molecular system and governs theunderlying molecular dynamics exposes the relationship between molecular form/structure, changesto form, and biological activity or function in the cell. However, several challenges stand in the wayof leveraging energy landscapes for relating structure and structural dynamics to function. Energylandscapes are high-dimensional, multi-modal, and often overly-rugged. Deep wells or basins inthem do not always correspond to stable structural states but are instead the result of inherentinaccuracies in semi-empirical molecular energy functions. Due to these challenges, energeticsis typically ignored in computational approaches addressing long-standing central questions incomputational biology, such as protein decoy selection. In the latter, the goal is to determine over apossibly large number of computationally-generated three-dimensional structures of a protein thosestructures that are biologically-active/native. In recent work, we have recast our attention on theprotein energy landscape and its role in helping us to advance decoy selection. Here, we summarizesome of our successes so far in this direction via unsupervised learning. More importantly, we furtheradvance the argument that the energy landscape holds valuable information to aid and advance thestate of protein decoy selection via novel machine learning methodologies that leverage supervisedlearning. Our focus in this article is on decoy selection for the purpose of a rigorous, quantitativeevaluation of how leveraging protein energy landscapes advances an important problem in proteinmodeling. However, the ideas and concepts presented here are generally useful to make discoveriesin studies aiming to relate molecular structure and structural dynamics to function.Entities:
Keywords: basin; decoy selection; energy landscape; machine learning; model quality assessment; purity
Year: 2019 PMID: 31615116 PMCID: PMC6843838 DOI: 10.3390/biom9100607
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Figure 1Illustration: Pipeline of ML-Select followed by a weighted model to first select pure basins and then select an individual decoy for prediction.
Testing dataset (* denotes proteins with a predominant fold and a short helix). Protein Data Bank entries of corresponding known native structures are shown in Column 3. Folds of native structures are shown in Column 4. The length of each sequence (number of amino acids) is shown in Column 5. The size of the Rosetta-generated decoy dataset (number of decoys) is shown in Column 6. The lowest root-mean-squared-deviation to the known native structure in each dataset is shown in Column 7. As described above, based on this proximity, the datasets are categorized into easy, medium-difficulty, and hard.
| Difficulty | # | PDB Entry | Fold | Length |
| min_dist |
|---|---|---|---|---|---|---|
| (Å) | ||||||
| Easy | 1 | 1dtd(B) |
| 61 | 58,745 |
|
| 2 | 1wap(A) |
| 68 | 68,000 |
| |
| 3 | 1hz6(A) |
| 64 | 60,000 |
| |
| 4 | 1tig |
| 88 | 60,000 |
| |
| 5 | 1dtj(A) |
| 74 | 60,500 |
| |
| Medium | 6 | 1bq9 |
| 53 | 61,000 |
|
| 7 | 1ail |
| 70 | 58,491 |
| |
| 8 | 1c8c(A) |
| 64 | 65,000 |
| |
| 9 | 1fwp |
| 69 | 51,724 |
| |
| 10 | 1sap |
| 66 | 66,000 |
| |
| Hard | 11 | 1hhp |
| 99 | 60,000 |
|
| 12 | 2ezk |
| 93 | 54,626 |
| |
| 13 | 1aoy |
| 78 | 57,000 |
| |
| 14 | 2h5n(D) |
| 123 | 54,795 |
| |
| 15 | 1isu(A) |
| 62 | 60,000 |
| |
| 16 | 1cc5 |
| 83 | 55,000 |
| |
| 17 | 1aly |
| 146 | 53,000 |
|
Dataset collected from critical assessment of protein structure prediction data archive. The target IDs are shown in Column 2. The length of each sequence (number of amino acids) is shown in Column 3. The size of the Rosetta-generated decoy dataset (number of decoys) is shown in Column 4. The lowest RMSD to the known native structure in each dataset is shown in Column 5.
| # | Target ID | Length |
| min_dist |
|---|---|---|---|---|
| (Å) | ||||
| 1 | T1008-D1 | 77 | 55,000 |
|
| 2 | T0886-D1 | 69 | 55,000 |
|
| 3 | T0953s1D1 | 67 | 55,000 |
|
| 4 | T0960-D2 | 84 | 55,000 |
|
| 5 | T0898-D2 | 55 | 43,435 |
|
| 6 | T0892D2 | 110 | 36,860 |
|
| 7 | T0953s2D3 | 77 | 55,000 |
|
Figure 2Visualization of two supervised and two unsupervised decoy selection strategies. The y-axis tracks the difference in purity of the top basin predicted by each method with the purity of the top (largest) cluster obtained by KMeans-Select. The x-axis tracks the PDB entry id of each target protein. Color-coding is used to distinguish the different methods under comparison. The left, middle, and right panels show results for the easy, medium, and hard datasets, respectively.
Figure 3Visualization of two supervised and two unsupervised decoy selection strategies on the critical assessment of structure prediction targets. The y-axis tracks the difference in purity of the top basin predicted by each method with the purity of the top (largest) cluster obtained by KMeans-Select. The x-axis tracks the protein data bank entry id of each target protein. Color-coding is used to distinguish the different methods under comparison.
Columns 2–4 relate the loss obtained from Weighted-Decoy-Select, Weighted- Decoy-Select, and Random-Decoy-Select, respectively. Columns 5–7 show the median RMSD (over all decoys in a dataset) from the known native structure, the percentage of decoys in each dataset with RMSDs less than 3Å from the known native structure, and the percentage of decoys with RMSDs less than Å from the best decoy (closest to the known native structure). The last seven rows show results for the CASP targets. Lowest loss per target is highlighted in bold.
| Loss (Å) | ||||||
|---|---|---|---|---|---|---|
| Targets | Weighted-Decoy-Select | Weighted-Decoy-Select | Random-Decoy -Select | Median RMSD (Å) | < 3Å (%) | < min_dist + |
| 1hz6(A) |
|
|
|
|
|
|
| 1dtj(A) |
|
|
|
|
|
|
| 1tig |
|
|
|
|
|
|
| 1dtd(B) |
|
|
|
|
|
|
| 1wap(A) |
|
|
|
|
|
|
| 1ail |
|
|
|
|
|
|
| 1bq9 |
|
|
|
|
|
|
| 1sap |
|
|
|
|
|
|
| 1fwp |
|
|
|
|
|
|
| 1c8c(A) |
|
|
|
|
|
|
| 2ezk |
|
|
|
|
|
|
| 1aoy |
|
|
|
|
|
|
| 1cc5 |
|
|
|
|
|
|
| 1isu(A) |
|
|
|
|
|
|
| 2h5n(D) |
|
|
|
|
|
|
| 1hhp |
|
|
|
|
|
|
| 1aly |
|
|
|
|
|
|
| T1008-D1 |
|
|
|
|
|
|
| T0886-D1 |
|
|
|
|
|
|
| T0953s1D1 |
|
|
|
|
|
|
| T0960-D2 |
|
|
|
|
|
|
| T0898-D2 |
|
|
|
|
|
|
| T0892D2 |
|
|
|
|
|
|
| T0953s2D3 |
|
|
|
|
|
|
Columns 2–6 relate template modeling-score and global distance test-total score loss obtained from Weighted-Decoy-Select, Weighted-Decoy-Select, MUFOLD-CL, Qprob, and SBROD, respectively. Lowest loss per CASP target is highlighted in bold.
| TM-Score Loss, GDT-TS Loss | |||||
|---|---|---|---|---|---|
| Target ID | Weighted-Decoy-Select | Weighted-Decoy-Select | MUFOLD-CL | Qprob | SBROD |
| T1008-D1 | |||||
| T0886-D1 | |||||
| T0953s1D1 | |||||
| T0960-D2 | |||||
| T0898-D2 | |||||
| T0892D2 | |||||
| T0953s2D3 | |||||
Figure 4Distribution of global distance test-total score scores for one medium difficulty protein under PDB entry 1bq9, two hard proteins under PDB entries 1aoy and 1isu(A) from the non-CASP targets and T0953s1D1 from the CASP targets. The x-axis tracks GDT-TS scores. The y-axis tracks frequency of decoys with corresponding GDT-TS scores. The frequency has been scaled to a range [1–10] by dividing all frequencies by the minimum frequency.