| Literature DB >> 29967418 |
Chen Keasar1, Liam J McGuffin2, Björn Wallner3, Gaurav Chopra4,5,6,7,8, Badri Adhikari9, Debswapna Bhattacharya9,10, Lauren Blake11, Leandro Oliveira Bortot12, Renzhi Cao9, B K Dhanasekaran13, Itzhel Dimas11, Rodrigo Antonio Faccioli14, Eshel Faraggi15,16,17, Robert Ganzynkowicz18, Sambit Ghosh13, Soma Ghosh13, Artur Giełdoń18, Lukasz Golon18, Yi He19, Lim Heo20, Jie Hou9, Main Khan21, Firas Khatib21, George A Khoury22, Chris Kieslich23, David E Kim24,25, Pawel Krupa18, Gyu Rie Lee20, Hongbo Li9,26,27, Jilong Li9, Agnieszka Lipska18, Adam Liwo18, Ali Hassan A Maghrabi2, Milot Mirdita28, Shokoufeh Mirzaei11,29, Magdalena A Mozolewska18, Melis Onel30, Sergey Ovchinnikov24,31, Anand Shah21, Utkarsh Shah30, Tomer Sidi1, Adam K Sieradzan18, Magdalena Ślusarz18, Rafal Ślusarz18, James Smadbeck22, Phanourios Tamamis23,30, Nicholas Trieber21, Tomasz Wirecki18, Yanping Yin32, Yang Zhang33, Jaume Bacardit34, Maciej Baranowski35, Nicholas Chapman36, Seth Cooper37, Alexandre Defelicibus14, Jeff Flatten36, Brian Koepnick24, Zoran Popović36, Bartlomiej Zaborowski18, David Baker24,25,36, Jianlin Cheng9, Cezary Czaplewski18, Alexandre Cláudio Botazzo Delbem14, Christodoulos Floudas23, Andrzej Kloczkowski18, Stanislaw Ołdziej35, Michael Levitt38, Harold Scheraga32, Chaok Seok20, Johannes Söding28, Saraswathi Vishveshwara13, Dong Xu9,27, Silvia N Crivelli39,40.
Abstract
Every two years groups worldwide participate in the Critical Assessment of Protein Structure Prediction (CASP) experiment to blindly test the strengths and weaknesses of their computational methods. CASP has significantly advanced the field but many hurdles still remain, which may require new ideas and collaborations. In 2012 a web-based effort called WeFold, was initiated to promote collaboration within the CASP community and attract researchers from other fields to contribute new ideas to CASP. Members of the WeFold coopetition (cooperation and competition) participated in CASP as individual teams, but also shared components of their methods to create hybrid pipelines and actively contributed to this effort. We assert that the scale and diversity of integrative prediction pipelines could not have been achieved by any individual lab or even by any collaboration among a few partners. The models contributed by the participating groups and generated by the pipelines are publicly available at the WeFold website providing a wealth of data that remains to be tapped. Here, we analyze the results of the 2014 and 2016 pipelines showing improvements according to the CASP assessment as well as areas that require further adjustments and research.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29967418 PMCID: PMC6028396 DOI: 10.1038/s41598-018-26812-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1A schematic depiction of the multi-step and multi-path information flow of protein structure prediction. Rounded rectangles represent information and plain rectangles represent basic tasks, each of which is an open computational problem. A prediction process starts with a protein sequence, passes at least once through a set of decoys (structural models of proteins), and ends with a short list, ideally one, of high score decoys. The paths in this graph are not mutually exclusive.
Figure 2An illustration of the WeFold pipeline concept. The figure presents a schematic depiction of 5 WeFold3 pipelines, which share their first components and differ in the final stages. Graph representation and colors are based on Fig. 1. A complete list of all the WeFold2 and WeFold3 pipelines is presented in Table 1 and in the main text.
Pipeline components in WeFold2 and WeFold3 and the groups that contributed.
| Contribution | WeFold2 | WeFold3 | Group |
|---|---|---|---|
| Alignment | HHPred | Söding | |
| Sampling | Foldit | Baker&Khatib Groups | |
| RosettaServer | Baker Group | ||
| UNRES | UNRES | Scheraga&Gdansk Groups | |
| Zhang | Zhang Group | ||
| Contact Predictions | GREMLIN | GREMLIN | Baker Group |
| Floudas | Floudas | Floudas Group | |
| ICOS | Jaume Bacardit | ||
| Secondary Structure Pred. | conSSert | conSSert | Floudas Group |
| Clustering | Wallner | Björn Wallner | |
| Minimum Variance | Minimum Variance | Scheraga&Gdansk Groups | |
| Filtering | Wallner | ProQ2 | Björn Wallner |
| Refinement | Delbem | Delbem Group | |
| QA/Selection | KoBaMIN | Levitt Group | |
| GalaxyRefine | GalaxyRefine | Seok Group | |
| PTIGRESS | TIGRESS | Floudas Group | |
| 3D refine | 3D refine | Cheng Group | |
| APOLLO | APOLLO | Cheng Group | |
| Delbem | Delbem Group | ||
| Kloczkowski/Pawlowski | Kloczkowski Group | ||
| Kloczkowski/Seder | Kloczkowski/Seder | Kloczkowski Group | |
| MESHI-score | MESHI-score | Keasar Group | |
| MESHI-MSC | Mirzaei&Crivelli Group | ||
| ModFOLD5 | ModFOLD6 | McGuffin Group | |
| MUfold | Xu Group | ||
| ProQ2 | ProQ2 | Björn Wallner | |
| SVLab | SVLab | SVLab |
Pipelines formed in WeFold2 and WeFold3, with their corresponding group number (assigned by the prediction center upon registration), category (tertiary structure prediction or refinement), number of targets attempted and groups involved.
| WeFold | Pipeline Name | Group # | Category | Attempted Targets | Groups Involved |
|---|---|---|---|---|---|
| WeFold2 | wf-Baker-UNRES | 128 | TSP | 13 | Baker, Scheraga, Gdansk |
| wfCPUNK | 442 | TSP | 55 | Floudas, Scheraga, Gdansk, Levitt | |
| wfKsrFdit-BW-Sk-BW | 336 | TSP | 25 | Keasar, Baker/Foldit, Wallner, Seok | |
| wfKsrFdit-BW-Sk-McG | 120 | TSP | 27 | Keasar, Baker/Foldit, Wallner, Seok, McGuffin | |
| wfZhng-Ksr | 173 | TSP | 25 | Zhang, Keasar | |
| wfZhng-Sk-BW | 260 | TSP | 27 | Zhang, Seok, Wallner | |
| wfAll-Cheng | 403 | TSP | 45 | All WeFold Groups, Cheng | |
| wfAll-MD-RFLB | 153 | TSP | 46 | All WeFold Groups, Delbem | |
| wfMix-KFa | 118 | TSP | 55 | Baker/Foldit, Kloczkowski/Faraggi | |
| wfMix-KFb | 197 | TSP | 55 | Baker/Foldit, Kloczkowski/Faraggi | |
| wfMix-KPa | 482 | TSP | 49 | Baker/Foldit, Kloczkowski/Pawlowski | |
| wfMix-KPb | 056 | TSP | 49 | Baker/Foldit, Kloczkowski/Pawlowski | |
| wfHHpred-PTIGRESS | 034 | TSP | 55 | Söding, Floudas | |
| wfKeasar-PTIGRESS | 457 | TSP | 43 | Keasar, Floudas | |
| wf-AnthropicDreams | 203 | TSP | 27 | Keasar, Baker/Foldit | |
| WeFold-Contenders | 014 | TSP | 24 | Keasar, Baker/Foldit | |
| WeFold-GoScience | 433 | TSP | 27 | Keasar, Baker/Foldit | |
| WeFold-Wiskers | 281 | TSP | 7 | Keasar, Baker/Foldit | |
| wf-Void_Crushers | 258 | TSP | 27 | Keasar, Baker/Foldit | |
| wfFdit-BW-KB-BW | 208 | Refinement | 22 | Baker/Foldit, Wallner, Levitt | |
| wfFdit-K-McG | 180 | Refinement | 23 | Baker/Foldit, Wallner, Levitt, McGuffin | |
| wfFdit_BW_K_SVGroup | 154 | Refinement | 15 | Baker/Foldit, Wallner, Levitt, SVLab | |
| wfFdit_BW_SVGroup | 334 | Refinement | 17 | Baker/Foldit, Wallner, SVLab | |
| WeFold3 | wf-BAKER-UNRES | 300 | TSP | 16 | Baker, Scheraga, Gdansk |
| wfCPUNK | 182 | TSP | 47 | Floudas, Scheraga, Gdansk, Levitt | |
| wfDB_BW_SVGroup | 475 | TSP | 46 | Baker, Wallner, SVLab | |
| wfRosetta-MUfold | 325 | TSP | 64 | Baker, Wallner, Xu | |
| wfRosetta-ProQ-MESHI | 173 | TSP | 59 | Baker, Wallner, Keasar | |
| wfRosetta-ProQ-ModF6 | 252 | TSP | 58 | Baker, Wallner, McGuffin | |
| wfRosetta-Wallner | 456 | TSP | 56 | Baker, Wallner | |
| wfRstta-PQ2-Seder | 067 | TSP | 85 | Baker, Wallner, Kloczkowski/Faraggi | |
| wfRstta-PQ-MESHI-MSC | 441 | TSP | 55 | Baker, Wallner, Keasar, Mirzaei | |
| wfAll-Cheng | 239 | TSP | 77 | All WeFold Groups, Cheng | |
| wfMESHI-Seok | 384 | TSP | 65 | Keasar, Seok | |
| wfMESHI-TIGRESS | 303 | TSP | 61 | Keasar, Floudas |
TSP is Tertiary Structure Prediction.
Figure 3Aggregated best models WeFold vs. all CASP groups. In each panel, targets are sorted in descending order of the best decoy submitted (blue line). The best WeFold decoy for each target is marked by a red dot or, when coincides with the overall best, red asterisk. The insert histograms depict the distributions of quality differences (Δ) between the best decoys and their corresponding best WeFold decoy. (A and B) – CASP11; (C and D) – CASP12; (A and C) – Best out of five; (B and D) – First model.
Figure 4Average z-scores (>−2.0) of the 20 top CASP12 groups, WeFold pipelines are marked with asterisks (Black = wfAll-Cheng; Red = wfMESHI-TIGRESS; Orange = wfMESHI-Seok; Light green = wfRstta-PQ2-seder; Dark green = wfRstta-PQ-ModF6; Light blue = wfRosetta-MUFOLD; Dark blue = wfRstta-PQ-MESHI-MSC; Purple = wfRosetta-PQ-MESHI). The results of MESHI and BAKER-ROSETTASERVER are marked by black circle and triangle respectively. Only those groups that submitted models for at least half of the targets are considered. Chart on the left shows top 20 groups/servers when considering the best model submitted by each group for each target. Chart on the right shows top 20 groups/servers when considering Model 1 only. CASP assessors used GDT_HA + ASE only for TBM targets hence the double depicting of that category. Source: http://www.predictioncenter.org/casp12/zscores_final.cgi.
Figure 5Pairwise comparison of WeFold and related (underlined) CASP11 groups. Each cell represents a comparison between the row and column groups, based on the subset of targets they both predicted. Cell colors depict the difference in average z-scores (GDT_TS). Blue indicate better performance of the row group. Asterisks indicate statistical significance (p < 0.05; Wilcoxon two-sided pair test). Dots indicate that the two groups shared no more than ten targets. Rows are ordered by decreasing number of significant cells, and then by blue cells. Source: http://www.predictioncenter.org/casp12/zscores_final.cgi.
The GDT_TS loss for the different steps in the complete clustering process for T0XXX targets, as measured by comparing the GDT_TS difference between the best GDT_TS before and after the different stages; energy is loss after applying the Rosetta energy filter cutoff, rmsd1 is the loss after applying the filter that excluded models too different from the lowest Rosetta energy model, energy + rmsd1 is the cumulative loss by applying both energy and rmsd1 filters, clustering is the loss after clustering, and Total loss refers to the complete cumulative loss after both filtering and clustering.
| Stages combo stages | Energy | rmsd1 | Energy + rmsd1 | Clustering | Total loss |
|---|---|---|---|---|---|
| T0759 | −1.0 | 0.0 | −1.0 | −1.8 | −2.9 |
| T0763 | −1.9 | −7.9 | −8.2 | −0.1 | −8.3 |
| T0765 | 0.0 | −3.9 | −3.9 | 0.0 | −3.9 |
| T0769 | −2.1 | 0.0 | −2.1 | −1.0 | −3.1 |
| T0773 | −2.2 | 0.0 | −2.2 | −0.7 | −3.0 |
| T0785 | −2.5 | 0.0 | −2.5 | −0.5 | −3.0 |
| T0787 | −1.3 | −1.5 | −2.5 | −0.5 | −3.0 |
| T0797 | −0.1 | 0.0 | −0.1 | −0.1 | −0.2 |
| T0803 | −0.2 | −17.0 | −17.0 | −0.7 | −17.7 |
| T0816 | −8.8 | −8.8 | −8.8 | −17.6 | −26.5 |
| T0818 | −2.4 | 0.0 | −2.4 | −1.5 | −3.9 |
| T0820 | −1.7 | −2.1 | −2.6 | −1.1 | −3.8 |
| T0822 | −16.4 | −25.0 | −28.5 | −0.2 | −28.7 |
| T0824 | −3.9 | −3.0 | −4.4 | −2.1 | −6.5 |
| T0837 | −8.3 | −5.0 | −8.5 | −0.4 | −8.9 |
| T0838 | −0.8 | 0.0 | −0.8 | −0.4 | −1.2 |
| T0848 | 0.0 | 0.0 | 0.0 | −1.8 | −1.8 |
| T0853 | −1.6 | −5.8 | −7.4 | −0.5 | −7.9 |
| T0855 | −1.3 | −1.3 | −1.3 | −1.7 | −2.9 |
|
|
Figure 6Box and whiskers plots represent the steps in Keasar-Foldit-based pipelines for target T0822-D1. First column represents the 20 models created by the servers at stage 1. Second column represents the 151 server models that are made available by the CASP organizers (stage 2). Keasar selects a subset of 10 server models using MESHI. These models are marked as dots in the third column. Then Khatib selects 5 of those models (marked with triangles). Khatib’s selected models (starting points) are given to the Foldit players. The Foldit players created a wide range of models, some of which were substantially better than the starting points as shown in column 4. However, column 5 shows that the clustering and filtering algorithm did not select those best models. Column 6 shows the clusters after refinement by Seok’s lab. Columns 7–13 represent the final selection by different WeFold groups, which selected either exclusively from the clusters in column 6, or from a combination of these and Zhang’s clusters, or from a combination of all the models shared by various WeFold groups and servers. Green line is the best model submitted to CASP11 for that target considering all the CASP11 groups. Note that the tick labels along the x-axis also show the number of models in each step of the pipeline. Box and whiskers plots for all the other targets attempted by the Keasar-Foldit pipelines and Zhang pipelines are in the Supplementary Materials.
Figure 7Comparison of GDT_HA differences between top model in each step of the refinement pipeline and the original model provided by the CASP11 organizers for each target. The steps are identified by color bars representing the difference between the GDT_HA of the starting model and the GDT_HA of (1) the best model among those generated by Foldit players (Foldit-All), (2) the best model among the clusters (Foldit-Cluster), (3) the best model among the clusters refined by KoBaMIN (Foldit-Koba), (4) the best selection by McGuffin (K-McG), (5) the best selection by Wallner/ProQ2 (BW-Kb-BW), (6) the best selection by SVLab of KoBaMIN-refined clusters (Koba-SVlab), and (7) the best selection by SVLab based on unrefined clusters (Clusters-SVLab).
Figure 8Chart comparing the percentage of models in each step of the refinement pipeline that improved the GDT_HA of the original model provided by CASP organizers. The steps are identified as follows: (1) models generated by Foldit players (Foldit-All), (2) clusters (Foldit Clusters), (3) clusters refined by KoBaMIN (Foldit Koba), (4) selection by McGuffin (K-McG), (5) selection by Wallner/ProQ2 (BW-Kb-BW), (6) selection by SVLab of KoBaMIN-refined clusters (SVLab-Koba), and (7) selection by SVLab based on unrefined clusters (SVLab-Clusters).
Figure 9Pairwise comparison of WeFold and related (underlined) CASP12 groups. Each cell represents a comparison between the row and column groups, based on the subset of targets they both predicted. Cell colors depict the difference in average z-scores (GDT_TS). Blue indicate better performance of the row group. Asterisks indicate statistical significance (p < 0.05; Wilcoxon two-sided pair test). Rows are ordered by decreasing number of significant cells, and then by blue cells. Source: http://www.predictioncenter.org/casp12/zscores_final.cgi.
Figure 10Bar plots show the down-selection process across the Rosetta-based pipelines for 6 targets using GDT_HA and GDT_MM. In each row, red bars represent best GDT_HA and blue bars represent best GDT_MM. GDT_MM is a Baker-lab specific metric, where the MAMMOTH alignment algorithm (MM = MAMMOTH) is used for the superposition (slight variations with respect to GDT_TS are based on alignment). Top row shows best GDT_HA (or MM) among the hundreds of thousands of models generated by Rosetta for that target. Next row shows the best GDT_HA (MM) among the best 5 selected by the BAKER-ROSETTASERVER; next row shows the best GDT_HA (MM) among the one thousand models selected by ProQ2; the remainder rows show the best GDT_HA (MM) among the best 5 selected by the Rosetta-based WeFold groups (one set of bars each).