| Literature DB >> 34845212 |
Julia Koehler Leman1,2, Sergey Lyskov3, Steven M Lewis4, Jared Adolf-Bryfogle5,6, Rebecca F Alford3, Kyle Barlow7, Ziv Ben-Aharon8, Daniel Farrell9,10, Jason Fell11,12,13, William A Hansen14,15, Ameya Harmalkar3, Jeliazko Jeliazkov16, Georg Kuenze17,18,19, Justyna D Krys20, Ajasja Ljubetič9,10, Amanda L Loshbaugh21,22, Jack Maguire23, Rocco Moretti17,18, Vikram Khipple Mulligan24, Morgan L Nance16, Phuong T Nguyen25, Shane Ó Conchúir21, Shourya S Roy Burman3, Rituparna Samanta3, Shannon T Smith18,26, Frank Teets27, Johanna K S Tiemann28, Andrew Watkins29, Hope Woods18,26, Brahm J Yachnin14,15, Christopher D Bahl30,31,32, Chris Bailey-Kellogg33, David Baker9,10, Rhiju Das29, Frank DiMaio9,10, Sagar D Khare14,15, Tanja Kortemme21,22, Jason W Labonte3, Kresten Lindorff-Larsen28, Jens Meiler17,18,19, William Schief5,6, Ora Schueler-Furman8, Justin B Siegel11,12,13, Amelie Stein28, Vladimir Yarov-Yarovoy25, Brian Kuhlman27, Andrew Leaver-Fay27, Dominik Gront20, Jeffrey J Gray34, Richard Bonneau35,36,37.
Abstract
Each year vast international resources are wasted on irreproducible research. The scientific community has been slow to adopt standard software engineering practices, despite the increases in high-dimensional data, complexities of workflows, and computational environments. Here we show how scientific software applications can be created in a reproducible manner when simple design goals for reproducibility are met. We describe the implementation of a test server framework and 40 scientific benchmarks, covering numerous applications in Rosetta bio-macromolecular modeling. High performance computing cluster integration allows these benchmarks to run continuously and automatically. Detailed protocol captures are useful for developers and users of Rosetta and other macromolecular modeling tools. The framework and design concepts presented here are valuable for developers and users of any type of scientific software and for the scientific community to create reproducible methods. Specific examples highlight the utility of this framework, and the comprehensive documentation illustrates the ease of adding new tests in a matter of hours.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34845212 PMCID: PMC8630030 DOI: 10.1038/s41467-021-27222-7
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 17.694
Guidelines for reproducible research and for the development of high-quality methods.
| General guidelines for reproducibility | Guidelines for high-quality benchmarks |
|---|---|
| 1. Document artifacts | 1. Define scientific questions for the benchmark |
| 2. Share input, output, and exact workflow in detail under an open license in public repositories | 2. Define quality metrics that are practically relevant |
| 3. Cite the data, software, and workflows | 3. Diversify examples in the benchmark set to cover easy and difficult targets |
| 4. Use persistent links in the publication | 4. Separate benchmark set from the developed method |
| 5. Journals should check for reproducibility | 5. Pick cutting edge methods to compare your method to |
| 6. Funding agencies should fund reproducibility research | 6. Use benchmarked methods that are freely available |
Fig. 1Goals and setups for the scientific tests.
A Test server setup with the web browser as the user interface, the frontend in bright green, and the backend in light green. The code is stored in GitHub, shown in dark gray. B Specific goals for our scientific tests, driven by flaws in a previous iteration of these tests. Each point is described in detail in the text. C Basic infrastructure of the scientific test framework, motivated by simplicity. Each box represents a file, folder, or script that is either provided in the template folder or generated throughout the protocol run. The basic workflow is highlighted in green with components that facilitate documentation and maintenance shown in white. [Icons in Fig. 1B were created by Ana Teixeira, Aman, Ben Davis, Gregor Cresnar, Anna Sophie, and Joel Avery from Noun Project.] SQL structured query language, HPC cluster high-performance computing cluster.
Fig. 2Webpages for the main dashboard and documentation of the tests.
A Dashboard of our benchmark server testing infrastructure. Each test is colored according to its test results: red denotes breakage, magenta denotes script failure, green denotes passing of a test, yellow denotes the test is currently running, and white denotes the test has yet to be run. All broken tests are shown prominently at the top of the page. All scientific tests are shown in the blue tab below (also encircled in bold black). Tests of the latest revision merged into the main branch are shown below with information about the committer, the pull request ID, a link to the code difference, and the commit message. B The results page shows the results of the run, the documentation, and the description of whether the test passes or fails. Results pages are automatically generated at the end of the run for each test.
Scientific tests for bio-macromolecular modeling, continuously running on our testing server framework.
| Test suite | Tests | Refs. | Test author | Quality measures | Targets | nstruct | Runtime in CPUh |
|---|---|---|---|---|---|---|---|
| Antibodies | antibody_grafting | [ | Jeliazko Jeliazkov | Fraction residues within rmsd to native | 48 | 1 | 3 |
| antibody_h3_modeling | [ | Score vs. rmsd | 6 | 500 | 3000 | ||
| antibody_snugdock | [ | I_sc vs. I_rmsd | 6 | 500 | 3000 | ||
| Carbohydrates | glycan_dock, (dock_glycans)* | [ | Jason Labonte, Morgan Nance | I_sc vs. L_rmsd | 6 | 1000 | 1100 |
| glycan_structure_prediction | [ | Jared Adolf-Bryfogle | Score vs. rmsd | 4 | 500 | 950 | |
| Comparative modeling | RosettaCM | [ | Jason Fell | GDT-MM | 16 | 200 | 1800 |
| Design | ddg_alanine_scan | [ | Ajasja Ljubetič | R, MAE, fraction correctly classified | 19: 381 | 1 | 3 |
| Design | SEWING | [ | Frank Teets | MotifScorer, InterModelMotifScorer | 1 | 100 | 75 |
| Design | enzyme_design | [ | Rocco Moretti | Various sequence recoveries | 50 | 1 | 50 |
| Design | design_fast | [ | Jack Maguire, Chris Bahl | Score vs. seqrec | 48 | 100 | 2600 |
| Design, interfaces | cofactor_binding_sites | [ | Amanda Loshbaugh | rank top, position profile similarity | 7 | 200 | 170 |
| design, immune system | mhc_epitope_energy | [ | Brahm Yachnin | Degree of de-immunization, among others | 50 | 100 | 2000 |
| docking | protein_protein_docking | [ | Shourya SR Burman | I_sc vs. I_rmsd | 10 | 5000 | 833 |
| ensemble docking | [ | Ameya Hamalkar | I_sc vs I_rmsd | 3 | 5000 | 3000 | |
| FlexPepDock | FlexPepDock | [ | Ziv Ben-Aharon | reweighted I_sc vs backbone I_rmsd | 2 | 200 | 70 |
| fragments | fragment_picking | [ | Justyna Krys, Dominik Gront | rmsd | 10 | 400 | 2000 |
| fragments | make fragments pipeline | [ | Daniel Farrell | Coverage, precision | 65 | 1 | 3000 |
| ligand docking | ligand_docking | [ | Shannon Smith | Delta_Isc vs. ligand_rmsd | 50 | 200 | 2000 |
| ligand_scoring_ranking | [ | Spearman and Pearson correlation coefficient | 57: 285 | 1 | 2 | ||
| loop modeling | loop_modeling_CCD | [ | Phuong Tran, Shane Ó Conchúir | Score vs. loop_rmsd | 7 | 500 | 500 |
| loop_modeling_KIC | [ | Score vs. loop_rmsd | 7 | 500 | 620 | ||
| loop_modeling_KIC_fragments | [ | Score vs. loop_rmsd | 7 | 500 | 760 | ||
| loop_modeling_NGK | [ | Score vs. loop_rmsd | 7 | 500 | 570 | ||
| membrane protein-energy function | mp_f19_energy_landscape# | [ | Rituparna Samanta, Rebecca Alford | ddG, depth and title angle | 4 | 1 | 10 |
| mp_f19_decoy_discrimination | [ | Score vs. rmsd, Wrms | 4×100 | 1 | 2000 | ||
| mp_f19_sequence_recovery | [ | sequence recovery, Kullback-Leibler divergence | 130 | 1 | 500 | ||
| mp_f19_ddG_of_mutation | [ | Pearson correlation coefficient | 3 | 1 | 1 | ||
| membrane proteins | mp_dock | [ | Julia Koehler Leman, Rebecca Alford | I_sc vs. I_rmsd | 10 | 1000 | 200 |
| mp_domain_assembly | [ | Score vs. rmsd | 5 | 5000 | 700 | ||
| mp_lipid_acc | [ | Accuracy | 223 | 1 | 2 | ||
| mp_relax | [ | Score vs. rmsd | 4 | 100 | 40 | ||
| mp_symdock | [ | I_sc vs. rmsd | 5 | 1000 | 140 | ||
| PDB diagnostic | PDB_diagnostic | NA | Steven Lewis, William Hansen, Sergey Lyskov | Read-in error type | entire PDB | 1 | 1000 |
| peptide structure prediction | simple_cycpep_predict | [ | Vikram K. Mulligan | Score vs. rmsd, PNear | 1 | ~800,000 | 320 |
| peptide_pnear_vs_ic50 | [ | IC50 vs. folding energy | 7 | 80,000 | 400 | ||
| refinement | relax_cartesian | [ | Julia Koehler Leman | Score vs. rmsd | 12 | 100 | 120 |
| relax_fast | [ | Score vs. rmsd | 12 | 100 | 120 | ||
| relax_fast_5iter | [ | Score vs. rmsd | 12 | 100 | 120 | ||
| RNA | rna_denovo_favorites | [ | Andy Watkins | Score vs. rmsd | 12 | 200 | 120 |
| stepwise_rna_favorites | [ | Score vs. rmsd | 12 | 200 | 240 | ||
| RosettaNMR | abinitio_RosettaNMR_rdc | [ | Georg Kuenze, Julia Koehler Leman | Score vs. rmsd | 3 | 2000 | 170 |
| abinitio_RosettaNMR_pcs | [ | Score vs. rmsd | 3 | 2000 | 1400 |
The number of tests is constantly being expanded. The test suite is the overall application, the test is the specific test, implemented by the test author(s). The quality measures are evaluated to choose a pass/fail criterion. The targets are the number of different proteins (or biomolecules) tested on, nstruct is the number of models built for each target, and the runtime in CPU hours is the total runtime over all targets.
*The dock_glycans test has been superceded by glycan_dock.
#The mp_f19_energy_landscape test has been renamed to mp_f19_tilt_angle.
Tests for which we compare different score functions (score12, talaris2013, talaris2014, ref2015, ligand, betaNov16, mpframework, ref2015mem, and franklin2019), complete with quality measures, number of targets in each benchmark, number of models created (nstruct) and runtime in CPU hours per score function.
| Test suite | Tests | score12 | ligand | mpframework | talaris13 | talaris14 | ref2015 | ref2015mem | betaNov16 | franklin2019 | Quality measures | Targets | nstruct | Runtime in CPUh |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Docking | docking | x | x | x | x | I_sc vs. I_rmsd | 10 | 1000 | 150 | |||||
| Design | design_fast | x | x | x | x | Score vs. seqrec | 48 | 100 | 2600 | |||||
| Loop modeling | loop_modeling_CCD | x | x | x | x | Score vs. loop_rmsd | 7 | 500 | 500 | |||||
| loop_modeling_KIC | x | x | x | x | Score vs. loop_rmsd | 7 | 500 | 620 | ||||||
| loop_modeling_KIC_fragments | x | x | x | x | Score vs. loop_rmsd | 7 | 500 | 760 | ||||||
| loop_modeling_NGK | x | x | x | x | Score vs. loop_rmsd | 7 | 500 | 570 | ||||||
| Refinement | relax_fast | x | x | x | x | Score vs. rmsd | 12 | 100 | 120 | |||||
| relax_fast5 | x | x | x | x | Score vs. rmsd | 12 | 100 | 120 | ||||||
| relax_cart | x | x | x | x | Score vs. rmsd | 12 | 100 | 120 | ||||||
| Ligand docking | ligand_docking | x | x | x | x | Delta_Isc vs. ligand_rmsd | 50 | 200 | 2000 | |||||
| Membrane proteins | mp_ddg (ddG of mutation) | x | x | x | x | Pearson correlation | 3 | 50 | 1800 |
The ligand docking and membrane ddG applications require specialized score functions.
Fig. 3Score function comparison for specific proteins for protein–protein docking and ligand docking.
Comparison of different score functions for different applications using the PNear metric as an indication of “funnel quality”. PNear falls between 0 (no funnel or incorrect global minimum) and 1 (the perfect funnel). The lambda parameter indicates the spread on the x-axis and is set to 4.0. Score functions are sorted from oldest to newest (left to right) and the models are colored in gray as the native (PDB) structure, then according to the score functions in order: yellow, green, cyan, and teal. A, B comparison for protein-protein docking on target with PDB ID 3eo1. The starting model is shown in dark blue—the docking partner of the starting model is too far away to be shown in the picture. The quality of the prediction improves over different score functions as indicated by tightening of the energy funnel. C, D comparison for ligand docking on target 4bqh. The native ligand pose is shown in dark blue. The quality of the prediction improves over different score functions as indicated by tightening of the energy funnel. E–H Ligand docking comparison on targets 3tll and 4uwc, respectively. Newer score functions lower the energy of an incorrect, alternative docking conformation, leading to a worse prediction.
Fig. 5Summary of score function comparisons.
Comparison of different score functions (one per column) for different applications and protocols, using the PNear metric as an indication of “funnel quality”. PNear falls between 0 (no funnel or incorrect global minimum) and 1 (the perfect funnel). The lambda parameter indicates the spread on the x-axis and is set to 4.0 in our comparison. Cells are colored according to the color bar on the right, teal is better. Unavailable data is indicated in gray. A The panel uses a “winner-takes-all” comparison: for each protein, the score function with the highest (i.e., best) PNear value (see methods) gets a point. Points are then summed by column, identifying the score function with the most and highest PNear values across proteins, the higher the better. B The averages of the PNear values for each score function were used, i.e., computed over each column. Higher values are better.