Literature DB >> 31273047

Reproducibility and Transparency by Design.

Vladislav A Petyuk1, Laurent Gatto2, Samuel H Payne3.   

Abstract

Entities:  

Keywords:  Algorithms; Bioinformatics; Computational Biology; Data Evaluation; Data Standards; Open Science; Reproducibility; Transparency

Mesh:

Year:  2019        PMID: 31273047      PMCID: PMC6692781          DOI: 10.1074/mcp.IP119.001567

Source DB:  PubMed          Journal:  Mol Cell Proteomics        ISSN: 1535-9476            Impact factor:   5.911


× No keyword cloud information.
To truly achieve reproducible research, having reproducible analytics must be a principal research goal. Biological discovery is not the only deliverable; reproducibility is an essential part of our research. Public trust of scientific research is affected by the clarity of published conclusions and also the perceived transparency of the method. Although irreproducibility is not exclusive to biology, strong public interest in environmental and biomedical discoveries seems to have focused the spotlight here following a number of high-profile studies that failed to be reproduced (1–6). In this report, we specifically focus on the linked issues of reproducibility and transparency of integration and analyses for multi-omics data. Unlike data generation where biological variability is expected to be manifest, computational analyses should be completely and exactly reproducible. Unfortunately, the documentation of data processing, analysis, and statistical algorithms in publications is usually not sufficiently detailed. This lack of detail is especially problematic for multi-omics characterizations where the complex statistical integration is essential to merging disparate data types (e.g. clinical, proteomics, genomics, etc.).

Making Reproducibility a Priority. Where Are the Gaps?

There are many stages of a multi-omics project, and recent efforts have made significant improvement on transparency of data files and selected steps of analysis. MCP and other journals have been leaders in requiring the complete sharing of raw data and preliminary processing (7–9). In a multi-omics project, it is now common to require that the mass spectrometry instrument files are freely shared via public repositories, which exist for genomics (10), proteomics (11), and metabolomics (12). Spectral identification must also be reported with the software and associated parameters. Pipelines run through the BioContainers (13) facilitate this recording. Although some popular tools with a graphical interface may not currently store this workflow meta-data, we feel that this is rapidly becoming a demand of both the users and publishers. After obtaining quantitative molecular data, there is still a lot of work before publication. This includes merging genomic and proteomic data tables, binning samples into phenotypic groups based on multi-omic clustering, functional enrichment analysis, metabolic network modeling, and so on. Unfortunately, the current efforts to mandate data sharing have focus on just the data. Data interpretation and statistical analyses that support scientific conclusions are an equally essential component of our work and must also be openly shared. We write this commentary to highlight the need for greater efforts in the open sharing of analyses. Although it is a narrow topic, we feel it is important to discuss. As mandated data sharing resolves a portion of the overall transparency/reproducibility challenge, the unaddressed issue remains the sharing of analyses. Moreover, our solution is not that difficult to implement for the new generation of data savvy researchers. It does not require large grants to fund computational/storage infrastructure; it can be done by individual researchers with a modicum of effort. Thus, without delay, journals can start to encourage or enforce the open sharing of computational and statistical data interpretation. As its central feature, our solution encapsulates the entire data analysis in software, including the creation of publication quality figures. We want to make it easy for peers to do exactly the same analyses in a publication—specifically the critical final steps where data interpretation happens. For example, when discussing an assertion in the results section, it is common to parenthetically list the p value and a specific test. To increase the transparency and reproducibility of this assertion, we should share the actual software code that produced this p value. Multiple modern software platforms have made this level of transparency achievable with modest effort, including Jupyter notebooks and R markdown (14, 15). Our support for these technologies is not meant to be exclusive but merely convenient as many publications already utilize Python and R/Bioconductor (16). We strongly advocate for the following three steps: code for analysis and figures posted to an open version control software repository like GitHub (17), data tables used by the analysis be posted in the same repository or linked to a password-free download if too large, and the URL to specific scripts in a repository be prominently listed in figure legends and methods sections. The effect of these three would be that anyone interested in a specific figure or conclusion of the paper could easily find the exact analysis method and fully repeat the computation. Indeed, this approach for reproducibility has already been used in a few exemplary publications (18–21).

Looking Forward

The benefits of true transparency have been previously noted (22, 23), and we reiterate that our proposed solution has lasting positive effects for the principal investigator, funding agencies, peer review, collaborators, and the general public. The solution is flexible and applicable to the broad needs of multi-omics integration for climate research, clinical proteogenomics, systems biology, computational neuroscience, and so on. As multi-omics measurements continue to revolutionize environmental and biomedical research, biology more explicitly becomes a data science. Most graduate programs now require statistics courses, where students learn tools like R and Python. Given the enormous societal impact that comes from scientific discoveries, the transparency of our data and methodology is a critical component of the scientific venture. As large data repositories have begun to capture much of the raw data generated for experiments, we have suggested a companion method to disseminate and expose data analysis methods. Ultimately, the transparency of full disclosure will expose any actual problems underlying irreproducibility in a manner where other researchers can help to correct and advance science.
  21 in total

1.  Recommendations for mass spectrometry data quality metrics for open access data (corollary to the Amsterdam Principles).

Authors:  Christopher R Kinsinger; James Apffel; Mark Baker; Xiaopeng Bian; Christoph H Borchers; Ralph Bradshaw; Mi-Youn Brusniak; Daniel W Chan; Eric W Deutsch; Bruno Domon; Jeff Gorman; Rudolf Grimm; William Hancock; Henning Hermjakob; David Horn; Christie Hunter; Patrik Kolar; Hans-Joachim Kraus; Hanno Langen; Rune Linding; Robert L Moritz; Gilbert S Omenn; Ron Orlando; Akhilesh Pandey; Peipei Ping; Amir Rahbar; Robert Rivers; Sean L Seymour; Richard J Simpson; Douglas Slotta; Richard D Smith; Stephen E Stein; David L Tabb; Danilo Tagle; John R Yates; Henry Rodriguez
Journal:  Mol Cell Proteomics       Date:  2011-11-03       Impact factor: 5.911

2.  Protein sequences from mastodon and Tyrannosaurus rex revealed by mass spectrometry.

Authors:  John M Asara; Mary H Schweitzer; Lisa M Freimark; Matthew Phillips; Lewis C Cantley
Journal:  Science       Date:  2007-04-13       Impact factor: 47.728

3.  New Guidelines for Publication of Manuscripts Describing Development and Application of Targeted Mass Spectrometry Measurements of Peptides and Proteins.

Authors:  Susan Abbatiello; Bradley L Ackermann; Christoph Borchers; Ralph A Bradshaw; Steven A Carr; Robert Chalkley; Meena Choi; Eric Deutsch; Bruno Domon; Andrew N Hoofnagle; Hasmik Keshishian; Eric Kuhn; Daniel C Liebler; Michael MacCoss; Brendan MacLean; D R Mani; Hendrik Neubert; Derek Smith; Olga Vitek; Lisa Zimmerman
Journal:  Mol Cell Proteomics       Date:  2017-02-09       Impact factor: 5.911

Review 4.  Orchestrating high-throughput genomic analysis with Bioconductor.

Authors:  Wolfgang Huber; Vincent J Carey; Robert Gentleman; Simon Anders; Marc Carlson; Benilton S Carvalho; Hector Corrada Bravo; Sean Davis; Laurent Gatto; Thomas Girke; Raphael Gottardo; Florian Hahne; Kasper D Hansen; Rafael A Irizarry; Michael Lawrence; Michael I Love; James MacDonald; Valerie Obenchain; Andrzej K Oleś; Hervé Pagès; Alejandro Reyes; Paul Shannon; Gordon K Smyth; Dan Tenenbaum; Levi Waldron; Martin Morgan
Journal:  Nat Methods       Date:  2015-02       Impact factor: 28.547

5.  Cancer biomarkers: can we turn recent failures into success?

Authors:  Eleftherios P Diamandis
Journal:  J Natl Cancer Inst       Date:  2010-08-12       Impact factor: 13.506

6.  Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking.

Authors:  Mingxun Wang; Jeremy J Carver; Vanessa V Phelan; Laura M Sanchez; Neha Garg; Yao Peng; Don Duy Nguyen; Jeramie Watrous; Clifford A Kapono; Tal Luzzatto-Knaan; Carla Porto; Amina Bouslimani; Alexey V Melnik; Michael J Meehan; Wei-Ting Liu; Max Crüsemann; Paul D Boudreau; Eduardo Esquenazi; Mario Sandoval-Calderón; Roland D Kersten; Laura A Pace; Robert A Quinn; Katherine R Duncan; Cheng-Chih Hsu; Dimitrios J Floros; Ronnie G Gavilan; Karin Kleigrewe; Trent Northen; Rachel J Dutton; Delphine Parrot; Erin E Carlson; Bertrand Aigle; Charlotte F Michelsen; Lars Jelsbak; Christian Sohlenkamp; Pavel Pevzner; Anna Edlund; Jeffrey McLean; Jörn Piel; Brian T Murphy; Lena Gerwick; Chih-Chuang Liaw; Yu-Liang Yang; Hans-Ulrich Humpf; Maria Maansson; Robert A Keyzers; Amy C Sims; Andrew R Johnson; Ashley M Sidebottom; Brian E Sedio; Andreas Klitgaard; Charles B Larson; Cristopher A Boya P; Daniel Torres-Mendoza; David J Gonzalez; Denise B Silva; Lucas M Marques; Daniel P Demarque; Egle Pociute; Ellis C O'Neill; Enora Briand; Eric J N Helfrich; Eve A Granatosky; Evgenia Glukhov; Florian Ryffel; Hailey Houson; Hosein Mohimani; Jenan J Kharbush; Yi Zeng; Julia A Vorholt; Kenji L Kurita; Pep Charusanti; Kerry L McPhail; Kristian Fog Nielsen; Lisa Vuong; Maryam Elfeki; Matthew F Traxler; Niclas Engene; Nobuhiro Koyama; Oliver B Vining; Ralph Baric; Ricardo R Silva; Samantha J Mascuch; Sophie Tomasi; Stefan Jenkins; Venkat Macherla; Thomas Hoffman; Vinayak Agarwal; Philip G Williams; Jingqui Dai; Ram Neupane; Joshua Gurr; Andrés M C Rodríguez; Anne Lamsa; Chen Zhang; Kathleen Dorrestein; Brendan M Duggan; Jehad Almaliti; Pierre-Marie Allard; Prasad Phapale; Louis-Felix Nothias; Theodore Alexandrov; Marc Litaudon; Jean-Luc Wolfender; Jennifer E Kyle; Thomas O Metz; Tyler Peryea; Dac-Trung Nguyen; Danielle VanLeer; Paul Shinn; Ajit Jadhav; Rolf Müller; Katrina M Waters; Wenyuan Shi; Xueting Liu; Lixin Zhang; Rob Knight; Paul R Jensen; Bernhard O Palsson; Kit Pogliano; Roger G Linington; Marcelino Gutiérrez; Norberto P Lopes; William H Gerwick; Bradley S Moore; Pieter C Dorrestein; Nuno Bandeira
Journal:  Nat Biotechnol       Date:  2016-08-09       Impact factor: 54.908

7.  Absence of detectable arsenate in DNA from arsenate-grown GFAJ-1 cells.

Authors:  Marshall Louis Reaves; Sunita Sinha; Joshua D Rabinowitz; Leonid Kruglyak; Rosemary J Redfield
Journal:  Science       Date:  2012-07-08       Impact factor: 47.728

8.  The ProteomeXchange consortium in 2017: supporting the cultural change in proteomics public data deposition.

Authors:  Eric W Deutsch; Attila Csordas; Zhi Sun; Andrew Jarnuczak; Yasset Perez-Riverol; Tobias Ternent; David S Campbell; Manuel Bernal-Llinares; Shujiro Okuda; Shin Kawano; Robert L Moritz; Jeremy J Carver; Mingxun Wang; Yasushi Ishihama; Nuno Bandeira; Henning Hermjakob; Juan Antonio Vizcaíno
Journal:  Nucleic Acids Res       Date:  2016-10-18       Impact factor: 16.971

9.  Ten Simple Rules for Taking Advantage of Git and GitHub.

Authors:  Yasset Perez-Riverol; Laurent Gatto; Rui Wang; Timo Sachsenberg; Julian Uszkoreit; Felipe da Veiga Leprevost; Christian Fufezan; Tobias Ternent; Stephen J Eglen; Daniel S Katz; Tom J Pollard; Alexander Konovalov; Robert M Flight; Kai Blin; Juan Antonio Vizcaíno
Journal:  PLoS Comput Biol       Date:  2016-07-14       Impact factor: 4.475

10.  BioContainers: an open-source and community-driven framework for software standardization.

Authors:  Felipe da Veiga Leprevost; Björn A Grüning; Saulo Alves Aflitos; Hannes L Röst; Julian Uszkoreit; Harald Barsnes; Marc Vaudel; Pablo Moreno; Laurent Gatto; Jonas Weber; Mingze Bai; Rafael C Jimenez; Timo Sachsenberg; Julianus Pfeuffer; Roberto Vera Alvarez; Johannes Griss; Alexey I Nesvizhskii; Yasset Perez-Riverol
Journal:  Bioinformatics       Date:  2017-08-15       Impact factor: 6.937

View more
  2 in total

1.  Proteomics Is Not an Island: Multi-omics Integration Is the Key to Understanding Biological Systems.

Authors:  Bing Zhang; Bernhard Kuster
Journal:  Mol Cell Proteomics       Date:  2019-08-09       Impact factor: 5.911

2.  CIAlign: A highly customisable command line tool to clean, interpret and visualise multiple sequence alignments.

Authors:  Charlotte Tumescheit; Andrew E Firth; Katherine Brown
Journal:  PeerJ       Date:  2022-03-15       Impact factor: 2.984

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.