| Literature DB >> 23193203 |
Yunda Huang1, Raphael Gottardo.
Abstract
With the development of novel assay technologies, biomedical experiments and analyses have gone through substantial evolution. Today, a typical experiment can simultaneously measure hundreds to thousands of individual features (e.g. genes) in dozens of biological conditions, resulting in gigabytes of data that need to be processed and analyzed. Because of the multiple steps involved in the data generation and analysis and the lack of details provided, it can be difficult for independent researchers to try to reproduce a published study. With the recent outrage following the halt of a cancer clinical trial due to the lack of reproducibility of the published study, researchers are now facing heavy pressure to ensure that their results are reproducible. Despite the global demand, too many published studies remain non-reproducible mainly due to the lack of availability of experimental protocol, data and/or computer code. Scientific discovery is an iterative process, where a published study generates new knowledge and data, resulting in new follow-up studies or clinical trials based on these results. As such, it is important for the results of a study to be quickly confirmed or discarded to avoid wasting time and money on novel projects. The availability of high-quality, reproducible data will also lead to more powerful analyses (or meta-analyses) where multiple data sets are combined to generate new knowledge. In this article, we review some of the recent developments regarding biomedical reproducibility and comparability and discuss some of the areas where the overall field could be improved.Entities:
Keywords: Analysis pipeline; accuracy; open science; precision; protocol; standardization
Mesh:
Year: 2012 PMID: 23193203 PMCID: PMC3713713 DOI: 10.1093/bib/bbs078
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1Life cycle of scientific discoveries. The overall cycle is broken down into five different steps. After completion of all steps according to the reproducible guidelines (Table 1), the results would rapidly lead to confirmed (or discarded) discoveries. The confirmed discoveries would then be translated into new knowledge and data supporting novel studies.
Checklist for a comparable and reproducible experiment following stages in the life cycle of scientific discoveries as shown in Figure 1
| Scientific discovery stage | Recommendations | Check |
|---|---|---|
| Step 1: Biological samples for measurement | Store and share source of samples and/or samples if possible | □ |
| Store and share extra samples for reproducibility (when possible/applicable) and future studies | □ | |
| Step 2: Raw instrument data | Standardized experimental protocol | □ |
| Store and share measuring system (technology and platform) | □ | |
| Store and share Standard Operating Procedure (SOP) | □ | |
| Store and share experiment conditions not specified in SOP (e.g. techinician and time) | □ | |
| Step 3: Primary data | Perform quality control | □ |
| Store and share primary data and metadata | □ | |
| Store and share code and softwarefor algorithms used during summary (e.g. image analysis) | □ | |
| Use open-source software and avoid point-and-click analysis interfaces | □ | |
| Use data standards and databases | □ | |
| Step 4: Data analysis results | Store and share analysis results and derived data | □ |
| Store and share code and software (with versions) | □ | |
| Use open-source software and repository for sharing code and data | □ | |
| Validate results using independent data or experiment(s) (when possible) | □ | |
| Step 5: Publication or report | Publish results with link to code, data and software | □ |
| Use dynamic reporting when possible (e.g. Sweave) | □ | |
| Publish in open access journals | □ |
Figure 2Precision-accuracy trade off. Four different protocols are compared. Protocol B exhibits large variance (wide box) with small bias (close to the true value on average) while protocol C has small variance but large bias. Overall, protocol D exhibits good variance-bias trade off and should be prefered.
List of tools and resources for reproducible biomedical data analysis mentioned in this review
| Name | Description/usage | URL |
|---|---|---|
| Online protocol storing and sharing | ||
| elabprotocols | Web-based Laboratory Protocol & SOP Management | |
| figshare | Web-based tool for storing and sharing all sorts of research output | |
| Databases and data management tools | ||
| LabKey Server | Biomedical research data management with powerful programming interfaces for analysis | |
| ImmPort | The Immunology Database and Analysis Portal | |
| Analysis tools | ||
| Bioconductor | Collection of R packages for high-throughput biological data analysis | |
| Biopython | Python tools for computational molecular biology | |
| BioPerl | Perl tools for bioinformatics, genomics and life science research | |
| Analysis platforms with graphical user interface | ||
| RStudio | Integrated development environment (IDE) for R | |
| GenePattern | Genomic analysis platform with web-based interface | |
| GenomeSpace | Genomic analysis platform linked with multiple tools including GenePattern, Galaxy and Cytospace | |
| Code sharing and versioning tools | ||
| GitHub | Web-based tool for software development and collaboration based on the Git version control system | |
| Authoring tools | ||
| GenePattern Word Plugin | Microsoft Word add-in for the GenePattern Reproducible Research Document | |
| Sweave | Integration of R code into LaTeX documents | |
| knitr | Elegant, flexible and fast dynamic report generation with R. knitr is integrated in RStudio for ease of use. | |