| Literature DB >> 28209127 |
Abstract
BACKGROUND: Reproducibility in Data Analysis research has long been a significant concern, particularly in the areas of Bioinformatics and Computational Biology. Towards the aim of developing reproducible and reusable processes, Data Analysis management tools can help giving structure and coherence to complex data flows. Nonetheless, improved software quality comes at the cost of additional design and planning effort, which may become impractical in rapidly changing development environments. I propose that an adjustment of focus from processes to data in the management of Bioinformatic pipelines may help improving reproducibility with minimal impact on preexisting development practices.Entities:
Keywords: Data flows; Data pipelines; R language; Reproducible research
Mesh:
Year: 2017 PMID: 28209127 PMCID: PMC5314482 DOI: 10.1186/s12859-017-1510-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Process-centered and Data-centered approaches a In a process-centered approach (PCA) to the development of an analysis pipeline a well established process P 1 is selected to build a new resource from input data. In the data-centered approach (DCA), the resource R 1 has been created by prototypical code which needs to be properly structured and polished. b In the PCA the resource R 1 has been produced and a well established procedure P 2 has been selected to further process it. In the DCA the process P 1 is now properly structured and the resource R 2 has been created using prototypical code. c Both approaches finally yield equivalent pipelines and annotations
A summary of commands available in the latest development version of the repo package
| Command | Description |
|---|---|
| attach | Store a generic file into the repository. |
| attr | Retrieves item attributes. |
| build | Runs code chunk associated with an item and dependant items if needed. |
| bulkedit | Saves repository meta data to a text file for offline editing or loads the file after editing. |
| check | Checks MD5-consistency of stored items. |
| chunk | Displays the code chunk associated with an item. |
| copy | Copies items between repositories. |
| cpanel | Runs visual interface. |
| dependencies | Returns and/or plots item dependencies. |
| export | Saves the contents of a repository item to a file in RDS format. |
| find | Searches all metadata for a partial string match. |
| get | Loads an item into the current workspace. |
| handlers | Returns a list of functions to be used as an alternative interface to the repository. |
| has | Checks wether an item is present in the repository. |
| info | Displays a summary of information about a regular item, a project item, or the repository. |
| lazydo | Evaluates specified code caching results in the repository. Loads results if already cached. |
| options | Sets default parameters to be used by subsequent calls to the put command. |
| pies | Shows statistics about disk space used by each item in the repository. |
| Summarizes information about items. | |
| project | Creates a special “project” item. |
| pull | Overwrites item contents by downloading data from the associated URL. |
| put | Stores new data into the repository. |
| related | Lists items related to a given item according to dependencies. |
| rm | Removes items from the repository. |
| root | Returns repository root position on the file system. |
| set | Updates an existing item. |
| stash | Stores an item with unspecified meta information. |
| stashclear | Removes stash-ed items. |
| sys | Runs a system command on a given item. |
| tag | Set tags for an item. |
| tags | Retrieves tags for an item. |
| untag | Removes specified tags from an item. |
Fig. 2Example of repository statistics Pie chart visualization of the repository items according to their memory usage on the disk, as produced by the pies function
Fig. 3The dependency graph summarizing relations between items in the repository. Three types of relations are supported by repo: attached to, depends on, generated by. When items are properly annotated, this visualization also represents the analysis data flow
Fig. 4Selective plot of dependencies within the repository. In this case all the items annotated with the tag “visualization” are excluded
Fig. 5The repository control panel. It is constituted by a Shiny [20] application running in an Internet browser. The user can browse through repository items and load them into the current workspace