| Literature DB >> 31522294 |
Kevin M Mendez1, Leighton Pritchard2, Stacey N Reinke3, David I Broadhurst4.
Abstract
BACKGROUND: A lack of transparency and reporting standards in the scientific community has led to increasing and widespread concerns relating to reproduction and integrity of results. As an omics science, which generates vast amounts of data and relies heavily on data science for deriving biological meaning, metabolomics is highly vulnerable to irreproducibility. The metabolomics community has made substantial efforts to align with FAIR data standards by promoting open data formats, data repositories, online spectral libraries, and metabolite databases. Open data analysis platforms also exist; however, they tend to be inflexible and rely on the user to adequately report their methods and results. To enable FAIR data science in metabolomics, methods and results need to be transparently disseminated in a manner that is rapid, reusable, and fully integrated with the published work. To ensure broad use within the community such a framework also needs to be inclusive and intuitive for both computational novices and experts alike. AIM OF REVIEW: To encourage metabolomics researchers from all backgrounds to take control of their own data science, mould it to their personal requirements, and enthusiastically share resources through open science. KEY SCIENTIFIC CONCEPTS OF REVIEW: This tutorial introduces the concept of interactive web-based computational laboratory notebooks. The reader is guided through a set of experiential tutorials specifically targeted at metabolomics researchers, based around the Jupyter Notebook web application, GitHub data repository, and Binder cloud computing platform.Entities:
Keywords: Cloud computing; Data science; Jupyter; Open access; Reproducibility; Statistics
Mesh:
Year: 2019 PMID: 31522294 PMCID: PMC6745024 DOI: 10.1007/s11306-019-1588-0
Source DB: PubMed Journal: Metabolomics ISSN: 1573-3882 Impact factor: 4.290
Fig. 1Applications for Jupyter Notebooks in the postgenomic community. Open virtual notebooks have three main, non-mutually exclusive, applications. First, they provide an efficient means for transparent dissemination of methods and results, thereby enabling alignment with FAIR data principles. Second, they provide a central and interactive platform that facilitates open collaboration to develop methodology and perform data analysis. Finally, their interactive and easily deployable framework can drive experiential learning opportunities for computational novices to develop their own skills and better understand metabolomics data analysis
Glossary of terms
| Paper section | Term | Definition |
|---|---|---|
| 1 | Data repository | A platform (such as Metabolights or Metabolomics Workbench) used to store metadata and experimental data |
| 2.1 | Command line interface (CLI) | A user interface that is used to execute operating system functions using text |
| 2.1 | Graphical user interface (GUI) | A user interface that is used to execute operating system functions using graphical icons or other visual indicators |
| 2.1 | Integrated development environment (IDE) | A software application that provides an interface to write and test code (such as RStudio, PyCharm and Visual Studio Code). It typically includes basic tools such as a code editor, compiler, and a debugger |
| 2.1 | Containers | Self-contained units of software that package code, dependencies, system tools and system libraries. The purpose is to be reliably transferred between, and deployed on, various operating systems and infrastructures |
| 2.1 | JavaScript object notation (JSON) format | A lightweight data-interchange format commonly used for communication between a browser and server. Internally, Jupyter Notebooks are JSON files with the.ipynb extension |
| 2.1 | Packages | Units of shareable code that can be imported and used to provide additional functionality (such as matplotlib and scikit-learn) |
| 2.1 | Application programming interface (API) | A set of defined functions and protocols for interacting with the software or package |
| 2.1 | Kernel | The “computational engine” that runs and introspects the code contained in a notebook document. Jupyter supports a kernel for Python, as well as kernels for many other languages (such as R, Julia, Kotlin, etc.) |
| 2.2 | Version control | A documented history of changes made to a file, enabling step-by-step reproduction and reconstruction of its development |
| 2.2 | Code repository | A hosted archive (such as those at GitHub and BitBucket) of source code and supporting files. |
| 3 | Virtual environment | An isolated environment that contains a specific version of Python and dependencies |
| 3.1.1 | Distribution (Software) | A collection of software bundled together |
| 3.1.1 | Markdown | A lightweight markup language used to add and format plain text. It is used in Jupyter Notebooks within “Markdown” cells |
| 3.1.3 | Configuration file | A file used to set the initial settings and parameters for computer applications. It is used in Binder to build the virtual environment with specific dependencies |
| 3.2.1 | Text cell (Markdown cell) | A cell in the Jupyter Notebook used to write text (using the Markdown language) |
| 3.2.1 | Code cell | A cell in the Jupyter Notebook used to run code (such as Python code) |
| 3.2.3 | Sandbox (Software development) | A software environment typically used to run or test experimental code in isolation from the rest of the system |
| 3.2.5 | Dependencies | The packages (and versions) that are required to be installed to use the software. For Python, these are the packages that need to be imported at the start of the file |
| 3.2.5 | Channels (Specific to Anaconda) | The location where packages that are installed using conda are stored (such as conda-forge and bioconda) |
| 3.2.5 | README | A file (commonly markdown or text) used to communicate information to visitors about the repository (such as purpose, usage, and contributors) |
| 3.2.5 | Root directory | The directory (or folder) that is the highest level in a hierarchy |
Fig. 2Metabolomics data analysis workflow. The workflow implemented in Tutorials 1 and 2 represents a typical metabolomics data science workflow for a binary classification outcome. The following steps are included: data import, data cleaning based on pooled QC relative standard deviation, PCA to visually inspect data reproducibility, univariate statistics, multivariate machine learning (PLS-DA including cross validation, feature selection, and permutation testing). The flow diagram is coloured by primary operation type (yellow = data import/export; green = data visualisation; blue = data processing)
Fig. 3Key elements required for FAIR data analysis, using Jupyter Notebooks and Binder deployment. A fishbone diagram describing the detailed requirements for FAIR data analysis in metabolomics. Experimental data are derived from typical metabolomics workflows and formatted appropriately for analysis. Data need to be shared, either privately (for pre-publication collaboration) or publicly (for open dissemination). The Jupyter Notebook contains all code, markdown comments, outputs, and visualisations corresponding to the study. The Jupyter Notebook and other required files (such as Readme and configuration files) are compiled into a public GitHub repository. Finally, Binder is used to easily deploy and share the Jupyter Notebook
Fig. 4Example Jupyter Notebook Screenshot. At the top of the page, there is the Jupyter menu bar and ribbon of action buttons. The main body of the notebook then displays text and code cells, and any outputs from code execution. This screenshot taken near the end of Tutorial 1 when the partial least squares discriminant analysis model is being evaluated. Three plots are generated, showing comparisons of the performance of the model on training and holdout test datasets: a violin plot showing the distribution of known positive and negative in both training and test sets, and the class cut-off (dotted line); probability density functions for positive and negative classes in the training and test sets (the training set datapoints are rendered as more opaque); ROC curves of model performance on training (with 95% CI) and test set