Literature DB >> 34694386

maplet: An extensible R toolbox for modular and reproducible metabolomics pipelines.

Kelsey Chetnik¹, Elisa Benedetti¹, Daniel P Gomari², Annalise Schweickart¹, Richa Batra¹, Mustafa Buyukozkan¹, Zeyu Wang¹, Matthias Arnold², Jonas Zierer¹, Karsten Suhre³, Jan Krumsiek¹.

Abstract

This paper presents maplet, an open-source R package for the creation of highly customizable, fully reproducible statistical pipelines for metabolomics data analysis. It builds on the SummarizedExperiment data structure to create a centralized pipeline framework for storing data, analysis steps, results, and visualizations. maplet's key design feature is its modularity, which offers several advantages, such as ensuring code quality through the maintenance of individual functions and promoting collaborative development by removing technical barriers to code contribution. With over 90 functions, the package includes a wide range of functionalities, covering many widely used statistical approaches and data visualization techniques. AVAILABILITY: The maplet package is implemented in R and freely available at https://github.com/krumsieklab/maplet.

Entities: Chemical

Year: 2021 PMID： 34694386 PMCID： PMC8796365 DOI： 10.1093/bioinformatics/btab741

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

A major shift in the biomedical community in recent years has been a push to promote reproducibility in research (Baker, 2016a; Brito ; Sandve ; Winchester, 2018). This has led to substantial changes in scientific publishing, including new rules for the mandatory sharing of source code and accompanying data for publication in peer-reviewed journals (Baker, 2016b). Adapting data analysis workflows to these new guidelines requires the development of code that is easily readable and maintainable, which often demands substantial time and effort. An effective way to mitigate this burden is to utilize computational toolboxes that include the following features. (i) To allow users to quickly start developing their own customized workflows, the toolbox must be straightforward to use with easily readable code pipelines. (ii) The source code should be modular to enable rapid development and efficient maintenance of its functionality. (iii) In order to provide full transparency of the workflow, the code should allow users to retrace and inspect the intermediate results generated at every step of the pipeline; this can be achieved using a container object that records all functions, parameters and results in a central location. (iv) The toolbox should be extensible—adding new functionalities should be straightforward for core developers and members of the open-source community alike. In recent years, several packages for the generation of metabolomics analysis pipelines have been published (Stanstrup ). To the best of our knowledge, only two of these packages, MetaboAnalystR (Chong and Xia, 2018) and structToolbox (Lloyd and Weber, 2021), partially address the requirements we listed above; however, they are missing a few key features from our list which results in several limitations. structToolbox does not store results and parameters in a single location, forcing the user to keep track of numerous result variables. Neither structToolbox nor MetaboAnalystR records intermediate steps, making it difficult to determine how results were generated subsequently. On the developer’s side, neither package appears to be designed with community development in mind, as structToolbox depends on a complex system of interconnected classes and MetaboAnalystR must be integrated with the MetaboAnalyst web service. Here, we present maplet, an open-source R package that combines modular design with a container to automatically record all steps, parameters and results to create a framework for flexible and reproducible metabolomics data analysis (Fig. 1). maplet contains a diverse collection of functions, covering preprocessing, differential analysis, pathway analysis, visualization and various other functionalities. The toolbox is under active development by an international team and its simple template-based design makes contributing new functionalities intuitive for external users. maplet code is readable and easy-to-use making it convenient for new users to integrate into their workflows.

Fig. 1.

maplet pipeline. All data and annotations are stored in a central Summarized Experiment container, which is passed between functions. Each function generates a result entry, containing all function-specific information as well as the results the function generated (such as statistics tables and plots)

2 Toolbox

2.1 Pipeline design

The maplet package allows for the creation of fully transparent analytical pipelines in which each intermediate result can be retraced and inspected by the user. This is achieved by using a container that is passed between each function and records all data, results, plots, R and package versions, analysis steps and their parameters. The pipeline container builds on SummarizedExperiment (Morgan ), a container class provided by Bioconductor (Huber ) which stores datasets and all corresponding annotations in a single object. maplet is designed to be used with a pipe operator—either the popular %>% operator from the magrittr package (Bache and Wickham, 2020) or the recently introduced |> operator from base R. Pipe operators enable the smooth connection of processing steps in a maplet pipeline—seamlessly passing the container from function to function. This makes code more readable and eliminates the need for intermediate result variables. Figure 1 presents a subsection of a pipeline and a diagram representing how each step in the pipeline is stored in the container.

2.2 Modularity

maplet follows a modular ‘one function, one operation’ design. Each task is encapsulated in a single function, which enables the rapid development of pipelines where any step can be flexibly inserted, removed or rearranged. Another key advantage of this modular design is the ability to maintain high-quality code. Since functions have no interdependencies, they can be rigorously evaluated and maintained separately. Finally, the modular structure promotes a culture of open-source development by removing the technical barriers to code contribution for unfamiliar developers. Any interested user can add a desired functionality based on a simple function template and only minimal knowledge of the inner workings of the package.

2.3 Functionality

Currently, maplet contains a growing set of over 90 functions organized into various groups, such as data loading, annotation, data modification, preprocessing, statistical analysis, visualization, reporting results, exporting data and pipeline maintenance. This covers many commonly used analytical methods necessary for standard data analysis encountered in everyday research projects. There are specialized loading functions for working with data from various popular metabolomics platforms, including loaders for the online metabolomics data repositories ‘Metabolomics Workbench’ (Sud ) and ‘MetaboLights’ (Haug ). The package provides a wide variety of functionalities commonly used by bioinformatics researchers, including linear models, missing-value imputation, principal component analysis (PCA), heatmaps, as well as more advanced functionalities such as pathway analysis and network inference. Notably, maplet can be used for other types of omics data and already contains loaders for the Olink proteomics platform. The toolbox comes with several extensive example pipelines and documentation to aid new users in the design of new workflows and a specialized testing framework to ensure stable functionality as the package is further developed.

2.4 Report generation and result access

Once a maplet pipeline has been executed, results can be visualized through comprehensive reports automatically assembled by maplet using R markdown/knitr. These reports lay out all functions in the pipeline in the order they were executed, including the name of the function, arguments and any plots or statistics tables produced by the function. The report is compiled into a single HTML, PDF or Word document, which stores all results in a single location and can be easily shared. Moreover, maplet comes with a series of accessor functions, which allow the user to extract processed data, statistical results or plots from the pipeline container and further analyze them using their own R code.

3 Conclusion

The maplet R package facilitates the fast development of reproducible analysis pipelines for metabolomics data. Its modular design allows for highly customizable, fully reproducible metabolomics pipelines, while also improving readability, ensuring code quality and promoting open-source development.

Funding

This work was supported by the ‘Biomedical Research Program’ funds at Weill Cornell Medical College in Qatar, a program funded by the Qatar Foundation and multiple grants from the Qatar National Research Fund (QNRF) to K.S.; the National Institute of Aging of the National Institutes of Health [U19AG063744 and 1R01AG069901-01A1 to J.K. and R.B.]; and the National Institute on Aging [U19AG063744, U01AG061359, RF1AG058942 and RF1AG059093 to M.A.]. Conflict of Interest: none declared.

9 in total

Review 1. Orchestrating high-throughput genomic analysis with Bioconductor.

Authors: Wolfgang Huber; Vincent J Carey; Robert Gentleman; Simon Anders; Marc Carlson; Benilton S Carvalho; Hector Corrada Bravo; Sean Davis; Laurent Gatto; Thomas Girke; Raphael Gottardo; Florian Hahne; Kasper D Hansen; Rafael A Irizarry; Michael Lawrence; Michael I Love; James MacDonald; Valerie Obenchain; Andrzej K Oleś; Hervé Pagès; Alejandro Reyes; Paul Shannon; Gordon K Smyth; Dan Tenenbaum; Levi Waldron; Martin Morgan
Journal: Nat Methods Date: 2015-02 Impact factor: 28.547

2. 1,500 scientists lift the lid on reproducibility.

Authors: Monya Baker
Journal: Nature Date: 2016-05-26 Impact factor: 49.962

3. Give every paper a read for reproducibility.

Authors: Catherine Winchester
Journal: Nature Date: 2018-05 Impact factor: 49.962

4. Ten simple rules for reproducible computational research.

Authors: Geir Kjetil Sandve; Anton Nekrutenko; James Taylor; Eivind Hovig
Journal: PLoS Comput Biol Date: 2013-10-24 Impact factor: 4.475

5. Metabolomics Workbench: An international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools.

Authors: Manish Sud; Eoin Fahy; Dawn Cotter; Kenan Azam; Ilango Vadivelu; Charles Burant; Arthur Edison; Oliver Fiehn; Richard Higashi; K Sreekumaran Nair; Susan Sumner; Shankar Subramaniam
Journal: Nucleic Acids Res Date: 2015-10-13 Impact factor: 16.971

6. MetaboAnalystR: an R package for flexible and reproducible analysis of metabolomics data.

Authors: Jasmine Chong; Jianguo Xia
Journal: Bioinformatics Date: 2018-12-15 Impact factor: 6.937

7. MetaboLights: a resource evolving in response to the needs of its scientific community.

Authors: Kenneth Haug; Keeva Cochrane; Venkata Chandrasekhar Nainala; Mark Williams; Jiakang Chang; Kalai Vanii Jayaseelan; Claire O'Donovan
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

8. Recommendations to enhance rigor and reproducibility in biomedical research.

Authors: Jaqueline J Brito; Jun Li; Jason H Moore; Casey S Greene; Nicole A Nogoy; Lana X Garmire; Serghei Mangul
Journal: Gigascience Date: 2020-06-01 Impact factor: 6.524

Review 9. The metaRbolomics Toolbox in Bioconductor and beyond.

Authors: Jan Stanstrup; Corey D Broeckling; Rick Helmus; Nils Hoffmann; Ewy Mathé; Thomas Naake; Luca Nicolotti; Kristian Peters; Johannes Rainer; Reza M Salek; Tobias Schulze; Emma L Schymanski; Michael A Stravs; Etienne A Thévenot; Hendrik Treutler; Ralf J M Weber; Egon Willighagen; Michael Witting; Steffen Neumann
Journal: Metabolites Date: 2019-09-23

9 in total

7 in total

1. Metaboprep: an R package for pre-analysis data description and processing.

Authors: David A Hughes; Kurt Taylor; Nancy McBride; Matthew A Lee; Dan Mason; Deborah A Lawlor; Nicholas J Timpson; Laura J Corbin
Journal: Bioinformatics Date: 2022-02-04 Impact factor: 6.931

2. Angiopoietin 2 Is Associated with Vascular Necroptosis Induction in Coronavirus Disease 2019 Acute Respiratory Distress Syndrome.

Authors: David R Price; Elisa Benedetti; Katherine L Hoffman; Luis Gomez-Escobar; Sergio Alvarez-Mulett; Allyson Capili; Hina Sarwath; Christopher N Parkhurst; Elyse Lafond; Karissa Weidman; Arjun Ravishankar; Jin Gyu Cheong; Richa Batra; Mustafa Büyüközkan; Kelsey Chetnik; Imaani Easthausen; Edward J Schenck; Alexandra C Racanelli; Hasina Outtz Reed; Jeffrey Laurence; Steven Z Josefowicz; Lindsay Lief; Mary E Choi; Frank Schmidt; Alain C Borczuk; Augustine M K Choi; Jan Krumsiek; Shahin Rafii
Journal: Am J Pathol Date: 2022-04-22 Impact factor: 5.770

3. Urine-based multi-omic comparative analysis of COVID-19 and bacterial sepsis-induced ARDS.

Authors: Richa Batra; Rie Uni; Oleh M Akchurin; Sergio Alvarez-Mulett; Luis G Gómez-Escobar; Edwin Patino; Katherine L Hoffman; Will Simmons; Kelsey Chetnik; Mustafa Buyukozkan; Elisa Benedetti; Karsten Suhre; Edward Schenck; Soo Jung Cho; Augustine M K Choi; Frank Schmidt; Mary E Choi; Jan Krumsiek
Journal: medRxiv Date: 2022-08-10

4. Multi-omic comparative analysis of COVID-19 and bacterial sepsis-induced ARDS.

Authors: Richa Batra; William Whalen; Sergio Alvarez-Mulett; Luis G Gómez-Escobar; Katherine L Hoffman; Will Simmons; John Harrington; Kelsey Chetnik; Mustafa Buyukozkan; Elisa Benedetti; Mary E Choi; Karsten Suhre; Edward Schenck; Augustine M K Choi; Frank Schmidt; Soo Jung Cho; Jan Krumsiek
Journal: medRxiv Date: 2022-08-13

5. Multi-omic comparative analysis of COVID-19 and bacterial sepsis-induced ARDS.

Authors: Richa Batra; William Whalen; Sergio Alvarez-Mulett; Luis G Gomez-Escobar; Katherine L Hoffman; Will Simmons; John Harrington; Kelsey Chetnik; Mustafa Buyukozkan; Elisa Benedetti; Mary E Choi; Karsten Suhre; Edward Schenck; Augustine M K Choi; Frank Schmidt; Soo Jung Cho; Jan Krumsiek
Journal: PLoS Pathog Date: 2022-09-19 Impact factor: 7.464

6. Integrative metabolomic and proteomic signatures define clinical outcomes in severe COVID-19.

Authors: Mustafa Buyukozkan; Sergio Alvarez-Mulett; Alexandra C Racanelli; Frank Schmidt; Richa Batra; Katherine L Hoffman; Hina Sarwath; Rudolf Engelke; Luis Gomez-Escobar; Will Simmons; Elisa Benedetti; Kelsey Chetnik; Guoan Zhang; Edward Schenck; Karsten Suhre; Justin J Choi; Zhen Zhao; Sabrina Racine-Brzostek; He S Yang; Mary E Choi; Augustine M K Choi; Soo Jung Cho; Jan Krumsiek
Journal: iScience Date: 2022-06-17

7. metabolomicsR: a streamlined workflow to analyze metabolomic data in R.

Authors: Xikun Han; Liming Liang
Journal: Bioinform Adv Date: 2022-09-16

7 in total