Literature DB >> 31943015

EpiMetal: an open-source graphical web browser tool for easy statistical analyses in epidemiology and metabolomics.

Jussi Ekholm^1,2,3, Pauli Ohukainen^1,2,3, Antti J Kangas⁴, Johannes Kettunen^1,2,3,5, Qin Wang^1,2,3,6, Mari Karsikas^1,2,3,7, Anmar A Khan^8,9, Bronwyn A Kingwell¹⁰, Mika Kähönen¹¹, Terho Lehtimäki¹², Olli T Raitakari^13,14, Marjo-Riitta Järvelin^2,3,15,16,17, Peter J Meikle⁸, Mika Ala-Korpela^{1,2,3,6,18,19,20,21}.

Abstract

MOTIVATION: An intuitive graphical interface that allows statistical analyses and visualizations of extensive data without any knowledge of dedicated statistical software or programming. IMPLEMENTATION: EpiMetal is a single-page web application written in JavaScript, to be used via a modern desktop web browser. GENERAL FEATURES: Standard epidemiological analyses and self-organizing maps for data-driven metabolic profiling are included. Multiple extensive datasets with an arbitrary number of continuous and category variables can be integrated with the software. Any snapshot of the analyses can be saved and shared with others via a www-link. We demonstrate the usage of EpiMetal using pilot data with over 500 quantitative molecular measures for each sample as well as in two large-scale epidemiological cohorts (N >10 000). AVAILABILITY: The software usage exemplar and the pilot data are open access online at [http://EpiMetal.computationalmedicine.fi]. MIT licensed source code is available at the Github repository at [https://github.com/amergin/epimetal].

Entities: Chemical

Mesh：

Year: 2020 PMID： 31943015 PMCID： PMC7660139 DOI： 10.1093/ije/dyz244

Source DB: PubMed Journal: Int J Epidemiol ISSN： 0300-5771 Impact factor: 9.685

Introduction

We are living in a multi-omics era of systems epidemiology., Quantitative high-throughput metabolomics and lipidomics, have resulted in hundreds of molecular measures for up to tens of thousands of people in multiple cohorts and biobanks. Extensive and complex data create significant challenges for statistical analyses. It would therefore be beneficial, not only for omics beginners but for all epidemiologists, to have a simple visual tool for rapid exploratory analyses of these kinds of modern datasets without the immediate need of bioinformaticians fluent with currently available professional statistical analysis tools as, for example, the R software. To this end, we developed a web browser-based graphical software—EpiMetal—for standard statistical epidemiological analyses as well as for multivariate self-organizing maps (SOMs) for data-driven analyses, metabolic profiling and potentially for clinical subgrouping. EpiMetal is versatile and any dataset with an arbitrary number of continuous and categorical variables can be easily integrated with the software. Data from multiple cohorts can be imported for comparative analyses. The original datasets can be divided into subgroups via multiple ways, for example based on SOMs, histograms or scatterplots; the created subgroups can be saved and analysed separately or in comparison with any other dataset. Regression analyses with covariate adjustments are available with graphical visualization of the results. Publication quality visualizations can be made and exported. Any snapshot of the analyses pipeline can be saved and shared with others via a www-link. Though it might not be an optimal choice to use EpiMetal for final publishable results, an additional good usage might be to use it as a benchmarking tool for newly written scripts and functions in another software. As a usage exemplar, we present explorative analyses in a pilot cohort of 190 samples for which serum nuclear magnetic resonance (NMR) metabolomics and mass spectrometry (MS) lipidomics, data are available. The data include over 500 quantitative molecular measures for each individual from these complementary methodologies that are getting increasingly popular in epidemiological applications. This is apparently the first time these comprehensive data are combined in an epidemiological setting. The data are made public along with the software (https://github.com/amergin/epimetal/blob/master/python-api/api-docker/samples.tsv). The exemplar demonstrates how the graphical interface of EpiMetal can be used to visualize extensive data, select subgroups and ultimately gain epidemiological insights via a combination of various statistical analysis options. In the supplement we also illustrate comparative analyses for two large-scale epidemiological cohorts.

Implementation

EpiMetal consists of three major components: (i) the database (MongoDB) is the long-term store for dataset samples, computational results and stored sessions; (ii) a single-page web application written in JavaScript (JS) that is accessed by users via a web browser; and (iii) a back-end software written in Python that serves as an intermediary between the web application and the database to retrieve data and record user sessions. The application uses third-party open-source libraries (versions and licensing information is available at Github). The software is encapsulated inside Docker containers to facilitate easy deployment across server platforms and to isolate the software from host system. Several Plotter instances can be run in parallel with differing configurations and data. The overall architecture of the software is presented in Supplementary Figure S1, available as Supplementary data at IJE online. Key data handling, visualization and statistical analyses features are summarized in Figure 1.

Figure 1.

Key data handling, visualization and statistical analyses features of the EpiMetal software illustrated using real epidemiological data (Northern Finland Birth Cohort 1966; N = 5713). A generalized flow of analysis begins by choosing a dataset(s) from user-uploaded options. This can be, for example, one population cohort but also a combination of many. Main analysis options are located in the top of the graphical interface and divided into three categories: ‘Explore and filter’, ‘Regression analysis’ and ‘SOM’. Under ‘Explore and filter’, the user can quickly generate basic plots to gain an overview of the data structure. Variables can be plotted and compared using histograms, scatterplots and boxplots. Heatmaps can also be created for an overall visualization of variable Spearman’s rank correlations. Active filters can also be applied to select subsets of the data. For example, one can choose to analyse only individuals with HDL-C <1.0 mmol/L in a given population cohort. The main category ‘Regression analysis’ allows the user to choose an outcome and exposure variables with an optional number of covariates and to generate a forest plot displaying the point estimate and 95% confidence intervals. Under ‘SOM’, the user can calculate a self-organizing map trained according to selected variables. The map can then be used to choose a subset of the entire dataset on the basis of this metabolic profiling. It should be noted that the analyses made in the ‘Explore and filter’ and ‘SOM’ sections are fully compatible with each other, enabling, for example, the SOM-based subgroups to be analysed via histograms and vice versa.

Installation

A step-by-step installation guide is provided in the EpiMetal software user guide. The source data are imported from a machine-readable format file (usually a tab-separated .tsv- or comma-separated values .csv-files) where each row corresponds to an individual sample. These samples have a unique identifier and usually belong to a single dataset. An import configuration file defines the column names and the column separator. A metadata file is needed for the source data to indicate variable name, free description, unit of measurement (e.g. mmol/L) and the group name for each variable. For variables that follow a common pattern, a regular expression pattern can be employed. Imported variable types are either numerical or categorical. Example files for the sample data are provided in the Github repository. There is no hard limit for the number of samples or variables that can be imported to the software, but the larger the dataset, the longer the download and processing times. The software has been tested with over 500 variables and some 30 000 samples, which present a realistic upper bound for current usage. During the installation phase, a Docker container is set up that initializes the database from the source data file, along with the metadata descriptions. A second container is used for compiling the front-end application from the JS source files, forming a bundle. This container contains an http server to serve the bundle to user’s web browser. A third container runs the back end and its application programmatic interface (API). Using the Docker system, the EpiMetal software can be set up to serve an individual user locally, or to allow the software to be accessed by the public. An example configuration on how to limit the access to the software instance with a user name and a password is provided.

Architecture

The separation of concerns in EpiMetal is achieved by the common distinction between the presentation layer (front end) and the data access layer (back end). Modern desktop computers have considerable computing resources available, and thanks to the developments in web browser JS engines and web technologies, those resources can be fully appreciated in web applications. The philosophy of EpiMetal is to perform these calculations on the client side and store them to the database for later retrieval. This is achieved by asynchronously downloading chunks of the sample data and then performing the computations as requested by the user. Computationally heavy actions are processed in parallel using Web Workers, if supported by the browser. AngularJS was chosen as the front-end web framework as it was popular at the time of starting the project, it had an active user base and several useful libraries, and its two-way data binding feature was appropriate considering the interactive nature of the application.

Back and front ends

The back end is a Python script developed with a Flask framework using Mongo Engine for object data mapping and served with Gunicorn HTTP server. The back end defines actions for retrieving the settings for the software, the metadata for variables, and samples for the requested variables. In addition, the back end is called to request previously stored SOM computations and SOM planes, and to store new ones. The front end is a single-page application written in JS using AngularJS framework. Several open-source auxiliary JS libraries are employed, most notably DC.js for interactive charting throughout the application and Data-Driven Documents (D3.js) as a dependency for DC.js and for SOM planes and other chart types. Visual appearance and user interface (UI) stylings depend on Bootstrap framework and Angular-strap library. The UI allows the user to freely create, resize, move and close window-like objects containing figures. The front end fetches necessary data samples by querying the back end asynchronously as the user navigates on the page. Figures produced with EpiMetal can be exported either in SVG or in PNG format. A particular state of the application can always be saved by creating a link to it and sharing the link with collaborators.

Usage exemplar: combined comprehensive metabolomics and lipidomics data

We present here explorative analyses in a unique pilot cohort of 190 blood samples for which serum NMR metabolomics and MS lipidomics, data are available. All these molecular data, including basic clinical characteristics (age, sex, systolic and diastolic blood pressure, body mass index and height) have been made open access along with the EpiMetal software. The NMR metabolomics data comprise over 200 metabolic measures, including standard lipids, lipoprotein subclass and composition data, fatty acids, amino acids, ketones, glycolysis and gluconeogenesis-related substrates and an inflammatory marker, glycoprotein acetyls. The MS lipidomics data consist of over 350 individual lipid concentrations in 20 lipid classes including, for example, ceramides, sphingomyelins, phosphatidylcholines, phosphatidylinositols, cholesteryl esters and triacylglycerols.,, We used EpiMetal to conduct a multifaceted exploratory analysis of this pilot dataset. We sought to demonstrate some commonly known epidemiological and molecular features of these types of data. First, we plotted the distribution of high-density lipoprotein cholesterol (HDL-C) and the correlation of HDL-C with triglycerides (TG) in the entire dataset (Figure 2B). As expected, the histogram follows roughly a normal distribution. The scatterplot for HDL-C and TG association reveals the well-known negative population-level correlation.

Figure 2.

Explorative analysis of a cohort of 190 samples with serum NMR metabolomics and mass spectrometry lipidomics measures available. A: The control panel of EpiMetal that contains clickable buttons for generating graphs and selecting, naming and generating subgroups. Colours indicate the entire cohort (cyan) and selected subgroups based on the self-organizing map (SOM) analysis. B: The histograms of HDL-C in the entire cohort and in the subgroups and the scatterplot of HDL-C vs triglycerides. C: The SOM component planes for serum triglycerides, HDL-C and LDL-C (note that the individuals in the entire cohort are identically distributed in each plane). Colours indicate high (red) and low (blue) concentration values of the variable in each plane. Individuals with similar metabolic profiles cluster close to each other in the SOM component planes. The user can specify and select different subgroups via the circular selection tools. D: A box plot for LDL-C in the entire cohort and in the two subgroups. E: Regression analyses with a forest plot showing standardized regression coefficients. Standardization means, that prior to analyses, all continuous, non-binary variables are normalized to zero mean and unit standard deviation. Point estimates are indicated by a dot surrounded by 95% confidence interval (CI). Plotting HDL-C as the outcome and triglycerides as an exposure illustrates the same negative association as already indicated via the scatterplot in B. We then applied the SOM analysis to organize the samples in the dataset via their systemic metabolic profiles. Readers interested in the comparison of the SOM methodology with other subgrouping methods in epidemiology are referred to a recent Software Application Profile in the IJE. Additional details of the statistical issues in SOM analyses can be found in references. We based the SOM profiling on 26 metabolic measures including multiple amino acids, 14 lipoprotein subclasses, standard cholesterol measures, glycoprotein acetyls and glycolysis-related measures. It should be noted that users could freely modify the initial SOM training data according to their preferences and data characteristics. The SOM planes for low-density lipoprotein cholesterol (LDL-C), HDL-C and TG are shown in Figure 2C. These planes reveal, on average, that people with high circulating HDL-C (circle A in the SOM; subgroup marked lowest TG) are indeed those that have low TG and vice versa (area marked B in the SOM; subgroup marked lowest HDL), as expected by the previous scatter plot. The SOM analysis also reveals that circulating LDL-C concentrations are rather indifferent regarding HDL-C and TG in this pilot dataset; this is also emphasized in Figure 2D by the box plot for LDL-C. These associations can also be illustrated via formal regression analyses; we considered HDL-C as an outcome variable and LDL-C or TG as an exposure with age and sex as covariates. The results are given in Figure 2E for the entire cohort and the above-mentioned SOM-derived subgroups. The negative association between HDL-C and TG, depicted in Figure 2B, is well replicated in the formal regression analysis. These demonstrations indicate the internal consistency of various software functions and illustrate that the pilot cohort represents well-known features of lipoprotein metabolism with respect to lipoprotein lipid measures. Exploration of associations between the NMR metabolomics and MS lipidomics data can be found in Supplementary Figure S2, available as Supplementary data at IJE online, which shows a heatmap of Spearman’s rank correlation coefficients between selected lipoprotein (NMR) and lipid variables (MS). Overall the correlations are very well in line with the known molecular characteristics of lipoprotein subclasses and their lipid compositions and demonstrate robust agreement between the NMR metabolomics and MS lipidomics platforms. To additionally demonstrate the properties of EpiMetal, we performed an additional set of analyses using data from two large-scale population-based epidemiological cohorts including over 10 000 individuals (see the Supplement, available as Supplementary data at IJE online).

Conclusion

The new EpiMetal software is used via a modern web browser and it provides an intuitive easy-to-use graphical interphase for multiple statistical methods relevant in epidemiological analyses. It easily handles data for tens of thousands of people and for hundreds of measures—numbers that are a reality nowadays in many metabolomics applications. It provides instant data visualizations and allows convenient sharing of results and data via data captures accessible via an automatically created www-link. The datasets can be fully customized by the users. We illustrated the usage and opportunities of EpiMetal in real large-scale epidemiological datasets (Figure 1; and the Supplement, available as Supplementary data at IJE online). In addition, we provide an open access usage exemplar of EpiMetal for a pilot cohort in which over 500 quantitative molecular measures are available from each sample. With increasing amounts of complex molecular data in epidemiology, sophisticated software is required for both convenient data handling and statistical analyses. Without statistical or programming expertise, the learning curve to conveniently use typical modern data analysis software, for example R, can be steep. From the epidemiology perspective, extensive molecular data may challenge traditional hypothesis-driven data analyses. These are common situations in which the EpiMetal software can help researchers. First, by enabling instant graphical exploration and analyses of a (new) dataset without the hurdles of programming-based data analyses; and second, by also allowing data-driven options to find unknown relations in the data without pre-existing hypotheses. As far as we are aware, the EpiMetal software is a first-of-a-kind versatile tool for both traditional and data-driven analyses of extensive large-scale epidemiological datasets.

Funding

M.A.K. is supported by a Senior Research Fellowship from the National Health and Medical Research Council (NHMRC) of Australia (APP1158958). He also works in a unit that is supported by the University of Bristol and UK Medical Research Council (MC_UU_12013/1). Q.W. is supported by the Novo Nordisk Foundation (NNF17OC0027034). B.A.K. is supported by a Senior Principal Research Fellowship from the NHMRC of Australia (APP1154331). The Sigrid Juselius Foundation and the Academy of Finland have also funded this work. The Northern Finland Birth Cohort 1966 and the Young Finns Study have been financially supported by multiple funding bodies (see the Supplement, available as Supplementary data at IJE online). The Baker Institute is supported in part by the Victorian Government’s Operational Infrastructure Support Program. Conflict of interest: A.J.K. is an employee and shareholder of Nightingale Health Ltd, a company offering NMR-based metabolic profiling. Other authors declare no conflict of interest. Click here for additional data file.

20 in total

1. High-Throughput Plasma Lipidomics: Detailed Mapping of the Associations with Cardiometabolic Risk Factors.

Authors: Kevin Huynh; Christopher K Barlow; Kaushala S Jayawardana; Jacquelyn M Weir; Natalie A Mellett; Michelle Cinel; Dianna J Magliano; Jonathan E Shaw; Brian G Drew; Peter J Meikle
Journal: Cell Chem Biol Date: 2018-11-08 Impact factor: 8.116

2. Numero: a statistical framework to define multivariable subgroups in complex population-based datasets.

Authors: Song Gao; Stefan Mutter; Aaron Casey; Ville-Petteri Mäkinen
Journal: Int J Epidemiol Date: 2019-04-01 Impact factor: 7.196

3. Data-driven metabolic subtypes predict future adverse events in individuals with type 1 diabetes.

Authors: Raija Lithovius; Iiro Toppila; Valma Harjutsalo; Carol Forsblom; Per-Henrik Groop; Ville-Petteri Mäkinen
Journal: Diabetologia Date: 2017-04-24 Impact factor: 10.122

4. Plasma lipid profiling in a large population-based cohort.

Authors: Jacquelyn M Weir; Gerard Wong; Christopher K Barlow; Melissa A Greeve; Adam Kowalczyk; Laura Almasy; Anthony G Comuzzie; Michael C Mahaney; Jeremy B M Jowett; Jonathan Shaw; Joanne E Curran; John Blangero; Peter J Meikle
Journal: J Lipid Res Date: 2013-07-18 Impact factor: 5.922

Review 5. Metabolic phenotyping of diabetic nephropathy.

Authors: V-P Mäkinen; A J Kangas; P Soininen; P Würtz; P-H Groop; M Ala-Korpela
Journal: Clin Pharmacol Ther Date: 2013-08-09 Impact factor: 6.875

6. Plasma-triglycerides in regulation of H.D.L.-cholesterol levels.

Authors: E J Schaefer; R I Levy; D W Anderson; R N Danner; H B Brewer; W C Blackwelder
Journal: Lancet Date: 1978-08-19 Impact factor: 79.321

7. Metabolomic Profiling of Statin Use and Genetic Inhibition of HMG-CoA Reductase.

Authors: Peter Würtz; Qin Wang; Pasi Soininen; Antti J Kangas; Ghazaleh Fatemifar; Tuulia Tynkkynen; Mika Tiainen; Markus Perola; Therese Tillin; Alun D Hughes; Pekka Mäntyselkä; Mika Kähönen; Terho Lehtimäki; Naveed Sattar; Aroon D Hingorani; Juan-Pablo Casas; Veikko Salomaa; Mika Kivimäki; Marjo-Riitta Järvelin; George Davey Smith; Mauno Vanhala; Debbie A Lawlor; Olli T Raitakari; Nish Chaturvedi; Johannes Kettunen; Mika Ala-Korpela
Journal: J Am Coll Cardiol Date: 2016-03-15 Impact factor: 24.094

8. Sympathetic neural adaptation to hypocaloric diet with or without exercise training in obese metabolic syndrome subjects.

Authors: Nora E Straznicky; Elisabeth A Lambert; Paul J Nestel; Mariee T McGrane; Tye Dawood; Markus P Schlaich; Kazuko Masuo; Nina Eikelis; Barbora de Courten; Justin A Mariani; Murray D Esler; Florentia Socratous; Reena Chopra; Carolina I Sari; Eldho Paul; Gavin W Lambert
Journal: Diabetes Date: 2009-10-15 Impact factor: 9.461

9. Metabolic phenotypes, vascular complications, and premature deaths in a population of 4,197 patients with type 1 diabetes.

Authors: Ville-Petteri Mäkinen; Carol Forsblom; Lena M Thorn; Johan Wadén; Daniel Gordin; Outi Heikkilä; Kustaa Hietala; Laura Kyllönen; Janne Kytö; Milla Rosengård-Bärlund; Markku Saraheimo; Nina Tolonen; Maija Parkkonen; Kimmo Kaski; Mika Ala-Korpela; Per-Henrik Groop
Journal: Diabetes Date: 2008-06-10 Impact factor: 9.461

Review 10. Quantitative Serum Nuclear Magnetic Resonance Metabolomics in Large-Scale Epidemiology: A Primer on -Omic Technologies.

Authors: Peter Würtz; Antti J Kangas; Pasi Soininen; Debbie A Lawlor; George Davey Smith; Mika Ala-Korpela
Journal: Am J Epidemiol Date: 2017-11-01 Impact factor: 4.897

1 in total

Review 1. New software tools, databases, and resources in metabolomics: updates from 2020.

Authors: Biswapriya B Misra
Journal: Metabolomics Date: 2021-05-11 Impact factor: 4.290

1 in total