Literature DB >> 25348215

MetMSLine: an automated and fully integrated pipeline for rapid processing of high-resolution LC-MS metabolomic datasets.

William M B Edmands¹, Dinesh K Barupal¹, Augustin Scalbert¹.

Abstract

UNLABELLED: MetMSLine represents a complete collection of functions in the R programming language as an accessible GUI for biomarker discovery in large-scale liquid-chromatography high-resolution mass spectral datasets from acquisition through to final metabolite identification forming a backend to output from any peak-picking software such as XCMS. MetMSLine automatically creates subdirectories, data tables and relevant figures at the following steps: (i) signal smoothing, normalization, filtration and noise transformation (PreProc.QC.LSC.R); (ii) PCA and automatic outlier removal (Auto.PCA.R); (iii) automatic regression, biomarker selection, hierarchical clustering and cluster ion/artefact identification (Auto.MV.Regress.R); (iv) Biomarker-MS/MS fragmentation spectra matching and fragment/neutral loss annotation (Auto.MS.MS.match.R) and (v) semi-targeted metabolite identification based on a list of theoretical masses obtained from public databases (DBAnnotate.R).
AVAILABILITY AND IMPLEMENTATION: All source code and suggested parameters are available in an un-encapsulated layout on http://wmbedmands.github.io/MetMSLine/. Readme files and a synthetic dataset of both X-variables (simulated LC-MS data), Y-variables (simulated continuous variables) and metabolite theoretical masses are also available on our GitHub repository.

Entities: Chemical Disease Species

Mesh：

Year: 2014 PMID： 25348215 PMCID： PMC4341062 DOI： 10.1093/bioinformatics/btu705

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Untargeted metabolite profiling is a promising approach to discover novel risk factors for chronic diseases and biomarkers for disease diagnosis (Wang ; Ritchie ). Liquid chromatography coupled to high-resolution mass spectrometry (LC–hrMS) instrumentation is being more routinely used for data-acquisition for metabolomic analyses in large-scale studies (e.g. >300 samples). Raw data need to be processed via computational tools to extract relevant information and meaningful biological conclusions. Processing of LC–hrMS raw data is currently facilitated by softwares such as XCMS, MzMine, MetaboAnalyst and Maven (Smith ; Pluskal ; Clasquin ; Xia ). However, an automated software pipeline with minimal manual interaction for efficient, reproducible and objective large-scale metabolomic data analysis is also desirable. Software tools can systematically perform all downstream aspects of metabolomic analysis following peak-picking. Development of such a workflow in the R language can offer several advantages such as availability of modular packages for functional programming and graphics and the ease of accessibility of a GUI for non-specialists. We have developed novel software, MetMSLine, coded in the R language which automates the process of untargeted metabolomic data analyses of large datasets from acquisition of data from LC–hrMS platforms through to unknown biomarker identification.

2 Results

An overview of data processing steps that are integrated by MetMSLine is shown in Figure 1. Use of the software via GUI requires a clear understanding of data processing steps in metabolomics.

Fig. 1.

Data acquisition and MetMSLine data processing workflow (Steps 1–4). Sample preparation (e.g. urine dilution) is followed by untargeted MS and MS/MS data acquisition in sequence and peak picking softwares MetMSLine then performs sequentially: signal drift correction and pre-processing (Step 1), automatic PCA-based outlier removal (Step 2) (samples = black, QCs = red, outliers = green), automatic iterative regression based on continuous Y-variables supplied and cluster ion identification (Step 3) and final identification by data-dependent MS/MS and database matching (Step 4)

2.1 Step 1: Pre-processing of raw data matrix from XCMS

Large LC–hrMS metabolomics datasets contain unwanted variation introduced by MS signal drift/attenuation and multiplicative noise across the dynamic range. These effects can detrimentally impact biomarker discovery and MS features require rigorous quality assurance. PreProc.QC.LSC first zero-fills data, then if sample normalization is required the median fold change method can also be applied (Veselkov ). PreProc.QC.LSC then uses the QC-based locally weighted scatter-plot smoothing method to alleviate the effects of signal drift (Dunn ). The degree of smoothing is controlled by the smoother span value (e.g. f = 1/5), this argument sets the proportion of points used to smooth at each point. Data are then Log-transformed and finally features analytically stable across the regularly injected (every 5–10 true sample) pooled QCs are retained (e.g. RSD = 30, i.e. <30% relative standard deviation).

2.2 Step 2: Removal of outliers

The next function performs automated removal of outliers in the pre-processed data based on expansion of the Hotellings T2 distribution ellipse. The argument ‘out.tol’ (outlier tolerance) controls the proportional expansion of the ellipse (e.g. 1.1 or a 10% proportional expansion). Any samples within the first and second component PCA score plot beyond this expanded ellipse are removed and the PCA model recalculated. Assuming outliers are detected Auto.PCA performs two rounds of outlier removal and saves details of outliers removed along with corresponding samples from the Y-variable data table supplied in the parent directory in .csv format.

2.3 Step 3: Multivariate regression

Auto.MV.Regress utilizes continuous Y-variables to regress to the pre-processed MS dataset. Auto.MV.Regress creates a subfolder for each Y-variable supplied, then identifies potential biomarkers based on a user-defined correlation threshold (e.g. Corr.thresh = 0.3) and below a multiple testing corrected P-value (P = 0.01) and both scatterplots and box and whisker plots are generated. Potential biomarkers above the threshold are hierarchically clustered and ‘X–Y’ and ‘X–X’ heatmaps generated. Inter-feature clustering (X–X) is used to identify cluster ions from a list of 88 isotope, adduct, fragment and co-metabolite mass shifts.

2.4 Step 4 (i): MS/MS matching for biomarker structure elucidation

As an LC–hrMS platform can acquire MS/MS data with precise masses, we coded a function to use the MS/MS data for the identification of metabolites. This function matches potential biomarkers identified by Auto.MV.Regress to MS/MS fragmentation spectra by a retention time window (ret = 10 s) and mass tolerance (Frag.ppm = 20). Auto.MS.MS.match calculates the precursor (in blue on plot) to fragment (in red on plot) and inter-fragment mass differences, and labels where available the neutral losses/fragments commonly encountered in MS/MS spectra.

2.5 Step 4 (ii): Compound annotation using exact mass matching

The final function utilizes targeted lists of experiment-specific anticipated metabolites (in .csv format) provided by the user to annotate the unknown biomarkers. DBAnnotate optionally calculates from the targeted lists of anticipated metabolites, expected m/z of both typical phase II conjugates and electrospray adducts. DBAnnotate matches against all iterations of these potential theoretical masses below user-defined mass tolerances (MassAcc = 10 ppm) and returns an aggregated result table.

3 Conclusion

MetMSLine presents a complete data processing method; it is easy to use as a GUI and should be very beneficial to researchers to rapidly process large-scale LC–hrMS dataset. It potentially requires minimal manual interaction with the software, when compared with the high-manual interaction required by commonly used softwares for LC–hrMS datasets such as MetaboAnalyst, MAVEN, apLCMS, MzMine and IDEOM (Yu ; Pluskal ; Clasquin ; Creek ; Xia ). The rapidity of the process allows great scope for parameter optimization and the subsequent ability to dedicate more time to result interpretation.

Funding

This work was supported by the European Union (NutriTech and the European Cancer Platform (EUROCAN) [grant No. 289511, 260791]). Conflict of interest: none declared.

10 in total

1. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification.

Authors: Colin A Smith; Elizabeth J Want; Grace O'Maille; Ruben Abagyan; Gary Siuzdak
Journal: Anal Chem Date: 2006-02-01 Impact factor: 6.986

2. apLCMS--adaptive processing of high-resolution LC/MS data.

Authors: Tianwei Yu; Youngja Park; Jennifer M Johnson; Dean P Jones
Journal: Bioinformatics Date: 2009-05-04 Impact factor: 6.937

3. Optimized preprocessing of ultra-performance liquid chromatography/mass spectrometry urinary metabolic profiles for improved information recovery.

Authors: Kirill A Veselkov; Lisa K Vingara; Perrine Masson; Steven L Robinette; Elizabeth Want; Jia V Li; Richard H Barton; Claire Boursier-Neyret; Bernard Walther; Timothy M Ebbels; István Pelczer; Elaine Holmes; John C Lindon; Jeremy K Nicholson
Journal: Anal Chem Date: 2011-07-05 Impact factor: 6.986

4. Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry.

Authors: Warwick B Dunn; David Broadhurst; Paul Begley; Eva Zelena; Sue Francis-McIntyre; Nadine Anderson; Marie Brown; Joshau D Knowles; Antony Halsall; John N Haselden; Andrew W Nicholls; Ian D Wilson; Douglas B Kell; Royston Goodacre
Journal: Nat Protoc Date: 2011-06-30 Impact factor: 13.491

5. LC-MS data processing with MAVEN: a metabolomic analysis and visualization engine.

Authors: Michelle F Clasquin; Eugene Melamud; Joshua D Rabinowitz
Journal: Curr Protoc Bioinformatics Date: 2012-03

6. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data.

Authors: Tomás Pluskal; Sandra Castillo; Alejandro Villar-Briones; Matej Oresic
Journal: BMC Bioinformatics Date: 2010-07-23 Impact factor: 3.169

7. IDEOM: an Excel interface for analysis of LC-MS-based metabolomics data.

Authors: Darren J Creek; Andris Jankevics; Karl E V Burgess; Rainer Breitling; Michael P Barrett
Journal: Bioinformatics Date: 2012-02-04 Impact factor: 6.937

8. Low-serum GTA-446 anti-inflammatory fatty acid levels as a new risk factor for colon cancer.

Authors: Shawn A Ritchie; Jon Tonita; Riaz Alvi; Denis Lehotay; Hoda Elshoni; Su- Myat; James McHattie; Dayan B Goodenowe
Journal: Int J Cancer Date: 2012-06-26 Impact factor: 7.396

9. Gut flora metabolism of phosphatidylcholine promotes cardiovascular disease.

Authors: Zeneng Wang; Elizabeth Klipfell; Brian J Bennett; Robert Koeth; Bruce S Levison; Brandon Dugar; Ariel E Feldstein; Earl B Britt; Xiaoming Fu; Yoon-Mi Chung; Yuping Wu; Phil Schauer; Jonathan D Smith; Hooman Allayee; W H Wilson Tang; Joseph A DiDonato; Aldons J Lusis; Stanley L Hazen
Journal: Nature Date: 2011-04-07 Impact factor: 49.962

10. MetaboAnalyst 2.0--a comprehensive server for metabolomic data analysis.

Authors: Jianguo Xia; Rupasri Mandal; Igor V Sinelnikov; David Broadhurst; David S Wishart
Journal: Nucleic Acids Res Date: 2012-05-02 Impact factor: 16.971

10 in total

8 in total

1. SimExTargId: a comprehensive package for real-time LC-MS data acquisition and analysis.

Authors: William M B Edmands; Josie Hayes; Stephen M Rappaport
Journal: Bioinformatics Date: 2018-10-15 Impact factor: 6.937

2. compMS2Miner: An Automatable Metabolite Identification, Visualization, and Data-Sharing R Package for High-Resolution LC-MS Data Sets.

Authors: William M B Edmands; Lauren Petrick; Dinesh K Barupal; Augustin Scalbert; Mark J Wilson; Jeffrey K Wickliffe; Stephen M Rappaport
Journal: Anal Chem Date: 2017-03-27 Impact factor: 6.986

Review 3. Computational Metabolomics: A Framework for the Million Metabolome.

Authors: Karan Uppal; Douglas I Walker; Ken Liu; Shuzhao Li; Young-Mi Go; Dean P Jones
Journal: Chem Res Toxicol Date: 2016-10-12 Impact factor: 3.739

4. An untargeted metabolomics method for archived newborn dried blood spots in epidemiologic studies.

Authors: Lauren Petrick; William Edmands; Courtney Schiffman; Hasmik Grigoryan; Kelsi Perttula; Yukiko Yano; Sandrine Dudoit; Todd Whitehead; Catherine Metayer; Stephen Rappaport
Journal: Metabolomics Date: 2017-02-03 Impact factor: 4.290

5. Novel Biomarkers of Habitual Alcohol Intake and Associations With Risk of Pancreatic and Liver Cancers and Liver Disease Mortality.

Authors: Erikka Loftfield; Magdalena Stepien; Vivian Viallon; Laura Trijsburg; Joseph A Rothwell; Nivonirina Robinot; Carine Biessy; Ingvar A Bergdahl; Stina Bodén; Matthias B Schulze; Manuela Bergman; Elisabete Weiderpass; Julie A Schmidt; Raul Zamora-Ros; Therese H Nøst; Torkjel M Sandanger; Emily Sonestedt; Bodil Ohlsson; Verena Katzke; Rudolf Kaaks; Fulvio Ricceri; Anne Tjønneland; Christina C Dahm; Maria-Jose Sánchez; Antonia Trichopoulou; Rosario Tumino; María-Dolores Chirlaque; Giovanna Masala; Eva Ardanaz; Roel Vermeulen; Paul Brennan; Demetrius Albanes; Stephanie J Weinstein; Augustin Scalbert; Neal D Freedman; Marc J Gunter; Mazda Jenab; Rashmi Sinha; Pekka Keski-Rahkonen; Pietro Ferrari
Journal: J Natl Cancer Inst Date: 2021-11-02 Impact factor: 13.506

Review 6. From chromatogram to analyte to metabolite. How to pick horses for courses from the massive web resources for mass spectral plant metabolomics.

Authors: Leonardo Perez de Souza; Thomas Naake; Takayuki Tohge; Alisdair R Fernie
Journal: Gigascience Date: 2017-07-01 Impact factor: 6.524

7. metaX: a flexible and comprehensive software for processing metabolomics data.

Authors: Bo Wen; Zhanlong Mei; Chunwei Zeng; Siqi Liu
Journal: BMC Bioinformatics Date: 2017-03-21 Impact factor: 3.169

Review 8. The metaRbolomics Toolbox in Bioconductor and beyond.

Authors: Jan Stanstrup; Corey D Broeckling; Rick Helmus; Nils Hoffmann; Ewy Mathé; Thomas Naake; Luca Nicolotti; Kristian Peters; Johannes Rainer; Reza M Salek; Tobias Schulze; Emma L Schymanski; Michael A Stravs; Etienne A Thévenot; Hendrik Treutler; Ralf J M Weber; Egon Willighagen; Michael Witting; Steffen Neumann
Journal: Metabolites Date: 2019-09-23

8 in total