Kelsey Chetnik1, Lauren Petrick2,3, Gaurav Pandey4,5. 1. Department of Genetics and Genomic Sciences and Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, NY, USA. 2. Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA. lauren.petrick@mssm.edu. 3. Institute for Exposomics Research, Icahn School of Medicine at Mount Sinai, New York, NY, USA. lauren.petrick@mssm.edu. 4. Department of Genetics and Genomic Sciences and Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, NY, USA. gaurav.pandey@mssm.edu. 5. Institute for Exposomics Research, Icahn School of Medicine at Mount Sinai, New York, NY, USA. gaurav.pandey@mssm.edu.
Abstract
INTRODUCTION: Despite the availability of several pre-processing software, poor peak integration remains a prevalent problem in untargeted metabolomics data generated using liquid chromatography high-resolution mass spectrometry (LC-MS). As a result, the output of these pre-processing software may retain incorrectly calculated metabolite abundances that can perpetuate in downstream analyses. OBJECTIVES: To address this problem, we propose a computational methodology that combines machine learning and peak quality metrics to filter out low quality peaks. METHODS: Specifically, we comprehensively and systematically compared the performance of 24 different classifiers generated by combining eight classification algorithms and three sets of peak quality metrics on the task of distinguishing reliably integrated peaks from poorly integrated ones. These classifiers were compared to using a residual standard deviation (RSD) cut-off in pooled quality-control (QC) samples, which aims to remove peaks with analytical error. RESULTS: The best performing classifier was found to be a combination of the AdaBoost algorithm and a set of 11 peak quality metrics previously explored in untargeted metabolomics and proteomics studies. As a complementary approach, applying our framework to peaks retained after filtering by 30% RSD across pooled QC samples was able to further distinguish poorly integrated peaks that were not removed from filtering alone. An R implementation of these classifiers and the overall computational approach is available as the MetaClean package at https://CRAN.R-project.org/package=MetaClean . CONCLUSION: Our work represents an important step forward in developing an automated tool for filtering out unreliable peak integrations in untargeted LC-MS metabolomics data.
INTRODUCTION: Despite the availability of several pre-processing software, poor peak integration remains a prevalent problem in untargeted metabolomics data generated using liquid chromatography high-resolution mass spectrometry (LC-MS). As a result, the output of these pre-processing software may retain incorrectly calculated metabolite abundances that can perpetuate in downstream analyses. OBJECTIVES: To address this problem, we propose a computational methodology that combines machine learning and peak quality metrics to filter out low quality peaks. METHODS: Specifically, we comprehensively and systematically compared the performance of 24 different classifiers generated by combining eight classification algorithms and three sets of peak quality metrics on the task of distinguishing reliably integrated peaks from poorly integrated ones. These classifiers were compared to using a residual standard deviation (RSD) cut-off in pooled quality-control (QC) samples, which aims to remove peaks with analytical error. RESULTS: The best performing classifier was found to be a combination of the AdaBoost algorithm and a set of 11 peak quality metrics previously explored in untargeted metabolomics and proteomics studies. As a complementary approach, applying our framework to peaks retained after filtering by 30% RSD across pooled QC samples was able to further distinguish poorly integrated peaks that were not removed from filtering alone. An R implementation of these classifiers and the overall computational approach is available as the MetaClean package at https://CRAN.R-project.org/package=MetaClean . CONCLUSION: Our work represents an important step forward in developing an automated tool for filtering out unreliable peak integrations in untargeted LC-MS metabolomics data.
Authors: Warwick B Dunn; David Broadhurst; Paul Begley; Eva Zelena; Sue Francis-McIntyre; Nadine Anderson; Marie Brown; Joshau D Knowles; Antony Halsall; John N Haselden; Andrew W Nicholls; Ian D Wilson; Douglas B Kell; Royston Goodacre Journal: Nat Protoc Date: 2011-06-30 Impact factor: 13.491
Authors: Manish Sud; Eoin Fahy; Dawn Cotter; Kenan Azam; Ilango Vadivelu; Charles Burant; Arthur Edison; Oliver Fiehn; Richard Higashi; K Sreekumaran Nair; Susan Sumner; Shankar Subramaniam Journal: Nucleic Acids Res Date: 2015-10-13 Impact factor: 16.971
Authors: David Broadhurst; Royston Goodacre; Stacey N Reinke; Julia Kuligowski; Ian D Wilson; Matthew R Lewis; Warwick B Dunn Journal: Metabolomics Date: 2018-05-18 Impact factor: 4.290
Authors: Kelvin K W To; Kim-Chung Lee; Samson S Y Wong; Kong-Hung Sze; Yi-Hong Ke; Yin-Ming Lui; Bone S F Tang; Iris W S Li; Susanna K P Lau; Ivan F N Hung; Chun-Yiu Law; Ching-Wan Lam; Kwok-Yung Yuen Journal: Diagn Microbiol Infect Dis Date: 2016-03-14 Impact factor: 2.803
Authors: Karan Uppal; Quinlyn A Soltow; Frederick H Strobel; W Stephen Pittard; Kim M Gernert; Tianwei Yu; Dean P Jones Journal: BMC Bioinformatics Date: 2013-01-16 Impact factor: 3.169