Literature DB >> 35290737

Deep Learning-Assisted Peak Curation for Large-Scale LC-MS Metabolomics.

Yoann Gloaguen1,2,3, Jennifer A Kirwan1,3, Dieter Beule2,3.   

Abstract

Available automated methods for peak detection in untargeted metabolomics suffer from poor precision. We present NeatMS, which uses machine learning based on a convoluted neural network to reduce the number and fraction of false peaks. NeatMS comes with a pre-trained model representing expert knowledge in the differentiation of true chemical signal from noise. Furthermore, it provides all necessary functions to easily train new models or improve existing ones by transfer learning. Thus, the tool improves peak curation and contributes to the robust and scalable analysis of large-scale experiments. We show how to integrate it into different liquid chromatography-mass spectrometry (LC-MS) analysis workflows, quantify its performance, and compare it to various other approaches. NeatMS software is available as open source on github under permissive MIT license and is also provided as easy-to-install PyPi and Bioconda packages.

Entities:  

Mesh:

Year:  2022        PMID: 35290737      PMCID: PMC8969107          DOI: 10.1021/acs.analchem.1c02220

Source DB:  PubMed          Journal:  Anal Chem        ISSN: 0003-2700            Impact factor:   6.986


Introduction

Liquid chromatography–mass spectrometry (LC-MS) is a widely used method in untargeted metabolomics. The postacquisition raw data processing, which aims to detect compound-related peaks and distinguish them from noise signals, is still a major challenge. Noise is defined as the background level of signal, which surrounds the chemical signals of interest. It can be caused by random fluctuations in the electrical environment, although its definition is often expanded to include chemical signals, which are of low intensity and quality and poorly reproducible. Many algorithms and tools have been developed to automate the process of peak picking (e.g., XCMS,[1] MZmine2,[2] Optimus,[3] MS-DIAL[4]). Pipelines for the automatic LC-MS raw data processing usually consist of the following steps: definition of regions of interest (ROI), detection of chromatographic peaks, quantification of these peaks, peak matching or grouping for samples within the batch or analysis, and clustering of peaks belonging to the same compound. XCMS and MZmine2 (called MZmine in the rest of the manuscript) are the most widely used open-source software that perform all of these steps and provide the user with a table of peaks found in the spectra and their integral intensities for each sample. However, there is a tendency for peak picking software to overpick peaks, i.e., they accept a large number of peaks, which are either (i) noise or (ii) low intensity, poorly reproducible peaks, or (iii) real peaks, which have incorrectly defined boundaries, thus creating a high number of false-positive peaks in the final dataset.[5] A previous detailed comparison of XCMS and MZMine demonstrated that the majority of peaks picked by both software packages were false positives.[6] Poor consistency between software is another major issue, with individual datasets showing overlaps of as little as 36% of the peaks picked by a single software.[6] Poor peak picking may obstruct or impede downstream analysis and biomedical interpretation of metabolomics studies and thus some kind of manual peak curation is still the norm. This also makes analysis of large-scale studies extremely laborious and limits reproducibility of analysis. Recent progress in machine learning (ML) algorithms[7] and availability of affordable parallel processing hardware (GPUs) has sparked the application of deep learning methods in both gas chromatography–mass spectrometry (GC-MS)[8,9] and LC-MS in peak detection.[10,11] ML has also been used for intra- and interbatch corrections[12] in LC-MS. A recent review[13] discusses the latest advancement of machine learning tools for LCMS-based metabolomics. Peakonly[10] is a deep neural network-based peak picking, segmentation, and integration algorithm that attempts to achieve a higher precision than conventional peak picking algorithm. However, it falls short in sensitivity compared to these tools. DNN[11] introduces the idea of postprocessing results from conventional algorithms with a neural network-based peak quality classification for the special case of prealigned peaks. This is an important proof of concept, but is not applicable to all study designs, and the required run time seems prohibitive for large-scale studies and routine use.

Methods

Here, we introduce NeatMS, which is designed to serve as an independent deep learning-based peak filter tool in existing analysis pipelines. It addresses the overpicking issue by automatically evaluating and classifying peaks based on quality. To achieve this, we introduce three peak quality classes (high, acceptable, and “poor quality or noise”—henceforth called noise; see Figure S1 for some examples) and provide a pre-trained neural network to allow for out-of-the-box usage. Transfer learning and complete re-training are also supported. NeatMS can be easily integrated into existing workflows (see Figure a); it uses a convolutional network architecture shown in Figure b.
Figure 1

(a) Integration of NeatMS (red) into an existing workflow (blue): samples are run through the LCMS data acquisition, raw data are processed using standard automated tools to extract peaks, and both raw data and extracted peaks are used as input for NeatMS that assigns them to three classes: high quality, and acceptable or noise. The classification information can be used in the downstream analysis. NeatMS comes with a pre-trained network and includes all components for re-training and transfer learning. (b) Architecture of the neural network: NeatMS includes a two-dimensional (2D) convolutional base for feature extraction and a classifier made of two fully connected layers. This architecture was chosen due to its high performance in object detection and pattern recognition.[13] The max pooling layer between the convolutional layers reduces data size in the retention time dimension. This enables a higher abstraction of the data and reduces the number of learned parameters and thus helps to prevent overfitting as well as to reduce computational training effort. The classifier is made of two fully connected layers and uses a softmax function to produce three output values, which correspond to the peak classes.

(a) Integration of NeatMS (red) into an existing workflow (blue): samples are run through the LCMS data acquisition, raw data are processed using standard automated tools to extract peaks, and both raw data and extracted peaks are used as input for NeatMS that assigns them to three classes: high quality, and acceptable or noise. The classification information can be used in the downstream analysis. NeatMS comes with a pre-trained network and includes all components for re-training and transfer learning. (b) Architecture of the neural network: NeatMS includes a two-dimensional (2D) convolutional base for feature extraction and a classifier made of two fully connected layers. This architecture was chosen due to its high performance in object detection and pattern recognition.[13] The max pooling layer between the convolutional layers reduces data size in the retention time dimension. This enables a higher abstraction of the data and reduces the number of learned parameters and thus helps to prevent overfitting as well as to reduce computational training effort. The classifier is made of two fully connected layers and uses a softmax function to produce three output values, which correspond to the peak classes. NeatMS is designed to be easily integrated into existing workflows and is adaptable to different measurement protocols, instruments, and preprocessing tools. Therefore, NeatMS can be used in different ways. Most simply, it is used to classify quality of an input dataset using an existing model. NeatMS also allows the creation of new or improvement of existing classification models using training data. The training data can be generated with an integrated labeling tool. For all usage, the input data formats are the same and processing always starts with a data preparation step.

LC-MS Data Acquisition

To evaluate the performance of NeatMS, we use two datasets with known chemical standards (CS). Dataset 1 consists of 20 quality control samples “QC 2” from the Biocrates P400 kit[14] consisting of 80 chemical standards (CS) at proprietary concentrations in a lyophilized plasma matrix. Dataset 2 is based on the Biocrates kit calibration sample “Cal 1”, which is matrix-free and contains 41 of the chemical standards also present in dataset 1. Note however that the concentration of the chemical standards (CS) in “Cal 1” is substantially lower than that in “QC 2”, and the detailed concentration relation is however an unknown trade secret of Biocrates. We created a serial dilution in water (1:1.2, 1:1.4, 1:1.6, 1:1.8, 1:2, 1:3, 1:5, 1:7.5, 1:10, 1:15, 1:20, 1:30, 1:50, and 1:100). Following dilution, we added 39 compounds (Biocrates internal standard mix) at the same concentration to each sample to act as chemical background. Each dilution was measured in triplicate. The objective of this dilution series was to quantitatively assess how NeatMS performed over a range of low peak intensities and near the limit of detection. All data were acquired following Biocrates P400 kit standard protocol on an Agilent 1290 coupled with a Thermo Q Exactive instrument. Datasets 1 and 2 were preprocessed with MZmine and XCMS using the versions and parameters detailed in Supporting Information Table S1. For XCMS centWave, we applied IPO[15] for parameter optimization individually for each dataset; furthermore, we used the XCMS peak merging feature. For MZmine, we used the parameters recommended in the user documentation.

Data Preparation

Although the peak detection is performed by an external tool, workflow or pipeline, the full signal is used for classification and is directly retrieved from the raw data. This prevents biases originating from data transformations applied by the different peak detection tools (baseline correction, smoothing, etc.). Therefore, the input data of the module consists of .csv formatted files describing the peaks detected and the raw sample files in mzML format. Other vendor-specific raw file formats can be converted into mzML format using the msconvert tool available in ProteoWizard.[16] The csv input files can be generated using standard preprocessing tools such as MZmine or XCMS. The position of the module within standard data processing workflows is illustrated in Figure a. The output of NeatMS is again in csv format and contains the information from the input csv as well as the peak classification and labeling generated by the software. Before any transformation, NeatMS excludes unacceptably narrow peaks from further processing by requiring a minimum scan number of 5 (configurable minimum scan number input filter). As the model evaluates the peak shapes and the quality of their extraction from the raw signal (e.g., peak boundaries), it is important to provide contextual information. This is performed by extracting a larger retention time (RT) window containing both the peak itself, as defined by the preprocessing tool, and the signal surrounding the peak (called peak margin thereafter). The RT window of the signal to be extracted is defined as follows (with n = 1 by default)A min–max normalization is then applied to the extracted signalThe resulting signal is linearly interpolated to obtain a vector of size s (120 by default). The s/(2n + 1) that is, by default, the central 40 values represents the peak signal, and the n × s/(2n + 1) that is by default the 40 values on each side represents the peak margins as shown in Figure . A second (binary) vector of length s is then created to describe whether an intensity value (single point) is part of the peak window (1) or the margin (0). The resulting data structure is a two-dimensional tensor, or matrix, of size 2 × s, as shown in Figure . Although n and s can be adjusted by the user, the pre-trained model provided with NeatMS has been trained using default parameters and requires no adjustment when using this model.
Figure 2

Data structure of a single peak after preprocessing for neural network feeding; s: size of scaled intensity vector, n: margin width (as a fraction of peak width), I: scaled intensity vector of size s, W: binary window vector of size s (0 = margin, 1 = peak).

Data structure of a single peak after preprocessing for neural network feeding; s: size of scaled intensity vector, n: margin width (as a fraction of peak width), I: scaled intensity vector of size s, W: binary window vector of size s (0 = margin, 1 = peak).

Convoluted Neural Network Architecture

The neural network (see Figure b) is built following generic convolutional network architecture including a convolutional base for feature extraction and a classifier made of fully connected layers. The convolutional architecture of the network was selected due to its high performance in object detection and pattern recognition.[17] The convolutional base is composed of two convolutional layers, with a max pooling layer between them. This operation halves the size of the data in the retention time dimension and enables a higher abstraction of the data to classify, which helps to prevent potential overfitting. This layer also reduces the number of parameters to learn and the computational cost. The classifier part of the network is made of two dense layers and produces three output values through a softmax activation function, which corresponds to the number of peak quality classes. Rectified linear unit (ReLU) activation function is used throughout the rest of the network. The convolutional layers use a stride of 1 with a “same” padding. The kernel size and channel number for every layer are detailed in the figure.

Training

Our NeatMS analysis first evaluates the pre-trained (PT) model. This initial model was trained by the authors on a wide range of peak shapes from different datasets. Additionally, we used the transfer learning approach to optimize the generic PT model to the specifics of dataset 1; we will call the resulting second model transfer learning (TL). The pre-trained (PT) as well as the transfer learning (TL) model used in this work are provided in the Supporting Information. All components needed to perform creation of new models or use transfer learning to improve existing ones are part of the open-source NeatMS software. Thus, users can also retrain the system for their specific column/instrument/peak detection workflow combination if the provided pre-trained (PT) model does not suffice. The creation of a new training dataset is facilitated by an interactive visualization and labeling tool that can be run within a Jupyter notebook.[18] Preparing a training dataset requires the user to be experienced with mass spectrometry data analysis and chromatographic peaks evaluation. The decisions by the user for the training data will be learned by the algorithm and thus must be consistent and trustworthy. This tool requires the same input as the main NeatMS tool and presents the user with randomly selected peaks for manual assignment to the three labeling classes. Typically, a few hundred peaks can be labeled within an hour. The PT model is based on about 5000 peaks, and the TL model was adjusted using about 2500 peaks. Once the desired number of peaks have been labeled, the model can be trained using two different approaches (full training or transfer learning, see below). The labeled dataset is divided into an 80/10/10 training/test/validation split by default. Model testing and validation are always performed the same way regardless of the training approach chosen. The test set is used during the actual training process to prevent overfitting. The validation set remains untouched during the entire training process and is subsequently used for hyperparameter optimization. This optimization can be performed automatically or manually, but performing it manually allows more control over the specificity vs. sensitivity of the model. Instructions for the manual process are given in the documentation. The training tool enables the freezing of any network layer, making it possible to select the specific layers in which weights should be adjusted. It is, however, considered better practice to only adjust the classifier part by freezing the entire convolutional base when the training set is very small. Guidance on layer selection is given in the advanced section of the documentation. This approach is especially important for transfer learning and enables fine tuning of the pre-trained models by further training specific layers.[19] The advantage of this approach over full training is that it requires a much smaller training dataset and thus less manual labeling effort. However, the tool also supports full training of entirely new models. This approach consists of using only the network architecture and fully training the model from scratch. This approach will produce the best results for the data being analyzed but requires a large training dataset. Instructions on how to import models are available in the documentation.

Implementation

NeatMS is written in Python 3.6 and is available as a python package through PyPi package installer and Bioconda. The data handling and operations are performed using NumPy[20] and scikit-learn,[21] and the neural network is constructed using Keras[22] and Tensorflow.[23] As a Python package, the intended use of the module is to be embedded as an extra step within a data processing pipeline. The module can be integrated and automatically executed by any pipeline or workflow management tool capable of running python code. However, it can also be used as a standalone application through a dedicated python script or within a Jupyter notebook. Several Jupyter notebooks are provided for tutorial purposes and can serve as templates and examples. The generated results are reported in standard .csv format and can also be exported as pandas[24] dataframes for direct integration in python-supported pipelines. The structure of the output can be controlled through a dedicated method to ensure smooth integration into the majority of data processing pipelines. Optional filters can also be turned on and parameterized. Details about the full usage of the export method are provided in the documentation.

Results and Discussion

Table summarizes the number of peaks and class assignments from NeatMS for dataset 1 using both the PT and TL models. Table shows that even if we do not use transfer learning, NeatMS can still deliver a useful improvement for existing MZmine workflows. It substantially reduces the number of peaks that need to be considered for downstream analysis. This facilitates, e.g., differential analysis either on only high-quality data or on high- and acceptable-quality data. For XCMS, the separation between acceptable and high-quality class is not so clean due to the fact that the training set was exclusively based on peaks reported by MZmine. Both MZmine and XCMS users can start working with the PT model and will immediately improve their workflow performance and incrementally create further improvements by training better models. Further discussion will be focused on NeatMS using the MZmine trained TL model. Results with XCMS could likely be further improved by creating an XCMS-specific trained model.
Table 1

NeatMS and Peakonly Comparisona

  NeatMS MZmine TL modelNeatMS MZmine PT modelNeatMS XCMS TL modelNeatMS XCMS PT modelpeakonly
peak numberinput69776977599459941907
classified5505550545134513not reported
high quality1069212722802635not reported
acceptable quality194518177141088not reported
noise249115601519791not reported
CS foundinput and classified94.25%94.25%97.13%97.13%79.44%
high quality79.31%88.19%91.44%92.94%not reported
acceptable quality11.25%4.94%2.13%3.69%not reported
noise3.69%1.13%3.56%0.50%not reported

Tool and model comparison using dataset 1 showing average number of peaks found across 20 samples and average percentages of detected CS. The input row shows the results returned by the original peak detection tools (MZmine, XCMS), and other rows show the details of the three peak classes given by NeatMS. The total number of peaks after classification is smaller than the input due to the application of a minimum scan number filter that NeatMS uses (a default value of 5 is used).

Tool and model comparison using dataset 1 showing average number of peaks found across 20 samples and average percentages of detected CS. The input row shows the results returned by the original peak detection tools (MZmine, XCMS), and other rows show the details of the three peak classes given by NeatMS. The total number of peaks after classification is smaller than the input due to the application of a minimum scan number filter that NeatMS uses (a default value of 5 is used). Median relative standard deviations (RSDs) increase substantially from high-quality to noise peaks (Figure b), with acceptable-quality peaks falling in between. NeatMS quality class assignment differs strongly from conventional QC-RSD filtering methods because we also observe many noise peaks with low RSDs. This effect is present with XCMS and MZmine for both models (not shown). Figure d,e shows that the CSs are consistently found by NeatMS across various low dilution samples and generally tend to move from high to acceptable quality as dilution increases. Eventually, some fall into the NeatMS noise category while most can no longer be detected by MZmine as the signal decreases with increasing dilution. Figure S2 (peak width distribution) shows that the noise class is dominated by rather broad peaks while the high-quality class shows a consistent peak width distribution independent of the peak area. This indicates that our three classes represent a sensible quality classification for peaks. Figure a (ROC curve) shows that the learning itself was very successful, the model closely resembles the expert knowledge of the trainer. Thus, NeatMS evaluation is comparable to, but much faster and much more reproducible and consistent than, human expert evaluation. NeatMS must be considered superior for large-scale studies with hundreds or thousands of samples and potentially several millions of peaks. By including training and transfer learning functionality into our solution, we empower researchers to adapt and optimize the learned classification and filtering to their specific data, needs, and preferences.
Figure 3

NeatMS results for dataset 1 (a–c) and dataset 2 (d, e) based on TL model and MZmine output. (a) ROC curves showing learning efficacy for three different group separations. The curves are created using the validation set of the TL model learning; the high-quality class probability returned by the model is used as the varying parameter. (b) Box plots of RSDs: high-quality peaks show lower individual RSDs and reduced within-class RSD variability. Each feature for which peaks were present in at least four samples was assigned to a quality based on the most frequently reported class. (c) Venn diagram comparing peakonly and NeatMS. Numbers show averages over the 20 samples: total number of detected peaks (black), percent of recovered CSs (red). On average, NeatMS reports 63.5 CSs as high quality, i.e., recovers 79% of original, another 11% are considered acceptable quality. Note that matching peaks derived by different algorithms is challenging in itself and not completely unambiguous; cf. Data Analysis Details section for details. (d) Classification of individual CS over different dilutions and for different tools, NeatMS high-quality class outperforms peakonly for most dilutions. A substantial number of additional CSs are categorized in the “acceptable quality” class. (e) Sankey diagram showing the distribution of the 41 diluted CSs for all three replicates in different quality classes and their change between dilution steps. Each dilution step is represented by a stacked barplot; the widths of the flows between bar plots (dilutions) represent the fraction of CSs going from one class to another.

NeatMS results for dataset 1 (a–c) and dataset 2 (d, e) based on TL model and MZmine output. (a) ROC curves showing learning efficacy for three different group separations. The curves are created using the validation set of the TL model learning; the high-quality class probability returned by the model is used as the varying parameter. (b) Box plots of RSDs: high-quality peaks show lower individual RSDs and reduced within-class RSD variability. Each feature for which peaks were present in at least four samples was assigned to a quality based on the most frequently reported class. (c) Venn diagram comparing peakonly and NeatMS. Numbers show averages over the 20 samples: total number of detected peaks (black), percent of recovered CSs (red). On average, NeatMS reports 63.5 CSs as high quality, i.e., recovers 79% of original, another 11% are considered acceptable quality. Note that matching peaks derived by different algorithms is challenging in itself and not completely unambiguous; cf. Data Analysis Details section for details. (d) Classification of individual CS over different dilutions and for different tools, NeatMS high-quality class outperforms peakonly for most dilutions. A substantial number of additional CSs are categorized in the “acceptable quality” class. (e) Sankey diagram showing the distribution of the 41 diluted CSs for all three replicates in different quality classes and their change between dilution steps. Each dilution step is represented by a stacked barplot; the widths of the flows between bar plots (dilutions) represent the fraction of CSs going from one class to another.

Data Analysis Details

Our validation with dataset 1 focuses on the number of known spiked chemical standards found by the different tools. Peak identification was performed using compound-specific RT and m/z tolerance windows provided with the Biocrates kit. To compute the final results, percentages were calculated on the basis that all 80 compounds are detectable in all 20 samples. The same approach was used to analyze the recovery of CSs in dataset 2. These samples contain the same 80 CSs as dataset 1, but only 41 of the CSs were diluted, while 39 were used as internal standards and thus were at the same concentration throughout (Figure S3b). The Sankey diagram (Figure e) was created by comparing the peak classes of the CSs for all three replicates in the consecutive dilution points to generate migration flows. A class corresponding to nondetected CSs was added to conserve an even CS population size throughout. To create the Venn diagrams (Figures c and S5), peaks reported by one tool were compared to the list of peaks reported by the other tool and considered the same when any two peaks presented a mutual overlap higher than 50% in the RT and m/z dimensions, respectively. However, tools can differ widely in the peak boundaries assignment for the same peak. Therefore, any peak matching method will remain imperfect and ambiguous. This explains the noncomplete overlap of peaks found by NeatMS compared to peakonly. To generate RSD results in Figure b, we used peak alignment (using the MZmine join aligner algorithm) and assigned the features to the most frequently found quality class across the 20 samples. Features were retained only when present in a minimum of four samples.

Comparison with Peakonly

Table also shows a comparison of NeatMS to results obtained with the recently published peakonly tool,[10] which also applies machine learning on raw data to detect high-quality peaks. We choose the peakonly parameters as suggested by the authors and also tried to further optimize them; see Figure S4. Unfortunately, it was not possible to apply transfer learning or any re-training for peakonly because the software does not provide the necessary components to do this. Figure d shows that NeatMS high-quality class equals or outperforms peakonly in CS peak recognition for all dilutions. Additional CSs are classified as acceptable. A more detailed comparison for dataset 1 is shown in Figure c (for XCMS, we find comparable results; see Figure S5). NeatMS High and Acceptable quality classes together contain on average more than 90% of CSs. Peakonly reports an average of 1907 peaks, containing on average 79% of the CSs. Approximately the same percentage of CS is contained in 1069 high-quality peaks reported by NeatMS. The concordance is however not perfect. Only rarely do we miss CSs that peakonly reports (0.7%). A small portion of the CS matched signals are considered Noise by NeatMS. Upon visual inspection, our expert usually agrees with the NeatMS algorithm (see Figure S6).

Comparison with DNN and MetaClean

In addition to the comparison with peakonly,[10] we also compared NeatMS with a deep neural network-based peak filtering tool developed by Kantz et al.[11] (referred to as DNN tool thereafter) using their default parameters. In contrast to peakonly and NeatMS, DNN uses a peak position imputation/gap filling approach, i.e., positions that contain peaks in a certain minimal number of samples are evaluated and quantified in all samples. Similar functionality is also optionally provided by MZmine; thus, NeatMS can also support this approach although this makes an unbiased evaluation of tool performance more difficult. This becomes very obvious in dataset 2, where we do not want to use information from the highly concentrated samples to infer algorithm performance on the lowly concentrated samples. Figure a visualizes the comparison of NeatMS and DNN for dataset 1, and the full confusion matrices are shown in Supporting Information Tables S4 and S5. Note that gap filling does not substantially alter NeatMS results; cf. Table S3. Overall, DNN finds about 85% CS in the Good category compared to about 91% by NeatMS. Overall, the results of the two tools are consistent but NeatMS still outperforms DNN. Furthermore, processing with DNN required about 24 h, while NeatMS took less than 10 min on the same hardware. Similar differences applied in terms of storage needs. This is in line with the DNN authors claiming to provide an “important proof of concept” and our ambition to provide easy-to-use software. Users of all three tools—DNN, peakonly, or NeatMS—should be aware that just as any other compound, contaminants may produce high-quality or low-quality peaks. Addressing contaminant-related issues is indeed an important task but is outside of the scope of this work or tools like DNN or peakonly, respectively.
Figure 4

Venn diagram comparing (a) DNN and NeatMS and (b) MetaClean and NeatMS analogue to peakonly comparison in Figure c using dataset 1. Numbers shown are averages over 20 samples: total number of detected peaks (black), percent of recovered CSs (red). The total compounds percentage refers to the total recovered CS. For clarity, the figure is limited to MetaClean “Pass” and DNN “Good” classes, and the full confusion matrices are given in Supporting Information Tables S4–S7.

Venn diagram comparing (a) DNN and NeatMS and (b) MetaClean and NeatMS analogue to peakonly comparison in Figure c using dataset 1. Numbers shown are averages over 20 samples: total number of detected peaks (black), percent of recovered CSs (red). The total compounds percentage refers to the total recovered CS. For clarity, the figure is limited to MetaClean “Pass” and DNN “Good” classes, and the full confusion matrices are given in Supporting Information Tables S4–S7. Finally, we compared NeatMS performances with those of MetaClean,[25] which uses a machine learning-based classifier to reduce false-positive peak detection by assigning “Pass” or “Fail” labels to peaks aligned and grouped across samples. The tool is designed to use with XCMS, and peaks must be aligned and gap-filled. Figure b visualizes the comparison of NeatMS and MetaClean for dataset 1, and the full confusion matrices are given in Supporting Information Tables S6 and S7. At least for this dataset NeatMS recovers much more CS (about 94%) compared to 36% by MetaClean. Compute and storage resources requirements are comparable to NeatMS.

Conclusions

NeatMS outperforms existing tools for peak curation; it requires neither large computing power nor long computing time; all data analysis described in this manuscript can be done with a standard laptop within minutes, and all described training can be done with a modern PC within a few hours. NeatMS software is available as open source on github under permissive MIT license and also provided as easy-to-install PyPi and Bioconda[26] packages. NeatMS comes with comprehensive user documentation, tutorials, and importantly also contains an easy-to-use training tool. Users can thus create their own models or improve existing ones according to their specific needs. NeatMS supports standard input and output formats and is therefore easily added into existing workflows. Thus, it is compatible with many use cases and may help to enable improved and reproducible data analysis for large LC-MS studies.
  17 in total

1.  XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification.

Authors:  Colin A Smith; Elizabeth J Want; Grace O'Maille; Ruben Abagyan; Gary Siuzdak
Journal:  Anal Chem       Date:  2006-02-01       Impact factor: 6.986

2.  Detailed Investigation and Comparison of the XCMS and MZmine 2 Chromatogram Construction and Chromatographic Peak Detection Methods for Preprocessing Mass Spectrometry Metabolomics Data.

Authors:  Owen D Myers; Susan J Sumner; Shuzhao Li; Stephen Barnes; Xiuxia Du
Journal:  Anal Chem       Date:  2017-08-17       Impact factor: 6.986

3.  One Step Forward for Reducing False Positive and False Negative Compound Identifications from Mass Spectrometry Metabolomics Data: New Algorithms for Constructing Extracted Ion Chromatograms and Detecting Chromatographic Peaks.

Authors:  Owen D Myers; Susan J Sumner; Shuzhao Li; Stephen Barnes; Xiuxia Du
Journal:  Anal Chem       Date:  2017-08-17       Impact factor: 6.986

4.  Bioconda: sustainable and comprehensive software distribution for the life sciences.

Authors:  Björn Grüning; Ryan Dale; Andreas Sjödin; Brad A Chapman; Jillian Rowe; Christopher H Tomkins-Tinch; Renan Valieris; Johannes Köster
Journal:  Nat Methods       Date:  2018-07       Impact factor: 28.547

5.  NormAE: Deep Adversarial Learning Model to Remove Batch Effects in Liquid Chromatography Mass Spectrometry-Based Metabolomics Data.

Authors:  Zhiwei Rong; Qilong Tan; Lei Cao; Liuchao Zhang; Kui Deng; Yue Huang; Zheng-Jiang Zhu; Zhenzi Li; Kang Li
Journal:  Anal Chem       Date:  2020-03-24       Impact factor: 6.986

6.  MetaClean: a machine learning-based classifier for reduced false positive peak detection in untargeted LC-MS metabolomics data.

Authors:  Kelsey Chetnik; Lauren Petrick; Gaurav Pandey
Journal:  Metabolomics       Date:  2020-10-21       Impact factor: 4.290

7.  IPO: a tool for automated optimization of XCMS parameters.

Authors:  Gunnar Libiseller; Michaela Dvorzak; Ulrike Kleb; Edgar Gander; Tobias Eisenberg; Frank Madeo; Steffen Neumann; Gert Trausinger; Frank Sinner; Thomas Pieber; Christoph Magnes
Journal:  BMC Bioinformatics       Date:  2015-04-16       Impact factor: 3.169

8.  MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis.

Authors:  Hiroshi Tsugawa; Tomas Cajka; Tobias Kind; Yan Ma; Brendan Higgins; Kazutaka Ikeda; Mitsuhiro Kanazawa; Jean VanderGheynst; Oliver Fiehn; Masanori Arita
Journal:  Nat Methods       Date:  2015-05-04       Impact factor: 28.547

9.  WiPP: Workflow for Improved Peak Picking for Gas Chromatography-Mass Spectrometry (GC-MS) Data.

Authors:  Nico Borgsmüller; Yoann Gloaguen; Tobias Opialla; Eric Blanc; Emilie Sicard; Anne-Lise Royer; Bruno Le Bizec; Stéphanie Durand; Carole Migné; Mélanie Pétéra; Estelle Pujos-Guillot; Franck Giacomoni; Yann Guitton; Dieter Beule; Jennifer Kirwan
Journal:  Metabolites       Date:  2019-08-21

10.  Recurrent Topics in Mass Spectrometry-Based Metabolomics and Lipidomics-Standardization, Coverage, and Throughput.

Authors:  Evelyn Rampler; Yasin El Abiead; Harald Schoeny; Mate Rusz; Felina Hildebrand; Veronika Fitz; Gunda Koellensperger
Journal:  Anal Chem       Date:  2020-11-28       Impact factor: 6.986

View more
  3 in total

1.  Power of mzRAPP-Based Performance Assessments in MS1-Based Nontargeted Feature Detection.

Authors:  Yasin El Abiead; Maximilian Milford; Harald Schoeny; Mate Rusz; Reza M Salek; Gunda Koellensperger
Journal:  Anal Chem       Date:  2022-06-07       Impact factor: 8.008

2.  AI/ML-driven advances in untargeted metabolomics and exposomics for biomedical applications.

Authors:  Lauren M Petrick; Noam Shomron
Journal:  Cell Rep Phys Sci       Date:  2022-07-20

3.  PeakBot: Machine learning based chromatographic peak picking.

Authors:  Christoph Bueschl; Maria Doppler; Elisabeth Varga; Bernhard Seidl; Mira Flasch; Benedikt Warth; Juergen Zanghellini
Journal:  Bioinformatics       Date:  2022-05-23       Impact factor: 6.931

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.