| Literature DB >> 31186508 |
Francis Brochu1,2, Pier-Luc Plante3, Alexandre Drouin4,5, Dominic Gagnon3,6, Dave Richard3,6, Francine Durocher3,7, Caroline Diorio3,8, Mario Marchand4,5, Jacques Corbeil4,3,7, François Laviolette4,5.
Abstract
Mass spectrometry is a valued method to evaluate the metabolomics content of a biological sample. The recent advent of rapid ionization technologies such as Laser Diode Thermal Desorption (LDTD) and Direct Analysis in Real Time (DART) has rendered high-throughput mass spectrometry possible. It is used for large-scale comparative analysis of populations of samples. In practice, many factors resulting from the environment, the protocol, and even the instrument itself, can lead to minor discrepancies between spectra, rendering automated comparative analysis difficult. In this work, a sequence/pipeline of algorithms to correct variations between spectra is proposed. The algorithms correct multiple spectra by identifying peaks that are common to all and, from those, computes a spectrum-specific correction. We show that these algorithms increase comparability within large datasets of spectra, facilitating comparative analysis, such as machine learning.Entities:
Year: 2019 PMID: 31186508 PMCID: PMC6560045 DOI: 10.1038/s41598-019-44923-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Definition of window size for the detection of VLM peaks. The peaks identified as 1, 2, and 3 are presumed to originate from three different spectra. Window size w1 correctly detects four VLM groups. Window size w2 however is too wide and will detect ambiguous and erroneous groups. Moreover, w2 will detect several overlapping VLM groups.
Figure 2Error in ppm versus mass units. Subfigure (A) shows the error on left-out VLMs in ppms, while Subfigure (B) shows the error in Daltons. This data was acquired on the Days Dataset.
Figure 3Workflow of the VLM and alignment algorithms. First, VLM points are detected in the original spectra in the dataset and VLM correction is applied. The alignment algorithm is then applied to the corrected spectra in order to obtain the alignment points. The representation of a given spectrum is the subset of peaks which fall within a mass window of an alignment point, with unmodified intensity.
Figure 4Learning Curves of Virtual Lock Mass Detection and Correction. Subfigures (A–C) show the learning curves for three different datasets ((A) Days, (B) Clomiphene-Acetaminophen and (C) Malaria). Subfigure (D) shows the Root Mean Square Error (RMSE) of VLM Correction for these datasets on an unseen test set. This test set consisted of 25 randomly selected samples from the datasets, which were kept separate. The experiments were replicated 50 times and averaged.
Figure 5Loss per peak in different m/z ranges of the spectra. Each boxplot represents the RMSE of the peaks in a given region (50–150 in (A), 150–250 in (B), 250–350 in (C) and greater than 350 in (D)). Shown here are the results for the Days Dataset, in increasing order to training spectra, from 10 to 150. The outliers are shown as ticks over each box.
Figure 6Transductive and inductive workflows. (A) The transductive workflow, in which all spectra are corrected at once, prior to partitioning the data into a training and testing set. (B) The inductive workflow, where the data are first partitioned and only the spectra in the training set are used to learn a transformation that is applied to all spectra. The dotted blue arrows show where the algorithms were applied on unseen data, while the whole black arrows show the workflow of the training data. Thus, in the inductive workflow, the test set is formed of unseen data that is only used for the final evaluation of the model. In the transductive case, some information is taken from all samples, while only the learning part of the workflow separating a test set on which the algorithm does not learn.
Machine learning results in the transductive setting.
| Condition | AdaBoost | Decision Tree | SCM | L1-SVM |
|---|---|---|---|---|
|
| ||||
| Binning only | 98.0% (4.7) | 98.6% (1.8) | 95.2% (1.1) | 89.6% (52.0) |
| VLM + Binning | 98.2% (4.9) | 97.0% (2.3) | 97.0% (1.2) | |
| VLM + Alignment | 92.8% (138.6) | |||
|
| ||||
| Binning only | 99.2% (1.0) | 99.2% ( | 99.2% (1.2) | 97.6% (97.5) |
| VLM + Binning | 99.2% ( | 99.2% ( | 99.0% (121.0) | |
| VLM + Alignment | ||||
|
| ||||
| Binning only | 92.4% (51.8) | 82.5% ( | 84.6% (2.2) | 92.6% (150.1) |
| VLM + Binning | 93.3% ( | |||
| VLM + Alignment | 86.1% (4.8) | 85.4% (2.3) | 95.2% ( | |
|
| ||||
| Binning Only | 55.6% ( | 56.8% ( | ||
| VLM + Binning | 70.2% (43.9) | 61.6% (4.8) | 53.6% (2.2) | 69.4% (138.6) |
| VLM + Alignment | 67.4% ( | 62.6% ( | ||
The percentage in each column is the average accuracy of classifiers on 10 repeats of the experiment. The number shown in parentheses is the average number of features used by the classifiers. The algorithms tested were AdaBoost, the Decision Tree algorithm, the Set Covering Machine (SCM) and a L1-norm Support Vector Machine (L1-SVM).
Comparison of transductive and inductive learning of the VLM and Alignment algorithms.
| Condition | AdaBoost | Decision Tree | SCM | L1-SVM |
|---|---|---|---|---|
|
| ||||
| Transductive | 98.8% (2.3) | 99.4% (1.0) | 92.8% (138.6) | |
| Inductive | ||||
|
| ||||
| Transductive | 99.8% ( | 99.4% ( | ||
| Inductive | 99.2% ( | 98.6% ( | ||
|
| ||||
| Transductive | 86.1% (4.8) | |||
| Inductive | 92.9% ( | 84.2% ( | 95.1% (151.0) | |
|
| ||||
| Transductive | 67.4% ( | |||
| Inductive | 61.2% (6.7) | 57.4% ( | 68.2% (145.4) | |
The algorithms tested were AdaBoost, the Decision Tree algorithm, the Set Covering Machine (SCM) and a L1-norm Support Vector Machine (L1-SVM).