| Literature DB >> 31374904 |
Céline Brouard1, Antoine Bassé2, Florence d'Alché-Buc2, Juho Rousu3.
Abstract
In small molecule identification from tandem mass (MS/MS) spectra, input-output kernel regression (IOKR) currently provides the state-of-the-art combination of fast training and prediction and high identification rates. The IOKR approach can be simply understood as predicting a fingerprint vector from the MS/MS spectrum of the unknown molecule, and solving a pre-image problem to find the molecule with the most similar fingerprint. In this paper, we bring forward the following improvements to the IOKR framework: firstly, we formulate the IOKRreverse model that can be understood as mapping molecular structures into the MS/MS feature space and solving a pre-image problem to find the molecule whose predicted spectrum is the closest to the input MS/MS spectrum. Secondly, we introduce an approach to combine several IOKR and IOKRreverse models computed from different input and output kernels, called IOKRfusion. The method is based on minimizing structured Hinge loss of the combined model using a mini-batch stochastic subgradient optimization. Our experiments show a consistent improvement of top-k accuracy both in positive and negative ionization mode data.Entities:
Keywords: kernel methods; machine learning; metabolite identification; structured prediction
Year: 2019 PMID: 31374904 PMCID: PMC6724104 DOI: 10.3390/metabo9080160
Source DB: PubMed Journal: Metabolites ISSN: 2218-1989
Figure 1Schematic illustration of IOKR and IOKRreverse approaches. IOKR learns a function h to map MS/MS spectra to a molecular feature space , whereas IOKRreverse learns a function g to map the molecular structures to a MS/MS feature space .
Description of the input kernels (see [34] for further details).
| Name | Description | |
|---|---|---|
| LI | Loss intensity | counts the number of common losses weighted by the intensity |
| RLB | Root loss binary | counts the number of common losses from the root to some node |
| RLI | Root loss intensity | weighted variant of RLB that uses the intensity of terminal nodes |
| JLB | Joined loss binary | counts the number of common joined losses |
| LPC | Loss pair counter | counts the number of two consecutive losses within the tree |
| MLIP | Maximum loss in path | counts the maximum frequencies of each molecular formula in any path |
| NB | Node binary | counts the number of nodes with the same molecular formula |
| NI | Node intensity | weighted variant of NB that uses the intensity of nodes |
| NLI | Node loss interaction | counts common paths and weights them by comparing the molecular formula of their terminal fragments |
| SLL | Substructure in losses and leafs | counts for different molecular formula in how many paths they are conserved (part of all nodes) or cleaved off intact (part of a loss) |
| NSF | Node subformula | considers a set of molecular formula |
| NSF3 | takes the value of NSF to the power of three | |
| GJLSF | Generalized joined loss subformula | counts how often each molecular formula from |
| RDBE | Ring double-bond equivalent | compares the distribution of ring double-bond equivalent values between two trees |
| PPKr | Recalibrated probability product kernel | computes the probability product kernel on preprocessed spectra |
Figure 2Heatmap of the top-1 accuracy obtained with IOKR and IOKRreverse for different input and output kernels in the negative ionization mode (a) and the positive ionization mode (b). The rows correspond to the different input kernels while the columns indicate the output kernels.
Figure 3Heatmap of the weights learned during the score aggregation in the negative ionization mode (a) and the positive ionization mode (b). The weights have been averaged over the five cross-validation folds.
Comparison of the top-k accuracy between CSI:FingerID, IOKR Unimkl and IOKRfusion in the negative and positive ionization modes. The highest top-k accuracies are shown in boldface.
| Method | Negative Mode | Positive Mode | ||||
|---|---|---|---|---|---|---|
| Top-1 | Top-5 | Top-10 | Top-1 | Top-5 | Top-10 | |
| CSI:FingerID | 31.9 | 60.2 | 69.9 | 36.0 | 67.5 | 76.5 |
| IOKR Unimkl - Linear | 30.1 | 58.8 | 68.6 | 34.9 | 66.9 | 76.0 |
| IOKR Unimkl - Tanimoto | 31.0 | 60.0 | 69.7 | 35.2 | 67.6 | 76.5 |
| IOKR Unimkl - Gaussian | 31.0 | 60.3 | 69.6 | 35.0 | 67.7 | 76.3 |
| IOKR Unimkl - Gaussian Tanimoto | 30.9 | 61.0 | 70.5 | 33.9 | 66.5 | 75.2 |
| IOKRfusion - only IOKR scores | 28.4 | 57.0 | 67.2 | 33.5 | 64.4 | 73.4 |
| IOKRfusion - only IOKRreverse scores | 30.1 | 60.4 | 71.4 | 37.6 | 69.2 | 77.9 |
| IOKRfusion - all scores |
|
|
|
|
|
|
Figure 4Plot of the top-k accuracy for IOKR Unimkl, CSI:FingerID and IOKRfusion in the negative ionization mode.
Figure 5Plot of the top-k accuracy for IOKR Unimkl, CSI:FingerID and IOKRfusion in the positive ionization mode.
Running times for the training and the test steps.
| Method | Training Time | Test Time |
|---|---|---|
| IOKR - linear | 0.85 s | 1 min 15 s |
| IOKR - tanimoto | 3.9 s | 7 min 40 s |
| IOKR - gaussian | 7.2 s | 8 min 38 s |
| IOKR - gaussian-tanimoto | 7.6 s | 8 min 44 s |
| IOKRreverse - linear | 3.9 s | 28 min 20 s |
| IOKRreverse - tanimoto | 4.1 s | 33 min 57 s |
| IOKRreverse - gaussian | 7.4 s | 34 min 49 s |
| IOKRreverse - gaussian-tanimoto | 7.5 s | 35 min 4 s |
| IOKR Unimkl - linear | 4.3 s | 1 min 10 s |
| IOKR Unimkl - tanimoto | 8.7 s | 7 min 52 s |
| IOKR Unimkl - gaussian | 11.7 s | 8 min 28 s |
| IOKR Unimkl - gaussian-tanimoto | 11.9 s | 8 min 42 s |
| IOKRfusion | 3 min 3 s | 0.1 s |