| Literature DB >> 27307628 |
Céline Brouard1, Huibin Shen1, Kai Dührkop2, Florence d'Alché-Buc3, Sebastian Böcker2, Juho Rousu1.
Abstract
MOTIVATION: An important problematic of metabolomics is to identify metabolites using tandem mass spectrometry data. Machine learning methods have been proposed recently to solve this problem by predicting molecular fingerprint vectors and matching these fingerprints against existing molecular structure databases. In this work we propose to address the metabolite identification problem using a structured output prediction approach. This type of approach is not limited to vector output space and can handle structured output space such as the molecule space.Entities:
Mesh:
Year: 2016 PMID: 27307628 PMCID: PMC4908330 DOI: 10.1093/bioinformatics/btw246
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Notations used in the article
| Symbol | Explanation |
|---|---|
| input, output sets | |
| elements of | |
| output scalar kernel | |
| output feature space | |
| output feature map | |
| input operator-valued kernel | |
| reproducing kernel Hilbert space of | |
| Gram matrix on training set | |
| input scalar kernel | |
| input feature space | |
| input feature map | |
| Gram matrix on training set |
Fig. 1.Overview of the IOKR framework for solving the metabolite identification problem. The mapping f between MS/MS spectra and 2D molecular structures is learnt by approximating the output feature map with a function h and solving a preimage problem
Description of the input kernels used in this article
| Category | Name | Description | Reference |
|---|---|---|---|
| Loss-based kernels | Loss binary (LB) | counts the number of common losses | |
| Loss intensity (LI) | weighted variant of LB that uses the intensity of terminal nodes | ||
| Loss count (LC) | counts the number of occurrences of the losses | ||
| Weighted loss count (LW) | weighted variant of LC using the inverse frequency of training losses | ||
| Root loss binary (RLB) | counts the number of common losses from the root to some node | ||
| Root loss intensity (RLI) | weighted variant of RLB that uses the intensity of terminal nodes | ||
| Loss intensity PP (LIPP) | probability product (PP) of shared losses | ||
| Node-based kernels | Node binary (NB) | counts the number of nodes with the same molecular formula | |
| Node intensity (NI) | weighted variant of NB that uses the intensity of nodes | ||
| Node subformula (NSF) | counts the number of common substructures | ||
| Fragment intensity PP (FIPP) | PP of shared fragments (nodes) | ||
| Path-based kernels | Common paths counting (CPC) | counts the number of common paths (identical sequences of losses) | |
| Common paths of length 2 (CP2) | counts the number of common paths of length 2 | ||
| Common paths of length at least 2 (CP2+) | counts the number of common paths of length at least 2 | ||
| Common paths with | the PPK | ||
| Common paths with | same as CPK1 with a different parameter | ||
| Common path joined binary (CPJB) | counts the number of paths for which the union of losses is equal | ||
| Common path joined (CPJ) | counts paths of length 2 that have the same loss | ||
| Weighted paths counting (WPC) | weighted variant of CPC that uses the inverse frequency of the losses | ||
| Subtree kernel | Common subtree counting (CSC) | counts the number of subtrees with common structures and losses | |
| Fragmentation tree | TALIGN | Pearson correlation of alignment scores between fragmentation trees | |
| alignment kernels | TALIGND | variant of TALIGN that modifies the scoring function | |
| Probability product | Recalibrated PPK (PPKr) | PPK computed on preprocessed spectra | |
| kernel | |||
| other | Chemical element counting (CEC) | weighted counts of chemical elements |
Fig. 2.An example of MS/MS spectrum and its fragmentation tree. Each node of the fragmentation tree corresponds to a peak and is labeled by the molecular formula of the corresponding fragment. The root of the tree is labeled with the molecular formula of the unfragmented molecule. Edges represent the losses. Two nodes and one edge are colored to show the correspondence between the MS/MS spectrum and the fragmentation tree
Comparison of the percentage of correctly identified structures for top 1, 10 and 20 using FingerID, CSI:FingerId and IOKR
| Method | MKL | Top 1 | Top 10 | Top 20 |
|---|---|---|---|---|
| FingerID | none | 17.74 | 49.59 | 58.17 |
| CSI:FingerID unit | ALIGNF | 24.82 | 60.47 | 68.2 |
| CSI:FingerID mod Platt | ALIGNF | 28.84 | 66.07 | 73.07 |
| IOKR linear | ALIGNF | 28.54 | 65.77 | 73.19 |
| UNIMKL | 30.02 | 66.05 | 73.66 | |
| IOKR Gaussian | ALIGNF | 29.78 | 67.84 | 74.79 |
| UNIMKL |
The highest values are shown in boldface.
Fig. 3.Difference in percentage points to the percentage of metabolites ranked lower than k with CSI:FingerID using the modified Platt scoring function
Running time evaluation
| Training time | Test time | |
|---|---|---|
| CSI:FingerID | 82 h 28 min 23 s | 1 h 11 min 31 s |
| IOKR linear | ||
| IOKR polynomial | 21 min 58 s | |
| IOKR Gaussian | 33 min 15 s |
These running times were obtained by training the methods on the 4138 GNPS spectra and using 625 spectra from Massbank as test set.
Fig. 4.Heatmap of the percentage of correctly identified metabolites (Top 1) with IOKR. The rows correspond to the different output kernels built on fingerprints (linear, polynomial and Gaussian) and the columns to the 24 input kernels derived from spectra and fragmentation trees, as well as the two multiple kernel combination schemes ALIGNF and UNIMKL
Fig. 5.Heatmap of kernel weights learned by ALIGNF for all pairs of input and output kernels on GNPS dataset. The weights have been averaged over the 10 CV folds
Fig. 6.Identified metabolites with IOKR in function of the size of candidate sets. We considered the candidate sets of size smaller than 8000, which corresponds to 98.8% of the sets, and divided them in 30 bins according to their sizes. (a) indicates the number of test metabolites that have a candidate set size in the corresponding size bin. The percentage of metabolites that are ranked in top 1 position, top 10 or above is shown on the (b) for the test metabolites falling in each size bin
Fig. 7.Scatter plot of classes in ChEBI ontology with shortest paths of length 7 from the class chemical entity. X-axis corresponds to the median number of candidates associated with the compounds in each class and y-axis to the proportion of correct compounds with rank less or equal to 10 for each class. The size of the point is proportional to the number of compounds in GNPS dataset that belong to that class and we only show classes with at least 10 compounds. The classes we can identify well are shown in red and the classes we cannot are shown in blue with ChEBI id and name next to them