Literature DB >> 27307628

Fast metabolite identification with Input Output Kernel Regression.

Céline Brouard¹, Huibin Shen¹, Kai Dührkop², Florence d'Alché-Buc³, Sebastian Böcker², Juho Rousu¹.

Abstract

MOTIVATION: An important problematic of metabolomics is to identify metabolites using tandem mass spectrometry data. Machine learning methods have been proposed recently to solve this problem by predicting molecular fingerprint vectors and matching these fingerprints against existing molecular structure databases. In this work we propose to address the metabolite identification problem using a structured output prediction approach. This type of approach is not limited to vector output space and can handle structured output space such as the molecule space.
RESULTS: We use the Input Output Kernel Regression method to learn the mapping between tandem mass spectra and molecular structures. The principle of this method is to encode the similarities in the input (spectra) space and the similarities in the output (molecule) space using two kernel functions. This method approximates the spectra-molecule mapping in two phases. The first phase corresponds to a regression problem from the input space to the feature space associated to the output kernel. The second phase is a preimage problem, consisting in mapping back the predicted output feature vectors to the molecule space. We show that our approach achieves state-of-the-art accuracy in metabolite identification. Moreover, our method has the advantage of decreasing the running times for the training step and the test step by several orders of magnitude over the preceding methods. CONTACT: celine.brouard@aalto.fi SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2016 PMID： 27307628 PMCID： PMC4908330 DOI： 10.1093/bioinformatics/btw246

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Metabolomics is a science which concerns the study of small molecules, called metabolites, and their interactions in the cell. An important problem of metabolomics is the identification of the metabolites present in a sample. Information on metabolites can be obtained using tandem mass spectrometry. This technology allows to obtain a tandem mass spectrum, also called MS/MS spectrum, by fragmenting a compound. A MS/MS spectrum is a plot containing a set of peaks, where each peak corresponds to a fragment. These peaks represent the relative abundance of the different fragments, also called intensity, in function of their mass-to-charge ratio. The identification of the metabolite from its mass spectrum is then needed for a more detailed biological interpretation. In general this step consists in a research of the obtained spectrum in databases of reference spectra, followed by an analysis by experts of the domain. Computational approaches for interpreting and predicting MS/MS data of small molecules date back to the 1960s (Lindsay ). However, the early approaches were hampered by the unavailability of large scale data on molecular structures as well as reference spectra. The introduction of molecular structure databases such as PubChem (Bolton ) as well as open mass spectral reference databases (da Silva ; Horai ) has in recent years fuelled the development of novel methods. Several novel strategies have been proposed, including simulation of mass spectra from molecular structure (Allen , 2015), combinatorial fragmentation (Heinonen ; Hill and Mortishire-Smith, 2005; Ridder ; Wang ; Wolf ) and prediction of molecular fingerprints (Heinonen ; Shen ). Methods based on machine learning (Allen , 2015; Dührkop ; Heinonen ; Shen , 2014) have been proposed very recently for learning a mapping between tandem mass spectra and metabolites. These methods fall into two general approaches. The first group of methods (Dührkop ; Heinonen ; Shen , 2014) introduces an intermediary step consisting in predicting molecular fingerprints for the metabolites from their mass spectra using Support Vector Machines (SVMs). Molecular fingerprints are a standard representation for molecules, used in cheminformatics and drug discovery. They are typically represented as binary vectors, whose values indicate the presence or absence of some molecular properties, e.g. the existence of particular substructures in the metabolite or some physiochemical properties. If two molecules share a large number of molecular properties they are likely to be similar in structure, which is the rationale in using them for metabolite identification. To identify a metabolite, the fingerprint predicted from its tandem mass spectrum is matched against a large molecular database such as PubChem. In Shen and Dührkop fragmentation trees are computed to model the fragmentation process of the molecules and then used for predicting the molecular fingerprints. The other machine learning approach for metabolite identification, used by CFM-ID (Allen , 2015), also relies on a two-step scheme, where the first step consists in predicting the mass spectra of the candidate molecules by modeling their fragmentation processes. In the second step, the simulated spectra of the candidate molecules are compared with the spectrum of the test metabolite. The goal of this work is to solve the metabolite identification problem in a single step, using a structured prediction method. These methods make use of structural dependencies existing among complex outputs (e.g. the fingerprints of a molecule) to improve the accuracy and make prediction efficiently. These methods have achieved an improved prediction performance over methods that predict parts of a structure independently in numerous applications. In the literature, two main structured prediction approaches can be distinguished. The first one models the dependencies between structured inputs and outputs using a joint feature map (Marchand ; Rousu ; Su and Rousu, 2015; Taskar ; Tsochantaridis ), and learns to discriminate the correct structure y for an input x from all incorrect output structures. The second one, called Output Kernel Regression, consists in learning a mapping between the input set and the feature space associated to some output kernel. A preimage problem, which consists in mapping back the predicted output feature vectors to the output space, is then solved. Existing Output Kernel Regression methods are Kernel Dependency Estimation (Cortes ; Kadri ; Weston ), Output Kernel Trees (Geurts ) and Input Output Kernel Regression (IOKR) (Brouard , 2015). In this work, we show how to apply the IOKR framework for solving the metabolite identification problem. Our method reaches improved identification rates compared with the previous state-of-the-art of Dührkop . More importantly, though, the IOKR framework results in vast improvements in running times: the method is one to two orders of magnitude faster in the prediction phase, and four orders of magnitude faster during training.

2 Methods

The main notations used in this article are summarized in Table 1. In the following, we note the set of input tandem mass spectra, also known as MS/MS spectra, and the set containing the 2D molecular structures corresponding to the spectra. We want to learn a function f that maps a MS/MS spectrum to its corresponding molecular structure . In this problem both input and output data are structured. Structured data refer to data having an internal structure, for example a graph or a tree, or to data being interdependent to each other. To solve this problem we use the IOKR framework that can learn a mapping between structured inputs and structured outputs. This framework has been introduced by Brouard to solve link prediction in the semi-supervised setting. In Brouard , this approach has been extended to address general structured output prediction problems. In this section we describe this method and explain how it can be applied to solve metabolite identification.

Table 1.

Notations used in the article

Symbol	Explanation
X, Y	input, output sets
x, y	elements of X, Y
κy:Y×Y→ℝ	output scalar kernel
Fy	output feature space
ϕy:Y→Fy	output feature map
Kx:X×X→L(Fy,Fy)	input operator-valued kernel
H	reproducing kernel Hilbert space of Kx
KXℓ	Gram matrix on training set
κx:X×X→ℝ	input scalar kernel
Fx	input feature space
ϕx:X→Fx	input feature map
KXℓ	Gram matrix on training set

Notations used in the article In the IOKR approach the internal structure of the output data is encoded using a kernel function . A kernel function is a positive semi-definite function that measures the similarity between two elements. Its values can be evaluated by computing scalar products in a high-dimensional space, called the feature space. In the case of the output kernel , this writes as follows: where the Hilbert space is the feature space associated to and is a feature map that maps the outputs to the output feature space. Depending of the kernel used, for example when using a Gaussian kernel, the feature map might not be explicitly known. We will see later that we only need to evaluate inner products between feature vectors for computing the solution, which is possible using the kernel trick in the output space. This means that the scalar products in the feature space are replaced by the kernel values. The spectra-metabolite mapping problem can then be decomposed in two tasks (see Figure 1). The first task consists in learning a function h between the input set and the Hilbert space that approximates the feature map . This task is called Output Kernel Regression. The second task is a pre-image problem that requires to learn or define a function g from to the output set . We detail these two steps in the following subsections.

Fig. 1.

Overview of the IOKR framework for solving the metabolite identification problem. The mapping f between MS/MS spectra and 2D molecular structures is learnt by approximating the output feature map with a function h and solving a preimage problem

2.1 Output Kernel Regression

The values of the function h that we want to learn in the Output Kernel Regression step are vectors belonging to the Hilbert space and not scalars. IOKR uses the Reproducing Kernel Hilbert Space (RKHS) theory devoted to vector-valued functions (Micchelli and Pontil, 2005; Senkene and Tempel’man, 1973) in order to find an appropriate functional space for searching the function h. This theory extends nicely the kernel methods to the problem of learning vector-valued functions. It has been used in the literature to solve different learning problems such as multi-task learning (Evgeniou ), functional regression (Kadri ), link prediction (Brouard ) and vector autoregression (Lim ). In this theory, a kernel is a function whose values are linear operators from to , where is a general Hilbert space. This theory does not require any assumption on the existence of an output kernel . is called an operator-valued kernel if it verifies the two following properties: In the case where the dimension d of is finite, the kernel is a function whose values are matrices of size d × d and the kernel matrix is a block matrix. , where denotes the adjoint. is defined as the linear operator satisfying In the IOKR approach, the function is searched in the RKHS with reproducing kernel . We denote this space . This means that we are searching models of the following form: Let be the set of training examples. The function h is searched by minimizing a regularized optimization problem. In this article, we chose to use the regularized least-squares loss function in the supervised setting: where is a regularization parameter. A sufficiently high enough value of λ prevents overfitting. According to the Representer Theorem (Micchelli and Pontil, 2005), the solution of this optimization problem can be written as a linear combination of the operator-valued kernel evaluated on the training examples: where , are vectors in . By replacing this expression in the optimization problem (1) and computing the derivative of the optimization problem, it has been shown by Micchelli and Pontil (2005) that the vectors verify the following equation: where and for all . If the dimension d of the output feature space is finite, this solution can be rewritten in closed form as follows: where and are two matrices of size ; denotes the identity matrix of size ; and is the Gram matrix of the operator-valued kernel on the training set. This is a block matrix, each block being of size d × d. is a column vector of length obtained by stacking the columns of the matrix on top of each other. Equation (2) generalizes the solution obtained with kernel ridge regression to the case of vector-valued functions.

2.2 Preimage step

To predict the output metabolite f(x) associated to the spectra , we must determine the pre-image of h(x) by . For this, we search the metabolite y in a set of candidates that minimizes the following criteria: As we consider that the output kernel is normalized, Equation (3) becomes: In this work, we consider operator-valued kernels of the following form: where is a scalar input kernel. We note the Hilbert space associated to this kernel and a feature map of . By using this operator-valued kernel and replacing by the solution given in the previous subsection, we obtain the following solution for metabolite identification with IOKR: where and is the Gram matrix of the scalar kernel on the training set. Using the kernel trick in the output space allows us to evaluate even in the case where the output feature map is not known explicitly. The solution can be rewritten as follows: where and are two column vectors.

2.3 Kernels

In the following, we describe the pairs of kernels that we used for solving the metabolite identification problem with IOKR.

2.3.1 Input kernels

We considered several existing mass spectral kernels for the scalar input kernel (Dührkop ; Heinonen ; Shen ). The kernels we used in this article are listed in Table 2. Most of them are defined based on fragmentation trees (Dührkop ; Shen ). Introduced by Böcker and Rasche (2008), fragmentation trees model the fragmentation process of a molecule in a tree shape: nodes of this tree are molecular formulas that correspond to the unfragmented molecule and its fragments. An edge between two nodes indicates the existence of a fragmentation reaction between two fragments or between the unfragmented molecule and one of its fragments. These edges are directed and correspond to losses. An example of fragmentation tree is given in Figure 2. Based on fragmentation trees, different categories of kernels have been proposed, such as: loss-based kernels, node-based kernels, path-based kernels or fragmentation tree alignment kernels.

Table 2.

Description of the input kernels used in this article

Category	Name	Description	Reference
Loss-based kernels	Loss binary (LB)	counts the number of common losses	Shen et al. (2014)
	Loss intensity (LI)	weighted variant of LB that uses the intensity of terminal nodes	Shen et al. (2014)
	Loss count (LC)	counts the number of occurrences of the losses	Shen et al. (2014)
	Weighted loss count (LW)	weighted variant of LC using the inverse frequency of training losses
	Root loss binary (RLB)	counts the number of common losses from the root to some node	Shen et al. (2014)
	Root loss intensity (RLI)	weighted variant of RLB that uses the intensity of terminal nodes	Shen et al. (2014)
	Loss intensity PP (LIPP)	probability product (PP) of shared losses	Dührkop et al. (2015)
Node-based kernels	Node binary (NB)	counts the number of nodes with the same molecular formula	Shen et al. (2014)
	Node intensity (NI)	weighted variant of NB that uses the intensity of nodes	Shen et al. (2014)
	Node subformula (NSF)	counts the number of common substructures	Dührkop et al. (2015)
	Fragment intensity PP (FIPP)	PP of shared fragments (nodes)	Dührkop et al. (2015)
Path-based kernels	Common paths counting (CPC)	counts the number of common paths (identical sequences of losses)	Shen et al. (2014)
	Common paths of length 2 (CP2)	counts the number of common paths of length 2	Shen et al. (2014)
	Common paths of length at least 2 (CP2+)	counts the number of common paths of length at least 2	Dührkop et al. (2015)
	Common paths with Kpeaks (CPK1)	the PPK Kpeaks are used to score the terminal peaks	Shen et al. (2014)
	Common paths with Kpeaks (CPK2)	same as CPK1 with a different parameter	Shen et al. (2014)
	Common path joined binary (CPJB)	counts the number of paths for which the union of losses is equal	Dührkop et al. (2015)
	Common path joined (CPJ)	counts paths of length 2 that have the same loss
	Weighted paths counting (WPC)	weighted variant of CPC that uses the inverse frequency of the losses
Subtree kernel	Common subtree counting (CSC)	counts the number of subtrees with common structures and losses	Shen et al. (2014)
Fragmentation tree	TALIGN	Pearson correlation of alignment scores between fragmentation trees	Dührkop et al. (2015)
alignment kernels	TALIGND	variant of TALIGN that modifies the scoring function	Dührkop et al. (2015)
Probability product	Recalibrated PPK (PPKr)	PPK computed on preprocessed spectra	Dührkop et al. (2015)
kernel			Heinonen et al. (2012)
other	Chemical element counting (CEC)	weighted counts of chemical elements	Dührkop et al. (2015)

Fig. 2.

An example of MS/MS spectrum and its fragmentation tree. Each node of the fragmentation tree corresponds to a peak and is labeled by the molecular formula of the corresponding fragment. The root of the tree is labeled with the molecular formula of the unfragmented molecule. Edges represent the losses. Two nodes and one edge are colored to show the correspondence between the MS/MS spectrum and the fragmentation tree Description of the input kernels used in this article We also used the recalibrated probability product kernel (PPKr), which is computed on preprocessed spectra. The PPK kernel, introduced by Heinonen , is computed from MS/MS spectra by modeling each peak in a spectrum by a normal distribution with two dimensions: the mass-to-charge ratio and the intensity. A spectrum is then modeled as a mixture of normal distributions. The PPK kernel between two spectra is evaluated by integrating the product between the two corresponding mixture distributions. We learned a linear combination of these 24 input kernels using multiple kernel learning (MKL). We used uniform MKL (UNIMKL), which associates the same weight to each kernel. We also applied the ALIGNF approach (Cortes ) which obtained the best performance for metabolite identification in the comparison performed by Shen . ALIGNF searches to maximize the centered kernel alignment between the combined kernel matrix and an ideal target kernel matrix K: denotes the centered Gram matrices of the input kernels: where is a column vector of ones of length . In Cortes , the target kernel was defined as in the case of single label classification. Here we used the Gram matrix of the output kernel on the training set. The combination of kernels learned with ALIGNF was then used for the scalar input kernel in IOKR.

2.3.2 Output kernels

For the output kernel, we have to define a similarity that takes into account the inherent structure of the metabolites. We compared the results obtained using different graph kernels (path, shortest-path and graphlet kernels) as well as kernels defined on molecular fingerprints. A molecular fingerprint is a vector encoding the structure of a molecule. Generally the values of this vector are binary values that indicate the presence or absence of certain molecular properties. A bit can indicate for example the presence of a chemical atom, a type of ring, an atom pair or a common functional group in the structure of the molecule. We consider here the kernels that obtained the best performances, which are the kernels based on fingerprints. We used the set of 2,765 binary molecular properties described in Dührkop . More details about these molecular properties are given in the Supplementary Materials. In the experiments, we considered different type of output kernels: where c(y) and denote the molecular fingerprints of y and . linear kernel: , polynomial kernel: , Gaussian kernel: ,

3 Results

We evaluated and compared our approach on a subset of 4138 MS/MS spectra extracted from the GNPS (Global Natural Products Social) public spectral library (https://gnps.ucsd.edu/ProteoSAFe/libraries.jsp) in Dührkop .

3.1 Protocol

The evaluation was performed using a 10-fold cross-validation (10-CV) procedure such that all compounds having the same structure are contained in the same fold. The input and output kernels were centered and normalized. The regularization parameter λ and the parameter(s) of the output kernel were selected using leave-one-out CV on each training fold. We used the averaged mean squared error (MSE) as error measure for tuning these parameters. The leave-one-out estimate of the averaged MSE was computed using the closed-form solution proved in Brouard . In the prediction step, the method was evaluated on 3,868 compounds. For solving the pre-image step, following Dührkop we assumed that all spectra have already their molecular formula predicted as a preprocessing step, and we searched among the PubChem (Bolton ) structures having the same molecular formula as the current target. We computed the distance between the predicted output feature vector (see Equation 3) and the output feature vectors of all the candidates. After the pre-image step, we ranked the candidates according to their distances to (from the smallest distance value to the highest one). For the evaluation, we evaluated the rank obtained by the true molecular structure among the candidate set for each test example and then we computed the percentage of structures that have been ranked lower than k, and this for varying k values. A test compound is said to be correctly identified if its correct structure is ranked first in the list.

3.2 Comparison with competing methods

We compared the performances of our method with two competing methods: FingerID (Heinonen ) and CSI:FingerID. Dührkop showed that CSI:FingerID improved significantly the metabolite identification rate compared with competing methods including CFM-ID (Allen ), MetFrag (Wolf ), MAGMa (Ridder ), MIDAS (Wang ) as well as FingerID—the second most accurate method in their comparison. Both FingerID and CSI:FingerID train a SVM classifier for each molecular property. A scoring function is then used to compare the predicted fingerprint with the candidate fingerprints and the candidate fingerprints are sorted correspondingly. FingerID uses as input the PPK kernel, whereas CSI:FingerID learns a combination of this kernel with different kernels defined on fragmentation trees using ALIGNF. In our experiment, we evaluated the performances of CSI:FingerID with unit scoring and with the modified Platt score, which was shown to perform the best among the different scores compared by Dührkop .

3.2.1 Identification performance

CSI:FingerID and FingerID were retrained on the 4138 GNPS spectra. For all methods, the parameter(s) were tuned on the training set using an internal 10-CV procedure. For the SVM-based approaches, the soft margin parameter C was tuned independently for each SVM. Table 3 shows the results obtained with IOKR, FingerID and CSI:FingerID and the differences with the identification percentage of CSI:FingerID modified Platt are visualized in Figure 3. We observe that IOKR with UNIMKL combined kernel and Gaussian output kernel reaches the first position with 30.66% of correct identifications that are ranked first. It is followed by IOKR linear UNIMKL, IOKR Gaussian ALIGNF and then by CSI:FingerID modified Platt with 28.84% of correctly identified metabolites. When considering the identification percentage between top 1 and top 20, we observe that IOKR outperforms CSI:FingerID unit in all the cases. When using a Gaussian kernel in output, IOKR improves upon CSI:FingerID modified Platt by around 2 percentage units. We performed statistical significance tests of the identification performance for the different approaches. These tests show that the difference between CSI:FingerID modified Platt and IOKR using a Gaussian output kernel is very significant. The corresponding P-values are 1.8e-16 with UNIMKL combination and 8e-14 with ALIGNF combination. A table containing all the P-values is given in the Supplementary Materials.

Table 3.

Comparison of the percentage of correctly identified structures for top 1, 10 and 20 using FingerID, CSI:FingerId and IOKR

Method	MKL	Top 1	Top 10	Top 20
FingerID	none	17.74	49.59	58.17
CSI:FingerID unit	ALIGNF	24.82	60.47	68.2
CSI:FingerID mod Platt	ALIGNF	28.84	66.07	73.07
IOKR linear	ALIGNF	28.54	65.77	73.19
	UNIMKL	30.02	66.05	73.66
IOKR Gaussian	ALIGNF	29.78	67.84	74.79
	UNIMKL	30.66	67.94	75.00

The highest values are shown in boldface.

Fig. 3.

Difference in percentage points to the percentage of metabolites ranked lower than k with CSI:FingerID using the modified Platt scoring function

Difference in percentage points to the percentage of metabolites ranked lower than k with CSI:FingerID using the modified Platt scoring function Comparison of the percentage of correctly identified structures for top 1, 10 and 20 using FingerID, CSI:FingerId and IOKR The highest values are shown in boldface.

3.2.2 Running times

We computed the running times of CSI:FingerID and IOKR using the 4138 spectra from GNPS as training set and 625 spectra from the Massbank dataset (Horai ) as test set (see Table 4). The running times correspond to the times that would have been obtained if we were using a single core. The training times were computed using fixed values for the parameters (regularization and kernel parameters). The computation of the fragmentation trees, input kernels and fingerprints was not taken into account here. The running times for the training and the test steps are shown in Table 4. In this table, we observe a substantial difference between the training times obtained with these two approaches: the IOKR method is approximately 7000 times faster to train. This can be explained by the fact that CSI:FingerID needs to train a SVM classifier for each molecular property, this means 2765 SVMs to train in this experiment. For the same reason, IOKR also presents smaller test time compared with CSI:FingerID. In the case of the linear kernel, the test running time of IOKR is smaller than when using a Gaussian or polynomial kernel. This comes from the fact that we can avoid kernel computations in the pre-image step for the linear kernel by computing explicitly the output feature vectors.

Table 4.

Running time evaluation

	Training time	Test time
CSI:FingerID	82 h 28 min 23 s	1 h 11 min 31 s
IOKR linear	42 s	1 min 15 s
IOKR polynomial	38 s	21 min 58 s
IOKR Gaussian	41 s	33 min 15 s

These running times were obtained by training the methods on the 4138 GNPS spectra and using 625 spectra from Massbank as test set.

Running time evaluation These running times were obtained by training the methods on the 4138 GNPS spectra and using 625 spectra from Massbank as test set.

3.3 Detailed evaluation of identification with IOKR

We will now analyze more in details the results obtained with our method on the GNPS dataset. We begin by presenting the results obtained for the different input and output kernels introduced in Section 2. Figure 4 contains the percentage of correctly identified structures (i.e. correct structures ranked top over all candidates) obtained with IOKR for the different pairs of input and output kernels. The two last columns correspond to the linear kernel combinations with UNIMKL and ALIGNF. We observe that the two MKL approaches clearly improve the results compared with the single kernels. The best performance is obtained with the UNIMKL approach, which is performing slightly better than ALIGNF. 30.74% of the metabolites are correctly identified with UNIMKL combined kernel. Among the individual input kernels, tree alignment-based kernels, node-based kernels [except Node subformula (NSF)] and the PPKr kernel obtain the best results. At the opposite end, the loss-based kernels and chemical element counting (CEC) are associated with low percentage of correct identified metabolites. Regarding output kernels, we notice that the performance obtained with linear and polynomial kernels are the same. This is because the optimal parameters selected for the polynomial kernel are 0 for the offset parameter and 1 for the degree, thus equalling linear kernel. Using Gaussian kernel seems to slightly improve the percentage of correctly identified structures for some input kernels, except for the root loss binary (RLB) kernel.

Fig. 4.

Heatmap of the percentage of correctly identified metabolites (Top 1) with IOKR. The rows correspond to the different output kernels built on fingerprints (linear, polynomial and Gaussian) and the columns to the 24 input kernels derived from spectra and fragmentation trees, as well as the two multiple kernel combination schemes ALIGNF and UNIMKL The averaged kernel weights learned with the ALIGNF algorithm on the training folds are visualized in Figure 5 for the three output kernels. The PPKr kernel is selected with the highest weight by ALIGNF for the three output kernels. Consistently with Figure 4, linear and the polynomial kernels are effectively the same. We observe that the weights are quite sparse: 14 kernels on a total of 24 are associated to a weight that is lower than 10−6. In order to analyze why these 10 particular kernels are selected by ALIGNF, we plotted the pairwise kernel alignment scores between the input kernels, as well as the alignment scores between the input and output kernels (see in the Supplementary Materials). The first plot shows which input kernels are similar to each other. Nine groups of kernels can be distinguished and we notice that at least one kernel in each group is selected by ALIGNF. The only exception is the group containing the subtree kernel CSC but this might be because this input kernel is the one having the lowest alignment score with the output kernel. The sparsity of the kernel weights can therefore be explained by the fact that some kernels are very similar to each other and thus contain redundant information.

Fig. 5.

Heatmap of kernel weights learned by ALIGNF for all pairs of input and output kernels on GNPS dataset. The weights have been averaged over the 10 CV folds

3.4 Prediction analysis

In the following, we detail the performance of the testing metabolites with IOKR in function of the size of their candidate sets. For this, we consider the best pair of kernels: UNIMKL combined kernel in input and Gaussian kernel in output. Figure 6a shows the distribution of the sizes of candidate sets, and the figure 6b represents the percentage of correctly identified metabolites in top 1, top 10 and above. We observe that the majority of the candidate sets contain <1000 candidates in our dataset. For these candidate sets, 32.8% of metabolites are identified correctly in the first position (magenta bars) and 71.7% are within the top 10 (cyan bars). The sizes of the candidate sets do not seem to have a strong influence on the identification accuracy. Even for large candidate sets our method is able to identify significant proportions of molecules within top 1 and top 10.

Fig. 6.

Identified metabolites with IOKR in function of the size of candidate sets. We considered the candidate sets of size smaller than 8000, which corresponds to 98.8% of the sets, and divided them in 30 bins according to their sizes. (a) indicates the number of test metabolites that have a candidate set size in the corresponding size bin. The percentage of metabolites that are ranked in top 1 position, top 10 or above is shown on the (b) for the test metabolites falling in each size bin We found 1203 compounds in the GNPS dataset that can be linked to the ontological classification database ChEBI (Hastings ). We are interested in evaluating whether there are some classes of compounds we can identify very well and some for which we cannot. Due to the hierarchical nature of the ontological classification, the classes far away from the root are very specific classes and contain very few compounds while the classes close to the root are very generic classes which contain too many compounds. As a result, we restrict the attention to the classes with shortest paths of length 7 from the root node chemical entity (ChEBI id 24431). For those classes, we count how many compounds in the GNPS dataset belong to them and represent the counts as the size of the points in Figure 7. For each compound, the number of candidates and rank of the correct compound are known, so we plot the median number of candidates associated with the compounds in each class on the x-axis and the proportion of cases for which we have correct compounds with rank ≤10 on the y-axis. Notice that we only show the classes containing at least 10 compounds.

Fig. 7.

Scatter plot of classes in ChEBI ontology with shortest paths of length 7 from the class chemical entity. X-axis corresponds to the median number of candidates associated with the compounds in each class and y-axis to the proportion of correct compounds with rank less or equal to 10 for each class. The size of the point is proportional to the number of compounds in GNPS dataset that belong to that class and we only show classes with at least 10 compounds. The classes we can identify well are shown in red and the classes we cannot are shown in blue with ChEBI id and name next to them From the Figure 7, it is clear that the number of candidates associated with the compounds is not a major factor of the identification results. Many classes with larger number of compounds, as shown with larger points, have around 60% of the cases where the identification lies within top 10. There are some classes we can identify very well like 3-aryl-1-benzopyrans (ChEBI id: 50753), also called isoflavonoids, and heterocyclic antibiotics (ChEBI id: 24531), while some classes, shown at the bottom of the figure, contain compounds that are more difficult to identify with our method. Among the difficult cases, there are the compounds belonging to the cyclic amide (ChEBI id: 23443) class and to the cyanides (ChEBI id: 23424) class. The compounds in the cyanide class contain a cyanid-anion sidegroup, which corresponds to a carbon atom connected to a nitrogen atom via a triple bond. We also studied the differences in prediction performance between CSI:FingerID and IOKR for the different compound classes. A detailed plot showing the differences between the numbers of compounds better ranked by the two methods is given in the Supplementary Materials. This plot shows that IOKR obtains better performances than CSI:FingerID in 74% of the classes. Interestingly IOKR presents the highest improvement for the cyanides class and one of its child. On the opposite CSI-FingerID considerably improves the performance for the compounds belonging to the heterocyclic antibiotics class and two of its children.

4 Discussion

In this article, we have proposed for the first time to solve the metabolite identification problem using a structured output prediction method, namely IOKR. We have shown that our method improves the metabolite identification rate comparing to competing methods with considerable shorter running time, in practise allowing training the models on a single computer instead of a large computing cluster. In addition, the structured output approach provides a more streamlined—and thus more easy to maintain—one-step prediction pipeline, as opposed to two-step pipelines of CSI:FingerID and FingerID which call for predicting and scoring fingerprints as an intermediate step. For future work, the most important direction is to address the prediction of the ‘dark matter’ in metabolomics (da Silva ): the metabolites that fall outside the compounds in molecular structure databases. There, we need to design better kernels and preimage algorithms for molecular structures. Finally, it is important to note that the recent breakthroughs in machine learning methodologies for metabolite identification rely heavily on the existence off community efforts building open reference databases such as GNPS and Massbank. At the same time, the reference databases still cover a small fraction of relevant metabolite space. Although machine learning can generalize and extrapolate beyond the training data, as also shown in this article, the scarceness of training data still imposes limits on how accurate models can be built. To really push metabolomics forward, we should widen and make more systematic the community efforts in building and utilizing reference databases.

14 in total

1. Illuminating the dark matter in metabolomics.

Authors: Ricardo R da Silva; Pieter C Dorrestein; Robert A Quinn
Journal: Proc Natl Acad Sci U S A Date: 2015-10-01 Impact factor: 11.205

2. Searching molecular structure databases with tandem mass spectra using CSI:FingerID.

Authors: Kai Dührkop; Huibin Shen; Marvin Meusel; Juho Rousu; Sebastian Böcker
Journal: Proc Natl Acad Sci U S A Date: 2015-09-21 Impact factor: 11.205

3. On learning vector-valued functions.

Authors: Charles A Micchelli; Massimiliano Pontil
Journal: Neural Comput Date: 2005-01 Impact factor: 2.026

4. FiD: a software for ab initio structural identification of product ions from tandem mass spectrometric data.

Authors: Markus Heinonen; Ari Rantanen; Taneli Mielikäinen; Juha Kokkonen; Jari Kiuru; Raimo A Ketola; Juho Rousu
Journal: Rapid Commun Mass Spectrom Date: 2008-10 Impact factor: 2.419

5. Automatic chemical structure annotation of an LC-MS(n) based metabolic profile from green tea.

Authors: Lars Ridder; Justin J J van der Hooft; Stefan Verhoeven; Ric C H de Vos; Raoul J Bino; Jacques Vervoort
Journal: Anal Chem Date: 2013-05-31 Impact factor: 6.986

6. MIDAS: a database-searching algorithm for metabolite identification in metabolomics.

Authors: Yingfeng Wang; Guruprasad Kora; Benjamin P Bowen; Chongle Pan
Journal: Anal Chem Date: 2014-09-11 Impact factor: 6.986

7. In silico fragmentation for computer assisted identification of metabolite mass spectra.

Authors: Sebastian Wolf; Stephan Schmidt; Matthias Müller-Hannemann; Steffen Neumann
Journal: BMC Bioinformatics Date: 2010-03-22 Impact factor: 3.169

8. Metabolite Identification through Machine Learning- Tackling CASMI Challenge Using FingerID.

Authors: Huibin Shen; Nicola Zamboni; Markus Heinonen; Juho Rousu
Journal: Metabolites Date: 2013-06-06

9. The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013.

Authors: Janna Hastings; Paula de Matos; Adriano Dekker; Marcus Ennis; Bhavana Harsha; Namrata Kale; Venkatesh Muthukrishnan; Gareth Owen; Steve Turner; Mark Williams; Christoph Steinbeck
Journal: Nucleic Acids Res Date: 2012-11-24 Impact factor: 16.971

10. CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra.

Authors: Felicity Allen; Allison Pon; Michael Wilson; Russ Greiner; David Wishart
Journal: Nucleic Acids Res Date: 2014-06-03 Impact factor: 16.971

29 in total

1. Recent advances and prospects of computational methods for metabolite identification: a review with emphasis on machine learning approaches.

Authors: Dai Hai Nguyen; Canh Hao Nguyen; Hiroshi Mamitsuka
Journal: Brief Bioinform Date: 2019-11-27 Impact factor: 11.622

2. Improving MetFrag with statistical learning of fragment annotations.

Authors: Christoph Ruttkies; Steffen Neumann; Stefan Posch
Journal: BMC Bioinformatics Date: 2019-07-05 Impact factor: 3.169

Review 3. Metabolomics and genomics in natural products research: complementary tools for targeting new chemical entities.

Authors: Lindsay K Caesar; Rana Montaser; Nancy P Keller; Neil L Kelleher
Journal: Nat Prod Rep Date: 2021-11-17 Impact factor: 13.423

4. CFM-ID 4.0: More Accurate ESI-MS/MS Spectral Prediction and Compound Identification.

Authors: Fei Wang; Jaanus Liigand; Siyang Tian; David Arndt; Russell Greiner; David S Wishart
Journal: Anal Chem Date: 2021-08-17 Impact factor: 8.008

5. METLIN: A Technology Platform for Identifying Knowns and Unknowns.

Authors: Carlos Guijas; J Rafael Montenegro-Burke; Xavier Domingo-Almenara; Amelia Palermo; Benedikt Warth; Gerrit Hermann; Gunda Koellensperger; Tao Huan; Winnie Uritboonthai; Aries E Aisporna; Dennis W Wolan; Mary E Spilker; H Paul Benton; Gary Siuzdak
Journal: Anal Chem Date: 2018-02-09 Impact factor: 6.986

6. Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra.

Authors: Kai Dührkop; Louis-Félix Nothias; Markus Fleischauer; Raphael Reher; Marcus Ludwig; Martin A Hoffmann; Daniel Petras; William H Gerwick; Juho Rousu; Pieter C Dorrestein; Sebastian Böcker
Journal: Nat Biotechnol Date: 2020-11-23 Impact factor: 54.908

7. Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions.

Authors: Grímur Hjörleifsson Eldjárn; Andrew Ramsay; Justin J J van der Hooft; Katherine R Duncan; Sylvia Soldatou; Juho Rousu; Rónán Daly; Joe Wandy; Simon Rogers
Journal: PLoS Comput Biol Date: 2021-05-04 Impact factor: 4.475

8. High-Throughput Non-targeted Chemical Structure Identification Using Gas-Phase Infrared Spectra.

Authors: Erandika Karunaratne; Dennis W Hill; Philipp Pracht; José A Gascón; Stefan Grimme; David F Grant
Journal: Anal Chem Date: 2021-07-21 Impact factor: 8.008

9. Probabilistic framework for integration of mass spectrum and retention time information in small molecule identification.

Authors: Eric Bach; Simon Rogers; John Williamson; Juho Rousu
Journal: Bioinformatics Date: 2021-07-19 Impact factor: 6.937

Review 10. Review of Machine Learning Methods for the Prediction and Reconstruction of Metabolic Pathways.

Authors: Hayat Ali Shah; Juan Liu; Zhihui Yang; Jing Feng
Journal: Front Mol Biosci Date: 2021-06-17