Literature DB >> 30903689

CliqueMS: a computational tool for annotating in-source metabolite ions from LC-MS untargeted metabolomics data based on a coelution similarity network.

Oriol Senan¹, Antoni Aguilar-Mogas¹, Miriam Navarro^2,3, Jordi Capellades^2,3, Luke Noon^3,4, Deborah Burks^3,4, Oscar Yanes^2,3, Roger Guimerà^1,5, Marta Sales-Pardo¹.

Abstract

MOTIVATION: The analysis of biological samples in untargeted metabolomic studies using LC-MS yields tens of thousands of ion signals. Annotating these features is of the utmost importance for answering questions as fundamental as, e.g. how many metabolites are there in a given sample.
RESULTS: Here, we introduce CliqueMS, a new algorithm for annotating in-source LC-MS1 data. CliqueMS is based on the similarity between coelution profiles and therefore, as opposed to most methods, allows for the annotation of a single spectrum. Furthermore, CliqueMS improves upon the state of the art in several dimensions: (i) it uses a more discriminatory feature similarity metric; (ii) it treats the similarities between features in a transparent way by means of a simple generative model; (iii) it uses a well-grounded maximum likelihood inference approach to group features; (iv) it uses empirical adduct frequencies to identify the parental mass and (v) it deals more flexibly with the identification of the parental mass by proposing and ranking alternative annotations. We validate our approach with simple mixtures of standards and with real complex biological samples. CliqueMS reduces the thousands of features typically obtained in complex samples to hundreds of metabolites, and it is able to correctly annotate more metabolites and adducts from a single spectrum than available tools.
AVAILABILITY AND IMPLEMENTATION: https://CRAN.R-project.org/package=cliqueMS and https://github.com/osenan/cliqueMS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Year: 2019 PMID： 30903689 PMCID： PMC6792096 DOI： 10.1093/bioinformatics/btz207

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The analysis of biological samples in untargeted metabolomic studies using liquid chromatography coupled to electrospray mass spectrometry results in tens of thousands of ion signals or features. It is now well accepted that this large number of features is an overestimation of the real number of different compounds in the sample, mainly because single metabolites can be detected as multiple ions of different mass in either positive or negative ionization mode. This redundancy of features is mostly due to in-source phenomena including cation adduction, multimerization and in-source fragmentation, plus contaminants. However, few studies have attempted to estimate which percentage of unique metabolites, out of the total number of detected features, are being profiled in an untargeted metabolomic experiment. These studies, in addition, have reported very disparate numbers (Brown , 2011; Jankevics ; Mahieu and Patti, 2017), ranging from as low as only 3% to more than 50% unique endogenous metabolites. This scenario reflects that the annotation of features, understood as their feature relationships in MS1 mode, is a challenging task and represents a serious obstacle for the real high-throughput analysis of metabolomics data. While computational solutions for the structural identification of metabolites from MS2 data (Aguilar-Mogas ; Allen ; Dührkop ; Heinonen ; Ridder ; Ruttkies ; Tsugawa ) have recently demonstrated substantial progress (Nishioka ; Schymanski ; Schymanski and Neumann, 2013), automated tools aimed at the complete exploitation of LC-MS1 data through the successful annotation of metabolite features have not reached the same maturity level. The two main grouping principles for detecting and annotating features related to a metabolite are chromatographic peak-shape similarity (i.e. coeluting features) and peak-abundance correlation, or a combination thereof. Pairwise intensity correlation analysis across multiple samples is the basis of computational tools such as AStream (Alonso ), MSClust (Tikunov ), RAMClust (Broeckling ), MS-FLO (DeFelice ), compMS2Miner (Edmands ), xMSannotator (Uppal ) or findMAIN (Jaeger ) among other similar approaches (Lee ; Zeng ). On the other hand, peak-shape similarity is used by CAMERA (Kuhl ) and MZmine2 (Pluskal ). MetAssign (Daly ) or xMSannotator (Uppal ) has also included a probabilistic score to measure the confidence in particular assignments based on statistical clustering. More recently knowledge-driven annotation tools have also been proposed by de la Fuente and collaborators (Gil de la Fuente ). To aid in the automatization of LC-MS1 data processing we have developed CliqueMS, a computational tool that annotates redundant LC-MS1 features using the similarity between coelution profiles and a calculated natural frequency of adduct formation observed in real complex biological samples and pure compounds. As a result, in contrast to the majority of existing tools, CliqueMS can produce accurate annotations for a single LC-MS1 spectrum. To do so, CliqueMS implements a novel mathematical approach to obtain the most plausible groupings of features according to a similarity network. Next, CliqueMS annotates features and ranks annotations using an estimated frequency of dominant adducts and in-source fragments in complex biological samples and from all available compounds in the National Institute of Standards and Technology 14 MS/MS library (Fig. 1 and Supplementary Material). CliqueMS correctly identifies and annotates a larger number of adducts, leading to more correct parental ion neutral masses than existing available approaches such as CAMERA (Kuhl ), xMSannotator (Uppal ) and MS-FLO (DeFelice ).

Fig. 1.

Schematic representation of CliqueMS. CliqueMS identifies the features belonging to the same metabolite. CliqueMS uses as input LC-MS1 data in any format that can be converted into either an ‘xcmsSet’ or an ‘XCMSnExp’ object in R such as mzML, mzXML, mzData and NetCDF. First, CliqueMS determines peak-shape (i.e. coeluting) similarities between all pairs of features in the LC-MS1 spectrum. Then CliqueMS finds groups of features based on the network of similarities. The assumption is that the more similar a pair of features, the more likely they are to belong to the same group. Following a maximum likelihood procedure, CliqueMS finds the best division into fully connected groups of features (or cliques). Then, for each clique, CliqueMS proceeds to annotate each feature by establishing the parental ion neutral mass. Annotations are scores based on a table of empirically observed frequencies for each adduct. The final output is, for each feature, the five annotations with the highest score specifying the adducts/in-source fragment and its corresponding parental mass. See Supplementary Figure S1 for a detailed description of the installation process, input and output formats as well as the parameters and modules within CliqueMS

2 Materials and methods

2.1 Description of CliqueMS

Formally, CliqueMS addresses the following problem. Our spectral data are comprised of a set of features characterized by an m/z value and intensity vector [note that the list of features is obtained by running the peak peaking algorithm available in XCMS (see Supplementary Material S2)]. For each feature i, we obtain the intensity vector discretizing the feature into K equal bins so that , where is the measured intensity at retention time t, where and (in our analysis, depends on the mass detector operational parameters and the spectral data processing program). Given this data CliqueMS aims at providing a set of plausible annotations for complex samples based on the assumptions that: (i) features of the same metabolite corresponding to in-source phenomena, including adducts (e.g. Na, K) and fragments (e.g. loss of water), display similar chromatographic elution profiles and (ii) in-source phenomena (such as adducts or fragments) occur with a probability equal to the frequency with which they are observed in experiments. To achieve this goal, we have identified three main steps for annotating features (Fig. 1): (i) the construction of a similarity network, where each node represents a feature and edges are weighted according to the similarity between features; (ii) the identification of the most plausible division of the similarity network into cliques (fully connected groups of features) and (iii) the annotation of the features corresponding to the same parental mass within each clique. Step 1: construction of a similarity network between features In order to provide meaningful annotations, first we need to group features so that features corresponding to the same metabolite are grouped together. CliqueMS is based on the expectation that all features resulting from in-source transformations of the same metabolite have a similar chromatographic retention pattern. A critical step is thus to select an appropriate measure of similarity between features that reflects our expectations and allows the construction of a similarity network to obtain reliable groups of features. A possible choice of similarity function is the Pearson correlation between intensity vectors as considered in CAMERA (Kuhl ). However, the caveat of the Pearson correlation coefficient is that it is only suited to detect similarity of features that are linearly growing/decreasing and therefore it is a priori not an optimal option when features are non-linear such as the spectral data we consider. To overcome this caveat, we propose to use the cosine similarity, a simple measure that assesses the alignment between intensity vectors: where . Note that the sum runs over all time bins, and therefore we are not restricting the comparison between features to a specific retention time window; we are considering the full window of retention times. To compare the ability of Pearson and cosine similarities to discriminate between features corresponding to the same metabolite from coeluting features corresponding to different metabolites, we performed a validation experiment in which we manually simulate the coelution of features (see Supplementary Fig. S2). The results from this experiment show that the cosine similarity has a superior discriminatory power than the Pearson correlation, therefore justifying our choice for similarity metric. We then construct a weighted undirected similarity network in which each node corresponds to a feature and the weight of each edge between nodes (i, j) corresponds to . Note that this is not a fully connected network, because features that have non-overlapping intensity vectors are not connected (c = 0). Step 2: principled identification of groups of features (cliques) in the similarity network Our next step is to identify groups of features that are similar. Because CliqueMS assumes that all features of the same metabolite must have c > 0, we aim at identifying cliques of features in the network, that is groups of features that are fully connected so that for any pair of features within a clique. Formally, the task of finding these groups is equivalent to a label assignment problem in which we want to assign a group label to each feature σ with the constraint that c > 0 for any pair of nodes with the same label. Note that this problem is different from the problem of community detection in complex networks in which nodes with the same group labels do not have to be necessarily connected (Guimerà and Amaral, 2005). For this reason, we cannot use community detection algorithms for this purpose. Instead, we follow a probabilistic approach to identify the most plausible label assignments (groupings) of features. To that end, we propose a generative model for node label assignments by noting that the cosine similarity between two features is a good proxy for how likely two features are to be adducts of the same metabolite. A plausible assumption is thus that the probability of two features (i, j) having the same group label (i.e. belonging to the same clique) given a certain similarity c between intensity vectors is precisely a function of that similarity: Conversely, the probability that two nodes (i, k) have different labels given their similarity c is . To specify the precise dependency of on c, we note that needs to fulfill the following conditions: (i) it has to be equal to zero if c = 0 (i.e. two nodes whose intensity vectors do not overlap cannot belong to the same clique) and (ii) it has to be equal to one if c = 1 (i.e. features with proportional intensity vectors, eg. similar peak shapes, have to belong to the same clique). Because in our sample , any power of the cosine similarity will satisfy these two conditions. Hence, we assume that where the proportionality is due to a necessary but irrelevant normalization constant and . Under these assumptions, we can express the probability of an assignment of group labels conditioned on the observed network of similarities as where is the Kronecker delta function. This conditional probability is the likelihood of the model. Assuming that we have no prior information about how labels are assigned to nodes, the most plausible group label assignment is the one that maximizes Equation (4) or, equivalently, the log-likelihood . To obtain this label assignment, we use the following algorithm: Start from a configuration in which each node has a different label. Propose a new label assignment. Accept the new label assignment if increases. Return to step (ii) and iterate until no more changes are accepted. In step (ii), in order to propose a new label assignment we use a combined strategy that alternates between: (i) merging existing cliques and (ii) moving nodes from one clique to another clique. In our implementation, we alternate these two possible configuration changes with a ratio of 10:1. To merge existing cliques, we follow the heuristic approach in Blondel which is computationally fast. Specifically, we compute the mean-similarity between nodes within each pair of cliques. We then propose to merge the pair of cliques with the largest mean-similarity. To move a node (i.e. to change the label of that node to that of a different clique), we select the label assignment that produces the largest increase in . As a last step, when cannot be increased by merging any pair of cliques in the network, we use the Kernighan–Lin algorithm (Kernighan and Lin, 1970) to propose single-node moves between cliques. The algorithm stops when we cannot further increase the log-likelihood with single-node switches. We set a lower bound (by default ) for the relative change in necessary to consider that a change in node label assignments results in a significant increase in the log-likelihood. In order to estimate the best value for the parameter α in Equation (3), we measure the accuracy of our algorithm at assigning group label to features that have similar retention time patterns. Specifically, starting from the spectral data for the mixture of 9 standards as in the validation of Step 1 (see Section 2.2 and Supplementary Fig. S2), we simulated differences in the coelution of metabolites by manually displacing all the features of the same metabolite along the retention time axis. We then simulate the coelution of two, three and four compounds for different time shifts, and evaluate the accuracy of our algorithm at correctly labeling features using the adjusted mutual information (AMI) (Vinh ). The AMI measures the accuracy of the labeling by comparing the ‘true’ and the proposed assignment while taking into account the number of features associated to each metabolite. The AMI value is scaled, so that for a random assignment of features to groups. In Supplementary Figure S3, we show the accuracy of our algorithm for different values of and 2. For reference, we also show the results obtained with the feature grouping algorithm in CAMERA, which is also network based [see Kuhl for details]. We find that for any choice of α, our algorithm outperforms the feature grouping algorithm in CAMERA. This is because the feature grouping algorithm in CAMERA tends to produce a large number of groups of features and therefore the AMI is close to zero independently of the time shift. We also find that for our algorithm, larger values of α result into too many cliques when the coelution is not as accentuated, slightly decreasing the algorithm’s accuracy. Therefore, we use a value of α = 2 as a default in CliqueMS and in what follows. Step 3: Annotation of adducts by isotope and parental mass identification After obtaining the maximum likelihood configuration of label assignments to nodes , we use the differences in values for all the features within a clique to identify isotopes, and putative adducts and in-source fragments associated to the parental neutral mass of the metabolite. Consider we have a clique γ comprising Γ features . The first step is to identify features corresponding to isotopic variants of the same metabolite, as they can be determined by the exact mass difference between features and their relative intensities. Whenever the mass difference between two features corresponds to 1.003355 (Da), ϵ being the relative error of the isotope search, the two features are candidates for being isotopes. If their intensity ratios also correspond to the relative abundance of such isotopes, then these two features are considered to be two isotopic variants of the same metabolite Note that we take into account other differences in m/z between isotopes with z > 1 (see details in the Supplementary Material). For the remaining features , N being the number of isotopes in the clique, our goal is to associate each one of the features to an adduct or fragment, and therefore to establish the mass of the corresponding neutral compound. In order to do that, CliqueMS considers a list of possible adducts and fragments and their associated mass difference taken from the NIST database (National Institute of Standards and Technology, 2014) spectra with positive and negative ionization (see Supplementary Tables S1 and S2). First, we determine the possible annotations for each feature that are compatible with the observed mass differences. In what follows, for simplicity we refer to all possible annotations of features as adducts, but bear in mind that annotations can also correspond to metabolite in-source fragments considered in the previously mentioned list of mass differences. Specifically, for feature , we obtain all the possible parental masses M that are compatible with feature i being adduct A (), i.e. those that fulfill Note that in order to get m, we consider z to be the charge of adduct A: In our analysis, we set ppm, but this parameter can be tuned by the user. For the remaining features , we establish that is compatible with being adduct A with parental neutral mass M if: Following this procedure for all the features , we obtain for clique γ all possible parental masses that are compatible with at least two features being annotated. For each such parental mass M, we construct an adduct vector in which each component corresponds to the adduct annotation of feature i compatible with parental mass M. If there is no compatible annotation for feature i then . The second step is to assess the plausibility of each one of these annotations. In order to do this, we note two facts. First, we note that in manual annotation the observation of some adducts such as [M + H]+ or [M+Na]+ is more frequent than that of other adducts such as [M + H-OH]+ or [2M+Na]+. As a result, the former couple of adducts are more commonly sought for in manual annotations than the latter couple of adducts. To formalize this intuition and quantify the plausibility of a specific annotation, CliqueMS uses observed frequencies of adducts and fragments from available LC-MS1 spectra for pure compounds available in the NIST database and biological in-house samples (see Supplementary Tables S1 and S2). Specifically, for each M the log-plausibility s of annotation is then: where is the frequency of observation of adduct x and, . In our analysis, we set , so that the frequency of a non-annotated feature is lower than that of the least common adduct or fragment in our database. Note that since available LC-MS1 spectra are likely to increase in the future, these parameters can be changed by the user as needed. Second, we note that, in the clique identification procedure, features corresponding to different metabolites that coelute sometimes are assigned to the same clique. Taking this into consideration, CliqueMS allows for the annotation of adducts corresponding to more than one parental neutral mass in the same clique. Therefore, given the set of parental masses and their associated annotations we can in principle obtain complex annotations with multiple compatible parental masses, so that and with M not necessarily equal to . These annotations are also subject to the constraint that we have at least two annotated features for each parental mass. Nonetheless, because we expect the number of metabolites in coelution to be low, we assume the plausibility of annotations with a large number of parental masses N to be low. To formalize this idea, the log-plausibility of such complex annotations s is then: where we have introduced an exponential penalty if the number of parental masses is larger than one and a = 10 in our analysis. While this may seem a rather large penalty, we note that the most common adducts have and rarest adducts have . Therefore, because our priority is to annotate large amounts of adducts or fragments (including rare ones) associated to the same parental mass rather than annotating the same features with two different parental masses and more common adducts, we need to introduce exponentially large penalties. On the other hand, the penalty has to be low enough to enable the use of more than one parental mass when no other annotations are possible. Using a value of a = 10 strikes the balance between both undesirable situations. Unfortunately, the number of possible annotations grows very fast with the number of features in a clique, so that even for moderately small cliques (30 features) it is unfeasible to produce and score all annotations exhaustively. Because of this CliqueMS focuses on producing only a few annotations with the largest plausibilities. To that end, we follow a greedy procedure to produce complex annotations. Specifically, we limit the list of parental masses to include: (i) those masses that have the largest overall plausibilities s and (ii) consider the top scoring masses for annotating each feature . In our analysis, we use Ms with the 15 top overall ss and the most plausible M for each feature; these parameter choices show a good compromise between speed of the calculations for large cliques and the retrieval of the most plausible annotations obtained from exhaustive annotation searches. Finally, we rank annotations according to their plausibility and produce for each clique the five most plausible annotations. In this way, unlike other methods which produce a unique annotation, CliqueMS allows researchers to compare alternative annotations. Note that annotation scores depend on the size of the clique/group of features, therefore annotation scores for different groups of features are not comparable.

2.2 Spectral data acquisition

Mixture of standards: LC-MS1 spectrum of a mixture of the following standards in solution: (-)riboflavin, 1, 2-distearoyl-sn-glycero-3-phosphocholine, biotin, cholic acid, deoxycholic acid, L-methionine sulfoxide, thymine and uracil (see Supplementary Material for details on preparation and acquisition of LC-MS1 spectra). The mzXML file is available at Zenodo with id 1480659, doi: 10.5281/zenodo.1480659. Complex sample 1: IRS2 KO—LC-MS1 spectra for retina samples from Irs-2-deficient mice [see Hennige , Withers and Supplementary Materials for preparation, metabolite extraction and LC-MS1 spectra acquisition]. The mzXML file is available at Zenodo with id 1480659, doi: 10.5281/zenodo.1480659. Complex sample 2: MTBLS103—LC-MS1 spectra of a subset of serum samples of young females within the control group in the study by Samino , which are available at https://www.ebi.ac.uk/metabolights/MTBLS103. We consider a subset of samples of both positive ionization C18 (18 samples) and HILIC (13 samples) found in MBTLS103 dataset; all samples belong to the control group.

3 Results and discussion

To validate the accuracy of CliqueMS we perform three kinds of experiments. First, we look at the accuracy at annotating a relatively simple sample corresponding to mixture of standards for which we have a manual MS1 annotation and the identity of the eluting standards was confirmed via MS2 fragmentation patterns. Second, we use CliqueMS to annotate a complex sample for which we also have manual MS1 annotations for metabolites confirmed via MS2 fragmentation (for the retina samples we provide the manual annotation of metabolites whose concentrations were significantly different from that of wild type animals). We look at the accuracy of CliqueMS at correctly annotating the identified compounds in the sample for LC-MS1 spectra obtained in positive and negative ionization modes separately. In these two cases, because we have a single LC-MS1 spectrum, we compare the accuracy of CliqueMS to annotate metabolites with that of CAMERA (Kuhl ), which is the only available tool that can annotate single LC-MS1 spectra. As a general result, we find that Clique MS groups feature in a smaller number of groups (cliques) than CAMERA (see Table 1); this makes CliqueMS able to annotate a larger number of adducts than CAMERA.

Table 1.

Summary of the full set of annotations for each sample

Sample	Tool	Features	Number of cliques/ groups/clusters	Annotated unique parental masses	Annotated features (%)
Standards	CliqueMS	275	69	49 (48)	64 (55)
Standards	CAMERA	275	164	25	32
Retina IRS2 KO(+ ionization)	CliqueMS	8489	606	1231 (1512)	70 (57)
Retina IRS2 KO(+ ionization)	CAMERA	8489	2836	1303	43
Retina IRS2 KO (− ionization)	CliqueMS	3893	349	334 (494)	44 (36)
Retina IRS2 KO (− ionization)	CAMERA	3893	1083	552	32
MTBLS103 HILIC	CliqueMS	16 160^a	387^a	3186 (3703)^a	84 (68)^a
	CAMERA	13 048	488	2947	62
	xMSannotator		230	5314	46
	MS-FLO		NA	2875	57
MTBLS103 C18	CliqueMS	24 620^a	927^a	3980 (4769)^a	74 (61)^a
	CAMERA	19 871	1332	13131	58
	xMSannotator		540	6226	48
	MS-FLO		NA	3283	41

Note: For single spectrum datasets, we show the total number of features in the LC-MS1 spectrum. For MTBLS103 datasets, we report the number of features after sample alignment with XCMS. For single spectrum datasets, we report results for CliqueMS and CAMERA. For multiple spectrum datasets, we report the results for CliqueMS, CAMERA, xMSannotator and MS-FLO. We report the total number of groups identified by the algorithm, the number of unique parental neutral masses identified and the percentage of features each algorithm associated to a parental neutral mass. For single spectrum datasets, we consider the five annotations with the highest scores produces by CliqueMS and report: (i) the average number of unique parental masses over annotations, and, in parenthesis, the number of unique parental masses in the annotation with the best score; (ii) the percentage of features with at least one annotation within the five annotations with best scores, and, in parenthesis, the percentage of features annotated within the best ranked annotation.

For MTBLS103 datasets, we run CliqueMS for each individual sample. For each dataset, we report the the average number of features and the average number of cliques obtained across samples. We also report: (i) the average number of unique parental neutral masses and, in parenthesis, the average number of unique parental masses in the top annotation and (ii) the average of the percentage of features with at least one annotation within the five top annotations and only considering the top annotation (in parenthesis).

Summary of the full set of annotations for each sample Note: For single spectrum datasets, we show the total number of features in the LC-MS1 spectrum. For MTBLS103 datasets, we report the number of features after sample alignment with XCMS. For single spectrum datasets, we report results for CliqueMS and CAMERA. For multiple spectrum datasets, we report the results for CliqueMS, CAMERA, xMSannotator and MS-FLO. We report the total number of groups identified by the algorithm, the number of unique parental neutral masses identified and the percentage of features each algorithm associated to a parental neutral mass. For single spectrum datasets, we consider the five annotations with the highest scores produces by CliqueMS and report: (i) the average number of unique parental masses over annotations, and, in parenthesis, the number of unique parental masses in the annotation with the best score; (ii) the percentage of features with at least one annotation within the five annotations with best scores, and, in parenthesis, the percentage of features annotated within the best ranked annotation. For MTBLS103 datasets, we run CliqueMS for each individual sample. For each dataset, we report the the average number of features and the average number of cliques obtained across samples. We also report: (i) the average number of unique parental neutral masses and, in parenthesis, the average number of unique parental masses in the top annotation and (ii) the average of the percentage of features with at least one annotation within the five top annotations and only considering the top annotation (in parenthesis). Third, because other available methods need more than one spectrum to produce annotations, we also consider another two datasets with 13 and 18 LC-MS1 spectra from Samino . We compare the performance of CliqueMS with that of CAMERA, xMSannotator (Uppal ) and MS-FLO (DeFelice ) (all other tools mentioned in the abstract were not in working condition at the time of our analysis). We find that CliqueMs is able to consistently provide better, more complete annotations than the other methods for the identified metabolites in the samples. Mixture of standards: Table 1 and Figure 2 show that overall CliqueMS produces better annotations than CAMERA. CliqueMS is able to correctly identify more manually annotated metabolites, and correctly annotate more features associated to these metabolites by both correctly identifying adducts/in-source fragments and their isotopes. The reason for this superior performance is that CliqueMS identifies a smaller number of feature groups so that features associated to the same metabolite are in the same group (Fig. 2a–b). By contrast, CAMERA generates a larger number of groups which results in assigning features corresponding to the same metabolite to different groups, usually annotating them as different metabolites or as non-annotated (Fig. 2c).

Fig. 2.

Feature annotation for a mixture of standards. (a) Extracted ion chromatogram. The nine ionized metabolites were annotated with CliqueMS. We show features that are adducts of each metabolite in different colors (shades of grey), as annotated by CliqueMS in (c). (b) Cliques identified by CliqueMS in the same experiment, after computing cosine correlation and maximizing clique likelihood. The intensity of the link is proportional to the correlation, and the area of each node is proportional to feature intensity. The colors are the same as in (a). For each feature, we show the annotation given by CliqueMs as shown in (c). We denote isotopes by adding a subindex to M, so that M0 corresponds to the monoisotopic mass and M1 to the first isotope. (c) Feature annotation by CliqueMS and CAMERA. For each metabolite, we show the different adducts annotated and the total number of isotopic variants of that particular adduct. Correctly annotated features are shown in green; incorrectly annotated features are shown in red (darker shade of grey), with indicating that the associated parental neutral mass was incorrect; non-annotated features are shown in white. For CliqueMS, we also show the ranking of the feature annotation that matches manual annotation. For CAMERA the * indicates those features for which the algorithm returned two possible annotations. DSPC stands for 1, 2-distearoyl-sn-glycero-3-phosphocholine. See Supplementary Material for CliqueMS annotations Overall CliqueMS is able to correctly annotate all nine metabolites within the two most plausible annotations for each clique (since for each metabolite, CliqueMS provides the correct annotation for at least one adduct/in-source fragment and its corresponding isotopes within the two highest ranked annotations). The total number of annotated features corresponding to the standard compounds is 42 (of which 29 correspond to adducts/in-source fragments and 13 to isotopes). Instead CAMERA annotates correctly 5 metabolites and a total of 27 features. Note that even if we only considered the highest ranked annotation provided by CliqueMS, the number of correctly annotated metabolites (8) would be higher than for CAMERA (5). Note that overall CliqueMS identifies a number of unique parental neutral masses that is substantially larger than 9 (48 if we consider the best ranked annotation—see Table 1). The main reason for this is that during the process to obtain the LC-MS1 data, metabolites can break down into smaller fragments that can also become ionized. Because the fragments that one might expect are different for each metabolite in the annotation step, CliqueMS is not considering these effects in the annotation step, therefore these fragments are assigned different parental neutral masses. Despite this issue, the difference in the percentage of features for which a parental neutral mass is reported—64% (or 55% if we consider exclusively the annotation with the largest score) versus 32%—is substantial and is a direct effect of the aforementioned high quality feature grouping which leads to a more accurate adduct annotation. Single spectrum from complex samples: To evaluate the capacity of CliqueMS to identify adducts in complex LC-MS1 data from a single spectrum, we analyze real retina samples from a mouse model in which the gene irs2 has been knocked out (see Supplementary Material). We analyze spectral data with both positive and negative ionizations. The positive ionization spectrum contained 8489 features reduced to 606 cliques by CliqueMS, whereas the negative ionization spectrum comprised 3893 features reduced to 349 cliques. Instead, as for the previously studied sample, CAMERA identifies a much larger number of groups: 2836 for positive ionization spectra, and 1083 for the negative ionization spectra. CliqueMS groups the features into a smaller number of groups than CAMERA does. However, in contrast to the results for the mixture of standards, each clique is not necessarily associated to a single metabolite. In fact, because metabolite coelution is so frequent in samples with a large number of features, CliqueMS can group features corresponding to different metabolites within the same clique (see Supplementary Fig. S3). In Tables 2 and 3 and in Supplementary Figure S4, we show that, overall, CliqueMS provides a better annotation than CAMERA; specifically, CliqueMS is able to annotate a larger number of the identified (via MS/MS) metabolites than CAMERA. Furthermore, CliqueMS is able to correctly annotate a larger number of features, including adducts, in-source fragments and isotopes.

Table 2.

Summary of the performance of different algorithms for complex samples

Sample	Identified and annotated metabolites	Tool	Annotated metabolites		Adducts/mass fragments	Annotated features
Sample	Identified and annotated metabolites	Tool	Multiple adducts	Single adduct	Adducts/mass fragments	Annotated features
Retina IRS2 KO (+ ionization)	20	CliqueMS	15	—	50	95
Retina IRS2 KO (+ ionization)	20	CAMERA	8	—	25	45
Retina IRS2 KO (− ionization)	18	CliqueMS	6	—	16	35
Retina IRS2 KO (− ionization)	18	CAMERA	5	—	14	33
MTBLS103 HILIC	6 (78)^a	CliqueMS	5/6/56^b	—	18/26/213^b	44/72/318^b
	6	CAMERA	3	—	13	21
		xMSannotator	1	4	10	10
		MS-FLO	1	—	2	3
MTBLS103 C18	9 (162)^a	CliqueMS	6/8/104^b	—	17/29/304^b	46/66/524^b
	9	CAMERA	3	—	11	20
		xMSannotator	3	6	13	13
		MS-FLO	0	—	0	0

Note: For single spectrum samples (Retina IRS2 KO in positive and negative ionization mode), we report results for CAMERA and CliqueMS. For the datasets in MTBLS103 (Samino ), we report results for the chromatographic column operating in two different conditions: RP-C18 and HILIC. For the MTBLS103 datasets, we show results for CliqueMS, CAMERA, xMSAnnotator and MS-FLO. The multiple adduct and single adduct columns indicate the number of correctly annotated metabolites through the identification of at least two adducts with the same parental neutral mass, and the number of metabolites annotated through the annotation of a single adduct [annotated single adducts are assigned to (M + H)+ by xMSannotator].

CliqueMS analyzes individual samples, therefore in parenthesis we show the total number of annotated metabolites in all samples.

Because CliqueMS produces an individual annotation for each sample (13 for HILIC and 18 for RP-C18), we report three results : r1 shows the number of unique metabolites/adducts/features that are correctly annotated in of the samples; r2 shows the number of unique metabolites/adducts/features which are correctly annotated in at least one sample and r3 shows the aggregate numbers over samples.

Table 3.

Feature annotation for complex samples

Metabolite	CliqueMS			CAMERA
Metabolite	Annotation	Iso- topes	Rank	Annotation	Iso- topes
Uracil	(M+H)+	2	1	(M+H)+	2
	(M+H-H2O)+	1	1	(M+H-H2O)+	1
	(M+H-NH3)+	1	1	(M+H-NH3)+	1
Taurine	(M+H)+	2	2	(M+H)+	2
	(M+Na)+	2	3	(M+Na)+	2
	(M+H-H2O)+	1	2	(M+H-H2O)+	1
	(2M+H)+	1	2	(2M+H)+	1
	(M₂+Na)+	3	1	(M-H+2Na)+	3
Adenine	(M+H)+	2	1	(M₂+NH4)+	2
Adenine	(M+H-NH3)+	2	1	(M₂+H)+	2
L-glutamic acid	(M+H)+	3	1	(M₂+NH4)+	3
	(M+H-H2O)+	2	1	—	—
	(M+Na-H2O)+	1	3	(M₂+H-H2O)+	1
	(M+Na)+	3	3	(M₂+H)+	3
	(M-H+2Na)+	3	3	(M₂+Na)+	3
	(M-2H+3Na)+	2	3	(M₂-H+2Na)+	3
Guanine	(M+H-H2O)+	1	1	(M+H-H2O)+	1
	(M+H-NH3)+	2	1	(M+H-NH3)+	2
	(M+H)+	2	1	(M+H)+	3
Xanthine	(M+Na)+	1	1	—	—
	(M+H-NH3)+	1	1	—	—
	(M+H)+	2	1	—	—
L-2-aminoadipic acid	(M+H-H2O)+	1	2	*(M+H-H2O)+	1
L-2-aminoadipic acid	(M+H)+	1	2	*(M+H)+	1
L-ascorbic acid	(M+Na)+	1	1	(M+Na)+	1
L-ascorbic acid	(M+H)+	1	1	(M+H)+	1
PC	(M+K)+	1	1	(M₂+K-H2O)+	1
	(M+Na)+	2	1	(M₂+Na-H2O)+	2
	(M+H)+	3	1	—	—
Inosine	(M+K)+	2	1	(M+K)+	1
	(2M+H)+	2	1	(2M+H)+	3
	(2M+Na)+	3	1	(2M+Na)+	3
	(M+H)+	3	1	(M+H)+	3
	(M+Na)+	2	1	(M+Na)+	2
Guanosine	(2M+H)+	2	1	(2M+H)+	2
	(M+Na)+	1	1	(M+Na)+	1
	(M+H)+	4	1	(M+H)+	4
Glutathione	(M+Na)+	1	1	(M+Na)+	1
	(M+H)+	2	1	(M+H)+	3
	(M+H-H2O)+	3	1	(2M₂+H)+	3
Oxigluthatione	(M+Na)+	2	1	(2M₂+Na)+	2
	(M+K)+	3	1	(2M₂+K)+	3
	(M+H)+	3	1	(2M₂+H)+	4
NAD	(M+Na)+	1	1	(2M₂+Na)+	1
	(M+2H)2+)	3	1	(M₂+H)+	1
	(M+H)+	4	1	(2M₂+H)+	4

Note: Detail of the adducts and in-source fragments annotated by CliqueMS and CAMERA for the retina samples of IRS2 deficient mice (+ ionization). For each molecule, we show the different adducts and in-source fragments annotated; in parenthesis we show the total number of isotopic variants of that particular adduct/in-source fragment. Correctly annotated features are shown in green (light grey); incorrectly annotated features are shown in red (darker grey), with indicating that the associated parental mass was incorrect; non-annotated features are shown in white. For CliqueMS, we also show the ranking of the feature annotation that matches manual annotation. For CAMERA the* indicates those features for which the algorithm returned two possible annotations (see Supplementary Material for the complete results obtained for this sample using CliqueMS and for the complete list of manually annotated metabolites).

Summary of the performance of different algorithms for complex samples Note: For single spectrum samples (Retina IRS2 KO in positive and negative ionization mode), we report results for CAMERA and CliqueMS. For the datasets in MTBLS103 (Samino ), we report results for the chromatographic column operating in two different conditions: RP-C18 and HILIC. For the MTBLS103 datasets, we show results for CliqueMS, CAMERA, xMSAnnotator and MS-FLO. The multiple adduct and single adduct columns indicate the number of correctly annotated metabolites through the identification of at least two adducts with the same parental neutral mass, and the number of metabolites annotated through the annotation of a single adduct [annotated single adducts are assigned to (M + H)+ by xMSannotator]. CliqueMS analyzes individual samples, therefore in parenthesis we show the total number of annotated metabolites in all samples. Because CliqueMS produces an individual annotation for each sample (13 for HILIC and 18 for RP-C18), we report three results : r1 shows the number of unique metabolites/adducts/features that are correctly annotated in of the samples; r2 shows the number of unique metabolites/adducts/features which are correctly annotated in at least one sample and r3 shows the aggregate numbers over samples. Feature annotation for complex samples Note: Detail of the adducts and in-source fragments annotated by CliqueMS and CAMERA for the retina samples of IRS2 deficient mice (+ ionization). For each molecule, we show the different adducts and in-source fragments annotated; in parenthesis we show the total number of isotopic variants of that particular adduct/in-source fragment. Correctly annotated features are shown in green (light grey); incorrectly annotated features are shown in red (darker grey), with indicating that the associated parental mass was incorrect; non-annotated features are shown in white. For CliqueMS, we also show the ranking of the feature annotation that matches manual annotation. For CAMERA the* indicates those features for which the algorithm returned two possible annotations (see Supplementary Material for the complete results obtained for this sample using CliqueMS and for the complete list of manually annotated metabolites). The differences in number of metabolites and features annotated are specially remarkable for the positive ionization mode spectrum, in which the number of adducts is larger mainly due to the influence of mobile phase additives and organic solvents (Kruve and Kaupmees, 2017), and therefore more features can coelute. In this case CliqueMS is able to assign a parental neutral mass to 70% of the features overall (and 57% if we only consider the top-ranked annotation), whereas CAMERA only assigns a parental mass to 43% of the features. In the negative ionization mode, the number adducts is much smaller and therefore the differences between both algorithms are not as stark. Multiple spectra from complex samples: In contrast to CAMERA, xMSannotator and MS-FLO, CliqueMS only produces annotations for each individual spectrum. Our results show that there is in fact an advantage to analyze individual spectra, since overall the performance of CliqueMS is consistently better than that of the other methods. CliqueMS is able to correctly annotate more metabolites than the other methods (see Table 2 and Supplementary Material). The only exception is xMSannotator, which annotates correctly more metabolites for the C18 dataset because it annotates single features as (M + H)+ by default without having another adduct for the same parental neutral mass (DeFelice ). Remarkably, CliqueMS is consistently able to correctly annotate substantially more adducts and identify more isotopic variants than the other methods. As an illustration (see Table 2), CliqueMS correctly annotates 17 adducts in the majority of samples of the HILIC dataset (29 if we consider all correct unique annotations across samples), whereas CAMERA, xMSAnnotator and MS-FLO identify 13, 10 and 2 different adducts/in-source fragments, respectively.

4 Conclusions

Annotating features in LC-MS1 metabolomic experiments is of the utmost importance. Without reliable annotation, however, questions as fundamental as, e.g. how many metabolites are there in a given sample or what is the best adduct for MS/MS experiments cannot be properly addressed. Here, we have shown that CliqueMS provides high quality annotations for biological samples from LC-MS1 single spectra. With simple and synthetic datasets we have provided evidence that explains the performance of CliqueMS: (i) it uses a highly discriminatory feature similarity metric; (ii) it treats the similarities between peaks in a transparent way by means of a simple generative model; (iii) it uses a well-grounded maximum likelihood inference approach to group features; (iv) it uses empirical adduct frequencies to identify the parental neutral mass and (v) it deals flexibly with the identification of the parental neutral mass by proposing and ranking alternative annotations. With real complex biological samples, we have demonstrated that annotating single spectra produces correct annotations for a larger number of features and metabolites than currently available tools for annotating both single and aligned spectra.

Funding

We acknowledge the support of Generalitat de Catalunya [program FI-DGR 2014 to O.S.]; the Ministry of Economy and Competitiveness of Spain [grant numbers FIS2013-47532-C3-1-P, FIS2016-78904-C3-1-P to R.G. and M.S.-P., BFU2014-57466-P to O.Y. and BES-2012-052585 (SAF2011-30578) to M.N.]; and the Ministerio de Ciencia e Innovación [grant SAF2011-28331 to D.B. and L.N.]. O.Y., D.B. and L.N. also acknowledge the support of the Spanish Biomedical Research Centre in Diabetes and Associated Metabolic Disorders (CIBERDEM), an initiative of Instituto de Investigacion Carlos III (ISCIII). Conflict of Interest: none declared. Click here for additional data file.

31 in total

1. Searching molecular structure databases with tandem mass spectra using CSI:FingerID.

Authors: Kai Dührkop; Huibin Shen; Marvin Meusel; Juho Rousu; Sebastian Böcker
Journal: Proc Natl Acad Sci U S A Date: 2015-09-21 Impact factor: 11.205

2. Winners of CASMI2013: Automated Tools and Challenge Data.

Authors: Takaaki Nishioka; Takeshi Kasama; Tomoya Kinumi; Hidefumi Makabe; Fumio Matsuda; Daisuke Miura; Masahiro Miyashita; Takemichi Nakamura; Ken Tanaka; Atsushi Yamamoto
Journal: Mass Spectrom (Tokyo) Date: 2014-09-02

3. Functional cartography of complex metabolic networks.

Authors: Roger Guimerà; Luís A Nunes Amaral
Journal: Nature Date: 2005-02-24 Impact factor: 49.962

4. Metabolite identification and molecular fingerprint prediction through machine learning.

Authors: Markus Heinonen; Huibin Shen; Nicola Zamboni; Juho Rousu
Journal: Bioinformatics Date: 2012-07-18 Impact factor: 6.937

5. In silico prediction and automatic LC-MS(n) annotation of green tea metabolites in urine.

Authors: Lars Ridder; Justin J J van der Hooft; Stefan Verhoeven; Ric C H de Vos; Jacques Vervoort; Raoul J Bino
Journal: Anal Chem Date: 2014-04-29 Impact factor: 6.986

6. Hydrogen Rearrangement Rules: Computational MS/MS Fragmentation and Structure Elucidation Using MS-FINDER Software.

Authors: Hiroshi Tsugawa; Tobias Kind; Ryo Nakabayashi; Daichi Yukihira; Wataru Tanaka; Tomas Cajka; Kazuki Saito; Oliver Fiehn; Masanori Arita
Journal: Anal Chem Date: 2016-08-04 Impact factor: 6.986

7. Systems-Level Annotation of a Metabolomics Data Set Reduces 25 000 Features to Fewer than 1000 Unique Metabolites.

Authors: Nathaniel G Mahieu; Gary J Patti
Journal: Anal Chem Date: 2017-09-15 Impact factor: 6.986

8. The Critical Assessment of Small Molecule Identification (CASMI): Challenges and Solutions.

Authors: Emma L Schymanski; Steffen Neumann
Journal: Metabolites Date: 2013-06-25

9. MetAssign: probabilistic annotation of metabolites from LC-MS data using a Bayesian clustering approach.

Authors: Rónán Daly; Simon Rogers; Joe Wandy; Andris Jankevics; Karl E V Burgess; Rainer Breitling
Journal: Bioinformatics Date: 2014-06-09 Impact factor: 6.937

10. MetFrag relaunched: incorporating strategies beyond in silico fragmentation.

Authors: Christoph Ruttkies; Emma L Schymanski; Sebastian Wolf; Juliane Hollender; Steffen Neumann
Journal: J Cheminform Date: 2016-01-29 Impact factor: 5.514

21 in total

1. Deep annotation of untargeted LC-MS metabolomics data with Binner.

Authors: Maureen Kachman; Hani Habra; William Duren; Janis Wigginton; Peter Sajjakulnukit; George Michailidis; Charles Burant; Alla Karnovsky
Journal: Bioinformatics Date: 2020-03-01 Impact factor: 6.937

Review 2. Using MetaboAnalyst 5.0 for LC-HRMS spectra processing, multi-omics integration and covariate adjustment of global metabolomics data.

Authors: Zhiqiang Pang; Guangyan Zhou; Jessica Ewald; Le Chang; Orcun Hacariz; Niladri Basu; Jianguo Xia
Journal: Nat Protoc Date: 2022-06-17 Impact factor: 17.021

3. Targeting unique biological signals on the fly to improve MS/MS coverage and identification efficiency in metabolomics.

Authors: Kevin Cho; Michaela Schwaiger-Haber; Fuad J Naser; Ethan Stancliffe; Miriam Sindelar; Gary J Patti
Journal: Anal Chim Acta Date: 2021-01-12 Impact factor: 6.558

4. HERMES: a molecular-formula-oriented method to target the metabolome.

Authors: Roger Giné; Jordi Capellades; Josep M Badia; Dennis Vughs; Michaela Schwaiger-Haber; Theodore Alexandrov; Maria Vinaixa; Andrea M Brunner; Gary J Patti; Oscar Yanes
Journal: Nat Methods Date: 2021-11-01 Impact factor: 47.990

5. R-MetaboList 2: A Flexible Tool for Metabolite Annotation from High-Resolution Data-Independent Acquisition Mass Spectrometry Analysis.

Authors: Manuel D Peris-Díaz; Shannon R Sweeney; Olga Rodak; Enrique Sentandreu; Stefano Tiziani
Journal: Metabolites Date: 2019-09-17

6. Enhanced in-Source Fragmentation Annotation Enables Novel Data Independent Acquisition and Autonomous METLIN Molecular Identification.

Authors: Jingchuan Xue; Xavier Domingo-Almenara; Carlos Guijas; Amelia Palermo; Markus M Rinschen; John Isbell; H Paul Benton; Gary Siuzdak
Journal: Anal Chem Date: 2020-04-10 Impact factor: 6.986

7. Exhaled volatilome analysis as a useful tool to discriminate asthma with other coexisting atopic diseases in women of childbearing age.

Authors: Rosa A Sola-Martínez; Gema Lozano-Terol; Julia Gallego-Jara; Eva Morales; Esther Cantero-Cano; Manuel Sanchez-Solis; Luis García-Marcos; Pedro Jiménez-Guerrero; José A Noguera-Velasco; Manuel Cánovas Díaz; Teresa de Diego Puente
Journal: Sci Rep Date: 2021-07-05 Impact factor: 4.379

8. "notame": Workflow for Non-Targeted LC-MS Metabolic Profiling.

Authors: Anton Klåvus; Marietta Kokla; Stefania Noerman; Ville M Koistinen; Marjo Tuomainen; Iman Zarei; Topi Meuronen; Merja R Häkkinen; Soile Rummukainen; Ambrin Farizah Babu; Taisa Sallinen; Olli Kärkkäinen; Jussi Paananen; David Broadhurst; Carl Brunius; Kati Hanhineva
Journal: Metabolites Date: 2020-03-31

9. An LC-QToF MS based method for untargeted metabolomics of human fecal samples.

Authors: Ken Cheng; Carl Brunius; Rikard Fristedt; Rikard Landberg
Journal: Metabolomics Date: 2020-04-03 Impact factor: 4.290

10. Susceptibility to false discovery in biomarker research using liquid chromatography-high resolution mass spectrometry based untargeted metabolomics profiling.

Authors: Pengwei Zhang; Irene L Ang; Melody M T Lam; Rui Wei; Kate M K Lei; Xingwang Zhou; Henry H N Lam; Qing-Yu He; Terence C W Poon
Journal: Clin Transl Med Date: 2021-06