| Literature DB >> 35360896 |
Rui Climaco Pinto1,2, Ibrahim Karaman1,2, Matthew R Lewis3,4, Jenny Hällqvist5,6, Manuja Kaluarachchi2,7, Gonçalo Graça7, Elena Chekmeneva3,4, Brenan Durainayagam1,2, Mohsen Ghanbari8, M Arfan Ikram8, Henrik Zetterberg9,10,11,12, Julian Griffin2,7, Paul Elliott1,2, Ioanna Tzoulaki1,13, Abbas Dehghan1,2,8, David Herrington14, Timothy Ebbels7.
Abstract
Integration of multiple datasets can greatly enhance bioanalytical studies, for example, by increasing power to discover and validate biomarkers. In liquid chromatography-mass spectrometry (LC-MS) metabolomics, it is especially hard to combine untargeted datasets since the majority of metabolomic features are not annotated and thus cannot be matched by chemical identity. Typically, the information available for each feature is retention time (RT), mass-to-charge ratio (m/z), and feature intensity (FI). Pairs of features from the same metabolite in separate datasets can exhibit small but significant differences, making matching very challenging. Current methods to address this issue are too simple or rely on assumptions that cannot be met in all cases. We present a method to find feature correspondence between two similar LC-MS metabolomics experiments or batches using only the features' RT, m/z, and FI. We demonstrate the method on both real and synthetic datasets, using six orthogonal validation strategies to gauge the matching quality. In our main example, 4953 features were uniquely matched, of which 585 (96.8%) of 604 manually annotated features were correct. In a second example, 2324 features could be uniquely matched, with 79 (90.8%) out of 87 annotated features correctly matched. Most of the missed annotated matches are between features that behave very differently from modeled inter-dataset shifts of RT, MZ, and FI. In a third example with simulated data with 4755 features per dataset, 99.6% of the matches were correct. Finally, the results of matching three other dataset pairs using our method are compared with a published alternative method, metabCombiner, showing the advantages of our approach. The method can be applied using M2S (Match 2 Sets), a free, open-source MATLAB toolbox, available at https://github.com/rjdossan/M2S.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35360896 PMCID: PMC9008693 DOI: 10.1021/acs.analchem.1c03592
Source DB: PubMed Journal: Anal Chem ISSN: 0003-2700 Impact factor: 6.986
Figure 1Method workflow: (top) overview of the approach; (bottom) matrices calculated at each step; Step 1: Distances between all features are calculated (RTdist, MZdist, log10FIdist) and linear thresholds set in all dimensions, finding “M” candidate matches between feature sets; Step 2: Find one-to-one feature correspondence: 2a: The expected inter-dataset shifts are modeled using neighbor consensus; 2b: residuals can then be obtained for each candidate match, normalized; 2c and 2d: transformed into single-value penalization scores; 2e: these are used to define feature-pair matrices containing only “U” unique matches. Step 3. A nonlinear tightening of thresholds is applied to filter out poor matches far from the inter-dataset shifts, yielding “U*” unique matches.
Figure 2Selection of best matches from multiple candidates, showing decomposition of a cluster with three reference (R) and two target (T) features, as well as connecting lines representing six candidate matches. Red matches (edges) have the lowest penalization score for each cluster at each iteration and are selected. Dashed lines are conflicting matches also containing the best-matched feature and thus are discarded. Blue lines are matches that initially are not the best but are not conflicting with the best match; thus, they can still be chosen in later iterations. In this case, two matches are formed from the original cluster after two iterations (R1–T1 and R3–T2).
Figure 3Summary of the data at each step of the workflow. Row 1: (Step 1) Inter-dataset distances for matched features in the (RT, MZ, log10FI) domains. Black dots are unique matches, blue circles are matches in clusters, and orange dots are matches outside the log10FI threshold limits. Row 2: (Step 2a) Black dots are the same as in Row 1, red circles are expected values at the (RT, MZ, log10FI) of the reference feature in the match. Row 3: (Step 2b) Residuals of the expected values. Row 4: (additional Step 2b) Normalized residuals obtained by dividing by the threshold point at their median + 3 × MAD. Row 5: (Step 2d) After defining weights W = [1,1,0.2] (Step 2c, not shown) penalization scores are obtained and used to color the same plots as in Row 1 (RT and MZ) and the comparison of log10FI of target and reference. Penalization scores are used (Step 2e, not shown) to decide the best match in clusters with multiple matches. Row 6: (Step 3) Tightening of thresholds used to define poor matches using the method “scores” at the threshold limit of median + 3 × MAD. Matches (part of clusters) previously discarded in blue, poor matches in red, and good matches in black.
Figure 4Inter-dataset distances after initial candidate matching (Step 1), with results of validation using annotated features. Black dots are matches within thresholds; blue dots are matches with identical annotations; nine red crosses are annotated matches outside of the initial thresholds; 10 red circles are annotated matches wrongly considered poor matches.
Matches and Annotations at Each Step s1–s3
| stage and results | annotations | matches |
|---|---|---|
| initial data | 604 | (10 427/14 097) |
| matches outside thresh | 9 | |
| after initial matches (s1) | 595 | 5426 |
| after unique matches (s2) | 595 | 5365 |
| correct ID matches | 595 | |
| wrong ID matches | 0 | |
| after poor matches (s3) | 4953 | |
| final correct ID matches | 585 | |
| final
wrong | 19 | |
| poor matches | 10 | 412 |
| with correct ID | 10 | |
| with wrong ID | 0 |
Numbers refer to matches, and when in parenthesis refer to features.
Wrong ID or outside threshold.
Figure 5(Left) Number of common featuresa highly correlatedb with each matched feature vs penalty scores used in the matching method. The lower the penalty score, the higher the number of common correlated features. (Center) Number of features highly associated (not necessarily common) with each matched feature in target vs reference. (Right) Number of common correlated features vs the minimum number of correlated features (not necessarily common) between the reference or target datasets. All plots are colored by “patternScore” obtained by the ratio common/(minimum +1). a Only features surviving removal of poor matches. b Spearman correlation >0.7 and ΔRT <0.25 s.