| Literature DB >> 35893261 |
Seongho Kim1, Ikuko Kato2, Xiang Zhang3.
Abstract
Compound identification is a critical step in untargeted metabolomics. Its most important procedure is to calculate the similarity between experimental mass spectra and either predicted mass spectra or mass spectra in a mass spectral library. Unlike the continuous similarity measures, there is no study to assess the performance of binary similarity measures in compound identification, even though the well-known Jaccard similarity measure has been widely used without proper evaluation. The objective of this study is thus to evaluate the performance of binary similarity measures for compound identification in untargeted metabolomics. Fifteen binary similarity measures, including the well-known Jaccard, Dice, Sokal-Sneath, Cosine, and Simpson measures, were selected to assess their performance in compound identification. using both electron ionization (EI) and electrospray ionization (ESI) mass spectra. Our theoretical evaluations show that the accuracy of the compound identification was exactly the same between the Jaccard, Dice, 3W-Jaccard, Sokal-Sneath, and Kulczynski measures, between the Cosine and Hellinger measures, and between the McConnaughey and Driver-Kroeber measures, which were practically confirmed using mass spectra libraries. From the mass spectrum-based evaluation, we observed that the best performing similarity measures were the McConnaughey and Driver-Kroeber measures for EI mass spectra and the Cosine and Hellinger measures for ESI mass spectra. The most robust similarity measure was the Fager-McGowan measure, the second-best performing similarity measure in both EI and ESI mass spectra.Entities:
Keywords: EI; ESI; binary similarity measure; compound identification; mass spectrometry; untargeted metabolomics
Year: 2022 PMID: 35893261 PMCID: PMC9394311 DOI: 10.3390/metabo12080694
Source DB: PubMed Journal: Metabolites ISSN: 2218-1989
Figure 1(a) Densities and (b) heatmap of the correlation matrix of scores among 15 binary similarity measures for EI mass spectra-based compound identification. The correlation was calculated using Pearson’s correlation coefficients. The horizontal red solid line indicates the four clusters generated by hierarchical clustering. The numbers in row and column represent the indices of binary similarity measures corresponding to Table 4 of Section 4.
Accuracy and 95% CI of the 15 binary similarity measures for the EI mass spectra-based compound identification up to the top three highest similarity scores.
| Similarity | Ranks | ||
|---|---|---|---|
| 1 | 2 | 3 | |
| 1 | 27.49 (26.89,28.07) | 35.08 (34.43,35.72) | 39.41 (38.76,40.07) |
| 2 | 27.49 (26.89,28.07) | 35.08 (34.43,35.72) | 39.41 (38.76,40.07) |
| 3 | 27.49 (26.89,28.07) | 35.08 (34.43,35.72) | 39.41 (38.76,40.07) |
| 4 | 27.49 (26.89,28.07) | 35.08 (34.43,35.72) | 39.41 (38.76,40.07) |
| 5 | 29.11 (28.50,29.71) | 37.27 (36.61,37.92) | 42.03 (41.38,42.68) |
| 6 | 27.51 (26.92,28.10) | 35.63 (34.99,36.28) | 40.19 (39.55,40.86) |
| 7 | 31.24 (30.62,31.86) | 40.24 (39.59,40.90) | 45.36 (44.68,46.02) |
| 8 | 31.24 (30.62,31.86) | 40.24 (39.59,40.90) | 45.36 (44.68,46.02) |
| 9 | 20.71 (20.17,21.25) | 20.80 (20.25,21.34) | 20.90 (20.34,21.44) |
| 10 | 18.32 (17.81,18.83) | 23.78 (23.21,24.36) | 26.65 (26.07,27.24) |
| 11 | 29.78 (29.17,30.39) | 38.09 (37.43,38.74) | 42.88 (42.22,43.54) |
| 12 | 27.49 (26.89,28.07) | 35.08 (34.43,35.72) | 39.41 (38.76,40.07) |
| 13 | 15.21 (14.75,15.69) | 15.40 (14.91,15.89) | 15.60 (15.11,16.09) |
| 14 | 26.16 (25.57,26.76) | 33.25 (32.63,33.89) | 37.34 (36.71,38.00) |
| 15 | 29.11 (28.50,29.71) | 37.27 (36.62,37.93) | 42.03 (41.38,42.68) |
The numbers in parentheses are 95% CI.
Figure 2Accuracy of all similarity measures for EI mass spectra-based compound identification by rank. The x-axis represents the ranks and the y-axis the identification accuracy. The numbers in legend and plot are the indices of binary similarity measures corresponding to Table 4 of Section 4.
Figure 3(a) Heatmap of the correlation matrix of the identification results among 15 binary similarity measures for Rank 1 and (b,c) Venn diagrams of consensus analysis among 4 selected binary similarity measures for EI mass spectra-based compound identification. In (a), the correlation was calculated using Pearson’s correlation coefficients. The horizontal red solid line indicates the four clusters generated by hierarchical clustering. The numbers in row and column represent the indices of binary similarity measures corresponding to Table 4 of Section 4. In (b), the Venn diagram was constructed based on all reference compounds with the highest corresponding similarity scores. In (c), the Venn diagram was constructed based on all reference compounds that were corrected identified.
Figure 4(a) Densities and (b) heatmap of the correlation matrix of scores among all 15 binary similarity measures for ESI mass spectra-based compound identification. The correlation was calculated using Pearson’s correlation coefficients. The horizontal red solid line indicates the four clusters generated by hierarchical clustering. The numbers in row and column represent the indices of binary similarity measures corresponding to Table 4 of Section 4.
Accuracy and 95% CI of all 15 binary similarity measures for ESI mass spectra-based compound identification up to top three highest similarity scores.
| Similarity | Ranks | ||
|---|---|---|---|
| 1 | 2 | 3 | |
| 1 | 52.24 (50.23,54.29) | 59.56 (57.64,61.49) | 62.83 (60.90,64.76) |
| 2 | 52.24 (50.23,54.29) | 59.56 (57.64,61.49) | 62.83 (60.90,64.76) |
| 3 | 52.24 (50.23,54.29) | 59.56 (57.64,61.49) | 62.83 (60.90,64.76) |
| 4 | 52.24 (50.23,54.29) | 59.56 (57.64,61.49) | 62.83 (60.90,64.76) |
| 5 | 53.37 (51.36,55.38) | 60.32 (58.39,62.24) | 64.13 (62.20,66.05) |
| 6 | 48.01 (46.00,49.98) | 54.37 (52.45,56.38) | 56.72 (54.75,58.73) |
| 7 | 51.15 (49.14,53.16) | 58.23 (56.26,60.23) | 61.87 (59.94,63.79) |
| 8 | 51.15 (49.14,53.16) | 58.23 (56.26,60.23) | 61.87 (59.94,63.79) |
| 9 | 42.78 (40.85,44.70) | 45.21 (43.20,47.22) | 47.59 (45.63,49.60) |
| 10 | 50.31 (48.26,52.28) | 57.22 (55.25,59.19) | 61.07 (59.15,63.04) |
| 11 | 53.33 (51.32,55.34) | 60.36 (58.39,62.29) | 63.83 (61.95,65.80) |
| 12 | 52.24 (50.23,54.29) | 59.56 (57.64,61.49) | 62.83 (60.90,64.76) |
| 13 | 36.12 (34.24,38.09) | 39.18 (37.21,41.15) | 41.23 (39.22,43.24) |
| 14 | 47.34 (45.33,49.31) | 52.32 (50.36,54.37) | 54.46 (52.45,56.43) |
| 15 | 53.37 (51.36,55.38) | 60.32 (58.39,62.24) | 64.13 (62.20,66.05) |
The numbers in parentheses are 95% CI.
Figure 5Accuracy of all similarity measures for ESI mass spectra-based compound identification by rank. The x-axis represents the ranks and the y-axis the identification accuracy. The numbers in legend and plot are the indices of binary similarity measures corresponding to Table 4 of Section 4.
Figure 6(a) Heatmap of the correlation matrix of identification among all 15 binary similarity measures for Rank 1 and (b,c) Venn diagrams of consensus analysis among 4 selected binary similarity measures for ESI mass spectra-based compound identification. In (a), the correlation was calculated using Pearson’s correlation coefficients. The horizontal red solid line indicates the four clusters generated by hierarchical clustering. The numbers in row and column represent the indices of binary similarity measures corresponding to Table 4 of Section 4; In (b), the Venn diagram was constructed based on all reference compounds with the highest corresponding similarity scores; In (c), the Venn diagram was constructed based on all reference compounds that were corrected identified.
A confusion matrix between binary query and reference mass spectra.
| Reference Mass Spectra | |||
|---|---|---|---|
| 0 | 1 | ||
| Query mass spectra | 0 | d | b |
| 1 | a | c | |
‘0’ indicates that a peak intensity is zero, while ‘1’ represents a nonzero intensity.
List of all 15 binary asymmetric similarity measures.
| Index | Name | Expression | Range |
|---|---|---|---|
| 1 | Jaccard | c/(a+b+c) | [0, 1) |
| 2 | Dice | 2c/(a+b+2c) | [0, 1) |
| 3 | 3W-Jaccard | 3c/(a+b+3c) | [0, 1) |
| 4 | Sokal–Sneath | c/(2a+2b+c) | [0, 1) |
| 5 | Cosine | c/√((a+c)·(b+c)) | [0, 1) |
| 6 | Mountford | 2c/(c(a+b)+2ab) | [0, 2] |
| 7 | McConnaughey | (c2−ab)/((a+c)·(b+c)) | [−1, 1) |
| 8 | Driver–Kroeber | c(a+b+2c)/(2(a+c)·(b+c)) | [0, 1) |
| 9 | Simpson | c/min(a+c,b+c) | [0, 1) |
| 10 | Braun–Banquet | c/max(a+c,b+c) | [0, 1) |
| 11 | Fager–McGowan | c/√((a+c)·(b+c)) − 1/(2·√(max(a+c,b+c))) | (−1/2, 1) |
| 12 | Kulczynski | c/(a+b) | [0, ∞) |
| 13 | Intersection | c | [0, ∞) |
| 14 | Hamming | 1/(a+b) | (0, 1] |
| 15 | Hellinger | 1 − √((1 − c/√((a+c)·(b+c)))) | [0, 1) |
1, Jaccard is also known as Tanimoto; 2, Dice is also known as Hodgkin index, Sorenson, Czekanowski, Nei–Li, and F1-score; 5, Cosine is equal to the square root of Sorgenfrei, and is also known as Carbo index, Ochiai, Otsuka, and Fowlkes–Mallows index; 8, Driver–Kroeber is equal to 0.5 times Johnson, and is also known as Kulczynski; 14, Hamming is also known as squared-Euclidean, Canberra, Manhattan, Cityblock, and Minkowski; a,b,c ≥ 0; a+b+c > 0.