| Literature DB >> 29950009 |
Dai Hai Nguyen1, Canh Hao Nguyen1, Hiroshi Mamitsuka1,2.
Abstract
Motivation: Recent success in metabolite identification from tandem mass spectra has been led by machine learning, which has two stages: mapping mass spectra to molecular fingerprint vectors and then retrieving candidate molecules from the database. In the first stage, i.e. fingerprint prediction, spectrum peaks are features and considering their interactions would be reasonable for more accurate identification of unknown metabolites. Existing approaches of fingerprint prediction are based on only individual peaks in the spectra, without explicitly considering the peak interactions. Also the current cutting-edge method is based on kernels, which are computationally heavy and difficult to interpret.Entities:
Mesh:
Year: 2018 PMID: 29950009 PMCID: PMC6022642 DOI: 10.1093/bioinformatics/bty252
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Example MS spectrum from Human Metabolome Database (Wishart ) for 1-Methylhistidine (HMBD00001), with its corresponding chemical structure (top-left) and peak list (top-right)
Fig. 2.A general scheme to identify unknown metabolites based on molecular fingerprint vectors. There are two main stages: 1) fingerprint prediction: learning a mapping from a molecule to the corresponding binary molecular fingerprint vector by classification methods, given a set of MS/MS spectra and fingerprints; 2) candidate retrieval: using the predicted fingerprints to retrieve candidate molecules from the databases of known metabolites. Note that the step of constructing an affinity matrix is optional and is used in L-SIMPLE only
Fig. 3.Illustration of constructing affinity matrix A from the set of fragmentation trees. The constructed matrix A is used as prior information for regularizing interaction matrix W
Micro-average performance of kernels: PPK (Heinonen ) is used to compute
| Method | Acc (%) | F1-score (%) |
|---|---|---|
| 75.74 (±8.13) | 60.59 (±13.75) | |
| 78.41 (±6.82) | 65.05 (±12.16) | |
| 78.57 (±6.24) | 65.34 (±11.99) | |
Note: ComUNIMKL, ALIGN, ALIGNF are combinations of and by algorithms UNIMKL, ALIGN, ALIGNF, respectively.
Performance comparison between SIMPLE and L-SIMPLE
| Task Id/Name | SIMPLE | L-SIMPLE | ||
|---|---|---|---|---|
| Acc (%) | F1 (%) | Acc (%) | F1 (%) | |
| 3 (Aldehyde) | 71.16 | 69.24 | 73.14 | 70.75 |
| 27 (Hydroxy) | 91.24 | 95.19 | 90.29 | 94.82 |
| 29 (Primary alcohol) | 79.36 | 53.74 | 79.85 | 53.98 |
| 30 (Secondary alcohol) | 80.35 | 54.18 | 81.84 | 56.17 |
| 37 (Ether) | 80.35 | 71.39 | 80.61 | 72.03 |
| 38 (Dialkyl etherEther) | 82.35 | 70.98 | 82.6 | 71.16 |
| 45 (Aryl) | 83.83 | 81.67 | 83.34 | 82.16 |
| 50 (Carboxylic acid) | 69.38 | 62.00 | 69.65 | 62.15 |
| 56 (Primary Carbon) | 73.12 | 40.42 | 73.88 | 44.46 |
| 57 (Secondary Carbon) | 71.88 | 67.15 | 72.39 | 67.28 |
| 60 (Alkene) | 81.35 | 23.49 | 84.08 | 27.75 |
| 78.33 ± 6.05 | 66.69 ± 13.03 | 78.86 ± 5.87 | 67.59 ± 12.35 | |
Micro-average performance and computation time (for prediction) of kernel-based methods in Shen and proposed methods in this paper
| Method | Acc | F1 score | Run. time |
|---|---|---|---|
| (%) | (%) | (ms) | |
| 75.74 (±6.72) | 60.59 (±14.54) | 52.37 | |
| 76.63 (±7.03) | 61.64 (±15.48) | 1501.02 | |
| 75.33 (±5.4) | 61.25 (±13.99) | 1501.02 | |
| 74.54 (±8.49) | 58.46 (±16.01) | 1501.02 | |
| 79.11 (±5.02) | 67.34 (±11.75) | 1501.09 | |
| 78.41 (±4.99) | 66.87 (±12.11) | 1501.01 | |
| 79.02 (±7.4) | 67.55 (±12.93) | 1501.11 | |
| 80.98 (±6.05) | 69.04 (±11.98) | 1559.20 | |
| 79.03 (±7.89) | 65.67 (±13.02) | 471.71 | |
| 78.33 (±6.05) | 66.70 (±13.03) | 4.57 | |
| 78.86 (±5.87) | 67.59 (±12.35) | 4.32 |
Fig. 4.(a) Weight vectors (w) of the main effect terms and (b) smooth heat map of weight matrices (W) of the interaction terms learned by L-SIMPLE for properties or tasks: 29 (Primary alcohol), 37 (Ether), 56 (Primary Carbon), 70 (Alkylarylether), 139 (Thioenol), 192 (Carbonic acid monoester), 236 (Heteroaromatic), 356 (1,5-Tautomerizable) and 366 (Actinide)
Case studies of weight vector w and interaction matrix W learned by L-SIMPLE over a set of randomly selected tasks
| Tasks/Name | (42, 85) | (42, 163) | (85, 227) | (130, 201) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29 (Primary Alcohol) | 0.0 | 0.0 | 0.0016 | −0.0545 | 0.0 | 0.0442 | −0.4085 | −0.0545 | 0.0765 | −0.4085 | 0.0 | 0.0218 |
| 37 (Ether) | 0.0 | 0.0 | 0.0120 | 0.0260 | 0.0 | 0.0471 | 0.0 | 0.0260 | 0.1264 | 0.0 | 0.0 | 0.3389 |
| 56 (Primary Carbon) | 0.0 | 0.0 | 0.0271 | −0.5657 | 0.0 | 0.0047 | −0.9833 | −0.5657 | 0.0271 | −0.9833 | 0.0 | 0.0104 |
| 70 (Alkylarylether) | 0.0 | 0.0 | 0.0159 | 0.0 | 0.0 | 0.0551 | −0.1238 | 0.0 | 0.0972 | −0.1238 | 0.0 | 0.0265 |
| 139 (Thioenol) | 0.1575 | 0.3939 | 0.0 | −0.2308 | −0.4094 | 0.0 | −0.3589 | −0.2308 | 0.0 | −0.3589 | −0.1126 | 0.0 |
| 192 (Carbonic acid monoester) | −0.1542 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | −0.6945 | 0.0 | 0.0 | −0.6945 | 0.0 | 0.0 |
| 236 (Heteroaromatic) | 0.0 | 0.0 | 0.0107 | 0.0 | 0.2499 | 0.0537 | −0.3069 | 0.0 | 0.1062 | −0.3069 | 0.1401 | 0.0201 |
| 356 (1,5-Tautomerizable) | 0.0 | 0.0 | 0.0170 | 0.0 | 0.0 | 0.0301 | 0.5539 | 0.0 | 0.0607 | 0.5539 | 0.0 | 0.0107 |
| 366 (Actinide) | 0.0 | 0.0 | 0.0245 | −0.1891 | 0.6065 | 0.0282 | −0.5373 | −0.1891 | 0.0399 | −0.5373 | 0.0 | 0.0153 |
Note: w1 and w2 denote weights corresponding two mass positions and W denotes the weight of their interactions. Four pairs of mass positions which are frequently present in these tasks, including (42, 85), (42, 163), (85, 227) and (130, 201) are shown.