| Literature DB >> 35888729 |
Shijinqiu Gao1, Hoi Yan Katharine Chau1, Kuijun Wang1, Hongyu Ao1, Rency S Varghese1, Habtom W Ressom1.
Abstract
Metabolite annotation has been a challenging issue especially in untargeted metabolomics studies by liquid chromatography coupled with mass spectrometry (LC-MS). This is in part due to the limitations of publicly available spectral libraries, which consist of tandem mass spectrometry (MS/MS) data acquired from just a fraction of known metabolites. Machine learning provides the opportunity to predict molecular fingerprints based on MS/MS data. The predicted molecular fingerprints can then be used to help rank putative metabolite IDs obtained by using either the precursor mass or the formula of the unknown metabolite. This method is particularly useful to help annotate metabolites whose corresponding MS/MS spectra are missing or cannot be matched with those in accessible spectral libraries. We investigated a convolutional neural network (CNN) for molecular fingerprint prediction based on data acquired by MS/MS. We used more than 680,000 MS/MS spectra obtained from the MoNA repository and NIST 20, representing about 36,000 compounds for training and testing our CNN model. The trained CNN model is implemented as a python package, MetFID. The package is available on GitHub for users to enter their MS/MS spectra and corresponding putative metabolite IDs to obtain ranked lists of metabolites. Better performance is achieved by MetFID in ranking putative metabolite IDs using the CASMI 2016 benchmark dataset compared to two other machine learning-based tools (CSI:FingerID and ChemDistiller).Entities:
Keywords: deep learning; metabolite identification; metabolomics; molecular fingerprint
Year: 2022 PMID: 35888729 PMCID: PMC9316655 DOI: 10.3390/metabo12070605
Source DB: PubMed Journal: Metabolites ISSN: 2218-1989
Comparison of CNN with other machine learning models based on F1 score, Tanimoto similarity score, and the top-k ranking of the candidates selected by mass-based and formula-based search against compound databases for 29,588 training and 6290 testing compounds. The result of the best performing model under each category is shown in bold.
|
|
|
|
|
| ||||||
|
| 61% | 59% | 66% | 67% |
| |||||
|
| 45% | 43% | 52% | 53% |
| |||||
|
|
| |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32% | 35% | 39% | 40% |
| 45% | 47% | 49% | 48% |
|
|
| 59% | 61% | 66% | 66% |
| 68% | 68% | 70% |
|
|
|
| 71% | 72% | 75% | 75% |
| 75% | 75% | 77% | 77% |
|
|
| 81% | 82% | 83% | 82% |
| 81% |
|
| 81% |
|
1 LR: Logistic Regression; SLP: Single-Layer Perceptron; SVM: Support Vector Machine; MLP: Multilayer Perceptron; CNN: Convolutional Neural Network.
Performance comparison of SLP, MLP, SVM, and CNN models that were trained using MS/MS spectra acquired from NIST 20 and MoNA and tested on the CASMI 2016 dataset. F1 score, Tanimoto similarity score, and top-k ranking of metabolite candidates are calculated for mass-based and formula-based approaches. The result of the best performing model under each category is shown in bold.
|
|
| |||||||
|
|
| |||||||
|
|
|
|
| |||||
|
| 49% | 52% | 53% | 56% | ||||
|
| 33% | 36% | 38% | 41% | ||||
|
|
| |||||||
|
|
|
|
|
|
|
|
|
|
|
| 35% | 32% | 48% |
| 60% | 59% | 68% |
|
|
| 61% | 59% | 71% |
| 81% | 81% | 87% |
|
|
| 78% | 75% | 83% |
| 88% | 86% | 90% |
|
|
| 93% | 89% | 93% |
| 94% | 92% | 94% |
|
Performance comparison of ChemDistiller and CSI:FingerID with MetFID using the CASMI 2016 dataset as a testing dataset. The percentage that is outside the parenthesis includes the unannotated peak lists in the ranked result, which means 208 spectra in total. The percentages inside parenthesis are calculated by excluding the peak lists whose candidate lists do not include the true compounds, in order to account for the situation when the target compound cannot be found by searching against compound databases. The result of the best performing tool under each category is shown in bold.
| CASMI 2016 Testing | ||||
|---|---|---|---|---|
| Mass-Based | Formula-Based | |||
| Rank | ChemDistiller | MetFID | CSI:FingerID | MetFID |
|
| 34% (44%) | 67% (73%) | ||
|
| 47% (59%) | 71% (78%) | ||
|
| 58% (73%) | 72% (79%) | ||
|
| 63% (80%) | 72% (79%) | ||
Figure 1Machine learning-based compound fingerprint prediction for metabolite annotation.
Figure 2Architecture of CNN.
Figure 3Three different strategies of the CNN model. (A) One CNN model with only ion intensity values as input. (B) One CNN model with ion intensity values and three additional inputs. (C) Eight CNN models each trained with a subset of the MS/MS spectra.