| Literature DB >> 31965249 |
Alex Chao1,2, Hussein Al-Ghoul3,4, Andrew D McEachran3,5, Ilya Balabin6, Tom Transue6, Tommy Cathey6, Jarod N Grossman4,5, Randolph R Singh3,7, Elin M Ulrich8, Antony J Williams9, Jon R Sobus10.
Abstract
High-resolution mass spectrometry (HRMS) enables rapid chemical annotation via accurate mass measurements and matching of experimentally derived spectra with reference spectra. Reference libraries are generated from chemical standards and are therefore limited in size relative to known chemical space. To address this limitation, in silico spectra (i.e., MS/MS or MS2 spectra), predicted via Competitive Fragmentation Modeling-ID (CFM-ID) algorithms, were generated for compounds within the U.S. Environmental Protection Agency's (EPA) Distributed Structure-Searchable Toxicity (DSSTox) database (totaling, at the time of analysis, ~ 765,000 substances). Experimental spectra from EPA's Non-Targeted Analysis Collaborative Trial (ENTACT) mixtures (n = 10) were then used to evaluate the performance of the in silico spectra. Overall, MS2 spectra were acquired for 377 unique compounds from the ENTACT mixtures. Approximately 53% of these compounds were correctly identified using a commercial reference library, whereas up to 50% were correctly identified as the top hit using the in silico library. Together, the reference and in silico libraries were able to correctly identify 73% of the 377 ENTACT substances. When using the in silico spectra for candidate filtering, an examination of binary classifiers showed a true positive rate (TPR) of 0.90 associated with false positive rates (FPRs) of 0.10 to 0.85, depending on the sample and method of candidate filtering. Taken together, these findings show the abilities of in silico spectra to correctly identify true positives in complex samples (at rates comparable to those observed with reference spectra), and efficiently filter large numbers of potential false positives from further consideration. Graphical abstract.Entities:
Keywords: CFM-ID; DSSTox; ENTACT; High-resolution mass spectrometry; Non-targeted analysis; ToxCast
Year: 2020 PMID: 31965249 PMCID: PMC7021669 DOI: 10.1007/s00216-019-02351-7
Source DB: PubMed Journal: Anal Bioanal Chem ISSN: 1618-2642 Impact factor: 4.142
Fig. 1Overall workflow for data acquisition and compound identification. Sections outlined in blue show aspects of the workflow previously implemented for the analysis of ENTACT mixtures. The section outlined in purple shows additions to the workflow that involve matching experimental MS2 spectra with CFM-ID predicted spectra. Identification confidence levels [2] for each match of experimental data to a corresponding database/library entry are shown alongside the specified match in the workflow
Fig. 2Three approaches for utilizing CFM-ID scores. Each combination of experimental spectrum vs. CFM-ID predicted spectrum generates a unique score via the dot-product algorithm, designated by a unique letter assignment. In approach 1, only one score is generated at the designated collision energy (CE, where CEexperimental = CEin silico). In approach 2, scores from all three CEin silico levels are summed. In approach 3, scores are summed across all three CEin silico levels, and then across all three CEexperimental levels
Numbers of spiked ENTACT substances meeting specific research criteria
| Mixture | Spiked substances | Passes | Passes in PCDL1 | Passes w/ MS2 | Passes in PCDL and w/ MS2 | Passes matched by PCDL |
|---|---|---|---|---|---|---|
| 499 | 95 | 46 | 28 | 37 | 23 | 18 |
| 500 | 95 | 19 | 14 | 14 | 11 | 7 |
| 501 | 95 | 47 | 28 | 34 | 25 | 23 |
| 502 | 95 | 58 | 42 | 22 | 17 | 15 |
| 503 | 185 | 103 | 59 | 67 | 43 | 34 |
| 504 | 185 | 103 | 55 | 68 | 41 | 34 |
| 505 | 365 | 224 | 128 | 64 | 44 | 40 |
| 506 | 365 | 195 | 114 | 113 | 74 | 61 |
| 507 | 95 | 19 | 13 | 14 | 9 | 7 |
| 508 | 364 | 31 | 19 | 20 | 13 | 7 |
| Total | 1939 | 845 | 500 | 453 | 300 | 246 |
| % of total | NA | 44% | 26% | 23% | 15% | 13% |
| % of passes | NA | NA | 59% | 54% | 36% | 29% |
1Composite “Personal Compound Database and Library” (PCDL) containing compounds from six individual Agilent PCDLs (i.e., Environmental water screening, Pesticides, Forensic toxicology, Veterinary drugs, Metlin, and Extractable and leachables)
CFM-ID results for ENTACT mixture compounds across three scoring approaches (Fig. 2). Candidate compounds from the CFM-ID database were limited to those having an MS-Ready monoisotopic mass matching (within 10 ppm) that of the known (spiked) substance
| Approach 1 | Approach 2 | Approach 3 | |||||
|---|---|---|---|---|---|---|---|
| CEexperimental | 10 | 20 | 40 | 10 | 20 | 40 | Σa |
| CEin silico | 10 | 20 | 40 | Σ | Σ | Σ | Σ |
| No. of compounds scored | 363 | 368 | 360 | 363 | 368 | 360 | 377 |
| Number of true positives | |||||||
| Top hit | 102 | 129 | 93 | 100 | 139 | 100 | 129 |
| Within top 5 | 187 | 219 | 162 | 188 | 221 | 162 | 224 |
| Within top 20 | 267 | 279 | 215 | 275 | 283 | 213 | 298 |
| Percentage of true positives | |||||||
| Top hit | 28% | 35% | 26% | 28% | 38% | 28% | 34% |
| Within top 5 | 52% | 60% | 45% | 52% | 60% | 45% | 59% |
| Within top 20 | 74% | 76% | 60% | 76% | 77% | 59% | 79% |
| Average percentile for true positives | 77th | 81st | 72nd | 78th | 82nd | 73rd | 81st |
| Average quotient for true positives | 0.67 | 0.62 | 0.45 | 0.64 | 0.65 | 0.47 | 0.69 |
aSum of three CEs
CFM-ID results for ENTACT mixture compounds across three scoring approaches (Fig. 2). Candidate compounds from the CFM-ID database were limited to those having an MS-Ready formula matching that of the known (spiked) substance
| Approach 1 | Approach 2 | Approach 3 | |||||
|---|---|---|---|---|---|---|---|
| CEexperimental | 10 | 20 | 40 | 10 | 20 | 40 | Σa |
| CEin silico | 10 | 20 | 40 | Σ | Σ | Σ | Σ |
| No. of compounds scored | 363 | 368 | 360 | 363 | 368 | 360 | 377 |
| Number of true positives | |||||||
| Top hit | 159 | 178 | 123 | 171 | 180 | 128 | 188 |
| Within top 5 | 239 | 250 | 194 | 243 | 252 | 194 | 268 |
| Within top 20 | 284 | 291 | 232 | 295 | 292 | 232 | 321 |
| Percentage of true positives | |||||||
| Top hit | 44% | 48% | 34% | 47% | 49% | 36% | 50% |
| Within top 5 | 66% | 68% | 54% | 67% | 68% | 54% | 71% |
| Within top 20 | 78% | 79% | 64% | 81% | 79% | 64% | 85% |
| Average percentile for true positives | 82nd | 83rd | 76th | 83rd | 84th | 77th | 84th |
| Average quotient for true positives | 0.77 | 0.73 | 0.57 | 0.77 | 0.75 | 0.59 | 0.79 |
aSum of three CEs
Fig. 3Number of “pass” compounds within the top 20 CFM-ID hits using approach 1 at CE = 10 V vs. 20 V vs. 40 V (a). Number of “pass” compounds within the top 20 CFM-ID hits using approach 3 vs. approach 1 at CE = 10, 20, or 40 V (b)
Fig. 4ROC curves (a) for ENTACT mixture data (all “pass” compounds from all ten mixtures) when using percentile and quotient cutoff values, and when filtering the CFM-ID database matches by mass or molecular formula. A global TPR of 0.90 (horizontal gray dashed line) results in percentile-based FPR values (green vertical dotted lines) of 0.67 (by mass) and 0.36 (by formula), and quotient-based FPR values (pink vertical dotted lines) of 0.57 (by mass) and 0.32 (by formula). Distributions (b) of true positive rates (TPRs) and false positive rates (FPRs) across individual ENTACT mixtures (n = 10) when selecting cutoff values based on a global TPR of 0.90 (from a)
Fig. 5Comparison of “pass” compounds (n = 377) correctly identified by reference library matching (using a composite Agilent PCDL) vs. CFM-ID database matching (when filtering by molecular formula)