| Literature DB >> 32606026 |
Estela de Oliveira Lima1,2, Luiz Claudio Navarro3, Karen Noda Morishita2, Camila Mika Kamikawa4, Rafael Gustavo Martins Rodrigues2, Mohamed Ziad Dabaja2, Diogo Noin de Oliveira2, Jeany Delafiori2, Flávia Luísa Dias-Audibert2, Marta da Silva Ribeiro2, Adriana Pardini Vicentini4, Anderson Rocha5, Rodrigo Ramos Catharino6.
Abstract
Brazil and many other Latin American countries are areas of endemicity for different neglected diseases, and the fungal infection paracoccidioidomycosis (PCM) is one of them. Among the clinical manifestations, pneumopathy associated with skin and mucosal lesions is the most frequent. PCM definitive diagnosis depends on yeast microscopic visualization and immunological tests, but both present ambiguous results and difficulty in differentiating PCM from other fungal infections. This research has employed metabolomics analysis through high-resolution mass spectrometry to identify PCM biomarkers in serum samples in order to improve diagnosis for this debilitating disease. To upgrade the biomarker selection, machine learning approaches, using Random Forest classifiers, were combined with metabolomics data analysis. The proposed combination of these two analytical methods resulted in the identification of a set of 19 PCM biomarkers that show accuracy of 97.1%, specificity of 100%, and sensitivity of 94.1%. The obtained results are promising and present great potential to improve PCM definitive diagnosis and adequate pharmacological treatment, reducing the incidence of PCM sequelae and resulting in a better quality of life.IMPORTANCE Paracoccidioidomycosis (PCM) is a fungal infection typically found in Latin American countries, especially in Brazil. The identification of this disease is based on techniques that may fail sometimes. Intending to improve PCM detection in patient samples, this study used the combination of two of the newest technologies, artificial intelligence and metabolomics. This combination allowed PCM detection, independently of disease form, through identification of a set of molecules present in patients' blood. The great difference in this research was the ability to detect disease with better confidence than the routine methods employed today. Another important point is that among the molecules, it was possible to identify some indicators of contamination and other infection that might worsen patients' condition. Thus, the present work shows a great potential to improve PCM diagnosis and even disease management, considering the possibility to identify concomitant harmful factors.Entities:
Keywords: artificial intelligence; diagnosis; metabolomics; paracoccidioidomycosis
Year: 2020 PMID: 32606026 PMCID: PMC7329323 DOI: 10.1128/mSystems.00258-20
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 6.496
FIG 1Optimization process to determine the most important features. The sign length indicates the number of variables which are the most discriminant ones. (Top left) Results of the iteration process as a function of the ranked feature length, determining 28 features for the best F1 score achieved. (Top right) Approximate derivative of the F1 score curve showing that results start becoming almost stable above the 10 first-ranked features. (Bottom left) Initial ranked features versus final ranked features, demonstrating the rank refinement convergence and that most discriminant features are in the first positions from the beginning of the process.
Definitions of statistical metrics to evaluate classification results
| Metric | Abbreviation | Formula |
|---|---|---|
| Sensitivity | STV or TPR | |
| Specificity | SPC | |
| Precision | PRC | |
| F1 score | F1S | 2 |
| Accuracy | ACC | ( |
Abbreviations: TP, true positives; TN, true negatives; FP, false positives; FN, false negatives.
The 28 most discriminant features which, together, achieved the best prediction performance
| Rank | Marker | Δ | |
|---|---|---|---|
| 1 | Yes | 1,274.6 | 47.0 |
| 2 | Yes | 912.7 | 48.1 |
| 3 | Yes | 760.3 | 50.0 |
| 4 | No | 808.5 | 0.0 |
| 5 | Yes | 977.9 | 49.1 |
| 6 | Yes | 1,275.6 | 46.1 |
| 7 | No | 909.8 | 0.0 |
| 8 | Yes | 935.7 | 46.1 |
| 9 | Yes | 814.7 | 41.2 |
| 10 | No | 822.4 | 0.0 |
| 11 | Yes | 1,273.6 | 46.7 |
| 12 | Yes | 977.4 | 49.1 |
| 13 | Yes | 758.6 | 50.0 |
| 14 | Yes | 936.7 | 47.8 |
| 15 | No | 1,276.6 | 0.0 |
| 16 | Yes | 801.6 | 45.6 |
| 17 | Yes | 911.7 | 47.5 |
| 18 | Yes | 757.6 | 49.5 |
| 19 | Yes | 1,296.6 | 47.7 |
| 20 | No | 860.3 | 0.0 |
| 21 | Yes | 761.6 | 33.9 |
| 22 | No | 839.6 | 24.8 |
| 23 | Yes | 768.2 | 49.4 |
| 24 | No | 1,046.4 | 0.0 |
| 25 | No | 1,045.4 | 0.0 |
| 26 | No | 760.5 | 0.0 |
| 27 | Yes | 978.4 | 49.1 |
| 28 | Yes | 933.7 | 47.3 |
Classification results of the validation tests and the final test using the 28 most discriminant features and the 19 PCM biomarkers according to ΔJ rank
| Metric | Signature (best length) | Only markers | ||||
|---|---|---|---|---|---|---|
| Validation | Final test | Validation | Final test | |||
| Mean | SD | Mean | SD | |||
| Vector length | 28 | 28 | 19 | 19 | ||
| No. of trees | 76 | 76 | 73 | 73 | ||
| Accuracy (%) | 99.0 | 1.0 | 97.1 | 97.4 | 2.3 | 97.1 |
| Sensitivity (%) | 99.3 | 2.1 | 94.1 | 96.3 | 4.9 | 94.1 |
| Specificity (%) | 98.8 | 1.3 | 100.0 | 98.6 | 1.3 | 100.0 |
| Precision (%) | 96.4 | 3.7 | 100.0 | 95.6 | 3.8 | 100.0 |
| F1 score (%) | 97.8 | 1.9 | 97.0 | 96.0 | 2.7 | 97.0 |
FIG 2Heatmap of the 19 most discriminant features. The color scale is from dark green (0%) to dark red (100%), corresponding to the minimum and maximum intensity values of the marker m/z on all samples, respectively.
Molecular features selected by machine learning analysis in PCM patients’ serum samples
| Molecule | ID | Theoretical mass (Da) | Exptl mass (Da) | Adduct | Error (ppm) | MS/MS ( |
|---|---|---|---|---|---|---|
| Fumonisin B1 and/or isofumonisin B1 | MID 53922, MID 88649 | 760.3516 | 760.3506 | [M + K]+ | 1.31 | 508, 714, 506, 572 |
| Cerebroside D | MID 477 | 756.5984 | 756.5997 | [M + H]+ | −1.71 | 710, 738, 568 |
| GlcCer (d36:1 | LMSP05010059 | 758.6141 | 758.6151 | [M + H]+ | −1.31 | 686, 570, 519 |
| PE-Cer (40:1 | MID 103125 | 761.6167 | 761.6152 | [M + H]+ | 1.96 | 508, 536, 729 |
| TG (53:2) | MID 36808 | 911.7464 | 911.7482 | [M + K]+ | −1.97 | 655, 629, 335, 865 |
| TG (55:4) | MID 100332 | 935.7464 | 935.7478 | [M + K]+ | −1.49 | 653, 679, 639, 903 |
| TG (55:5) | MID 101029 | 933.7308 | 933.7317 | [M + K]+ | −0.96 | 651, 677, 637, 683 |
| TG (48:3) | MID 99740 | 801.6967 | 801.6981 | [M + H]+ | −1.74 | 551, 729, 545, 567 |
| TG (49:4) | MID 100470 | 813.6967 | 813.6983 | [M + H]+ | −1.96 | 795, 557, 781, 681 |
| Rhamnosyl-galactosyl-diphosphoundecaprenol | MID 71958 | 1,273.7057 | 1,273.7033 | [M + K]+ | 1,88 | 1,258, 1,253, 1,231 |
| Unknown | 768.2646 | |||||
| Unknown | 977.4555 | |||||
| Unknown | 977.9566 | |||||
| Unknown | 978.4537 | |||||
| Unknown | 1,296.6092 |
Carbon number: double bond.
Abbreviations: M, molecule mass without adduct; ppm, parts per million; MS/MS, tandem mass spectrometry; MID, Metlin ID; LMSP, Lipid Maps data bank ID; TG, triacylglyceride; GlcCer, glucosylceramide; PE-Cer, phosphoethanolamine-ceramide.
Classification results of the validation tests and the final test using the features 756.6 and 758.6 as the most representative of PCM condition
| Metric | 756.6, 758.6 | ||
|---|---|---|---|
| Validation | Final test | ||
| Mean | SD | ||
| Vector length | 2 | 2 | |
| No. of trees | 64 | 64 | |
| Accuracy (%) | 86.2 | 6.9 | 92.2 |
| Sensitivity (%) | 79.4 | 14.4 | 88.2 |
| Specificity (%) | 93.0 | 3.3 | 96.1 |
| Precision (%) | 78.8 | 7.4 | 88.2 |
| F1 score (%) | 79.1 | 8.5 | 88.2 |