| Literature DB >> 35076208 |
Márcia H C Nascimento1, Wena D Marcarini2, Gabriely S Folli1, Walter G da Silva Filho2, Leonardo L Barbosa2, Ellisson Henrique de Paulo1, Paula F Vassallo3, José G Mill2, Valério G Barauna2, Francis L Martin4, Eustáquio V R de Castro1, Wanderson Romão5, Paulo R Filgueiras1.
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused the worst global health crisis in living memory. The reverse transcription polymerase chain reaction (RT-qPCR) is considered the gold standard diagnostic method, but it exhibits limitations in the face of enormous demands. We evaluated a mid-infrared (MIR) data set of 237 saliva samples obtained from symptomatic patients (138 COVID-19 infections diagnosed via RT-qPCR). MIR spectra were evaluated via unsupervised random forest (URF) and classification models. Linear discriminant analysis (LDA) was applied following the genetic algorithm (GA-LDA), successive projection algorithm (SPA-LDA), partial least squares (PLS-DA), and a combination of dimension reduction and variable selection methods by particle swarm optimization (PSO-PLS-DA). Additionally, a consensus class was used. URF models can identify structures even in highly complex data. Individual models performed well, but the consensus class improved the validation performance to 85% accuracy, 93% sensitivity, 83% specificity, and a Matthew's correlation coefficient value of 0.69, with information at different spectral regions. Therefore, through this unsupervised and supervised framework methodology, it is possible to better highlight the spectral regions associated with positive samples, including lipid (∼1700 cm-1), protein (∼1400 cm-1), and nucleic acid (∼1200-950 cm-1) regions. This methodology presents an important tool for a fast, noninvasive diagnostic technique, reducing costs and allowing for risk reduction strategies.Entities:
Mesh:
Year: 2022 PMID: 35076208 PMCID: PMC8805707 DOI: 10.1021/acs.analchem.1c04162
Source DB: PubMed Journal: Anal Chem ISSN: 0003-2700 Impact factor: 6.986
Figure 1Diagram outlining methodologies for collection of saliva from patients, spectral acquisition, and multivariate statistical data analyses.
Health Data for Participants
| number
(%) | |||
|---|---|---|---|
| total ( | positive | negative | |
| female | 162 (68%) | 92 (57%) | 70 (43%) |
| male | 75 (32%) | 46 (61%) | 29 (39%) |
| hypertension | 58 (24%) | 33 (57%) | 25 (43%) |
| diabetes | 20 (8%) | 13 (65%) | 7 (35%) |
| chronic obstructive pulmonary disease (COPD) | 15 (6%) | 10(67%) | 5 (33%) |
| obesity | 13 (5%) | 8 (62%) | 5 (38%) |
Figure 2Mid-infrared (MIR) spectral data set from saliva samples of n = 237 patients with RT-qPCR diagnoses for COVID-19 infection with an average spectrum (red line). (A) Positive (n = 138 samples) and (B) negative (n = 99 samples).
Principal Mid-Infrared (MIR) Bands of the Data Set and Chemical Assignmentsa[6,8]
| band | tentative assignment |
|---|---|
| ∼3275 cm–1 | stretching O–H symmetric |
| ∼3200–3550 cm–1 | symmetric and asymmetric vibrations attributed to water |
| ∼2930 cm–1 | stretching C–H |
| ∼2800–3000 cm–1 | C–H lipid region |
| ∼2100 cm–1 | combination of hindered rotation and O–H bending (water) |
| ∼1750 cm–1 | lipids: |
| ∼1650 cm–1 | amide I: |
| ∼1550 cm–1 | amide II: δ(N–H) coupled to |
| ∼1450 cm–1 | methyl groups of proteins: δ[(CH3)] asymmetric |
| ∼1400 cm–1 | methyl groups of proteins: δ[(CH3)] symmetric |
| ∼1250–1260 cm–1 | amide III: |
| ∼1155 cm–1 | carbohydrates: |
| ∼1225 cm–1 | DNA and RNA: |
| ∼1080 cm–1 | DNA and RNA: |
| ∼1030 cm–1 | glycogen vibration: |
| ∼971 cm–1 | nucleic acids and proteins: |
| ∼960–966 cm–1 | C–O, C–C, deoxyribose |
νs= symmetric stretching; νas = asymmetric stretching; and δ = bending.
Figure 3Principal coordinates analysis (PCoA) scores plot from the unsupervised random forest (URF) model from the mid-infrared (MIR) saliva data set (n = 246).
Confusion Matrix of the Consensus Class of Training and Test Data Setsa
| actual class | TP | TN | FP | FN |
|---|---|---|---|---|
| training data set | ||||
| positive | 82 | 48 | 20 | 15 |
| negative | 48 | 82 | 15 | 20 |
| test data set | ||||
| positive | 38 | 22 | 9 | 3 |
| negative | 22 | 38 | 3 | 9 |
TP = true positive; TN = true negative; FP = false positive; and FN = false negative.
Quality Parameters of Classification Modelsa
| samples/class | quality
parameters | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| model | preprocessing | set | POS | NEG | outlier | SENS (%) | SPEC (%) | PREC CL.1 (%) | PREC CL.2 (%) | ACC (%) | MCC |
| SPA-LDA | second derivative | train | 97 | 68 | 7 | 69.1 | 63.2 | 72.8 | 58.9 | 66.7 | 0.32 |
| test | 41 | 31 | 2 | 87.8 | 67.7 | 78.3 | 80.8 | 79.8 | 0.57 | ||
| GA-LDA | second derivative | train | 97 | 68 | 7 | 86.6 | 67.6 | 79.2 | 77.9 | 78.8 | 0.56 |
| test | 41 | 31 | 2 | 95.1 | 70.9 | 81.2 | 91.7 | 84.7 | 0.69 | ||
| PLS-DA (7 LV) | second derivative | train | 97 | 68 | 7 | 70.0 | 76.5 | 80.9 | 64.2 | 72.7 | 0.46 |
| test | 41 | 31 | 2 | 75.6 | 74.2 | 79.5 | 69.7 | 75.0 | 0.49 | ||
| PSO-PLS-DA (9 LV) | second derivative /mean-centered | train | 97 | 68 | 7 | 79.4 | 76.5 | 82.8 | 72.2 | 78.2 | 0.55 |
| test | 41 | 31 | 2 | 82.9 | 74.2 | 80.9 | 76.7 | 79.2 | 0.57 | ||
| consensus class | train | 97 | 68 | 7 | 82.5 | 75.0 | 82.0 | 75.0 | 79.0 | 0.57 | |
| test | 41 | 31 | 2 | 93.0 | 74.0 | 83.0 | 88.0 | 85.0 | 0.69 | ||
SENS = sensitivity; SPEC = specificity; PREC CL.1 = precision of class 1, or prevalence positive value (PPV); PREC CL.2 = precision of class 2, or prevalence negative value (PNV); ACC = accuracy; and MCC = Mathew’s correlation coefficient.
Figure 4Most important and selected variables through selection methods and classification models. (A) Genetic algorithm linear discriminant analysis (GA-LDA) selected variables; (B) successive projection algorithm LDA (SPA-LDA) selected variables; (C) particle swarm optimization (PSO) selected variables; (D) most important variables for class 1 (red +) and class 2 (blue +) through partial least squares discriminant analysis (PLS-DA) coefficient values; and (E) most important variables for class 1 (red +) and class 2 (blue +) through PSO-PLS-DA coefficient values.