Alexandre de Fátima Cobre1, Monica Surek2, Dile Pontarolo Stremel3, Mariana Millan Fachi4, Helena Hiemisch Lobo Borba5, Fernanda Stumpf Tonin6, Roberto Pontarolo7. 1. Pharmaceutical Sciences Postgraduate Program, Universidade Federal Do Paraná, Curitiba, Brazil. Electronic address: alexandrecobre@gmail.com. 2. Pharmaceutical Sciences Postgraduate Program, Universidade Federal Do Paraná, Curitiba, Brazil. Electronic address: monicasurek13@gmail.com. 3. Department of Forest Engineering and Technology, Universidade Federal Do Paraná, Curitiba, Brazil. Electronic address: dile.stremel@gmail.com. 4. Pharmaceutical Sciences Postgraduate Program, Universidade Federal Do Paraná, Curitiba, Brazil. Electronic address: marianamfachi@gmail.com. 5. Department of Pharmacy, Universidade Federal Do Paraná, Curitiba, Brazil. Electronic address: helena.hlb@gmail.com. 6. Pharmaceutical Sciences Postgraduate Program, Universidade Federal Do Paraná, Curitiba, Brazil; H&TRC- Health & Technology Research Center, ESTeSL, Escola Superior de Tecnologia da Saúde, Instituto Politécnico de Lisboa, Lisbon, Portugal. Electronic address: stumpf.tonin@ufpr.br. 7. Department of Pharmacy, Universidade Federal Do Paraná, Curitiba, Brazil. Electronic address: pontarolo@ufpr.br.
Abstract
OBJECTIVE: To implement and evaluate machine learning (ML) algorithms for the prediction of COVID-19 diagnosis, severity, and fatality and to assess biomarkers potentially associated with these outcomes. MATERIAL AND METHODS: Serum (n = 96) and plasma (n = 96) samples from patients with COVID-19 (acute, severe and fatal illness) from two independent hospitals in China were analyzed by LC-MS. Samples from healthy volunteers and from patients with pneumonia caused by other viruses (i.e. negative RT-PCR for COVID-19) were used as controls. Seven different ML-based models were built: PLS-DA, ANNDA, XGBoostDA, SIMCA, SVM, LREG and KNN. RESULTS: The PLS-DA model presented the best performance for both datasets, with accuracy rates to predict the diagnosis, severity and fatality of COVID-19 of 93%, 94% and 97%, respectively. Low levels of the metabolites ribothymidine, 4-hydroxyphenylacetoylcarnitine and uridine were associated with COVID-19 positivity, whereas high levels of N-acetyl-glucosamine-1-phosphate, cysteinylglycine, methyl isobutyrate, l-ornithine and 5,6-dihydro-5-methyluracil were significantly related to greater severity and fatality from COVID-19. CONCLUSION: The PLS-DA model can help to predict SARS-CoV-2 diagnosis, severity and fatality in daily practice. Some biomarkers typically increased in COVID-19 patients' serum or plasma (i.e. ribothymidine, N-acetyl-glucosamine-1-phosphate, l-ornithine, 5,6-dihydro-5-methyluracil) should be further evaluated as prognostic indicators of the disease.
OBJECTIVE: To implement and evaluate machine learning (ML) algorithms for the prediction of COVID-19 diagnosis, severity, and fatality and to assess biomarkers potentially associated with these outcomes. MATERIAL AND METHODS: Serum (n = 96) and plasma (n = 96) samples from patients with COVID-19 (acute, severe and fatal illness) from two independent hospitals in China were analyzed by LC-MS. Samples from healthy volunteers and from patients with pneumonia caused by other viruses (i.e. negative RT-PCR for COVID-19) were used as controls. Seven different ML-based models were built: PLS-DA, ANNDA, XGBoostDA, SIMCA, SVM, LREG and KNN. RESULTS: The PLS-DA model presented the best performance for both datasets, with accuracy rates to predict the diagnosis, severity and fatality of COVID-19 of 93%, 94% and 97%, respectively. Low levels of the metabolites ribothymidine, 4-hydroxyphenylacetoylcarnitine and uridine were associated with COVID-19 positivity, whereas high levels of N-acetyl-glucosamine-1-phosphate, cysteinylglycine, methyl isobutyrate, l-ornithine and 5,6-dihydro-5-methyluracil were significantly related to greater severity and fatality from COVID-19. CONCLUSION: The PLS-DA model can help to predict SARS-CoV-2 diagnosis, severity and fatality in daily practice. Some biomarkers typically increased in COVID-19 patients' serum or plasma (i.e. ribothymidine, N-acetyl-glucosamine-1-phosphate, l-ornithine, 5,6-dihydro-5-methyluracil) should be further evaluated as prognostic indicators of the disease.
liquid chromatography coupled with mass spectrometrydiscriminant analysis by partial least squaresartificial neural networks discriminant analysisgradient boosted tree discriminant analysissoft independent modelling of class analogysupport vector machinek-nearest neighbourslogistic regression discriminant analysis
Introduction
The COVID-19 outbreak has been met by variable responses across countries, especially regarding the adoption of prevention measures. Yet, although science has uncovered much about SARS-CoV-2 and made unprecedented progress in the development of vaccines, there is still great uncertainty as the pandemic continues to evolve (i.e. new active cases and deaths reported worldwide), and no globally recognized effective treatment is available [1,2]. In this scenario, the implementation of further sensitive, accurate, low-cost screening and early diagnostic approaches is paramount for preventing infections and guiding disease stage monitoring [3,4].In recent years, artificial intelligence (AI), including deep learning (DL) and machine learning-based (ML) algorithms, has emerged as a useful tool to support the decision-making process in healthcare, as well as drug discovery and disease diagnosis and monitoring [[5], [6], [7]]. Some studies published in the past two years have already employed these methods to guide COVID-19 diagnosis (i.e. using chest computed tomography scans and X-ray images), to characterize biomarkers of disease stage, to identify risk factors of disease severity and mortality and to forecast future outbreaks [[8], [9], [10], [11], [12]]. A recent systematic review conducted by Wang (2021) highlights that AI has the potential to improve existing medical and healthcare system efficiency during the COVID-19 pandemic by additionally assisting with surveillance and public health decision-making [27].However, although several of these ML-based models for the prediction of COVID-19 diagnosis and severity are available in the literature, limitations to their use in practice still exist. Biomarkers identified by these algorithms as potentially associated with the disease vary widely due to, among others, high heterogeneity among patients’ clinical profiles (i.e. different populations of regions/countries) and small sample sizes (i.e. retrospective single-centre studies), which reduces external validity and data generalization [13,14]. Additionally, differences in sample preparation and analytical techniques, as well the increasing number of SARS-CoV-2 variants worldwide, hinder the integration of more streamlined and effective predictive modelling in this field [[15], [16], [17]].Considering the increasing cases of COVID-19 in the past months, mainly caused by the rapid spread of the omicron variant [18], alongside with the limited number of kits to perform real-time PCR tests - as a consequence of the growing demand for these products worldwide, the aim of this study is to evaluate different ML-based algorithms for the prediction of COVID-19 diagnosis, severity and fatality, as well as to identify new biomarkers associated with these outcomes, using two databases with over 1300 water-soluble and fat-soluble metabolites.
Material and methods
Study design and databases
We evaluated public datasets from two cohorts of patients diagnosed with COVID-19 by RT-PCR in China (Wuhan), including mild and severe cases and deaths associated with the disease (https://drive.google.com/drive/folders/1R_I_gu5D3SkD_9q_J93HOA9GuKxZiGNG) [19]. Blood samples from all diagnosed patients were assessed by ultra-efficiency liquid chromatography coupled to mass spectrometry (LC-MS).The first cohort (Dataset I) refers to samples from COVID-19 patients diagnosed in the Wuhan Jinyin Tan Hospital (China) (file name: C2_metaboanalyst_input_full.csv) [19]. A series of samples was registered during the disease course: samples collected from 14 patients with mild symptoms at two time points in the study (total of 14 × 2 = 28 samples), samples from 11 patients with severe symptoms (total of 11 × 2 = 22 samples) and samples collected at four time points from 9 patients that died during the study (total of 9 × 4 = 36 samples). Blood samples from 10 healthy volunteers with negative RT-PCR tests were used as controls. Plasma samples from all these patients were also collected and analyzed by ultra-efficiency liquid chromatography (C18 column) coupled to quadrupole mass spectrometry and electrospray source (UPLC-ESI-MS/MS). A total of 431 metabolites (both fat-soluble and water-soluble substances) were identified and quantified using an in-house hospital database and molecular ion fragmentation profile in MS/MS mode. Fragments were compared to data from the international public literature/database. This first database, thus, included 96 samples and 431 variables (metabolites).The second cohort (Dataset II) refers to a sample of 46 patients diagnosed with COVID-19 in the Taizhou Hospital (China) (file name: C3_metaboanalyst_input_full.csv) [19]. Blood samples from 25 healthy volunteers (negative RT-PCR) and from 25 patients with pneumonia syndrome but with negative RT-PCR for SARS-CoV-2 were respectively used as negative and positive control groups. Serum samples of all patients were analyzed by UPLC-ESI-MS/MS. This second database accounted for 96 samples and 941 metabolites (identified and quantified using both ionization modes; ESI- and ESI+).
Data preprocessing in ML
Data preprocessing is an important step for metabolomic data analysis and refers to the technique of preparing (i.e. cleaning and organizing) the raw data to make it suitable (i.e. readable) for building and training ML-based models [20,21].In this study, both COVID-19 datasets (i.e. diagnosis and disease severity) went through different preprocessing methods aiming at selecting the one that best fit the data: (i) Imputation: missing data were replaced by the median values; (ii) Transformations: absolute value, Log10; (iii) Filtering: baseline (specified points), baseline (weighted least square), derivative (Savitzky – Golay), smoothing (Savitzky – Golay), detrend, generalized least squares weighting (GLSW), orthogonal signal correction (OSC) and external parameter orthogonalization (EPO); (iv) Normalization: normalize, standard normal variate (SNV) and multiplicative scatter correction (MSC-mean); (v) Scaling and centering: autoscale, group scale, Log decay scaling, mean center, median center, multiway center, multiway scale and sqrt mean scale. All analyzes were performed in SOLO software (Eigenvector Research).
Development of ML-based and prediction models
In this study, an unsupervised ML-based model (principal component analysis - PCA) was initially developed with both datasets aiming at identifying the structure of the data and detecting possible anomalous samples [22,23] (i.e. exploratory analyses).For the prediction of COVID-19 diagnosis (Database II) and disease severity and fatality (Database I), several supervised ML-based models were used: support vector machine (SVM), discriminant analysis (ANNDA), k-nearest neighbours (KNN), artificial neural networks discriminant analysis (ANNDA), discriminant analysis by partial least squares (PLS-DA), soft independent modelling of class analogy (SIMCA), gradient boosted tree discriminant analysis (XGBoostDA) and logistic regression discriminant analysis (LREG). For the implementation of these classification models, 70% of the data was used for the training set (calibration) and the remaining 30% for the test set.Sample selection for the training and testing sets was randomly performed using the Kennad Stone algorithm [24]. For the implementation of the models, the classes (groups) of samples from the datasets were individually divided into two new subsets (i.e., training and test samples), being this last subset (test samples) used to predict each specific class. Samples from Database I (Wuhan, Jinyin Tan Hospital, n = 96 samples) were grouped into the following: class 1 – healthy (accounting for 10 samples of which 5 were used for training the model and the remaining 5 for data prediction); class 2 – death (36 samples of which 25 were used for training; 11 for data prediction); class 3 – severe COVID-19 (22 samples of which 15 were used for training; 7 for data prediction); and class 4 – mild COVID-19 (28 samples; 20 used for training and 8 for data prediction). The samples from Database II (Taizhou Hospital, n = 96) were categorized into the following classes: class 1 – COVID-19 (accounting for 46 samples of which 32 were used for training the models and the remaining 14 for data prediction); class 2 – healthy (25 samples of which 15 were used for training and 10 for data prediction); and class 3 – non-COVID-19 (25 samples; 15 used for training and 10 for data prediction).The venetian blind cross-validation method was used to select the number of latent variables (LVs) of the ML-based models [25]. LVs with lower values of cross-validation error, square root of mean cross-validation error (RMSECV) and root mean square error of calibration (RMSEC) were selected. The root of the mean error of prediction (RMSEP) was the metric used to assess the predictive capacity of the ML-based models; models with smaller RMSEP performed better.Model performance was evaluated considering the metrics of accuracy, sensitivity, and specificity. These metrics were calculated using the following figures of merit: false positive (FP), false negative (FN), true positive (TP) and true negative (TN), according to equations (1), (2), (3)).
where, FP = false positive; FN = false negative; TP = true positive; TN = true negative.The accuracy of the models was also evaluated considering the area under the receiver operating characteristic (ROC) curve (AUC). The values of AUC ROC were calculated considering both samples’ datasets (i.e., training and test sets).A VIP (variable importance in projection) graph was built from the ML-based model presenting the best performance aiming at identifying the ‘top 10’ most important biomarkers for predicting the diagnosis of COVID-19 and the ‘top 10’ most important biomarkers for predicting disease severity and fatality. A VIP score of an original variable is calculated as a weighted sum of the squared correlations between the LV of the PLS-DA model and the original variable (e.g. metabolite). The number of terms in the sum relies on the number of LV from the PLS-DA model that were considered significant for distinguishing the groups (classes) of samples. The weights correspond to the percentage variance explained by the LV in the PLS-DA model. An original variable with a VIP score greater than 1 is considered statistically significant for classifying groups (e.g. COVID-19 group vs. healthy individuals). See below the VIP score calculation equation (4) [26,27].where: wj = PLS-weight value; SSY = percentage of explained Y variance by each specific latent variable; F = number of latent variables of the PLS-DA model; J = number of X variables.Analyses were carried out using SOLO (Eigenvector Research) software [28] and Metaboanalyst 5.0 web server [29]; results obtained with these different tools were qualitatively compared.
Results
Exploratory analyses
Fig. 1 shows the PCA model for both datasets (Dataset I and II). The preprocessing methods that best suited the model were a combination of imputation using median values, autoscale and GLSW. In both cases, the PCA model was able to discriminate all classes of samples. No outlier sample was identified.
Fig. 1
Exploratory data analysis. (A) PCA model of the Thaizhou hospital patient dataset (blood samples from 46 patients with COVID-19 diagnosed by RT-PCR are represented by the red triangles; blood samples from 25 patients with pneumonia syndrome but with negative RT-PCR for SARS-CoV-2 are depicted as blue squares; blood samples from 25 healthy volunteers with negative RT-PCR are represented by green circles). (B) PCA model of the Wuhan hospital patient dataset (blood samples from 28 patients with mild COVID-19 are represented by the blue color squares; blood samples from 36 patients with COVID-19 are represented by pink stars; blood samples from 36 deaths by COVID-19 are depicted as red triangles; blood samples from 10 healthy volunteers negative by RT-PCR are depicted as green circles). (C) Graph of leverage versus student residuals for the detection of outlier samples in the Thaizhou hospital patient dataset (sample n. 55 had high leverage values but was not considered an outlier as it is within ±2.5 standard deviations of student residuals). (D) Graph of leverage versus student residuals for the detection of outlier samples in the Wuhan hospital patient dataset (sample n. 25 had high leverage values but was not considered an outlier as it is within ±2.5 standard deviations of student residuals).
Exploratory data analysis. (A) PCA model of the Thaizhou hospital patient dataset (blood samples from 46 patients with COVID-19 diagnosed by RT-PCR are represented by the red triangles; blood samples from 25 patients with pneumonia syndrome but with negative RT-PCR for SARS-CoV-2 are depicted as blue squares; blood samples from 25 healthy volunteers with negative RT-PCR are represented by green circles). (B) PCA model of the Wuhan hospital patient dataset (blood samples from 28 patients with mild COVID-19 are represented by the blue color squares; blood samples from 36 patients with COVID-19 are represented by pink stars; blood samples from 36 deaths by COVID-19 are depicted as red triangles; blood samples from 10 healthy volunteers negative by RT-PCR are depicted as green circles). (C) Graph of leverage versus student residuals for the detection of outlier samples in the Thaizhou hospital patient dataset (sample n. 55 had high leverage values but was not considered an outlier as it is within ±2.5 standard deviations of student residuals). (D) Graph of leverage versus student residuals for the detection of outlier samples in the Wuhan hospital patient dataset (sample n. 25 had high leverage values but was not considered an outlier as it is within ±2.5 standard deviations of student residuals).
Classification models
Table 1 shows the performance of the ML-based models built using SOLO software. The PLS-DA model showed the most promising results (high performance) for predicting the diagnosis, severity, and fatality of COVID-19 with higher figures of accuracy, sensitivity, and specificity of 93%, 94% and 97%, respectively (see ROC curve using samples from both training and test sets in Fig. 2
. The ROC curves using only the training set samples are available in supplementary material - Fig. S1). Supplementary Material Figs. S2–S7 show the PLS-DA models for predicting each class of samples. The remaining ML-based models (ANN, ANNDA, XGBoostDA, SIMCA, SVM, LREG and KNN) showed poor performance in this study.
Area under the ROC curve of PLS-DA model performance. Area under the curves (AUC) reflect the accuracy of PLS-DA models in predicting patients of different COVID-19 classes and healthy volunteers. Curves include both sets of samples (training and test samples). (A) Thaizhou hospital dataset (dataset II): results represent the accuracy in predicting the class of patients with COVID-19 diagnosed by RT-PCR (AUC = 0.93). (B) Thaizhou hospital dataset (dataset II): results represent the accuracy in predicting the class of patients with pneumonia syndrome but with negative RT-PCR for SARS-CoV-2 (AUC = 0.91). (C) Thaizhou hospital dataset (dataset II): results represent the accuracy in predicting the class of healthy volunteers with negative RT-PCR (AUC = 0.94). (D) Wuhan hospital dataset (dataset I): results represent the accuracy in predicting the class of patients with acute COVID-19 (AUC = 0.95). (E) Wuhan hospital dataset (dataset I): results represent the accuracy in predicting patients with severe COVID-19 (AUC = 0.94). (F) Wuhan hospital dataset (dataset I): results represent the accuracy in the classification of deaths by COVID-19 (AUC = 0.97).
Performance of the seven ML-based models.Note: TP = true positive; TN = true negative; FP = false positive; FN = false negative.Area under the ROC curve of PLS-DA model performance. Area under the curves (AUC) reflect the accuracy of PLS-DA models in predicting patients of different COVID-19 classes and healthy volunteers. Curves include both sets of samples (training and test samples). (A) Thaizhou hospital dataset (dataset II): results represent the accuracy in predicting the class of patients with COVID-19 diagnosed by RT-PCR (AUC = 0.93). (B) Thaizhou hospital dataset (dataset II): results represent the accuracy in predicting the class of patients with pneumonia syndrome but with negative RT-PCR for SARS-CoV-2 (AUC = 0.91). (C) Thaizhou hospital dataset (dataset II): results represent the accuracy in predicting the class of healthy volunteers with negative RT-PCR (AUC = 0.94). (D) Wuhan hospital dataset (dataset I): results represent the accuracy in predicting the class of patients with acute COVID-19 (AUC = 0.95). (E) Wuhan hospital dataset (dataset I): results represent the accuracy in predicting patients with severe COVID-19 (AUC = 0.94). (F) Wuhan hospital dataset (dataset I): results represent the accuracy in the classification of deaths by COVID-19 (AUC = 0.97).Fig. 3, Fig. 4 (VIP graphs of the PLS-DA models) depict the most promising biomarkers for predicting the diagnosis and severity/fatality of COVID-19, respectively. The calculation of the VIP scores of these biomarkers (see equation (4) - material and methods section) included two parameters: (i) four LVs selected for the PLS-DA model as they presented lower calibration error (RMSEC) and cross-validation (RMSECV) values (see Figs. S8–S9 in supplementary material), and (ii) and the total variance explained by these four selected LVs, which was 42.84% for block X and 83.53%% for block Y.
Fig. 3
Variable Importance in Projection graph of the most important biomarkers for COVID-19 diagnosis (top 10). X axis represents all analyzed metabolites; Y axis represents the VIP score that reflects the importance of each metabolite in the prediction of the different classes of the samples (COVID-19 represented by the red color, non-COVID-19 by blue, and healthy volunteers in green). The black dashed line parallel to the X axis represents the VIP score threshold (VIP score threshold = 1). Metabolites significantly contributing to the prediction of the different classes of the samples are above the threshold (VIP score> 1); the top 10 biomarkers were highlighted in the figure.
Fig. 4
Variable Importance in Projection graph of the most important biomarkers for COVID-19 severity and fatality. X axis represents all analyzed metabolites; Y axis represents the VIP score that reflects the importance of each metabolite in the prediction of the different classes of the samples (healthy individuals are depicted in blue, mild COVID-19 is green, severe COVID-19 is in red and death is colored in black). The black dashed line parallel to the X axis represents the VIP score threshold (VIP score threshold = 1). Metabolites significantly contributing to the prediction of the different classes of the samples are above the threshold (VIP score> 1); the top 10 biomarkers were highlighted in the figure.
Variable Importance in Projection graph of the most important biomarkers for COVID-19 diagnosis (top 10). X axis represents all analyzed metabolites; Y axis represents the VIP score that reflects the importance of each metabolite in the prediction of the different classes of the samples (COVID-19 represented by the red color, non-COVID-19 by blue, and healthy volunteers in green). The black dashed line parallel to the X axis represents the VIP score threshold (VIP score threshold = 1). Metabolites significantly contributing to the prediction of the different classes of the samples are above the threshold (VIP score> 1); the top 10 biomarkers were highlighted in the figure.Variable Importance in Projection graph of the most important biomarkers for COVID-19 severity and fatality. X axis represents all analyzed metabolites; Y axis represents the VIP score that reflects the importance of each metabolite in the prediction of the different classes of the samples (healthy individuals are depicted in blue, mild COVID-19 is green, severe COVID-19 is in red and death is colored in black). The black dashed line parallel to the X axis represents the VIP score threshold (VIP score threshold = 1). Metabolites significantly contributing to the prediction of the different classes of the samples are above the threshold (VIP score> 1); the top 10 biomarkers were highlighted in the figure.The VIP graph from the PLS-DA model revealed that the most important biomarkers for predicting COVID-19 diagnosis were dibutyl sulfosuccinate, ortho-cresol sulphate, beta alanine, 4-vinylguaiacol sulphate, 4-hydroxyphenylacetoylcarnitine, ribothymidine, glycerophosphoserine and uridine (see Fig. 5
). These three last biomarkers were found in extremely low concentrations (decreased by factors of two, three and four) in patients diagnosed with COVID-19 when compared to negative control samples.
Fig. 5
Profile of the top 10 blood biomarkers associated with the diagnosis of COVID-19. Results are grouped according to the classes: healthy (n = 25), non-COVID-19 (n = 25) and COVID-19 (n = 46). Boxes indicate the interquartile ranges (median); horizontal lines indicate minimum and maximum values.
Profile of the top 10 blood biomarkers associated with the diagnosis of COVID-19. Results are grouped according to the classes: healthy (n = 25), non-COVID-19 (n = 25) and COVID-19 (n = 46). Boxes indicate the interquartile ranges (median); horizontal lines indicate minimum and maximum values.As for the prediction of COVID-19 severity and fatality, six different biomarkers were highlighted in the model as most probably associated with these outcomes: cyclohexylamine, methyl isobutyrate, 2-Undecanone, cysteinylglycine, N-acetyl-glucosamine-1-phosphate and 5,6-dihydro-5-methyluracil were increased by factors of three, four, five, six, seven and seven, respectively, in patients with severe COVID-19 when compared to those with acute disease (see Fig. 6
).
Fig. 6
Profile of the top 10 blood biomarkers associated with the COVID-19 severity and fatality. Results are grouped according to the classes: healthy (n = 10), mild COVID-19 (n = 28), severe COVID-19 (n = 22) and death (n = 36). Boxes indicate interquartile ranges (median); horizontal lines indicate minimum and maximum values.
Profile of the top 10 blood biomarkers associated with the COVID-19 severity and fatality. Results are grouped according to the classes: healthy (n = 10), mild COVID-19 (n = 28), severe COVID-19 (n = 22) and death (n = 36). Boxes indicate interquartile ranges (median); horizontal lines indicate minimum and maximum values.The above mentioned analyzes were also carried out (i.e. re-run) using Metaboanalyst 5.0; results are available in supplementary material (Figs. S10–S13). In this case, the PCA model was not able to distinguish the samples from the three classes (COVID-19, non-COVID-19, healthy individuals) using the diagnostic data; the accuracy was inferior to 80% (i.e. lower than the one obtained in our study [93–94%] using SOLO software). Similarly, although the PCA model of the severity data was able to distinguish healthy patients from deaths, it was not able to classify the other two groups (mild COVID-19 vs. severe COVID-19). The accuracy of this model was of around 80%, lower that the one obtained using SOLO (94%–97%).We also found differences in the identification of biomarkers from the PLS-DA models obtained using SOLO vs. Metaboanalyst 5.0 software (see Table 2
). The analyzes performed in Metaboanalyst 5.0 showed the following metabolites in extremely low levels in patients with COVID-19 compared with those without the disease or healthy volunteers: linoleate, palmitate, urea, lactate, carnitine, proline, glycerophosphoethanolamine, stearate, phenylalanine (see Fig. 7
). On the other hand, the metabolites 6-methylmercaptopurine, dihydroxybenzeneacetic acid, l-phenylalanine, 6-methylmercaptopurine, 4-dihydroxybenzeneacetic acid, l-phenylalanine, formylanthranilic acid, terephthalic acid and phthalic acid were found in high concentrations in patients who died from COVID-19 (see Fig. 8
).
Table 2
Biomarkers for predicting the diagnosis and severity/fatality of COVID-19 according to the PLS-DA models from SOLO software vs. and Metaboanalyst 5.0
Diagnosis of COVID-19 (Dataset II - Thaizou Hospital)
COVID-19 severity/fatality (Dataset I - Wuhan Hospital)
Rank
SOLO
Metaboanalyst
SOLO
Metaboanalyst
1
Dibutyl sulfosuccinate
Oleate
N-acetyl-glucosamine-1 phosphate
5-Triiodo-l-thyronine
2
O-cresol sulfate
Linoleate
Cys-glicine
l-Thyroxine
3
Beta-alanine
Palmitate
5,6Dihydro-5-methyluracil
2-Furanmethanol
4
Sphingosine 1-phosphate
Urea
Methyl isobutirate
4-Nitrophenol
5
4-Vinylguaiacol sulfate
Lactate
2-Undecanone
6-Methylmercaptopurine
6
4-Hydroxyphenylacetoylcarnitine
Carnitine
Pantothenate
4-Dihydroxybenzeneacetic acid
7
1,2 Dilinoleoyl-GPC (18:2/18:2)
Roline
l-Ornithine
l-phenylalanine
8
5-Methyluridine
Glycerophosphoethanolamine
2,6 Dihydroxypurine
l-citruline
9
Glicerophosphoserine
Stearate
Cyclohexilamine
Formylanthranilic acid
10
Uridine
phenylalanine
7-methyluric acid
Phthalic acid
Fig. 7
Variable Importance in Projection (VIP) graph of the most important biomarkers for COVID-19 diagnosis (web server Metaboanalyst 5.0). The Y axis represents the top 10 most important metabolites in predicting COVID-19 diagnosis and the X axis represents the VIP score that reflects the importance of each metabolite in the prediction of the different classes of the samples (COVID-19, non-COVID-19, and healthy volunteers). The change from blue to red color is proportional to the increase in the intensity of the biomarker signal.
Fig. 8
Variable Importance in Projection (VIP) graph of the most important biomarkers for COVID-19 severity/fatality (web server Metaboanalyst 5.0). The Y axis represents the top 10 most important metabolites in predicting the severity of COVID-19 and the X axis represents the VIP score that reflects the importance of each metabolite in the prediction of the different classes of the samples (death, severe COVID-19, mild COVID-19, healthy individuals). The change from blue to red color is proportional to the increase in the intensity of the biomarker signal.
Biomarkers for predicting the diagnosis and severity/fatality of COVID-19 according to the PLS-DA models from SOLO software vs. and Metaboanalyst 5.0Variable Importance in Projection (VIP) graph of the most important biomarkers for COVID-19 diagnosis (web server Metaboanalyst 5.0). The Y axis represents the top 10 most important metabolites in predicting COVID-19 diagnosis and the X axis represents the VIP score that reflects the importance of each metabolite in the prediction of the different classes of the samples (COVID-19, non-COVID-19, and healthy volunteers). The change from blue to red color is proportional to the increase in the intensity of the biomarker signal.Variable Importance in Projection (VIP) graph of the most important biomarkers for COVID-19 severity/fatality (web server Metaboanalyst 5.0). The Y axis represents the top 10 most important metabolites in predicting the severity of COVID-19 and the X axis represents the VIP score that reflects the importance of each metabolite in the prediction of the different classes of the samples (death, severe COVID-19, mild COVID-19, healthy individuals). The change from blue to red color is proportional to the increase in the intensity of the biomarker signal.
Discussion
We were able to develop and evaluate the performance of seven different ML-based models (ANNDA, PLS-DA, XGBoostDA, SIMCA, SVM, LREG and KNN) to predict the diagnosis of COVID-19 as well as disease severity and fatality using plasma and serum samples of patients from two reference hospitals in China. Over 1300 water-soluble and fat-soluble metabolites were assessed. As the course of COVID-19 is extremely variable among patients (especially due to emerging SARS-Cov-2 mutations and biological variability), the metabolomic profile of these cases is also uncertain, thus requiring more robust ML-based models [[30], [31], [32]].In our study, the PLS-DA model presented the best performance (AUC ROC 87%–97%), with accuracy figures similar to those of other ML-based models available in the literature in this field (AUC ROC 70%–99%) [33]. The PLS-DA model is currently one of the most commonly used ML-based algorithms for analyzing data from metabolomics and other omics sciences (e.g. genomics, transcriptomics and proteomics), being recommended by experts in the field [34,35]. In fact, a systematic review assessing the number of citation of studies published between 1990 and 2018 and available in Web of Science showed an increase in publications citing PLS-DA (n = 2242), while other algorithms (e.g, ANN, SVM, RF, logistic regression, deep learning) were less commonly mentioned (n = 500) [36]. The main reasons for this include the intrinsic characteristics of the PLS-DA, considered a versatile algorithm with better predictive and descriptive advantages over other models. A recent study performed by Mendes et al. (2019) [35] compared the predictive performance of eight ML algorithms (PLS-DA, ANN, non-ANN, random forest (RF, radial basis function kernel support vector machines (SVM), logistic regression and principal components regression (PCR)), using 10 clinical metabolomics datasets available in the Metabolights and Metabolomics Workbenchre repositories, and reported the PLS-DA as the model with the best performance [35].The PLS-DA is able to forecast highly multivariate data into a space of smaller coordinates called LV (or principal components) that describe the variance between input data (e.g. metabolites) and output data (e.g. sample class) before regressing to a dependent variable. This allows datasets with more variables than samples to be modeled without resorting to pre-screening variables (essential for hypothesis-generating studies). Additionally, as it considers LV, problems regarding multicollinearity between the different metabolites in any biological system can be avoided (i.e. LV do not correlate with each other) [35,37]. Once optimized, the PLS-DA model can be reduced to a common linear regression model, enabling to predict the value of each metabolite/biomarker in the dataset [34,35]. Other ML-based models, including multilayer ANN, usually require larger sample sizes to achieve a high predictive performance. As a consequence, the number of variables included in these models is less than the number of samples [5,[38], [39], [40], [41]], which is not the scenario of most metabolomic datasets [[42], [43], [44]]. In our study, we were able to use a dataset including 180 samples and 1300 variables.Considering the great complexity of metabolomics data and the intrinsic properties of this information, missing values, heteroscedasticity, poorly informative parameters, and biological variability are commonplace. Data preprocessing is thus paramount to improve the quality of information by transforming the raw data matrix into a ‘cleaner’ set [29,45,46]. Several preprocessing strategies are available including missing data imputation, filtering, transformations, sample-based normalization, metabolite-based normalization, sample and metabolite-based and internal standard-based normalization [[46], [47], [48]]. This process may be conducted in free online tools such as MetaboAnalyst, NOREVA, ANPELA, NormalizeMets, MMEASE e Data Analysis [29,[46], [47], [48], [49], [50], [51], [52], [53], [54], [55]]. Another challenge in metabolomic analysis is the integration of data from different experiments and the simultaneous removal of unwanted biological and experimental variations [55]. MMEASE is an online tool that allows merging this data and removing the effect of unwanted variations between samples, which increases the efficiency of statistical analyzes and leads to more robust and reliable results [55]. Data is merged according to the alignment ID for retention time (RT) and exact mass (m/z) of a given metabolite considered as a reference. If both RT and m/z of the reference metabolite fall within the tolerable range, this procedure is automatically applied to the metabolites in the chromatograms of the remaining samples from other experiments [55]. Another alternative to eliminate problems from the batch effect is to use the Z-score method, which transforms the data to mean zero and standard deviation of 1, normalizing the distribution of analytical signals [49]. In our study, as only spectra data were available (RT of metabolites were absent), the MMEASE method was not employed. However, the datasets were standardized using the GLSW preprocessing method, which calculates a matrix of filters based on the differences between groups of samples that somehow should be ‘similar’ [28,56]. According to this method, in the case of classification problems, similar samples would be those whose data from the same samples were analyzed in different instruments or even in different periods [28,56]. In our study, we used two databases of patients with COVID-19 from two different experiments, whose samples were obtained by two different models of LC-MS equipments and time periods [28,56].Methodological procedures for optimizing the processing of metabolomics data are better described in the guidelines of NOREVA (Normalization and Evaluation of MS-based Metabolomics Data), NormalizeMets, MMEASE, MetaboAnalyst and ANPELA [29,[46], [47], [48], [49], [50], [51], [52], [53], [54], [55]]. Data preprocessing is usually performed in five steps: data filtering and missing value imputation (S1), quality control samples correction (S2), data transformation (S3), data normalization (S4) and performance assessment (S5). During S1, filtering focuses on removing uninformative features considered as intrinsic properties of the metabolomic data, while imputation seeks to replace missing or invalid values arising from technical/biological reasons with specific values based on available information, thus preserving the structure of the dataset, and reducing the imprecision or limitation of the analyses. The correction of quality control samples (S2) aims to reduce interference from harmful or uncontrollable signals in the metabolomic data to guarantee the stability and consistency of the data based on quality control samples. This allows to correct problems related to the variation in signal strength, intra- and inter-sample variability, and deviations in quality accuracy. In stages S3 and S4, the transformation and normalization of the metabolomic data aims to correct problems of heteroscedasticity and unwanted variations, transforming the distribution of asymmetric data into symmetrical ones, while preserving the existing variables. Finally, S5 consists on evaluating the performance of the pre-processing data based on five criteria: (i) ability to reduce intragroup variation among samples (metric: pooled median absolute deviation); (ii) effect on differential metabolic analysis (metric: purity); (iii) method's consistency in markers discovered from different datasets (metric: relative weighted consistency); (iv) method's influence on classification accuracy (metric: area under the curve); (v) level of correspondence between normalized and reference data (metric: log fold changes of the concentrations) [29,[46], [47], [48], [49], [50], [51], [52], [53], [54], [55]]. In our study, the above-mentioned steps of data preprocessing were followed (i.e. imputation of missing values and filtering of data by means of GLSW were employed [28,57]; data were normalized using autoscale [52]). Median imputation is a widely used method in metabolomics, as unlike the mean, it is not affected by extreme values (outliers), which preserves the structure of the data and provides a more reliable value of the dataset [28,52]. Autoscale is an approach based on mean-centering followed by the division of each column or variable (e.g. protein or any other metabolite) by the column standard deviation, assuming that all metabolites are equally important [28,57].Recent studies using blood or urine samples from patients diagnosed with COVID-19 highlighted that some biomarkers predict the severity and fatality of the disease. Yao et al. (2020), by using the SVM model, found that high levels of neutrophils were associated with more severe cases [58], while Patterson et al. (2020), through the random forest model, highlighted that an increase in interleukin 6 (IL-6) and interferon-gamma (IFN-γ) is related to a worse prognosis [59]. Conversely, using SOLO (Eigenvector Research) software we found different and new biomarkers potentially associated with the disease course. High levels of ribothymidine, 4-hydroxyphenylacetoylcarnitine and uridine were associated with COVID-19 positivity, whereas high levels of N-acetyl-glucosamine-1-phosphate, cysteinylglycine, methyl isobutyrate, ornithine and 5,6-dihydro-5-methyluracil were related to COVID-19 severity and fatality. Differences among study findings may be due the different samples (i.e. type of sample, origin), the multifactorial pathophysiological course of the disease [60] that has not yet been fully elucidated [61,62], as well as the different analytical methods/models employed by the authors. Regarding this last, we also found that the analyzes conducted in SOLO software resulted in models with higher predictive performance compared to those from in Metaboanalyst 5.0 and identified different biomarkers for COVID-19 diagnosis and severity/fatality prediction (see qualitative comparison in Table 2). This may be due the differences on the preprocessing methods. While SOLO enables the combination of autoscale and GLSW, the Metaboanalyst 5.0 applies only this first approach, meaning that GLSW was, in this case, a determinant factor for obtaining more robust models. Although SOLO is not a free software, it allows the selection of different preprocessing strategies, providing further autonomy to the analysts, which should be considered when developing ML-based studies.Currently, COVID-19 is broadly considered a viral respiratory and vascular illness. Yet, it can affect other major organs such as those of the gastrointestinal tract and the hepatobiliary, cardiovascular, renal, and central nervous systems. Recent evidence shows that SARS-CoV-2 can cause dysbiosis in the faecal microbiota and modify the oral and respiratory tract microbiome, leading to changes in the levels of several microbial metabolites in the blood or in their metabolic pathways [[63], [64], [65]]. Although evidence on the matter is still scarce, it has been reported that microbiota is responsible for around 50% of all blood metabolites [65], which raises questions about its role on multifactorial diseases, such as COVID-19 [63].Li et al. (2019), by evaluating the nasopharyngeal microbiota profile of patients with COVID-19, found that positive samples were significantly enriched with the signature of two bacterial taxa (Cutibacterium and Lentimonas) and had a lower abundance of other bacterial taxa, including Prevotellaceae. The latter is a family of the phylum Bacteroidetes commonly found in the oral and faecal microbiota, recently associated with the metabolite ribothymidine (methylated nucleoside), which was increased in COVID-19-positive samples in our study. When overexpressed, these proteins actively contribute to the severity of pneumonia and pneumonia-like symptoms and are thus potential biomarkers for disease diagnosis and severity [66,67]. Similarly, high levels of 2-undecanone, a long-chain volatile organic compound usually produced during hospital-acquired bacterial infections caused by Pseudomonas aeruginosa [68,69], may be associated with severe cases of respiratory infections, including COVID-19. This substance can be found in patients with cystic fibrosis [70]. In fact, pulmonary fibrosis is a serious complication of some viral pneumonias, often leading to dyspnoea and impaired lung function. Patients with confirmed COVID-19 were found to have different degrees of pulmonary fibrosis at and after hospital discharge [71]. Sphingosine 1-phosphate (a product of membrane sphingolipid metabolism or secreted from cells), acts through G protein-coupled receptors and regulates immune cell trafficking, diverse immunological processes and fibrosis [72]. The pathway of this metabolite is implicated in normal pulmonary vasculature function; it appears to be impaired in acute lung dysfunction, while it is induced during chronic fibrosis. Further studies on the alteration of levels of this compound in COVID-19 are needed to elucidate its role in infection.Another microbial metabolite, now associated with oral bacteria causing caries and periodontitis (e.g. Porphyromonas gingivalis, Prevotella sp. and Tannerella forsythia), is methyl isobutyrate [73]. Metagenomic analyses of patients infected with SARS-CoV-2 demonstrated high reads of cariogenic and periodontopathic bacteria, endorsing the notion of a connection between the oral microbiome and COVID-19 complications [73]. We also found high levels of cyclohexylamine (a potential carcinogenic compound eliminated in the urine) in patients with severe COVID-19. This probably occurs due to another dysbiosis caused by SARS-CoV-2, which allows the hyperproliferation of intestinal bacteria that metabolize cyclamate (an artificial sweetener still used in some food categories in China) [74,75]. Other compounds that are commonly found in foods and manufactured products (e.g. tobacco smoke) are the cresols (xenobiotics). O-cresol and 4-vinylguaiacol are converted to sulphates through phase II metabolism (i.e. a joint process between the microbiome and the host), and eliminated through the urine [76]. Previous studies demonstrated low levels of o-cresol sulphate and 4-vinylguaiacol sulphate in COVID-19 patients, which can be due to the high rates of urinary elimination of these metabolites (e.g. possible kidney damage caused by the disease) [77].COVID-19 might also negatively impact body weight and nutritional status [78]. This may occur due to loss of appetite and reduced nutrient intake, patients’ fear and stress regarding the disease and metabolic alterations in caused by the infection. For instance, the metabolite 4-hydroxyphenylacetoylcarnitine, found to be increased in patients with COVID-19 in our study, belongs to tyrosine metabolism and has been previously associated with overweight in patients with metabolic syndrome. Other studies also reported an increase in inflammation and serum levels of leptin in COVID-19 patients as in other infectious diseases that can contribute to anorexia [[79], [80], [81]]. These metabolites should be further investigated as potential biomarkers of viral infection severity.Another important metabolite is uridine, a pyrimidine nucleotide for RNA synthesis that is associated with glucose homeostasis, lipid and amino acid metabolism, regulation of glycogen synthesis and lipid deposition [82]. During its catabolism, uridine is converted into β-alanine, followed by secretion to the brain and muscle tissues. Beta-alanine and histidine are components of carnosine, a molecule with proven anti-inflammatory, antioxidant and anti-glycating effects [83]. In our first model, levels of beta-alanine were found to be low in COVID-19 patients, while those of uridine were high. This may indicate inhibition of uridine catabolism during the course of the infection. A recent study found a significantly low ratio of arginine/ornithine among adults and children infected with SARS-Cov-2. Ornithine and citrulline are amino acids resulting from the breakdown of arginine by the arginase enzyme. The depletion of these substances may contribute to endothelial dysfunction, T-cell dysregulation and coagulopathies that are commonly observed in COVID-19 [84]. The high level of ornithine in COVID-19 patients that was reported in our study may indicate increased activity of the arginase enzyme.N-acetyl-glucosamine-1-phosphate (GlcNAc-1-P) is a substrate of the biosynthetic pathway of hexosamines, converted by the enzyme UDP-GlcNAc pyrophosphorylase into UDP-GlcNAc (this metabolite can use the O-glycosylation route). This conversion is an important step in the production of cytokines during influenza virus infection, as demonstrated in vivo models (murine models) [85]. Researchers believe that inhibition of the hexosamine pathway is a mechanism used by respiratory viruses, including SARS-Cov-2, to infect host cells [86,87]. The elevated level of GlcNAc-1-P in patients with severe COVID-19 reveals a potential modification of the hexosamine biosynthetic pathway. Additionally, as GlcNAc-1-P is an intracellular component, its presence in the plasma indicates the existence of cellular damage. SARS-CoV-2 infection leads to pyroptosis, which is usually more prevalent in severe cases. More than half of hospitalized COVID-19 patients present high levels of lactate dehydrogenase, another marker of cell damage [88,89]. Regulators of oxidative stress such as cysteinylglycine, an intermediate metabolite in the glutathione metabolic pathway, have also been associated to cell damage in viral diseases. High levels of oxidized cysteinylglycine were reported in HIV-infected individuals and also related to a higher risk of lung damage in COVID-19, probably due increased oxidative stress [90,91]. Other metabolites such as 5,6-dihydro-5-methyluracil (dihydrothymine), an intermediate breakdown product of thymine, may act as markers of DNA damage [92]. We found that the levels of this substance were high in patients with severe COVID-19, but recent studies demonstrated that the spike protein from SARS-CoV-2 can inhibit repair of damaged DNA [93]. Other metabolites identified at extremely low levels in patients with COVID-19 (e.g. linoleate, palmitate, urea, lactate, carnitine) when using the Metaboanalyst 5.0 software, or those found at high levels in patients who died from the disease (e.g. 6-methylmercaptopurine, l-phenylalanine, terephthalic acid) should also be further evaluated.Our study has some limitations. Although we used approximately 1300 different biomarkers for model training and validation, these may not accurately represent the universe of metabolites available in the blood. Yet, it was possible to obtain models with high performance (accuracy >90%) for the prediction of diagnosis, severity and fatality of COVID-19 that can be used in daily practice. Seven different ML-based models grounded in data from two different sets from China were built in our study; however, other datasets and algorithms may lead to different findings.
Conclusion
In this study, seven different ML-based algorithms (PLS-DA, KNN, XGboost, SVM, ANN, SIMCA and LREG) were built to predict the diagnosis, severity and fatality of COVID-19 using two different databases. The PLS-DA model presented the best performance, with an accuracy of approximately 93%. This model can aid in the early diagnosis of COVID-19 and guide disease management with additional interventions tailored to daily practice. Finally, some of the biomarkers associated with the diagnosis and prognosis of COVID-19 found in the sample set of our study (i.e. 5,6-dihydro-5-methyluracil, cysteinylglycine, ribothymidine, sphingosine 1-phosphate, cyclohexylamine, uridine and ornithine) have previously been mentioned in the scientific literature, which reinforces their role in infection. Conversely, we reported for the first-time additional biomarkers (i.e. N-acetyl-glucosamine-1-phosphate and 4-hydroxyphenylacetoylcarnitine) that should be evaluated further as prognostic indicators of COVID-19.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Authors: Maríllya Morais da Silva; André Silva Lira de Lucena; Sergio de Sá Leitão Paiva Júnior; Vanessa Mylenna Florêncio De Carvalho; Priscilla Stela Santana de Oliveira; Michelle Melgarejo da Rosa; Moacyr Jesus Barreto de Melo Rego; Maira Galdino da Rocha Pitta; Michelly Cristiny Pereira Journal: Rev Med Virol Date: 2021-09-30 Impact factor: 11.043