| Literature DB >> 35891394 |
Athar Khalil1,2, Khalil Al Handawi3,4, Zeina Mohsen5, Afif Abdel Nour6, Rita Feghali5, Ibrahim Chamseddine7, Michael Kokkolaras3,4.
Abstract
The rapid spread of the coronavirus disease COVID-19 has imposed clinical and financial burdens on hospitals and governments attempting to provide patients with medical care and implement disease-controlling policies. The transmissibility of the disease was shown to be correlated with the patient's viral load, which can be measured during testing using the cycle threshold (Ct). Previous models have utilized Ct to forecast the trajectory of the spread, which can provide valuable information to better allocate resources and change policies. However, these models combined other variables specific to medical institutions or came in the form of compartmental models that rely on epidemiological assumptions, all of which could impose prediction uncertainties. In this study, we overcome these limitations using data-driven modeling that utilizes Ct and previous number of cases, two institution-independent variables. We collected three groups of patients (n = 6296, n = 3228, and n = 12,096) from different time periods to train, validate, and independently validate the models. We used three machine learning algorithms and three deep learning algorithms that can model the temporal dynamic behavior of the number of cases. The endpoint was 7-week forward number of cases, and the prediction was evaluated using mean square error (MSE). The sequence-to-sequence model showed the best prediction during validation (MSE = 0.025), while polynomial regression (OLS) and support vector machine regression (SVR) had better performance during independent validation (MSE = 0.1596, and MSE = 0.16754, respectively), which exhibited better generalizability of the latter. The OLS and SVR models were used on a dataset from an external institution and showed promise in predicting COVID-19 incidences across institutions. These models may support clinical and logistic decision-making after prospective validation.Entities:
Keywords: COVID-19; Ct values; deep neural networks; machine learning; now-casting; predictive modeling; statistical analysis; viral load
Mesh:
Year: 2022 PMID: 35891394 PMCID: PMC9317659 DOI: 10.3390/v14071414
Source DB: PubMed Journal: Viruses ISSN: 1999-4915 Impact factor: 5.818
Figure 1Scatter plot of biweekly mean Ct values and observed number of cases nationwide showing a clear negative value that is significant as given by p-value < 0.05.
Figure 2Structure of the sequence-to-sequence (S2S) model used for now-casting the weekly number of cases. The left side of the network is the encoder that uses past information on Ct and the number of cases to create context vectors used to initialize the hidden and cell states of the decoder LSTM cells.
Optimal hyperparameters of different models.
| Hyperparameter | Symbol | Value | Possible Values |
|---|---|---|---|
| Sequence-to-sequence model (S2S) | |||
| Sliding window size |
| 6 | 1–40 |
| Number of hidden neurons |
| 1500 | 1–2500 |
| Probability of dropout |
| 0.8 | 0.0–0.9 |
| Number of hidden layers |
| 2 | 1–5 |
| Teacher forcing probability |
| 0.3 | 0.0–0.9 |
| Learning rate |
|
| |
| batch size |
| 32 | 4–128 |
| best epoch |
| 31 | 1– |
| Sequence completion model (SEQ) | |||
| Number of hidden neurons |
| 2500 | 1–2500 |
| Probability of dropout |
| 0.8 | 0.0–0.9 |
| Number of hidden layers |
| 3 | 1–5 |
| Learning rate |
|
| |
| batch size |
| 64 | 4–128 |
| best epoch |
| 1 | 1– |
| Deep neural network (DNN) | |||
| Sliding window size |
| 6 | 1–40 |
| Number of hidden neurons |
| 1000 | 1–2500 |
| Probability of dropout |
| 0.9 | 0.0–0.9 |
| Number of hidden layers |
| 1 | 1–5 |
| Learning rate |
|
| |
| batch size |
| 4 | 4–128 |
| best epoch |
| 4 | 1– |
| Support vector machine regression (SVR) | |||
| Sliding window size |
| 6 | 1–40 |
| Ridge factor |
|
| |
| Margin of tolerance |
|
| |
| Stopping criteria tolerance |
| 0.1 | 1–5 |
| Learning rate |
|
| |
| Gradient boosting machine (GBM) | |||
| Sliding window size |
| 36 | 1–40 |
| Subsample fraction |
| 0.8 | 0.1–1.0 |
| Maximum portion of features |
| 0.1 | 0.1–1.0 |
| Decision tree maximum depth |
| 7 | 1–5 |
| Learning rate |
| 0.01 | |
| Maximum number of boosting stages |
| 5000 | 50–5000 |
| Polynomial regression (OLS) | |||
| Sliding window size |
| 6 | 1–40 |
| Ridge factor |
| 1.0 | |
| Degree |
| 1 | 1–5 |
| Common fixed parameters | |||
| Output window size (all models) |
| 7 | 1–40 |
| Maximum number of epochs (all models) |
| 5000 | |
| Kernel (SVR) | linear | ||
| Early stopping patience (S2S, SEQ, DNN) |
| 200 | |
| Optimizer (S2S, SEQ, DNN) | Adam | ||
The tuned hyperparameters of each model are reported underneath it. The fixed hyperparameters are reported at the bottom of the table.
Figure 3Cross-validation and hyperparameter determination scheme for model development. Following the discovery group (Group 1), the inner loop tuned the model’s hyperparameters by minimizing the average k-fold cross-validation error using a stochastic direct search algorithm or a grid search. The second loop (following tuning) generates several models randomly and bins them by training error. The best model with the lowest training error is tested on the test group to obtain the testing error.
Figure 4(A) Bi-weekly mean Ct values of RHUH patients. The solid line represents the median bi-weekly Ct values, and the gray shaded area represents the inter-quartile range (25–75 percentile) of the observed Ct values. (B) The grey bars show the weekly running average of the number of cases observed nationwide in Lebanon between 1 March 2020, and 7 December 2020 (the running average can be computed until 23 November). The solid black line represents the growth rate in the weekly number of cases.
Figure 5Predicted 7-day rolling average of daily number of cases on the unseen data set using (A) the sequence-to-sequence (S2S) model, (B) the stacked LSTM (SEQ), (C) The feedforward neural network (DNN), (D) The support vector machine regression (SVR) model, (E) The gradient boosting machine (GBM), and (F) the polynomial regression (OLS) model. All models were tuned using the cross-validation error of the discovery set. The grey shaded region represents the test data set (Group 2) used to test the models’ performance.
Training and testing errors given by mean squared error (MSE) of different models constructed using different feature sets.
| Model |
|
| ||
|---|---|---|---|---|
| Train Error | Test Error | Train Error | Unseen Error | |
| Group 1 | Group 2 | Groups 1,2 | Group 3 | |
| Sequence-to-sequence (S2S) | 0.02462 | 0.02504 | 0.01309 | 0.57112 |
| Stacked LSTM (SEQ) | 0.38373 | 0.02724 | 0.78142 | 0.32584 |
| Feedforward neural network (DNN) | 0.02223 | 0.04179 | 0.00919 | 0.25547 |
| Support vector machine regression (SVR) | 0.01362 | 0.08347 | 0.00518 | 0.16754 |
| Gradient boosting machine (GBM) | 2.316 × 10 | 0.32589 | 2.316 × 10 | 1.44463 |
| Polynomial regression (OLS) | 0.01335 | 0.08954 | 0.00459 | 0.15954 |
The MSE in Equation (3) is computed using the standardized value of the predictions by normalizing them using the mean and standard deviation of all the daily number of cases (ncases) given by 463.8 and 597.0, respectively.