Literature DB >> 35891394

Weekly Nowcasting of New COVID-19 Cases Using Past Viral Load Measurements.

Athar Khalil^1,2, Khalil Al Handawi^3,4, Zeina Mohsen⁵, Afif Abdel Nour⁶, Rita Feghali⁵, Ibrahim Chamseddine⁷, Michael Kokkolaras^3,4.

Abstract

The rapid spread of the coronavirus disease COVID-19 has imposed clinical and financial burdens on hospitals and governments attempting to provide patients with medical care and implement disease-controlling policies. The transmissibility of the disease was shown to be correlated with the patient's viral load, which can be measured during testing using the cycle threshold (Ct). Previous models have utilized Ct to forecast the trajectory of the spread, which can provide valuable information to better allocate resources and change policies. However, these models combined other variables specific to medical institutions or came in the form of compartmental models that rely on epidemiological assumptions, all of which could impose prediction uncertainties. In this study, we overcome these limitations using data-driven modeling that utilizes Ct and previous number of cases, two institution-independent variables. We collected three groups of patients (n = 6296, n = 3228, and n = 12,096) from different time periods to train, validate, and independently validate the models. We used three machine learning algorithms and three deep learning algorithms that can model the temporal dynamic behavior of the number of cases. The endpoint was 7-week forward number of cases, and the prediction was evaluated using mean square error (MSE). The sequence-to-sequence model showed the best prediction during validation (MSE = 0.025), while polynomial regression (OLS) and support vector machine regression (SVR) had better performance during independent validation (MSE = 0.1596, and MSE = 0.16754, respectively), which exhibited better generalizability of the latter. The OLS and SVR models were used on a dataset from an external institution and showed promise in predicting COVID-19 incidences across institutions. These models may support clinical and logistic decision-making after prospective validation.

Entities: Chemical

Keywords: COVID-19; Ct values; deep neural networks; machine learning; now-casting; predictive modeling; statistical analysis; viral load

Mesh：

Year: 2022 PMID： 35891394 PMCID： PMC9317659 DOI： 10.3390/v14071414

Source DB: PubMed Journal: Viruses ISSN： 1999-4915 Impact factor: 5.818

1. Introduction

Coronavirus disease (COVID-19) was declared a pandemic by the World Health Organization (WHO) on March 2020 following the global spread of the underlying severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) [1,2]. The clinical outcomes of patients can range from an asymptomatic state to acute respiratory distress syndrome and multi-organ dysfunction that can lead to death. The most identified risk factors are age, sex, ethnicity, smoking status and other comorbidities such as cardiovascular disorders, chronic kidney diseases and diabetes [2,3]. SARS-CoV-2 can be transmitted via direct contact; within a distance of one meter through coughing, talking, or sneezing; or indirectly via infectious secretions from infected patients [3]. COVID-19 put a strain on the economy and caused the general well-being of the population to diminish due to the public health and social measures (PHSMs) employed to control it [4]. Now-casting models are used to infer the epidemic trajectory and make informed decisions about its severity and necessary actions needed to bring the epidemic under control. Information regarding the origin of the pathogen, serological assays, and social behavior, among other aspects, are used to inform now-casting models and provide situational awareness to policymakers [5]. Several works in the literature have used epidemiological size indicators such as the frequency of tests, fatalities and new confirmed cases to infer the pandemic trajectory [6,7]. This paper focuses on both the epidemic size and serological assays from a cross-sectional sample of patients to develop a now-casting framework. Predictive modeling and now-casting of epidemic trajectories can alert policymakers and health institutions about an increase in incidence rates. This allows sufficient time to use other detailed scenario models to proactively test and deploy various PHSMs [8,9,10,11]. Therefore, the transmission rate depends on the patient’s contagious stage, viral load, and the time of exposure between individuals [4]. RT-qPCR is a serological test that remains the gold standard for COVID-19 diagnosis [12]. It measures the first PCR cycle, denoted as the cycle threshold (Ct), at which a detectable signal of the targeted DNA appears [13]. The Ct value is inversely proportional to the viral load; a 3-point increase in Ct value equals a 10-fold decrease in the quantity of the virus’ genetic material [14]. Ct values were proposed to have potential prognostic value in predicting severity, infectiousness, and mortality among patients [15]. Ct values were also used to determine the duration an infected patient needs to quarantine [16,17]. A high Ct value (indicating a low viral load) is detected at early stages of the infection before the person becomes contagious and at the late stages when the risk of transmission is low [18]. The lowest possible Ct value is usually reported within three days of the onset of symptoms and coincides with peak detection of cultivable virus and infectivity that implies an increase in transmissibility by up to 8-fold [19]. Using viral load measurements, individuals with high viral load and mild symptoms can be identified as potential superspreaders [20]. Thus, early testing is highly recommended alongside isolation practices, to interrupt SARS-CoV-2 transmission [21]. Reverse-transcription quantitative polymerase chain reaction (RT-qPCR) is a serological test that remains the gold standard for COVID-19 diagnosis [12]. It measures the first PCR cycle, denoted as the cycle threshold (Ct), at which a detectable signal of the targeted DNA appears [13]. The Ct value is inversely related to the viral load; a 3-point increase in Ct value equals a 10-fold decrease in the quantity of the virus’ genetic material [14]. Ct values were proposed to have potential prognostic value in predicting severity, infectiousness, and mortality among patients [15]. Ct values were also used to determine the duration an infected patient needs to quarantine [16,17]. A high Ct value (indicating a low viral load) is detected at the early stages of the infection before the person becomes contagious and at the late stages when the risk of transmission is low [18]. The lowest possible Ct value is usually reported within three days of the onset of symptoms and coincides with peak detection of cultivable virus and infectivity implies an increase in transmissibility by up to 8-fold [19]. Individuals with high viral load and mild symptoms can be identified as potential superspreaders using viral load measurements [20]. Thus, early testing is highly recommended alongside isolation practices to interrupt SARS-CoV-2 transmission [21]. We believe that the use of Ct for now-casting has its merits since it is a commonly available parameter irrespective of demographics and is highly correlated with transmissibility and incidence rates [22,23]. A popular approach for now-casting the pandemic trajectory is to use Bayesian inference frameworks to inform the posterior distributions for susceptibleexposed-infectious-recovered (SEIR) models and the corresponding time-varying incidence rate [6,24]. These approaches are limited by the assumptions of the underlying SEIR models (homogeneous distribution of population traits and contacts). On the other hand, machine learning approaches make little to no assumptions about the underlying models describing the mechanics of the transmission and can potentially generalize better when the viral transmission is not completely understood and sufficient data is available. We demonstrate the merits of this approach using a robust framework that leverages observed viral load measurements for time series now-casting of new COVID-19 cases for an upcoming 7-day time frame. The models are developed using a large cohort from a single cross-sectional virologic test center in Lebanon with a hold-out cohort for independent testing after the model is finalized. The Lebanese patient cohort used in this study is the largest and most consistent one in terms of serological assessment. This fact made the retrieved Ct values representative and reflective of the whole country. Now-casting the pandemic trajectory can facilitate its containment and improves healthcare providers’ preparedness against new SARS-CoV-2 variants and the surge in new cases caused by them. Furthermore, now-casting the pandemic trajectory can support policymakers during the decline phase of the pandemic (e.g., when vaccination rates are high and herd immunity is beginning to take hold) to suggest the best time frame for relaxing current PHSMs without the risk of the pandemic relapsing.

2. Materials and Methods

2.1. Patient Population

We retrospectively collected de-identified data for all COVID-19 patients diagnosed at Rafik Hariri University Hospital (RHUH) in Lebanon between 1 March 2020, and 31 March 2021. Rafik Hariri University Hospital (RHUH) is the country’s leading institution for COVID-19 testing and treatment, and the collected cohort represents the nation’s COVID-19 trajectory well [25]. Ct values were retrieved from the electronic medical database of the hospital, considering the date of the first positive RT-qPCR test for each patient while disregarding any subsequent positive tests that may have resulted during follow-up visits. RNA extraction and RT-qPCR processing protocols were consistent over time and the used PCR machines had similar calibration. The daily COVID-19 confirmed case counts in Lebanon were obtained from the Lebanese Ministry of Public Health and Worldometers website [26,27]. This study was approved by the Ethical Committee of RHUH. Written informed consent was waived since the study is retrospective and the patients’ information was de-identified.

2.2. Study Design

We created 3 cohorts (discovery, testing, and independent validation) using a longitudinal data split. The discovery group (Group 1) was used for training and cross-validation [28] to tune the hyperparameters and calibrate the model weights. The testing group (Group 2) was reserved for testing the models’ performance and calculating the test error. This approach complies with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) [29], which represents a classification criterion for predictive modeling. It has four types of increasing reliability. Since the data was split randomly into discovery (Group 1) and test groups (Group 2) at the beginning of the study, the model is a TRIPOD type 2b. We used a third portion of the data (Group 3) for further independent validation of the models developed using the discovery group (Group 1). This third group of data is called the ‘unseen data group’. Such a longitudinal split of data enhances the model validation, making it between internal and external validation [29].

2.3. Predictive Modeling

We first identify the relevant input features needed by the models to predict epidemic trajectories in Lebanon using Spearman’s correlation test. We analyzed the association between the patients’ Ct values (Figure 1) and age (Figure S3a) with respect to the epidemic trajectory and only selected the features with . Recent studies pointed out the case ascertainment rates may change over time (due to changes in PHSMs), resulting in biased Ct values [30]. The daily number of confirmed positive patients was plotted alongside the incidence rates in Lebanon to verify that this was not the case for the cohort used in this paper (see Section S1 and Figure S3b).

Figure 1

Scatter plot of biweekly mean Ct values and observed number of cases nationwide showing a clear negative value that is significant as given by p-value < 0.05.

In addition to the previously mentioned features, the epidemic trajectory also depends on the past number of COVID-19 confirmed cases and is therefore aggregated with the input features during now-casting [5,7]. The period of time over which the input features and confirmed cases counts are aggregated is defined as the sliding window . The input to all models is therefore a sequence of data over the past days. The epidemic trajectory is given by a sequence of predicted case counts in the upcoming days and is fixed to 7 days throughout the study in this paper. The window size of 7 days on the epidemiological calendar was chosen due to its clinical relevance to health providers. Furthermore, other studies based on cross-sectional virological data have used the 7-day window size for now-casting pandemic trajectories [24]. We developed 6 different machine learning algorithms for now-casting the epidemic trajectory, which are described below.

2.3.1. Recurrent Neural Network (RNN) Models

The first two models are built around recurrent neural networks (RNNs), which accommodate time-series data that are often temporarily correlated (i.e., the independent and identically distributed (i.i.d) assumption does not hold for time series data). This type of neural network can capture the temporal relationship between a decrease in Ct value and a subsequent (possibly delayed) rise in the number of cases. The RNN unit used in the models is the long short-term memory (LSTM) cell, which can capture long-term temporal effects and trends encoded by a long sequence of inputs and avoid the problem of vanishing gradients during backpropagation [31]. The LSTM has a cell for storing temporal data and gates to control data flow and capture long-term dependencies. Each gate is composed of a multilayer perceptron with neurons [32]. We used stacked LSTM cells with several layers (given by ) in the RNN models to learn high-level feature representations (the interaction of Ct values with the past number of cases) and used a dropout probability on all but the first layer to generalize better and avoid overfitting. Dropout arbitrarily excludes a number of hidden neurons from weight and bias updates during backpropagation to improve generalization performance [33]. Temporal information at time step of the n-th layer LSTM cell is represented by its hidden and cell states . The first model is given by a sequence-to-sequence (S2S) model commonly used in natural language processing (NLP) translation tasks. The model consists of an encoder RNN that accepts an input sequence of features of length and yields a context vector , where is the final time step of the input series. The context vector is fed to a decoder that outputs a predicted sequence of length corresponding to the projected number of cases . During training, the decoder uses its predictions at time step as an input for the next the time step . To speed up training, teacher forcing can be used to provide the actual value at time step instead of the decoder’s prediction with a probability [34]. The architecture of the S2S model used in this paper is shown in Figure 2.

Figure 2

Structure of the sequence-to-sequence (S2S) model used for now-casting the weekly number of cases. The left side of the network is the encoder that uses past information on Ct and the number of cases to create context vectors used to initialize the hidden and cell states of the decoder LSTM cells.

We also developed a second RNN model based on the stacked LSTM cells alone (i.e., the size of the input sequence must be equal to the size of the output sequence ) (Figure S4a). This model is called the stacked LSTM (SEQ) model. The average number of predicted and actual number of cases for the next days is given by Equations (1) and (2), respectively

2.3.2. Feedforward Neural Network (DNN) Model

We then developed a third model based on deep learning using feedforward neural networks. The feedforward neural network (DNN) model has several hidden layers () with several hidden neurons () each. All layers had a dropout probability and a rectified linear unit (ReLU) activation function (Figure S4b). All deep learning models were trained using the stochastic gradient descent algorithm ADAM with a learning rate and batch size [35]. Early stopping was used on all deep learning models to avoid overfitting if no improvement in the validation error occurred after a certain number of epochs (given by the patience parameter ) [36].

2.3.3. Regression Models

We developed three additional models that are not based on deep learning, namely a support vector machine regression (SVR) model [37], a gradient boosting machine (GBM) regression model [38], and a polynomial regression (OLS) model. Unlike deep learning models, these models do not yield a sequence of predictions for the next days. Instead, they compute a single value predicting the average number of confirmed COVID-19 cases for the next days (). This is because such models are primarily used for regression of univariate functions. This allows for a fair comparison with the models described in Section 2.3.1 and Section 2.3.2.

2.3.4. Hyperparameter Tuning

The hyperparameters of each model (listed in and described in Table 1) were optimized using cross-validation on the discovery group (Group 1) only. The cross-validation consists of outer and inner loops (Figure 3).

Table 1

Optimal hyperparameters of different models.

Hyperparameter	Symbol	Value	Possible Values
Sequence-to-sequence model (S2S)
Sliding window size	T1	6	1–40
Number of hidden neurons	nhidden	1500	1–2500
Probability of dropout	Pdropout	0.8	0.0–0.9
Number of hidden layers	nhidden	2	1–5
Teacher forcing probability	Pteacher	0.3	0.0–0.9
Learning rate	lrate	1×10−4	1×10−5–1×10−2
batch size	bsize	32	4–128
best epoch	nepochsbest	31	1–nepochs
Sequence completion model (SEQ)
Number of hidden neurons	nhidden	2500	1–2500
Probability of dropout	Pdropout	0.8	0.0–0.9
Number of hidden layers	nhidden	3	1–5
Learning rate	lrate	1×10−4	1×10−5–1×10−2
batch size	bsize	64	4–128
best epoch	nepochsbest	1	1–nepochs
Deep neural network (DNN)
Sliding window size	T1	6	1–40
Number of hidden neurons	nhidden	1000	1–2500
Probability of dropout	Pdropout	0.9	0.0–0.9
Number of hidden layers	nhidden	1	1–5
Learning rate	lrate	1×10−3	1×10−5–1×10−2
batch size	bsize	4	4–128
best epoch	nepochsbest	4	1–nepochs
Support vector machine regression (SVR)
Sliding window size	T1	6	1–40
Ridge factor	λ	1×10−4	1×10−3−1.0
Margin of tolerance	ϵ	1×10−2	1×10−3–1.0
Stopping criteria tolerance	ϵtol	0.1	1–5
Learning rate	lrate	1×10−5	1×10−5–1×10−2
Gradient boosting machine (GBM)
Sliding window size	T1	36	1–40
Subsample fraction	fsample	0.8	0.1–1.0
Maximum portion of features	ffeatures	0.1	0.1–1.0
Decision tree maximum depth	D	7	1–5
Learning rate	lrate	0.01	1×10−5–1×10−2
Maximum number of boosting stages	nstages	5000	50–5000
Polynomial regression (OLS)
Sliding window size	T1	6	1–40
Ridge factor	λ	1.0	1×10−3–1.0
Degree	ndegree	1	1–5
Common fixed parameters
Output window size (all models)	T2	7	1–40
Maximum number of epochs (all models)	nepochs	5000
Kernel (SVR)		linear
Early stopping patience (S2S, SEQ, DNN)	npatience	200
Optimizer (S2S, SEQ, DNN)		Adam

The tuned hyperparameters of each model are reported underneath it. The fixed hyperparameters are reported at the bottom of the table.

Figure 3

Cross-validation and hyperparameter determination scheme for model development. Following the discovery group (Group 1), the inner loop tuned the model’s hyperparameters by minimizing the average k-fold cross-validation error using a stochastic direct search algorithm or a grid search. The second loop (following tuning) generates several models randomly and bins them by training error. The best model with the lowest training error is tested on the test group to obtain the testing error.

The outer loop split (Group 1) into five groups and sent four into the inner loop for training the models and subsequent hyperparameter optimization with respect to the average k-fold cross-validation error [39]. The cross-validation error of each fold was calculated using the mean squared error (MSE) criterion on the predicted and actual average number of cases for the following days as shown in Equation (3). where and are defined by Equations (1) and (2), respectively. Several models used in this paper (S2S, SEQ, DNN, and GBM) involve random variables associated with the training algorithm (backpropagation and gradient boosting), which are often ignored in the literature of applied machine learning. Examples of these random variables include the initial value of learnable parameters (weights, biases, and decision tree parameters), dropout, and gradient descent step sizes. Fixing the random seed of these random variables could result in model bias. We address this issue by randomly sampling different training runs during hyperparameter optimization and optimizing the mean cross-validation errors of all the sampled runs. We apply this approach to a grid search on the hyperparameter space to discern the sensitivity of the cross-validation error to the hyperparameters. We then use a stochastic derivative-free optimization (DFO) algorithm (stochastic mesh adaptive direct search (StoMADS)) to fine-tune the hyperparameters [40]. StoMADS is an extension of the mesh adaptive direct search (MADS) algorithm that automatically updates its estimates of a stochastic objective function (in this paper, the objective function is given by the cross-validation error) depending on the level of uncertainty in the current incumbent solution. After obtaining the optimal model in the internal loop, we scored it using the outer loop data. We then performed a random draw to obtain 30 models using the tuned hyperparameters. These models were binned by training error and the top-performing model was stored and used to make predictions for the test group (Group 2). We note that binning and sampling of the cross-validation error are unnecessary for the OLS and SVR models since their training is deterministic and does not involve random variables.

3. Results

3.1. Patient Population

The entire dataset included 23,185 patients with a median age of 37 years. We aggregated the individuals’ Ct into a sequence of daily mean Ct values. Group 1 contained 6296 patients admitted to RHUH between 2 March 2020 and 17 October 2020; Group 2 contained 3228 patients from 18 October 2020 to 30 November 2020, and the unseen group contained 12,097 patients from 1 December 2020 to 16 March 2021. All three groups have comparable median ages (34.0, 37.0, and 37.25 years, respectively). Group 1 was further split into five groups during model development for cross-validation: four training and one validation interchangeably. Figure 4 shows the bi-weekly average Ct values observed and the corresponding number of cases in Lebanon nationwide for the period of time spanning groups 1 and 2 used in the model development phase. The entire dataset is provided in the Supplementary Material, Figure S1.

Figure 4

(A) Bi-weekly mean Ct values of RHUH patients. The solid line represents the median bi-weekly Ct values, and the gray shaded area represents the inter-quartile range (25–75 percentile) of the observed Ct values. (B) The grey bars show the weekly running average of the number of cases observed nationwide in Lebanon between 1 March 2020, and 7 December 2020 (the running average can be computed until 23 November). The solid black line represents the growth rate in the weekly number of cases.

3.2. Correlation between the National Daily Number of COVID-19 Cases and Mean Ct

We observed a temporal delay between the incidence rate and the observed Ct values. For example, the trough in mean Ct values on 8 October 2020 (Trough 3 in Figure 4A) was followed by an increase in the number of cases, on 29 October 2020, with more than 1640 cases per day (Peak 3 in Figure 4B). This delay could be due to the time needed for population dynamics of disease transmission to take hold. Low Ct values indicate nascent infections circulating in the population that need time to reach the rest of the population. This observation has been reported by Hay et al. [24] using compartmental SEIR models to show that cross-sectional Ct observations with a low median value signal the growth phase of a pandemic (when case counts are still typically low). A similar trend was observed for case count peaks 1 and 2, which were superseded by median Ct troughs 1 and 2, respectively. This visual analysis of the data indicates that the median Ct value is temporally related to incidence. We also investigate the relationship between the median Ct value and case counts using correlation analysis. We observed a clear inverse correlation between mean Ct and number of cases (p < 0.001), quantified by the Spearman correlation test (Figure 1). This indicates that the mean cross-sectional Ct value is an important feature for now-casting the pandemic trajectory.

3.3. Now-Casting the Epidemic Trajectories

We developed six types of predictive models for now-casting the COVID-19 epidemic trajectory in Lebanon using the data in the discovery group (Group 1). The optimal hyperparameters for each model are listed in Table 1. Early stopping terminated the backpropagation algorithm at 31, 1, and 4 epochs for the S2S, SEQ, and DNN models, respectively. All models except the GBM had an optimal input window size of 6 days. This implies that an aggregate measure of cross-sectional Ct values and past incidence rates over the last 6 days could be used to now-cast the expected number of positive COVID-19 cases over the following 7 days. The models developed using Group 1 were used to now-cast the trajectory from 18 October 2020 to 30 November 2020 (Group 2) (Figure 5). The models were then retrained in Groups 1 and 2 using the hyperparameters in Table 1 and used to now-cast the epidemic trajectory after 1 December 2020 (Group 3). Table 2 lists the MSE error for the predicted trajectories on Groups 2 and 3 (see Table 2 footnote).

Figure 5

Predicted 7-day rolling average of daily number of cases on the unseen data set using (A) the sequence-to-sequence (S2S) model, (B) the stacked LSTM (SEQ), (C) The feedforward neural network (DNN), (D) The support vector machine regression (SVR) model, (E) The gradient boosting machine (GBM), and (F) the polynomial regression (OLS) model. All models were tuned using the cross-validation error of the discovery set. The grey shaded region represents the test data set (Group 2) used to test the models’ performance.

Table 2

Training and testing errors given by mean squared error (MSE) of different models constructed using different feature sets.

Model	Figure 5		Figure 6
	Train Error	Test Error	Train Error	Unseen Error
	Group 1	Group 2	Groups 1,2	Group 3
Sequence-to-sequence (S2S)	0.02462	0.02504	0.01309	0.57112
Stacked LSTM (SEQ)	0.38373	0.02724	0.78142	0.32584
Feedforward neural network (DNN)	0.02223	0.04179	0.00919	0.25547
Support vector machine regression (SVR)	0.01362	0.08347	0.00518	0.16754
Gradient boosting machine (GBM)	2.316 × 10−6	0.32589	2.316 × 10−6	1.44463
Polynomial regression (OLS)	0.01335	0.08954	0.00459	0.15954

The MSE in Equation (3) is computed using the standardized value of the predictions by normalizing them using the mean and standard deviation of all the daily number of cases (ncases) given by 463.8 and 597.0, respectively.

The RNN models (S2S and SEQ) performed well on Group 2 (MSE of 0.025 and 0.027, respectively), followed by the DNN model with an MSE that is two-fold larger (0.042). The OLS and SVR had an MSE that is four-fold larger than that of the RNN models (0.090 and 0.083, respectively). The GBM was heavily biased and did not generalize well on Group 2 (MSE of 0.326). The training error for the RNN models was higher than that of the parametric models (OLS and SVR) due to the regularization performed by the early-stopping criterion to avoid overfitting. Movie S1 shows an example training run of the S2S model with arbitrary hyperparameters, where early stopping helped avoid overfitting. The MSE error of the SVR, OLS, and DNN models was comparable on the unseen data group (MSE values of 0.168, and 0.160, and 0.255, respectively). The SEQ and S2S performed worse on the unseen group, implying that simpler models perform better on the unseen group due to the limited number of data points available for training and hyperparameter tuning. Deep learning models generally excel when a large dataset is available for model development and has been reported by several studies in the literature [41,42]. To verify this, the RNN models were re-developed using both Groups 1 and 2 for training, hyperparameter tuning and validation. The generalization performance improved significantly, bringing the MSE error down from 0.571 to 0.106 for the S2S model (Table S1). This implies that the RNN models generalize better when more training data is available (see Supplementary Material Section S3). If limited data are available (at the start of a pandemic), simpler models can provide better generalization performance. We deployed the models developed on the combined dataset (Groups 1 and 2) as a web application to facilitate prospective validation in the future [43].

4. Discussion

Host viral load and the resultant Ct values have been widely proposed to evaluate the progression of SARS-CoV-2 infection and address patients’ contagiousness [44]. Mathematical modeling has been widely used for predicting the course of the COVID-19 pandemic. These prediction models were developed based on the applied intervention measurements and the population behavioral fluctuations, including social distancing and mask-wearing [45]. The COVID-19 reproduction number (R0), defined as the average number of naive individuals a patient can infect, has a mean estimate of 3.28 and could range from 1.4 to 6.49 [46]. Although R0 can widely vary by country, culture, and stage of the outbreak, it has been used to justify the need for community mitigation strategies and political interventions [47]. So far, only few advanced and more recent models have evaluated the disease spread based on viral kinetics and serological assays (such as RT-qPCR tests) [22,24]. Furthermore, these studies focused on serological assays or pandemic size indicators (such as R0 and incidence rates) in isolation without combining the two. This paper utilizes both past incidence rates and serological viral load measurements to now-cast the pandemic trajectory. Hay et al. [24] used Bayesian inference to predict the growth rate in the daily number of COVID-19 cases as a function of Ct values [24]. They showed that the population-level Ct distribution is strongly correlated with growth rate estimates of new infections in MA, USA. They estimated R0 and growth rate by using observations of Ct values to inform priors on key viral kinetics parameters (such as the viral load wane rate, and Ct at peak viral load and the pandemic trajectory (daily probability of infection is used as a proxy for the trajectory). The prior on the pandemic trajectory is assumed to come from a Gaussian process that makes no assumptions regarding the evolution of the trajectory as more Ct observations are made. We have used the Gaussian process regression model to predict the pandemic trajectory using the RHUH cross-sectional patient cohort (see Supplementary Material Section S4). The advantage of such models is that they are highly interpretable as they estimate the viral kinetics model parameters that are most likely to give rise to the observed Ct values [48]. This provides useful information about the virulence and severity of the pathogen. However, such models make assumptions about the likelihood used to update the priors. These assumptions limit the predictive capability of the model if any of these assumptions (such as the viral kinetics models) do not hold in reality, potentially resulting in poor generalization performance. This is the case when a different clade of viruses takes hold. Another dataset from Bahrain demonstrated the effectiveness of Ct in predicting the epidemiological dynamics of COVID-19 [23]. However, the study did not consider the interaction between different features (i.e., number of positive cases and Ct), nor does it consider temporal effects observed in epidemics. In comparison, the presented data-driven approach of inferring the epidemic trajectory using past case counts and Ct values makes very little assumptions about the pandemic trajectory and viral kinetics models that gave rise to the observed Ct values. This has the benefit of potentially generalizing to a wide range of scenarios. To prove this, we used all the models developed in this paper using Group 1 (Figure 5) to infer the case counts in the state of Massachusetts using the patient cohort of Brigham andWomen’s Hospital (BWH) provided by Hay et al. [24] (Figure S9A). Most models captured the underlying trend except for GBM and the stacked LSTM (SEQ) models (Figure S10). SVR performed the best on this dataset (Figure S9B). However, further prospective validation is needed in the future to ensure that these models can generalize to different testing centers and reject disturbances in Ct values due to sample collection and handling methods. The inferred trajectories for the state of Massachusetts (15 April 2020–15 December 2020) and Lebanon (1 December 2020–31 March 2021) show that simple machine learning models (such as OLS and SVR) perform well with limited training data (when developing the models using data from Group 1 only). Deep learning models begin to outperform such models when including more data in the development set (Groups 1 and 2) to infer the trajectory in Lebanon (Figure S5). Although the outcomes of this study favored simpler regression models, their simplicity provides an advantage in terms of interpretability [48]. The dataset used in this study contained fluctuations that allowed us to extract the Ct temporal effect on the trajectory of the pandemic. Since the data came from a single institution, the fluctuations are likely to be signals in the data rather than noise. The significant changes in the Ct values mirrored the well-recognized political, economic, and social turning points that happened in Lebanon during the pandemic. These incidences impacted the population’s behavior towards COVID-19 in a consistent and well-defined manner, allowing us to track and correlate these changes with the variation in the mean Ct values and subsequently the disease spread. The early reported high mean Ct values and the low number of COVID-19 cases in Lebanon between March 2020 and June 2020 co-occurred with a strictly imposed lockdown and a harsh awareness campaign executed by local media platforms [25]. In comparison, the sharp rise in COVID-19 cases and the decrease in mean Ct values upon diagnosis were detected after releasing the first national lockdown in July, which occurred with a significant shifting of local media attention towards the economic crisis peaking in the country. Yet, the highest jump in the number of national COVID-19 patients and the sharpest drop in Ct values were reported after the explosion of Beirut’s port in August 2020, which was classified among the most significant chemical explosions in history [49]. The devastating effects of the explosion amplified the country’s pre-existing social, economic, and health challenges, causing a significant increase in the COVID-19 positivity rate in September and November 2020, which had reached 13.9% [49,50]. The consequences of this explosion shifted the residents’ attention away from proper precautions. This was reflected by the sharp decrease in the mean Ct values indicating less responsible behavior and a delay in diagnosis time among suspected patients, which resulted in a subsequent increase in SAR-CoV-2 spread among individuals. These events caused three significant peaks in the number of cases and three drops in mean Ct. We trained the models on two of these peaks and tested its ability to detect the third peak using the unseen data. Thus, the training and validation errors reflect the models’ robustness against unexpected events. The detected inverse relationship between Ct values and the number of national COVID-19 positive cases reflects population dynamics of transmission and demonstrates the temporal significance of Ct values. The results emphasized the importance of early testing when the patient’s viral load and infectivity are low to prompt isolation practices and thus, suppress the national spread of the virus. The models were able to predict the upcoming one-week expected number of national COVID-19 cases based on a commonly available diagnostic measurement, the Ct value. This shows that viral load measurements are a rigid input that can enhance the outcomes of disease forecasting models. Interestingly, this model is still valuable among vaccinated patients as these patients were shown to have a similar viral load pattern as unvaccinated patients and thus, can efficiently transmit the disease in the same manner upon infection [51]. Ultimately, the data promoted incorporating Ct values with other epidemiological variables and patient demographics to predict new COVID-19 waves and to study epidemic behaviors. The models in this paper could be extended to now-cast other contagious viral diseases that are diagnosed by qPCR, provided that sufficient training data is available (at least one wave of the viral disease has been observed). This study is limited to a single-institution cohort. Although the cohort represents the national number of cases, and the model’s variable (Ct) is country-independent, a prospective validation on multi-institutional data is needed before translation. To facilitate this process, we have hosted the models on a web interface to be used in future studies that compare the predicted and observed number of cases [43]. Another limitation is the inability of the model to compare the effect of preventative policies such as lockdowns and quarantining. The model provides an alert when the number of cases is about to rise significantly, allowing more informed triage decisions and better allocation of medical resources during the pandemic. However, it does not provide guidance on what measures could best control an upcoming peak. Mechanistic models, on the other hand, such as individualbased models (IBMs), can provide such insights but their application is limited to a much smaller population size due to computational cost [52,53,54,55]. A future study could focus on combining IBMs with viral load models such as those developed by Hay et al. [24] to estimate Ct values for a cross-section of the population and use them to retrain the models developed in this paper to now-cast the trajectory under different intervention policies [24].

5. Conclusions

This study is a first attempt at combining serologic assays from a representative cross-sectional patient cohort with epidemiological indicators such as incidence rates and infection size to now-cast the number of nationwide positive COVID-19 cases in a specific region [5]. This was motivated the premise that SARS-CoV-2 spread is highly dependent on the individual viral dynamics. The models used in this paper showed the merits of this approach using observations of Ct values and historical infection data from Lebanon in a now-casting framework. The modeling framework relied on multiple machine learning algorithms that make little assumptions about population and transmission dynamics. The patient cohort revealed that the evolution of the viral load mirrored the growth of positive national cases in the country. Low mean Ct values were followed by a large number of national positive COVID-19 cases and vice versa, in line with similar observations in the literature [22,24]. This finding is also supported by applying the machine learning models in this paper to the BWH dataset provided by Hay et al. [24]. To account for the effect of social interactions that could occur a few days before and after testing, we used a sequence of daily mean Ct values across multiple machine learning algorithms. We trained the models on a training dataset and independently validated them on unseen data forming TRIPOD type 2b models [29]. The training process utilized a cross-validation approach combined with a state-of-the-art stochastic direct search for hyperparameter tuning to prevent model over-fitting [56]. The sequence-to-sequence (S2S) model had the best accuracy when a large amount of data was used for its development, while the support vector machine regression (SVR) model provided better accuracy with limited development data as given by the MSE criterion. Since the models were trained and validated on datasets from different time periods, they have the potential to extend to future data. In addition, since the variables used for prediction (Ct values) are not specific to the institution from which the data were acquired, the models are ready to undergo a prospective and external validation in the future. This will form a TRIPOD 4 study, which is recommended to translate the model to practice to now-cast a 7-day forward number of cases based on recently reported Ct values.

39 in total

1. Automatic early stopping using cross validation: quantifying the criteria.

Authors: Lutz Prechelt
Journal: Neural Netw Date: 1998-06

2. A stochastic agent-based model of the SARS-CoV-2 epidemic in France.

Authors: Nicolas Hoertel; Martin Blachier; Carlos Blanco; Mark Olfson; Marc Massetti; Marina Sánchez Rico; Frédéric Limosin; Henri Leleu
Journal: Nat Med Date: 2020-07-14 Impact factor: 53.440

3. Association between RT-PCR Ct values and COVID-19 new daily cases: a multicenter cross-sectional study.

Authors: Abdulkarim Abdulrahman; Saad I Mallah; Abdulla Alawadhi; Simone Perna; Essam M Janahi; Manaf M AlQahtani
Journal: Infez Med Date: 2021-09-10

4. The Lebanese COVID-19 Cohort; A Challenge for the ABO Blood Group System.

Authors: Athar Khalil; Rita Feghali; Mahmoud Hassoun
Journal: Front Med (Lausanne) Date: 2020-11-20

5. The impact of contact tracing and household bubbles on deconfinement strategies for COVID-19.

Authors: Lander Willem; Steven Abrams; Pieter J K Libin; Pietro Coletti; Elise Kuylen; Oana Petrof; Signe Møgelmose; James Wambua; Sereina A Herzog; Christel Faes; Philippe Beutels; Niel Hens
Journal: Nat Commun Date: 2021-03-09 Impact factor: 14.919

6. Pretest Symptom Duration and Cycle Threshold Values for Severe Acute Respiratory Syndrome Coronavirus 2 Reverse-Transcription Polymerase Chain Reaction Predict Coronavirus Disease 2019 Mortality.

Authors: Emily Happy Miller; Jason Zucker; Delivette Castor; Medini K Annavajhala; Jorge L Sepulveda; Daniel A Green; Susan Whittier; Matthew Scherer; Nicola Medrano; Magdalena E Sobieszczyk; Michael T Yin; Louise Kuhn; Anne-Catrin Uhlemann
Journal: Open Forum Infect Dis Date: 2021-01-04 Impact factor: 3.835

7. Ct threshold values, a proxy for viral load in community SARS-CoV-2 cases, demonstrate wide variation across populations and over time.

Authors: Nicole Stoesser; Philippa C Matthews; David W Eyre; A Sarah Walker; Emma Pritchard; Thomas House; Julie V Robotham; Paul J Birrell; Iain Bell; John I Bell; John N Newton; Jeremy Farrar; Ian Diamond; Ruth Studley; Jodie Hay; Karina-Doris Vihta; Timothy Ea Peto; Koen B Pouwels
Journal: Elife Date: 2021-07-12 Impact factor: 8.140

8. Estimating SARS-CoV-2 infections from deaths, confirmed cases, tests, and random surveys.

Authors: Nicholas J Irons; Adrian E Raftery
Journal: Proc Natl Acad Sci U S A Date: 2021-08-03 Impact factor: 11.205

9. Covasim: An agent-based model of COVID-19 dynamics and interventions.

Authors: Cliff C Kerr; Robyn M Stuart; Dina Mistry; Romesh G Abeysuriya; Katherine Rosenfeld; Gregory R Hart; Rafael C Núñez; Jamie A Cohen; Prashanth Selvaraj; Brittany Hagedorn; Lauren George; Michał Jastrzębski; Amanda S Izzo; Greer Fowler; Anna Palmer; Dominic Delport; Nick Scott; Sherrie L Kelly; Caroline S Bennette; Bradley G Wagner; Stewart T Chang; Assaf P Oron; Edward A Wenger; Jasmina Panovska-Griffiths; Michael Famulare; Daniel J Klein
Journal: PLoS Comput Biol Date: 2021-07-26 Impact factor: 4.475