Sina Fathi-Kazerooni1, Roberto Rojas-Cessa1, Ziqian Dong2, Vatcharapan Umpaichitra3. 1. Department of Electrical and Computer Engineering, Newark College of Engineering, New Jersey Institute of Technology, Newark, NJ, 07102, USA. 2. Department of Electrical and Computer Engineering, College of Engineering and Computing Sciences, New York Institute of Technology, New York, NY, 10023, USA. 3. Department of Pediatrics, SUNY Downstate Health Sciences University, Brooklyn, NY, 11203, USA.
Abstract
In this paper, we show a strong correlation between turnstile entries data of the New York City (NYC) subway provided by NYC Metropolitan Transport Authority and COVID-19 deaths and cases reported by the NYC Department of Health from March to May 2020. This correlation is obtained through linear regression and confirmed by the prediction of the number of deaths by a Long Short-Term Memory neural network. The correlation is more significant after considering incubation and symptomatic phases of this disease as experienced by people who died from it. We extend the analysis to each individual NYC borough. We also estimate the dates when the number of COVID-19 deaths and cases would approach zero by using the Auto-Regressive Integrated Moving Average model on the reported deaths and cases. We also backward forecast the dates when the first cases and deaths might have occurred.
In this paper, we show a strong correlation between turnstile entries data of the New York City (NYC) subway provided by NYC Metropolitan Transport Authority and COVID-19 deaths and cases reported by the NYC Department of Health from March to May 2020. This correlation is obtained through linear regression and confirmed by the prediction of the number of deaths by a Long Short-Term Memory neural network. The correlation is more significant after considering incubation and symptomatic phases of this disease as experienced by people who died from it. We extend the analysis to each individual NYC borough. We also estimate the dates when the number of COVID-19 deaths and cases would approach zero by using the Auto-Regressive Integrated Moving Average model on the reported deaths and cases. We also backward forecast the dates when the first cases and deaths might have occurred.
New York City (NYC) was the epicenter of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in early 2020 (Levenson, 2020). Many people have been hospitalized and also many others have lost their lives to this disease. It is unknown when the first COVID-19 cases occurred in the city, but the first case was reported on Feb. 29, 2020. The COVID-19 cases and deaths in NYC increased rapidly after that date, peaked in mid-April 2020 (NYC Health Department, 2020a), and started to decline after the city implemented a series of containment measures, including closing schools, social distancing, business shut-downs, and shelter-in-place (Nicola et al., 2020). As a result, the COVID-19 pandemic has brought and continues to bring economic hardship to NYC residents and people worldwide, and especially those in the lower economical strata (Wadhera et al., 2020). As city dwellers mainly rely on public transportation and more notably the subway to move around, it is expected that a highly contagious virus such as SARS-CoV-2 would easily spread through the NYC population. It is expected that the NYC subway, being an enclosed environment and also a major indicator of people’s social activity, would be directly correlated with the spreading of SARS-CoV-2.Although this correlation has been described by reported cases throughout neighborhoods, the correlation between the subway ridership and COVID-19 incidence and deaths has not been shown. In this paper, we show this correlation between the public turnstile entry data of NYC subway provided by the NYC Metropolitan Transportation Authority (MTA) (NYC MTA, 2020; Whong, 2020) and the statistics of the confirmed COVID-19 deaths provided by the NYC Department of Health (DOH) (NYC Health Department, 2020b). We used linear regression and tested the turnstile data from a few days before the day of the reported deaths and cases to support this correlation. There are many prediction models for forecasting the spread of COVID-19 (Al-Qaness et al., 2020; Anastassopoulou et al., 2020; Ceylan, 2020, p. 138817; Chakraborty & Ghosh, 2020, p. 109850; Fanelli & Piazza, 2020; Goscé & Johansson, 2018; Kucharski et al., 2020; Kuniya, 2020; Li et al., 2020; Mo et al., 2020; Perc et al., 2020; Petropoulos & Makridakis, 2020; Qin et al., 2020; Ribeiro et al., 2020, p. 109853; Tosepu et al., 2020; Yang et al., 2020; Zhang et al., 2020). Much of the literature focuses on modeling the incidence of COVID-19 in different countries and partially enclosed environments, such as cruise ships (Ceylan, 2020, p. 138817; Fanelli & Piazza, 2020; Kucharski et al., 2020; Li et al., 2020; Ribeiro et al., 2020, p. 109853; Tosepu et al., 2020; Zhang et al., 2020). These models use local and regional data to examine the correlation of COVID-19 cases with environmental or social factors. Many of the approaches use the Susceptible-Exposed-Infective-Recovered epidemiological model to analyze data on this outbreak. These models rely on the accuracy of the reported cases. However, the accuracy of case reports is expected to be low at the early stage of this new outbreak and that may add significant errors in their analysis and forecast. To avoid such errors, some models forecast the number of COVID-19 cases according to the population size (Kermack & McKendrick, 1927; Petropoulos & Makridakis, 2020; Qin et al., 2020). There is an increasing interest in models that estimate the impact of public transit on incidence and the possibilities of contagion by airborne transmissions (Goscé & Johansson, 2018; Mo et al., 2020).Our analysis leverages on the reported number of deaths because their count has higher accuracy than the reported number of cases. However, the data on cases is larger than the one on deaths, so we also analyze the reported cases. We employ linear regression for analyzing the correlation of the reported number of COVID-19 deaths and the NYC subways turnstile entries and Long Short-Term Memory (LSTM) neural network (Hochreiter & Schmidhuber, 1997) for predicting the number of deaths and cases through time-series data analysis to confirm the correlation between these two data sets. We extend this analysis to each of the five NYC boroughs: Bronx, Brooklyn, Manhattan, and Staten Island.We also use the Auto-Regressive Integrated Moving Average (ARIMA) model to 1) predict the dates when COVID-19 deaths and cases would approach zero according to the reported data, and to 2) estimate the time when the contagion might have started in NYC by identifying the dates when the first case and death might have occurred through analysis of the provided cases and deaths data.The analysis performed in this paper exclusively uses the mentioned public datasets, and so it leaves out factors that may have affected the actual counts of cases and deaths. For example, the data analyzed in this paper was recorded while no face coverings were widely considered by NYC dwellers.The remainder of this paper is organized as follows. Section 2 describes the models and methods used in our data analysis. Section 3 describes the obtained results. Section 4 presents a discussion and remaining questions. Section 5 presents our conclusions.
Model description
In this section, we present the data sources and analysis tools used in this paper.
Datasets
The NYC DOH dataset reports the first COVID-19 case on Feb. 29, 2020 and the first death on Mar. 11, 2020. The MTA turnstile entry data includes the number of subway entries and exits per station in NYC, recorded every 4 h each day from a total of 379 subway stations (NYC MTA, 2020; Whong, 2020). We calculate the average daily entries of NYC subway stations per day for our analysis. We also use the number of daily cases and the number of daily deaths of the COVID-19 dataset published by the NYC DOH (NYC Health Department, 2020b).We consider the reported COVID-19 incubation and symptomatic periods (Centers for Disease Control and Prevention (CDC), 2020; Guan et al., 2020), as a combined period, in the reported subway entry data as features in the analysis. We call these features shifted days. For example, a shifted day is a day prior to the day under prediction. The number of daily subway entries is denoted as for a given day t0, and the number of turnstile entries of m days before t0 is denoted as . We use and as features to predict the number of deaths and cases for day t0, where 1 ≤ m ≤ 25. We use 20% of the NYC DOH data as test data for validation and the rest for training.
Forecasting models
In our analysis, we use linear regression with L1 regularization to find the correlation between the two datasets and the precedence of the different shifted days. We use LSTM, a recurrent neural network (RNN) (Hochreiter & Schmidhuber, 1997) that learns the time evolution of data sequences (Hochreiter & Schmidhuber, 1997; Ramsundar & Zadeh, 2018), to predict the number of daily deaths and the number of daily cases and evaluate their accuracy as verification of the correlation in NYC and its five boroughs.We use ARIMA, a tool to forecast time-series data (Box et al., 2011; Flandoli, 2020), to predict the number of daily deaths and the number of daily cases, evaluate their accuracy, and estimate the dates of when the first case and death may have occurred. For the latter use, we analyze the data in a backward (time) sequence.We adopt multiple linear regression with x independent variables (Montgomery et al., 2012). The estimated value in the linear regression model follows:where β represents the model weights or regression coefficients and ϵ is the error. L1 regularization is the absolute sum of the model weights multiplied by a shrinkage value (λ). It is formulated as and the regression model goal is to minimize:where y is the actual data point and x is the value of the independent variables for each data point.We evaluate the accuracy of the prediction performed through LSTM as an indicator of the correlation extent between the two datasets. We use a Savitzky–Golay filter to smooth out the prediction results of LSTM (Schafer, 2011).ARIMA consists of three models: auto-regressive (AR), integration, and moving average (MA). It uses three hyper parameters: p, d, and q, where p is the order of auto-regressive model, d is the degree of difference, and q is the order of moving average (Box et al., 2011). The ARIMA(p, d, q) is defined aswhere α is the coefficient of discrete time linear equation of AR, L is a time lag operator defined as LX = X, X is the observed value at time t, β is the coefficient for the noise term, ϵ, in MA (Flandoli, 2020).
Performance metrics
The performance of the prediction models is evaluated by the R2 score, mean absolute error (MAE), and root mean square error (RMSE). The R2 score indicates how close the data are from the obtained regression (Montgomery et al., 2012). R2 score is calculated aswhere SS is the residual sum of squares and SS is the total sum of squares. SS is calculated as , where y is the actual values and is the predicted values. SS is calculated as , where is the mean of the actual values. R2 shows the variance of residuals (or prediction errors) in the predicted data points divided by the variance of data points from the average of all data points. R2 represents the performance of a model as compared to randomly guessed predictions that are equal to the average of all data. The closer R2 is to 1, the predictions are closer to the actual values.MAE is defined as the average of sum of residuals, orwhere n is the number of observations. An MAE that approaches 0 means that the model predicts data points with high accuracy. We use MAE to measure the training and validation losses of LSTM and, in turn, to show whether the prediction is a good fit and not overfitting or underfitting occurs. If the training loss is significantly lower than the validation loss, the model is overfitting the training data, which means the model is learning complex patterns of training data that may not generalize in predicting unseen test data and it results in poor performance. If the validation loss is significantly lower than training loss, the model is underfitting the training data, which means that the model is unable to learn important patterns of the training data, and it results in poor model performance. The training is stopped and the model is improved to avoid overfitting and underfitting if the losses grow (although in different directions) as the training progresses (Abadi et al., 2015).RMSE is defined asIn other words, RMSE is the standard deviation of residuals. With normally distributed data points, one can expect 68% of predicted points to be within one RMSE from the mean and 95% to be within two RMSE from the mean. In LSTM predictions, we consider the combination of R2 scores and RMSE to assess accuracy.Fig. 1 shows the number of subway turnstile entries and the reported daily COVID-19 deaths and cases from Mar. 1, 2020 to May 15, 2020. The trend of deaths trails the cases by a few days. The turnstile data include variations of ridership before the city executed a lockdown. We tested the correlation between NYC subway turnstile entry data and reported daily COVID-19 deaths with turnstile entry data of the predicted day t0 and m shifted days (t0 − m) for 1 ≤ m ≤ 40, using linear regression.
Fig. 1
NYC COVID-19 deaths, COVID-19 cases, and MTA turnstile entries data from Mar. 1, 2020 to May 15, 2020.
NYC COVID-19 deaths, COVID-19 cases, and MTA turnstile entries data from Mar. 1, 2020 to May 15, 2020.Fig. 2 shows the calculated R2 score for t0 and each shifted day. The score of each considered shifted day is about or larger than 0.2, showing a strong correlation between these two data sets. Next, we predict the cases and deaths using LSTM to verify the correlation of the data sets. From the different number of shifted days shown in Fig. 2, we selected those that correspond to the possible reported incubation period of SARS-CoV-2 in a person plus possible symptomatic days. We used three feature sets (Feature sets A, B, and C, as sshown in Table 1). Feature set A uses the entries of the predicted day and the 14th prior day (, ; where m = 14). Feature set B uses entries of the predicted day and five prior days: 10th, 12th, 13th, 14th, and 17th prior days (, , where m = 10, 12, 13, 14, 17). Feature set C includes the entries of the predicted day and prior-day entries of each of the considered 25 days (, ; where m = 1, …, 25). This set is selected for completeness and as a reference.
Fig. 2
R2 score for multiple regression prediction of COVID-19 deaths with predicted day turnstile entries () and shifted-day turnstile entries () for (m) 1–40 days.
Table 1
Feature sets of training data derived from average daily entries of NYC subway stations.
Feature set A
Predicted day and 14 days ago
Feature set B
Predicted day, 10, 12, 13, 14, and 17 days ago
Feature set C
Predicted day, 1, 2, 3, 4, 5, …, and 25 days ago
R2 score for multiple regression prediction of COVID-19 deaths with predicted day turnstile entries () and shifted-day turnstile entries () for (m) 1–40 days.Feature sets of training data derived from average daily entries of NYC subway stations.
Results of data analysis
Correlation between NYC subway turnstile entries and COVID-19 deaths and cases
We used the feature sets described in Section 2 on LSTM to predict the number of deaths and the number of cases. We compared the predicted numbers using the different features with the actual reported numbers.Fig. 3(a) and (b) show the MAE of LSTM training and validation losses for the prediction of the number of deaths using Feature sets A and C, respectively. Feature set A produces an R2 score of 0.61 and an RMSE of 68.15. Feature set C produces an R2 score of 0.98 and an RMSE of 17.29. Fig. 3(c) shows the prediction of the number of deaths using Feature sets A and C, respectively. Feature set C is the top performing feature set and Feature set B is the bottom performing one for the prediction of deaths, as shown in Table 2. The results show that the correlation is strong with one shifted day and the correlation strengthens by adding more shifted days to the feature set as the scores show for the three considered feature sets.We also tested the use of the three feature sets on LSTM to predict the number of cases. Fig. 4(a) and (b) show the MAE of training and validation losses in the prediction of cases for Feature sets A and C, respectively. The R2 score on the test data is 0.74 and the RMSE is 1,510.43 for Feature set A, and the R2 score and the test data is 0.99 and the RMSE is 389.71 for Feature set C. The LSTM predicted cases using Feature set A (Fig. 4(c)) follows the actual data with, however; small discrepancies. The prediction using Feature set C also follows the actual data; however, it is closer than that predicted with Feature set A. Table 2 summarizes the R2 scores and RMSE for each feature set in the prediction of cases. The RMSE of the predictions of the number of cases is much higher than that of the prediction of the number of deaths because the number of cases each day is larger than deaths. These losses further confirm the lower accuracy of the reported cases despite the high R2 scores.
Fig. 3
MAE of LSTM training and validation losses in prediction of deaths using (a) Feature set A and (b) Feature set C. (c) Prediction of COVID-19 deaths using Feature sets A and C on LSTM.
Table 2
RMSE of all data points and R2 scores of test data for LSTM predictions of deaths and cases in NYC.
Features - Deaths
R2
RMSE
Set A
0.61
68.15
Set B
0.58
60.28
Set C
0.98
17.29
Features - Cases
Set A
0.74
1510.43
Set B
0.84
616.63
Set C
0.99
389.71
Fig. 4
MAE of LSTM training and validation losses in prediction of cases using (a) Feature set A and (b) Feature set C. (c) Prediction of COVID-19 cases using Feature sets A and C on LSTM.
MAE of LSTM training and validation losses in prediction of deaths using (a) Feature set A and (b) Feature set C. (c) Prediction of COVID-19 deaths using Feature sets A and C on LSTM.RMSE of all data points and R2 scores of test data for LSTM predictions of deaths and cases in NYC.MAE of LSTM training and validation losses in prediction of cases using (a) Feature set A and (b) Feature set C. (c) Prediction of COVID-19 cases using Feature sets A and C on LSTM.
Correlation of data set per NYC borough
We analyzed the LSTM predictions of turnstile data and COVID-19 cases and deaths for each NYC borough to show the correlation for the different boroughs of NYC. Each borough may have different socioeconomic status. Table 3 lists the R2 scores and RMSE for each feature set and borough.
Table 3
RMSE and R2 scores of test data for LSTM predicting deaths and cases in NYC Boroughs.
Features - Deaths
Set A
Set B
Set C
R2
RMSE
R2
RMSE
R2
RMSE
Bronx
0.47
20.38
0.50
15.98
0.89
7.89
Brooklyn
0.35
30.35
0.43
19.61
0.84
11.63
Manhattan
0.16
12.3
0.27
12.3
0.58
6.17
Queens
0.73
22.99
0.67
21.24
0.96
8.12
Staten Island
0.18
6.17
0.61
3.88
0.13
4.02
Features - Cases
Set A
Set B
Set C
R2
RMSE
R2
RMSE
R2
RMSE
Bronx
0.86
154.31
0.64
159.53
0.96
71.46
Brooklyn
0.74
176.34
0.78
129.34
0.93
68.36
Manhattan
0.73
81
0.82
48.34
0.95
26.67
Queens
0.94
209.32
0.86
158.75
0.97
76.87
Staten Island
0.75
96.23
0.44
84.53
0.81
64.20
RMSE and R2 scores of test data for LSTM predicting deaths and cases in NYC Boroughs.
LSTM prediction of the number of deaths and cases per borough data
We also evaluated a prediction of the number of deaths and cases per borough considering the turnstile entry data of the boroughs.
Prediction of the number of deaths per borough
Table 3 summarizes the R2 and RMSE of the prediction of deaths per borough. Each of these scores was estimated using NYC-subway turnstile entry data per borough for each of the three feature sets used; A, B, and C. The table shows that R2 is the highest for Queens for all feature sets.The RMSE is the smallest for Manhattan and Staten Island for all feature sets. These values combined with their R2 shows that the correlation is high. The RMSE is highest for Brooklyn and Queens in general. During the analyzed period of time, these two boroughs reported the highest number of deaths.
Predictions of the number of cases per borough
Table 3 also shows R2 and RMSE for the number of cases per borough. Different from deaths, cases shows high R2 for the five boroughs; and reaching 0.97 for Queens with Feature set C. However, their RMSE is also higher than those for deaths. These scores present close similarities to those obtained at NYC level (i.e., all boroughs together).
Predictions of the number of deaths by LSTM per borough
Fig. 5 shows the predicted number of deaths for each of the five NYC boroughs. This figure shows the bottom and top performing feature sets; Feature sets A and C, respectively. The figure shows that for Feature set A, Queens shows the highest similarity to the actual values, and Manhattan and Staten Island the lowest similarity to the actual values. As for Feature set C, the prediction of deaths is very similar to the actual data for the five boroughs. Manhattan and Staten Island reported the smallest number of cases during this period. Therefore, Feature set C confirms the high correlation of deaths and turnstile data for these boroughs.
Fig. 5
LSTM predictions of deaths per borough: (a) Bronx, (b) Brooklyn, (c) Manhattan, (d) Queens, and (e) Staten Island.
LSTM predictions of deaths per borough: (a) Bronx, (b) Brooklyn, (c) Manhattan, (d) Queens, and (e) Staten Island.
Predictions of the number of cases by LSTM per borough
Fig. 6 shows the predicted number of cases for each of the five NYC boroughs. Bronx, Brooklyn, and Queens reported the larger number of cases during this period. Different from the predicted number of deaths, Feature sets A and C closely follow the actual data on reported cases for the five boroughs. Such predictions are very similar to the one obtained for NYC as a whole.
Fig. 6
LSTM results of cases predictions in (a) Bronx, (b) Brooklyn, (c) Manhattan, (d) Queens, and (e) Staten Island.
LSTM results of cases predictions in (a) Bronx, (b) Brooklyn, (c) Manhattan, (d) Queens, and (e) Staten Island.These results show that the LSTM predictions can be transported to smaller segments of the city. They are consistent with those obtained for the entire city.
Estimation of trends and dates of minimum cases and deaths
Fig. 7(a) shows that considering the time series of the number of deaths on May 16, 2020, the lower bound (with 95% confidence interval) of ARIMA forecast zero deaths on that date. On the other hand, the upper bound does not cross zero, indicating that the number of deaths could also increase. Fig. 7(b) shows the ARIMA backward forecast of the number of deaths to estimate the date when the first death may have occurred in NYC. The graph shows that the forecast date of the first death could have occurred on Mar. 9, 2020, two days earlier than the first reported death. Fig. 7(c) shows the ARIMA forecast of the progression of the number of COVID-19 cases in NYC after May 15, 2020 (last day of reported data). The forecast follows the decreasing trend of the number of cases, reaching zero on Jun. 20, 2020. However, this is a very early date as the number of cases has not reached zero by the time this paper was written. Fig. 7(d) shows the backward forecast of the number of cases to estimate the date of the first case in NYC. The lower and upper bounds of 95% confidence interval of ARIMA forecast shows that the first case of COVID-19 in NYC might have occurred between Jan. 28 and Mar. 20, 2020. The first case was reported on Feb. 29, 2020, but the actual first case is unknown. Table 4 shows the first reported COVID-19 cases (Feb. 29, 2020) and deaths (Mar. 11, 2020) in NYC DOH dataset.
Fig. 7
ARIMA (a) forecast of the number of COVID-19 deaths, (b) backward forecast to find the approximate dates of the occurrence of the possible first death, (c) forecast of the number of cases, and (d) backward forecast to find the approximate dates of the occurrence of the possible first case.
Table 4
The first reported COVID-19 case and the first death in NYC DOH dataset and the projected dates with ARIMA model.
Date
Reported
Projected
First death
Mar. 11, 2020
Mar. 9, 2020–Mar. 21, 2020
First case
Feb. 29, 2020
Jan. 28, 2020–Mar. 20, 2020
Date with 0 deaths
___
May 16, 2020 - ___
Date with 0 cases
___
___ - Jun. 20, 2020
ARIMA (a) forecast of the number of COVID-19 deaths, (b) backward forecast to find the approximate dates of the occurrence of the possible first death, (c) forecast of the number of cases, and (d) backward forecast to find the approximate dates of the occurrence of the possible first case.The first reported COVID-19 case and the first death in NYC DOH dataset and the projected dates with ARIMA model.
Discussion
The reporting of cases has been affected by the deployment and execution of testing for SARS-CoV-2. The effective and widespread testing offers better identification of the cases and their locations. But because of the time needed to detect a likely case, the limitation of testing capabilities at the beginning of the pandemic, and the possible occurrence of asymptomatic cases, the reported COVID-19 cases are likely inaccurate.The estimations performed with ARIMA are based on the provided data, with no additional information. The discrepancies on the estimated dates and the reported dates are the product of the model using solely past data from the two analyzed datasets. Public health suggestions on the use of face covering and social distancing are not considered in the models to reflect the multi-variant effects in linear regression. The use of face coverings adopted at some point in time or neglecting social distancing may affect the actual trends that past data may not be able to forecast.Looking ahead, comparisons of the incidence of COVID-19 in NYC with cities that experienced an early face-covering measure is of interest for forecasting future incidence or containment of the spread of this virus.
Conclusions
We have shown that the NYC turnstile entry data and the COVID-19 deaths and cases data in NYC from March 1, 2020 to May 15, 2020 are highly correlated. We have also shown that the correlation is also detectable for the data of each individual borough of NYC. However, each borough shows a different correlation strength that is probably associated with their geographical location and economical status. The consideration of different incubation-symptomatic phases for subway users as features through Long Short-Term Memory analysis confirms and enhances this correlation. Based on the dataset on COVID-19 deaths and cases in NYC, we have also shown the predicted progression of deaths and cases beyond the dates of the reported data using the Auto-Regressive Integrated Moving Average forecast model. This progression shows the possible decline of COVID-19 incidence but also a probable increase. We have also shown a backward forecast of the possible dates when the first case and death might have occurred at the beginning of the pandemic in NYC. The backward forecast of the first death shows high accuracy.
Declaration of competing interest
The authors declare having no conflict of interest.