Pei-Sheng Lin1,2. 1. Institute of Population Health Sciences, National Health Research Institutes, Miaoli, Taiwan. 2. Department of Mathematics, National Chung Cheng University, Minxiong, Taiwan.
Abstract
Identifying transmission of hot spots with temporal trends is important for reducing infectious disease propagation. Cluster analysis is a particularly useful tool to explore underlying stochastic processes between observations by grouping items into categories by their similarity. In a study of epidemic propagation, clustering geographic regions that have similar time series could help researchers track diffusion routes from a common source of an infectious disease. In this article, we propose a two-stage scan statistic to classify regions into various geographic clusters by their temporal heterogeneity. The proposed scan statistic is more flexible than traditional methods in that contiguous and nonproximate regions with similar temporal patterns can be identified simultaneously. A simulation study and data analysis for a dengue fever infection are also presented for illustration.
Identifying transmission of hot spots with temporal trends is important for reducing infectious disease propagation. Cluster analysis is a particularly useful tool to explore underlying stochastic processes between observations by grouping items into categories by their similarity. In a study of epidemic propagation, clustering geographic regions that have similar time series could help researchers track diffusion routes from a common source of an infectious disease. In this article, we propose a two-stage scan statistic to classify regions into various geographic clusters by their temporal heterogeneity. The proposed scan statistic is more flexible than traditional methods in that contiguous and nonproximate regions with similar temporal patterns can be identified simultaneously. A simulation study and data analysis for a dengue fever infection are also presented for illustration.
Cluster analysis is a useful tool to explore underlying structures of a stochastic process and relations between observations by grouping items into categories according to their similarity.
This analytical approach has long been of great importance in applications of epidemiology, including contagious disease surveillance. We are here particularly interested in clustering geographic units with similar time series of elevated risks while accounting for spatial‐temporal correlation. A motivation for this article is the study and surveillance of dengue fever infection in Taiwan. Dengue fever is a disease caused by the flavivirus, which is transmitted by certain species of mosquito of the Aedes type, whose incubation period has a seasonal cycle in Taiwan. Due to climate change, dengue is now a global problem, and it has had outbreaks in more than 110 countries since the Second World War. Dengue fever infection also had consecutive outbreaks in Taiwan in 2014 and 2015, which caused 15,732 and 43,784 confirmed cases, respectively. Since there is no effective vaccine or specific medicine to treat dengue infection, dengue prevention usually puts more focus on vector control, such as chemical spraying and physical removal of breeding sites. A surveillance system that can map clusters of cases to explore epidemic propagation is thus essential to preventing a dengue outbreak.Epidemic propagation is a process by which an infectious disease spreads from its origin. Taiwan, an island straddling the Tropic of Cancer, is an excellent place to investigate the epidemic propagation of dengue fever. Geographically, Taiwan is surrounded by China, Japan, and some southeast Asian counties. While dengue pandemics occurred often in southeast Asian counties, dengue outbreaks have been found in East Asian countries just in recent years. Since Taiwan forms an isolated geographic environment and most dengue epidemics initiated from imported cases, a better understanding of the epidemic propagation could help other countries prevent their own dengue outbreaks. For dengue progression, the incubation period of mosquitos is a pivot factor. Since the incubation period depends on climate and environmental situations,
the epidemic diffusion would usually be related to an evolution of disease clusters that have space and time interaction.
An exploratory spatial analysis such as kernel density estimation may thus not be suitable to explore complex diffusion processes.
On the other hand, a disease mapping model that can characterize different temporal trends into their corresponding geographic clusters would be useful to evaluate epidemic progression in time and space for disease surveillance. For this purpose, we aim to develop an identification method to group regions into various geographic clusters by their temporal heterogeneity of incidence rates.To cluster geographic regions that have similar time series, we note that these regions should also have similar annual incidences. So, in the first stage of the identification procedure, we modify the quasi‐likelihood (QL) scan statistic by Lin et al
for the dengue data to find geographic clusters that are in proximity and have similar elevated (annual) disease rates. Then, proceeding from the geographic clusters identified by the spatial scan statistic, in the second stage we use a combination procedure based on a chi‐squared test for comparison of temporal risks to regroup the clusters. Instead of considering candidate clusters with sizes up to the same order of the sampling regions, the proposed identification procedure uses a concept similar to local regression with restriction for sizes of candidate clusters in the first stage. By knitting the identified geographic clusters into temporal‐heterogeneity (TH) clusters with suitable selection procedures in the second stage, the proposed method can sense regions that are distant from each other but have similar temporal trends. For convenience, we refer to the proposed method as a TH scan statistic.When an infectious disease can be transmitted by persons in transportation, identification of nonproximate geographic clusters that share temporal similarity would provide useful information to track multiple routes of propagation that come from a common source, for example, as shown in the data analysis of Section 5. However, traditional scan statistics are mainly designed to cluster contiguous regions with similar mean infection rates.
On the other hand, although some classification methods, such as the k‐means method, can be used to group nonproximate spatial‐temporal units with similar event rates, there is no guarantee that temporal patterns can be observed from the grouped units since the k‐means method does not supervise the known spatial structure. Additionally, the above method generally does not take correlation between observations into account. The TH scan statistic, on the other hand, can bridge clustering and classification methods. For example, in the first stage of the identification procedure, our method bears a similarity to the traditional clustering approaches by using the spatial scan statistic to identify clusters with contiguous regions. And, in the second stage of the TH scan statistic, we merge geographic clusters that may not be in proximity but have the same temporal patterns into TH clusters, which results in an identification result similar to the classification method. The TH scan statistic can thus not only identify geographic clusters in arbitrary shapes but also summarize temporal waves of outbreaks over the whole time period.The remainder of this article is organized as follows. In Section 2, we establish cluster models and QL estimation methods for the proposed models. In Section 3, the QL function is adapted to develop a spatial scan statistic for geographic clusters, and then a chi‐squared test is built to regroup geographic clusters. A simulation study is conducted to compare the TH scan statistic with other existing approaches in Section 4. Data analysis and discussion are given in Sections 5 and 6, respectively.
CLUSTER MODELS AND ESTIMATES
Model settings
Assume that we have a spatial‐temporal data set with n geographic regions and T time periods. Let , , denote a spatial‐temporal coordinate for region i at time t, where denotes a geographic location for the centroid of region i. Let denote the total number of spatial‐temporal observations. Also, let and denote numbers of observed cases and people at risk, respectively, in region at time t. Assume that is affected by a spatial‐temporal noise from a zero‐mean Gaussian random field with variance and correlation for and . With consideration of regional noises, we are interested to know whether TH clusters for elevated infection rates exist. Let H denote a null hypothesis that the observations do not have a geographic cluster (but could have spatial‐temporal correlation). Under H, an expected number of cases in region at time t is given by . A spatial‐temporal (standardized) incidence rate is therefore given by .We now fit a spatial‐temporal cluster model for . In this article, each TH cluster is assumed to have its own temporal pattern, which is determined mainly by a risk coefficient within the given cluster in each time t. Let , denote disjoint TH clusters, and let denote an indicator variable such that if . Let denote a log risk associated with cluster at time t. Conditional on , , and , an integrated model is given by
where denotes an intercept parameter. Note that model (1) is different from a traditional generalized linear model (GLM) since both and are unknown. Thus, a traditional GLM method cannot be directly applied for estimation, and therefore the concept of scan statistics is later used to address the related estimation problem in Section 2.2.To find geographic clusters by scan statistics, we next develop a spatially marginal model associated with from (1). Let and denote sums of observed and expected cases, respectively, in region over all time periods. A spatial (standardized) incidence rate in region can thus be defined as . When for all , the spatial incidence rate can be expressed as . Let denote a yearly log‐risk associated with cluster and let denote a spatial noise associated with region . It then follows from (1) that , where denotes an intercept and , as derivation can be seen in the Appendix. So, conditional on and , we can model by
where is from a Gaussian process with mean zero, variance , and correlation for and .Finally, we set a covariance structure for the responses by assuming that is spatial‐temporally separable. That is, , where and denote the spatial and temporal correlations, respectively, with a Euclidean distance between and and a time lag between t
and . Also, let denote a little o function. We make the following assumptions for the dependence structure.(a) is a positive‐definite correlation function satisfying as . (b) The temporal correlation decays exponentially in lag .We note that in model (1), is associated with time t, and therefore the temporal structure of within a TH cluster is almost included in . Thus, it is reasonable to assume that the temporal correlation of is negligible in the sense that as , as described in Assumption 1(b). It then follows from the definition of that . This implies that and . This relationship will be used to connect spatial and spatial‐temporal variations in later sections.
QL estimating equations for scan statistics
For estimation of unknown parameters, we rely on QL estimating equations, which involve first‐order and second‐order moments. To develop a spatial estimating equation for model (2), let denote the corresponding marginal mean. By moment generating functions for a normal distribution, we can get and . Let denote an matrix with the element to be
Let , let , and let denote a diagonal matrix formed by . The covariance matrix of can then be expressed as . Let denote a derivative matrix of with respect to . The QL estimating equation for the spatial model (2) is given by
where denotes an inverse matrix of . We refer to (4) as the spatial estimating equation, and to the estimates and as the QL estimates for and , , respectively. It is easy to see that the derivative matrix of with respect to is symmetric, and therefore the QL estimating equation has a unique root in this setting.Similarly, we develop a spatial‐temporal estimating equation for model (1). Let denote the temporally marginal data and let denote the ST data. Also, let . Then, it follows from the moment generating function that . Let and . Let denote an matrix with the element to be . By Assumption 1, the covariance matrix of can be approximated by
where , is an identity matrix of size T, and denotes a Kronecker product. Let denote the time series in cluster , and let . Also, let denote a derivative matrix of with respect to . The spatial‐temporal estimating equation,
is then used to estimate the parameters , where denotes an inverse matrix of . We refer to and as the QL estimates for and , respectively.In practice, locations of the true clusters are usually unknown; therefore, the scan statistic is commonly used to identify the clusters. Let denote a candidate geographic cluster, and let denote a collection of the candidate geographic clusters, where denotes the total number of candidate clusters. Let denote a cardinality for set A. To ensure asymptotic properties for the QL estimating equation on , the size of each candidate cluster should be restricted.
We make an assumption for the size of .For each candidate cluster , , we require for some .The idea of Assumption 2 for the spatial scan statistic is similar to the concept of local regression. In spatial epidemiology, identification of localized clusters is important for exploration of disease transmission. However, in some areas with uneven terrains, close regions may still have different disease transmission patterns. For this reason, restriction for the sizes of candidate clusters with local models may not only be useful to obtain statistical inference for the proposed scan statistic, but also provide ability to detect regions that have abrupt changes in disease rates. Thus, by applying the concept of local regressions in the identification process, the proposed scan statistic could more accurately identify temporal trends by piecewisely regrouping clusters with suitable criteria. In Section 3.2, we propose a test statistic to combine geographic clusters with similar patterns into a larger cluster. On the other hand, although the size of candidate clusters is limited, the scan statistic used in this article can allow the candidate clusters to be arbitrary shapes.By Assumptions 1 and 2, it is reasonable to assume that converges to a positive‐definite matrix as . Let denote evaluated at the estimated parameters. Thus, under suitable conditions, we have
in distribution as , where denotes an inverse matrix of . Similarly, we can assume that converges to a positive‐definite matrix as . Under Assumption 2 and suitable conditions, we have asymptotic properties for the spatial‐temporal estimating Equation (6). Details can be seen in the Appendix.
IDENTIFICATION FOR TH CLUSTERS
Scan statistics for geographic clusters
To identify local spatial clusters associated with , we first use an independent scan statistic to search a collection of single clusters. Given the single cluster , we fit a regression model for by
Under an independence assumption (ie, ), we estimate and for model (8) by the QL estimating Equation (4), . Since we have estimated cluster coefficients to be evaluated under the null hypothesis H, the Benjamini‐Hochberg procedure
is used to control a false discovery rate (FDR) at level . However, after the Benjamini‐Hochberg procedure, some of the initial estimated clusters could have overlapping regions, which would cause some problems in statistical inference for a multiple‐cluster model. To address this issue, a partition procedure for the initial estimated clusters is proposed to make these clusters disjoint. Details of the Benjamini‐Hochberg and partition procedures can be seen in Section 5.2.Let denote the estimated clusters by the independent scan statistic after the partition procedure. We then evaluate whether the independence assumption is suitable. Based on , a multiple‐cluster model for spatial data is given by . The estimating equation (4) under the independence assumption is used again to obtain estimates for . Residuals can then be computed by . For spatially correlated data, a variogram, , is commonly used to measure spatial dependence. Under suitable conditions, can be estimated by an empirical variogram , where denotes a collection of pairs with distance h
apart. We obtain estimates for by a weighted least squares estimation
where denotes a specific variogram model. In (9), the used weights are the numbers of pairs .If is not significantly away from zero, then are the estimated clusters for the spatial data , and the identification procedure jumps to the combination procedure shown in Section 3.2. When the correlation estimate from (9) is significantly away from zero, it provides evidence for the existence of spatial correlation. In this situation, a spatial scan statistic by adding empirical estimates and into the estimating equation is developed below to calibrate estimated clusters from the independent scan statistic. Specifically, let and . We then use the estimating equation (4) to estimate . However, in practice, the true covariance parameters in are usually unknown. We therefore implement the estimates and from (9) into and , respectively, for of (3). Let denote the estimate of from the variogram method, and let . The spatial estimating equation for the multiple‐cluster model associated with now becomes , where is an inverse matrix of .Let denote the corresponding QL estimates for . By the limiting distribution (7), we can obtain the P‐values of . The significance of each estimated cluster , , is evaluated again by the corresponding P‐value. For those estimated clusters with nonsignificant P‐values, we remove them from the collection of estimated clusters. Let , , denote the significant identified clusters by the spatial scan statistic.
A combination procedure for geographic clusters
To regroup geographic clusters such that temporal patterns are different between TH clusters but similar within the clusters, we first estimate the temporal pattern in each geographic cluster , . For the given cluster , a spatial‐temporal model from (1) for the marginal mean is given as
where denotes a cluster coefficient associated with at time t. Note that in (10), the intercept parameter is set to be the same for all k, , so that the temporal patterns can be compared under the same baseline. The spatial‐temporal estimating equation (6) is then applied to estimate the parameters . Since the true covariance matrix of (5) is unknown, we use to replace in the spatial‐temporal estimating equation, where is the estimate of with and being estimated by the variogram method. Let denote the estimated temporal pattern by the QL estimating equation for , .Based on the estimated temporal patterns, we now propose a chi‐squared test to evaluate whether two geographic clusters and (that may not be adjacent) should be combined. Let and . Let and denote derivative matrices of and with respect to and , respectively. By Assumptions 1 and 2, it is reasonable to assume that as . The chi‐squared test statistic,
is proposed to evaluate temporal heterogeneity between the two geographic clusters and , where denotes evaluated at the estimated parameters. When the two geographic clusters and have the same temporal pattern, follows a chi‐square distribution with degrees of freedom T. Thus, and are combined if , where denotes a ‐percentile of the chi‐squared distribution with degrees of freedom T. In the data analysis, we present a regrouping procedure based on a forward selection procedure associated with the chi‐squared statistic . Details can be seen in Section 5.3. Note that this procedure can also be applied to further partition larger clusters into smaller ones.Let denote the final TH cluster from the regrouping procedure. For the spatial‐temporal model (1) associated with , we apply the spatial‐temporal estimating Equation (6) again to estimate for . The expected standardized incidence rate can thus be estimated from the TH cluster model by . The whole identification procedure shown in Section 3 is therefore called the TH scan statistic method.
SIMULATION
In this section, we conduct a simulation study by using the geographic structure of Kaohsiung City, which consists of 891 villages. The purpose of the simulation study is to evaluate whether the proposed method can cluster hot‐spot villages whose temporal patterns of elevated incidence rates are the same. Let denote a collection of the 891 villages, and let denote the administrative centroid for the ith village. Since Kaohsiung is partitioned by rivers and hills, using nearest neighbors to construct candidate clusters would probably be better than traditional distance methods. We thus use the concept of nearest neighbors to define candidate clusters. Specifically, for a given village , we define its lth‐order neighbors to be a collection of villages that share a common border with , , with . Similar to the procedure used in Section 5, candidate clusters associated with the centroid are unions of its neighbors up to the third order. That is, the candidate clusters associated with are given by . More details on how to choose centroids to make candidate clusters can be seen in the data analysis of Section 5. In total, there are 855 candidate clusters. Besides using the TH scan statistic, we also employ the SaTScan method with two different settings to analyze the simulated data for comparisons.In the simulation study, we consider two scenarios. The first mainly follows the dengue data analysis result shown in Section 5. Five geographic clusters identified by the spatial scan statistic in the data analysis, , are chosen as hot‐spot regions. Figure 1A shows the locations of on a map of downtown Kaohsiung with , , , , and . (Downtown Kaohsiung consists of 528 villages with most dengue cases happening there.) Four of the five geographic clusters, , are set to have the same temporal pattern, while geographic cluster has its own temporal pattern (Figure 1B). Let and denote the TH clusters. The total numbers of villages in and are 34 and 77, respectively. As can be seen in Figure 1B, some villages in the TH cluster are in proximity, but others are not contiguous. Also, some villages (cluster ) in the TH cluster are next to the TH cluster , while clusters and have similar annual incidence rates but different temporal patterns. These characteristics may make traditional scan statistics difficult to accurately identify the TH clusters and .
FIGURE 1
In the dengue data analysis, locations of the estimated clusters by the (A) spatial scan statistic and (B) regrouping procedure for temporal heterogeneity
In the dengue data analysis, locations of the estimated clusters by the (A) spatial scan statistic and (B) regrouping procedure for temporal heterogeneityIn the second simulation scenario, we choose two proximate geographic clusters and (representing ) with various sizes and , or 1. The temporal patterns set in and are also the same as those set in the first simulation scenario. Note that the geographic clusters and belong to the TH clusters and , respectively, and therefore the temporal patterns in and are different. Figure A1 in the Supplementary Material depicts locations of the half‐size geographic clusters, say, and . We use the second simulation scenario to evaluate finite sample properties for the TH scan statistic. To simulate responses for both scenarios, we first generate a spatial‐temporal noise from a Gaussian random field with mean zero, variance , 0.02, or 0.05, and correlation . In the simulation, the spatial correlation function is set to be , which is the same as that given in (13) from the data analysis. Also, due to seasonal dengue infection in Taiwan, we set if t and are both in the interval [30, 51], and 0, otherwise. In the simulation setting, we consider , 0.1, or 0.3. Let denote the simulated spatial‐temporal noise for the rth simulation run, , where R denotes a total number of simulation runs. Given , a temporal pattern of responses for village , , is simulated by , , where denotes an indicator variable for whether is in , and values of , , are the same as those given in model (16). (Figure 2B also depicts , , .) In the simulation, we simulate replicates for each setting.
FIGURE 2
Temporal patterns of dengue infections for the 2014 Kaohsiung data. (A) Averages of observed incidence rates in the identified clusters (clusters and ) and whole study area (Kaohsiung City). (B) Estimated incidence rates by the spatial‐temporal estimating equation in the identified clusters
Temporal patterns of dengue infections for the 2014 Kaohsiung data. (A) Averages of observed incidence rates in the identified clusters (clusters and ) and whole study area (Kaohsiung City). (B) Estimated incidence rates by the spatial‐temporal estimating equation in the identified clustersIn applying the TH scan statistic for the simulated data, we conduct an identification procedure similar to the one used for the data analysis of Section 5. Recall that the collection of candidate clusters used for the TH scan statistic is the same as that in the data analysis (Section 5.1), which has 855 candidate clusters. The level of significance for the FDR is controlled at . In use of the SaTScan software, the first setting for SaTScan, say , is to use the default setting, which means each candidate cluster including up to 50% of the population. On the other hand, in the second setting for SaTScan, say , the radius of each candidate cluster is restricted to a maximum value of 2.5 km, which is twice the range of the estimated spherical correlation function shown in model (13). For each scan statistic method, we use to denote the estimated cluster that has the greatest number of overlapping regions with , . Additionally, we use to denote a union of estimated clusters other than and . For convenience, we also use , , to denote the corresponding in the rth simulation, .In the first simulation scenario, the TH scan statistic has very close identification results in all the simulation settings (, 0.02, 0.05, and temporal correlation , 0.1, 0.3). Also, the and scan statistics also present similar identification results in all the simulation settings. The simulation results may thus indicate that the proposed method is quite robust when the sample size and correlation satisfy certain conditions, as we discuss further in Section 6. Therefore, in the main context, we present only the simulation result for and , while the other identification results are shown in the Supplementary Material. Table 1 lists average numbers of villages in , , which are classified into , , by each scan statistic. That is, in Table 1, we compute for each and . We also use Figure 3 to show the location of identified by each scan statistic in part of the Kaohsiung map. (The identified clusters are very similar in all simulation runs.) For each , let denote an average temporal pattern over the simulation runs, where , . We use as an estimated temporal pattern for cluster . Figure 4 depicts the estimated temporal pattern for each identified cluster by each scan statistic.
TABLE 1
The average number of regions in the true cluster , that are classified into , , by the TH scan statistic and two SaTScan methods ( and ) for and temporal correlation
C1
C2
C‾
|C1|=34
|C2|=77
|C‾|=780
TH
Ĉ1
30.0
0.0
0.0
Ĉ2
4.0
73.8
4.7
Ĉ3
0.0
3.2
4.5
LR1
Ĉ1
34.0
74.7
167.5
Ĉ2
–
–
–
Ĉ3
–
–
–
LR2
Ĉ1
29.0
22.0
26.0
Ĉ2
0.0
44.0
35.0
Ĉ3
5.0
1.0
65.0
Note: denotes a union of clusters other than and , and denotes regions that are not in the true clusters. The simulation result is based on 200 replicates.
FIGURE 3
In the simulation, locations of the estimated clusters identified by the (A) TH scan statistic, (B) LR scan statistic, and (C) LR scan statistic
FIGURE 4
In the simulation, average temporal patterns within the identified clusters by the (A) TH scan statistic, (B) LR scan statistic, and (C) LR scan statistic
In the simulation, locations of the estimated clusters identified by the (A) TH scan statistic, (B) LR scan statistic, and (C) LR scan statisticIn the simulation, average temporal patterns within the identified clusters by the (A) TH scan statistic, (B) LR scan statistic, and (C) LR scan statisticThe average number of regions in the true cluster , that are classified into , , by the TH scan statistic and two SaTScan methods ( and ) for and temporal correlationNote: denotes a union of clusters other than and , and denotes regions that are not in the true clusters. The simulation result is based on 200 replicates.To ensure no ambiguity, in the following discussion, we also use , , and to denote the identified clusters , , by the TH, LR, and LR scan statistics, respectively. First, for the LR scan statistic, Table 1 shows that it identifies only one big cluster, , which includes all the villages of , 97% of the villages in , and many villages not in the true clusters. Note that the number of villages in that do not belong to any true clusters is about 168. Figure 4B maps the identified cluster by the LR scan statistic, indicating that about 61% of the villages in are not in any true clusters. Also, we find from Figure 3B that only one peak in the estimated temporal pattern has been identified by the LR scan statistic, while the true simulation setting has two waves of outbreaks (Figure 2B). Since is mixed with almost all villages of and , it is hard to compare the LR scan statistic with the other scan statistics. So, only the TH and LR scan statistics will be evaluated next for identification performance.We use two criteria, the true positive rate (TPR) and precision, to compare the identification performance of for , , between the TH and LR scan statistics. Specifically, for , we define the TPR and precision by
, respectively. We also use and to denote the corresponding value by the TH and LR scan statistics, respectively, and likewise and for . For the identification results of , Table 1
shows that the TPRs of by the TH and LR scan statistics are and , respectively. A t‐test then shows no significant difference in the TPR of between both scan statistics. Nevertheless, for the precision of , Table 1
gives and for the TH and LR scan statistics, respectively. By using the t‐test to evaluate the difference between the precisions of , we find that the TH scan statistic performs significantly better than the LR scan statistic (P‐value ) in the simulation. As can be seen from Figure 2C, the LR scan statistic is also unable to identify two outbreaks of epidemics in , while Figure 3A indicates that the TH scan statistic can accurately predict the true temporal pattern in .For the identification result of , Table 1
shows that for the TH and LR scan statistics, the respective TPRs are and , and the respective precisions are and . These values indicate that, in the simulation for , the TH scan statistic has significantly better performance than the LR scan statistic in the TPR (P‐value ) and precision (P‐value ). To see the reason, we note from Figure 4C that the geographic cluster , which belongs to but is next to , is misclassified into the circular cluster by the LR scan statistic, while Figure 4A depicts that the TH scan statistic can identify almost all villages in and into and , respectively. Additionally, in the simulation result for , Figure 4C also shows the LR scan statistic groups the geographic clusters to and some villages not in the true cluster into a “big” circular cluster . These results may reveal that the LR scan statistic has a tendency to overestimate cluster regions in circular shapes, while the TH scan statistic could identify localized clusters more precisely. So, by the simulation study, we learn that, under suitable conditions for cluster sizes, the TH scan statistic could be more flexible than the LR scan statistics in detecting TH clusters with arbitrary shapes.In our study of how cluster size could affect scan statistics, from the second simulation scenario, we find that when , simulation results are close to those in the first simulation scenario. That is, the TH scan statistic can still identify and very well when . However, when , the TH scan statistic could fail to identify and as two different TH clusters. Specifically, the TH scan statistic would combine regions in and together as an estimated cluster. This simulation result may thus indicate that, for two different TH clusters in proximity, each cluster should contain at least a number of geographic units so that the TH scan statistic can work well. Note that and , which are close to the value of . For the SatScan
method, both the and scan statistics fail to classify and separately in the second simulation scenario. That is, the and scan statistics identify only one big cluster in all the settings of the second simulation scenario. Finally, for the TH scan statistic, we also conduct another simulation study by using a collection of candidate clusters that include all villages as centroids. Corresponding simulation results are very similar to those shown in Table 1.
ANALYSIS FOR DENGUE DATA
Background
To illustrate the TH scan statistic, we use the 2014 Kaohsiung data collected by the Taiwan Centers for Disease Control (https://od.cdc.gov.tw/eic/DengueDailyEN.csv) to analyze the propagation pattern of the dengue infection. The Kaohsiung city consists of 891 villages with a total of 2,778,992 persons. In 2014, Kaohsiung experienced one of its largest dengue outbreaks with 15,043 confirmed cases and 20 deaths. Let denote the number of population at risk in village . Under the null hypothesis H, the expected number of dengue cases in village at week t, , can be expressed by , where denotes the average weekly infection rate. Let denote the (weekly) number of cases in village at week t. The spatial‐temporal incidence rate (or, weekly incidence rate in ) for the dengue infection in village at week t is thus given by . Also, let denote the (yearly) number of dengue cases in village . We then define the spatial incidence rate (or, yearly incidence rate at ) of the dengue infection in village by .As have been explained in Section 4, we use the nearest neighborhood structure to define candidate clusters because of the geographic surface in Kaohsiung. To construct candidate clusters for the QL scan statistic, we first make two remarks for characteristics of the dengue data. First, in the Kaohsiung data, most dengue cases (87%) happened in downtown Kaohsiung (with 528 villages), while other dengue cases seemed to randomly scatter in remote villages. Figure 5 shows an image map for spatial incidence rates of the dengue infection in downtown Kaohsiung. So, candidate clusters are set only in downtown Kaohsiung. Second, in the dengue data, we find that some villages had yearly incidence rates lower than the expected value (ie, ), but for some . This phenomenon is probably due to dengue‐control intervention that was immediately undertaken after dengue cases were found in these villages. To explore how the epidemics “naturally” propagated, the village with less than the expected value would not be considered as a “center” of candidate clusters. (Nevertheless, these villages with could still be included in some candidate clusters if their neighbors had yearly incidence rates greater than one.) Finally, recall that in Assumption 2, the number of villages in each candidate cluster should be restricted. A candidate cluster with a center at is thus limited to include villages up to . Let denote a set of villages with greater than one. The collection of candidate clusters is thus given by . The number of villages in the largest cluster among is 41, which is just little higher than . In total, 855 candidate clusters are used in the data analysis. For convenience, we use , , to denote a given candidate cluster.
FIGURE 5
A heat map for the spatial incidence rates of the dengue infection in downtown Kaohsiung in 2014
A heat map for the spatial incidence rates of the dengue infection in downtown Kaohsiung in 2014
Cluster identification for spatial data
For the candidate clusters in , we first use the independent scan statistic to find localized clusters. In the first step of the independent scan statistic, we test significance for each single cluster . Given , we fit a scan model by , and then estimate parameters and by the estimating Equation (4) under the independence assumption. We thus have 855 estimated cluster coefficients . To address the multiple‐testing issue, we use the BH procedure to control the FDR at . Specifically, let denote a P‐value associated with , and let denote the mth order statistic for . (That is, .) Also, let denote the cluster associated with . The BH procedure will select a candidate cluster as an estimated cluster if . In the data analysis, 22 single clusters, , are selected as (initial) estimated clusters after the BH procedure. The total number of villages in is 137.However, some of had overlapping regions, which would cause a collinearity problem in the multiple regression model. To avoid collinearity, we use a forward selection procedure to partition into disjoint clusters. Specifically, for , we sequentially updated each initial estimated cluster by the following criteria:After the partition procedure, we have 13 disjoint clusters, . Note that the total number of villages in is still 137.To investigate whether the independence assumption is suitable for the cluster model associated with , we fit a multiple regression model by , and obtained residuals under the independence assumption. By applying the variogram method (9) for the residuals , we get an estimate for the spatial correlation function by
where the value of 1.23 (km) denotes the range for the spherical correlation. Also, an estimate for the spatial variance is given by .A permutation test for the variogram estimates and suggests the existence of spatial correlation for the cluster model. We hence implement and into the spatial estimating Equation (4) to reestimate parameters in the multiple cluster model. The limiting distribution (7) is used to evaluate P‐values for estimated parameters. We find that only five geographic clusters are significant with associated P‐values less than .05. Let denote the resulting estimated (geographic) clusters by the spatial scan statistic. The total number of villages in is 111 with , , , , and . Figure 1A shows the locations of the five geographic clusters, which include most historic hot spots of the dengue infection.
Clusters for temporal heterogeneity
We next propose a combination procedure based on the test given in Section 3.2 to regroup the estimated clusters by their temporal heterogeneity. For each , we fit a spatial‐temporal model by
where follows a Gaussian process with mean 0, variance , and the covariance structure given by (5) with being implemented by of (13). The value of 0.07 in (14) comes from a logarithm of an average of all spatial‐temporal responses. Since the dengue infection was seasonal in Taiwan in 2014 (Figure 2A), to reduce estimation variation caused by the log‐transformation of small disease rates, we directly set if . That is, we set if , where .Let denote a vector of the log‐spatial‐temporal risks. We use the spatial‐temporal estimating Equation (6) to estimate for each cluster . The test statistic in (11) is then computed to measure heterogeneity between and . Let denote a 95th percentile of the chi‐squared distribution with 52 degrees of freedom. If some values of are less than , then the pair of geographic clusters and with the smallest value of will be combined. When and are evaluated to be combined as , we then update the corresponding risk to . Table 2 shows the regrouping process for temporal heterogeneity between the geographic clusters. After the regrouping process for the dengue data, the previous five geographic clusters, , are combined into two TH clusters, say and , with and . The total number of villages in is still 111 with and . Figure 1B maps the locations of the TH clusters and .
TABLE 2
A flow chart for the regrouping process based on for geographic clusters
G1
G2
G3
G4
G5
G1
—
—
—
—
—
G2
258.7
—
—
—
—
G3
189.2
16.4
—
—
—
G4
161.7
25.8
22.4
—
—
G5
89.5
31.6
16.5
19.4
—
Note: The bold number indicates the smallest value of in each step.
A flow chart for the regrouping process based on for geographic clustersNote: The bold number indicates the smallest value of in each step.Based on the TH clusters and , a final model associated with becomes
Again, we set if . Applying the estimating Equation (6) for (15) gives an estimate for the expected disease rate by
Figure 2A shows temporal patterns for averaged incidence rates in , , and , while Figure 2B depicts estimated incidence rates for the TH clusters and . Details for the estimated log‐risks and can be seen in the Supplementary Material.
Analysis result
By comparing Figure 2A,B, we find that the estimated disease rates by the TH scan statistic are quite close to the observed values, although for cluster , the TH scan statistic would slightly underestimate the average disease rate in week 34. Also, as can be seen from Figure 2A, the identified clusters, and , had very different temporal patterns for dengue epidemics. On the other hand, an exploratory analysis for weekly disease rates across the study area shows a relatively flat AR(1) model for dengue infection. Figure 2A also indicates three waves of dengue outbreaks in the year 2014: the first was in cluster , which was transmitted to cluster and finally went back to cluster . Note that part of cluster was next to cluster , as can be seen from Figure 1B.When exploring why the first strike of the dengue epidemics happened in cluster , we find that cluster included several industrial parks, where guest workers from Southeast Asia were often employed. In Taiwan, some scientists believed that no local dengue virus existed, due to cold currents in winter.
Since dengue epidemics occur very frequently in Southeast Asian countries, in cluster , the imported cases could play a crucial role to initiate the dengue outbreaks. On the other hand, cluster , which included two major areas of Kaohsiung, had more local people and merchant activities, which could speed up dengue transmission. Specifically, the geographic cluster includes a station for the high‐speed train system, and has the main train station for Kaohsiung. Also, as shown in Figure 1A, the geographic cluster is close to a lake, which could provide a suitable environmental situation for the growth of mosquitos. This could be why cluster appears to have more severe and widespread dengue outbreaks than cluster , as shown in Figure 2A.Another interesting finding from Figure 2A is that the dengue epidemic in cluster started in week 38, about 6 weeks later than that in cluster . A cross‐correlation map (not shown here) between the number of cases and temperature also indicates that, for the 2014 Kaohsiung dengue infection, higher temperatures were associated with a larger number of cases in a 6‐week lag. Since an extrinsic incubation period of dengue is normally considered to be positively correlated with temperature, the 6‐week lag between the outbreaks in clusters and may show that the proposed method can reflect the extrinsic incubation period.We also use the SaTScan method with two different settings, the and statistics, to analyze the dengue data. However, the scan statistic identifies only one big circular cluster, which is very similar to the one identified by the LR scan statistic in the simulation. The big circular cluster can be seen in Figure 4B. So, in the data analysis, we compare the identification performance between only the TH and scan statistics. The scan statistic identifies nine geographic clusters, say, , in various time intervals. Information about the spatial‐temporal clusters identified by the scan statistic can be seen in Table A4 of the Supplementary Material. Let denote the corresponding time interval of elevated disease rates associated with , . We compute a sum of squared errors (SSE) for each scan statistic to compare the performance. In the computation of SSEs, only responses in the identified spatial‐temporal clusters with are considered. For the scan statistic, the sum of squared errors is given by , where . The total number of spatial‐temporal units identified by the scan statistic is 3778. For the TH scan statistic, let denote the residual, and let denote a vector formed by . The sum of squared errors for the TH scan statistic is , where is an estimated covariance matrix by the variogram estimation. The total number of spatial‐temporal units identified by the TH scan statistic is 1196. So, the mean squared errors (MSEs) are for the scan statistic, and for the TH scan statistic. A comparison between the value of 1.27 (=) and an F‐distribution with degrees of freedom 3378 and 1196 gives a P‐value around , which may provide evidence that the TH scan statistic could have better identification results than the scan statistic in this data analysis.
DISCUSSION
Identifying transmission of hot spots and characterizing the temporal trends are important for understanding infectious disease propagation. In this article, we propose a novel scan statistic by combining the spatial scan statistic for geographic clusters and chi‐squared test for temporal heterogeneity to identify clusters whose temporal patterns are similar within clusters but different between clusters. The proposed scan statistic could be more flexible than traditional methods in the sense that contiguous and nonproximate regions with similar temporal patterns can be identified simultaneously. Although some scan statistics, such as those produced by SaTScan, can also be used to search for geographic clusters with similar temporal trends, they are mainly designed to group contiguous hot‐spot regions as clusters in certain types of shapes. The simulation indicates that such scan statistics may thus have difficulty in distinguishing proximate clusters that have different temporal patterns. The proposed approach, on the other hand, can be used to explore different routes of epidemic propagation from a common source of an infectious disease, as we illustrated in the data analysis for the dengue infection. Although Lin and Zhu
developed a method that also can connect spatial clustering and classification approaches, their work focuses on dealing with spatial heterogeneity.To see when the proposed method could work well, three remarks summarized from the asymptotic property and simulation study are listed below. We recall that in the limiting distribution (7), the error rate for the spatial scan statistic is an order of . So, first, if one geographic cluster with a disease rate p% higher than the expected value does exist, then the true cluster should consist of at least geographic units to make the statistical inference valid. For example, if we would like all geographic clusters that are not in proximity with disease rates at least 40% higher than the expected value to be identified, then each of the true clusters should contain at least nine geographic units. In the simulation and data analysis, we find that the TH scan statistic is able to identify geographic cluster , which consists of 10 villages. Second, if two clusters are in proximity but have different temporal patterns, then by the simulation result in the second scenario, each of the true clusters should contain at least a number of geographic units such that the TH scan statistic can work well. On the other hand, if a study area consists of n geographic regions, then only TH clusters with disease rates higher than the expected value could be identified by the proposed method. Third, in the identification process of the TH scan statistic, we characterize the temporal pattern of each cluster by estimating its risk coefficient at each time t. This approach could work well for data with separable spatial‐temporal correlations, as can be seen from the simulation result. Nevertheless, the number of temporal coefficients in the corresponding model would then be an order of T, and this could make estimation less efficient, unless the study area consists of geographic units many enough. In the data analysis and simulation, the largest geographic cluster consists of villages, while the whole Kaohsiung city has villages. We find that in such a situation, the temporal pattern in each TH cluster can still be estimated well.While traditional scan statistics can generally allow larger sizes of candidate clusters, the TH scan statistic is required to restrict the size of candidate clusters up to for valid statistical inference. To evaluate whether the size limitation on candidate clusters would make the TH scan statistic infeasible to identify “big” clusters, in the simulation, the number of villages in the true clusters is set to be proportional to the number of sampling villages n. The simulation shows that, with a suitable regrouping procedure, the TH scan statistic can still identify the true clusters very accurately. Finally, because the dengue virus is transmitted mainly by the mosquito, whose range of activities and life period are usually short, the dengue infection may have different epidemic circles in different TH clusters. In this article, we consider that each cluster has its own risk coefficient at each time period. Nevertheless, an interesting extension to this article would be to combine the TH scan statistic and generalized linear dynamic model for temporal coefficients. We leave this topic for future research.Data S1 Details for simulations and data analysis with tables and figures referenced in Sections 4 and 5, and the computer code for analysis of the dengue data referenced in Section 5 are available in the Supporting informationClick here for additional data file.