Literature DB >> 34238302

Optimizing the maximum reported cluster size in the spatial scan statistic for survival data.

Abstract

BACKGROUND: The spatial scan statistic is a useful tool for cluster detection analysis in geographical disease surveillance. The method requires users to specify the maximum scanning window size or the maximum reported cluster size (MRCS), which is often set to 50% of the total population. It is important to optimize the maximum reported cluster size, keeping the maximum scanning window size at as large as 50% of the total population, to obtain valid and meaningful results.
RESULTS: We developed a measure, a Gini coefficient, to optimize the maximum reported cluster size for the exponential-based spatial scan statistic. The simulation study showed that the proposed method mostly selected the optimal MRCS, similar to the true cluster size. The detection accuracy was higher for the best chosen MRCS than at the default setting. The application of the method to the Korea Community Health Survey data supported that the proposed method can optimize the MRCS in spatial cluster detection analysis for survival data.
CONCLUSIONS: Using the Gini coefficient in the exponential-based spatial scan statistic can be very helpful for reporting more refined and informative clusters for survival data.

Entities: Chemical Disease Gene Species

Keywords: Exponential model; Gini coefficient; SaTScan; Spatial cluster detection

Year: 2021 PMID： 34238302 PMCID： PMC8265152 DOI： 10.1186/s12942-021-00286-w

Source DB: PubMed Journal: Int J Health Geogr ISSN： 1476-072X Impact factor: 3.918

Background

The spatial scan statistic is a useful and widely used tool for detecting spatial or space–time clusters in disease surveillance. The method has been developed for different types of data such as count [1], ordinal [2, 3], survival [4], continuous [5-7], and multinomial [8]. The software SaTScan™ [9], available for free, enhances the ease of access to this method for researchers. The spatial scan statistic is formulated based on the likelihood ratio test statistic. A large number of scanning windows of various sizes across all locations are first constructed on the entire study area. Each scanning window is a candidate for the most likely cluster. In SaTScan™, circular or elliptical scanning windows are considered. The likelihood ratio test statistic is calculated for each window to compare its inside and outside. The scanning window with the maximum value of the likelihood ratio test statistic is defined as the most likely cluster. Secondary clusters with high test statistic values are also reported. Cluster detection results can be sensitive to the maximum scanning window size (MSWS), as studied by Riberiro and Costa [10]. In SaTScan™, users can specify the MSWS, which is set to 50% of the total population by default. A high MSWS and a high maximum reported cluster size (MRCS) could result in an excessively large cluster. Some researchers try different MSWS values to obtain seemingly good results without knowing the MRCS. Repeatedly performing spatial cluster detection analyses using different values of MSWS leads to a multiple testing problem, as pointed out by Han et al. [11]. We can consider different values of MRCS with a fixed MSWS to avoid this problem. Still, we need to choose the optimal value of the MRCS. The clusters reported by subjectively chosen MRCS may be different from the true clusters. Han et al. [11] proposed a criterion measure to optimize the MRCS for the Poisson-based spatial scan statistic. They defined the Gini coefficient to represent the degree of heterogeneity of disease clusters for count data. Their simulation study showed that the Gini coefficient can be useful for selecting the best MRCS to obtain a refined collection of clusters. Interestingly, by reporting an optimized and refined collection of clusters rather than a single large cluster, the Gini coefficient allows us to better identify irregularly shaped ones [12]. As the formulation of test statistics of the spatial scan statistic is different for different models, the Gini coefficient should be clearly and distinctly defined for each model and thoroughly evaluated. The Gini coefficients for the ordinal- and normal-based spatial scan statistics were proposed by Kim and Jung [13] and by Yoo and Jung [14], respectively. In this paper, we defined the Gini coefficient for the exponential-based spatial scan statistic, which is used for survival data. Through an extensive simulation study under various scenarios, we showed that the proposed method is very useful for optimizing the MRCS for the exponential-based spatial scan statistic. We illustrated the method using Community Health Survey data collected by the Korea Centers for Disease Control and Prevention.

Methods

Poisson model and the Gini coefficient

When we have count data such as the number of certain disease occurrences according to an underlying population at risk in a study region, we can use the Poisson-based spatial scan statistic [1]. We are often interested in identifying areas with high disease incidence rates. The null and alternative hypotheses are written as where p and q are the intensities of the outcome variable inside and outside the scanning window , respectively, and Z denotes the collection of all scanning windows. The likelihood ratio test statistic given window is expressed as if , and otherwise. In the above equation, and denote the observed number of cases and population within window z. and denote the total number of cases and population in the whole study area, respectively. The scanning window that maximizes the value of is the most likely cluster. Statistical inference for the most likely cluster can be performed using Monte Carlo hypothesis testing. In addition, secondary clusters with high values of the likelihood ratio test statistic are often of interest. The p-values of the secondary clusters are typically obtained in the same manner as the null hypothesis is rejected on own strength. When reporting the most likely and secondary clusters, the Gini coefficient can be used to find a more refined collection of non-overlapping clusters. In economics, the Gini coefficient was developed to indicate the degree of heterogeneity of wealth distribution [15]. As a summary measure of the Lorenz curve, the larger the Gini coefficient, the higher the heterogeneity in wealth. Han et al. [11] adopted the Gini coefficient in the spatial scan statistic for count data to measure the degree of heterogeneity in the spatial distribution of disease cases by defining the x-axis of the Lorenz curve as the cumulative proportion of the number of disease cases and the y-axis as the cumulative proportion of the population. Its value is calculated as twice the area between the Lorenz curve and the 45° line, which indicates that the number of cases is proportional to the population of each region. When there is only one significant cluster, the Lorenz curve is constructed as a line graph connecting the three points (0,0), (), and (1,1), where and are the proportions of observed cases and population (expected cases) in the cluster. As more cases are concentrated in the cluster than expected, increases and the Lorenz curve moves farther away from the reference line. The Gini coefficient also increases. When we have K multiple clusters, the Lorenz curve connects K points between (0,0) and (1,1). The coordinates of each cluster are defined as and where and are the number of cases and population in the -th cluster. The Gini coefficient can be calculated as with and The Gini coefficient values range from 0 to 1. We select the best collection of clusters to report the highest Gini coefficient value from among several competing sets of clusters. Han et al. [11] included more detailed information. The Gini coefficient has been implemented in SaTScan™ for the Poisson and Bernoulli models.

Spatial scan statistic for survival data

Different spatial scan statistics for survival data have been proposed based on different models, including Weibull and generalized life distributions [16, 17]. Huang et al. [4] proposed a spatial scan statistic for survival data based on an exponential model. We focused on the exponential model. The exponential-based spatial scan statistic has been used to examine geographic disparities in survival in cancer patients [18-20]. Suppose we have survival data for I subjects in a study area, such as time to death for cancer patients. Let and be the survival time and fixed censoring time for the th subject, respectively. We assume that is exponentially distributed with a probability density function Parameter represents mean survival time. The observed time Let be the censoring indicator, that is, and To identify clusters of short survival, the null and alternative hypotheses are written as: where denotes the mean survival time for subjects within zone , and is the mean survival time for subjects outside zone . The exponential-based spatial scan statistic is defined aswhere and (the number of non-censored subjects inside and outside zone , respectively). The total number of non-censored subjects in the entire study area is denoted by When there are no censored observations, and are replaced by the total number of subjects inside and outside zone , and , respectively, with by in the above test statistic. When searching for clusters of short survival time using SaTScan™, users can specify the maximum size for z. The default setting is 50% of the total population. When the size of the most likely cluster is very large, one may want to know if smaller clusters that are statistically significant are contained in the large cluster. We can try different values for the maximum reported cluster size (MRCS), not the maximum scanning window size (MSWS). The MRCS is also set to 50% of the total population by default. It is not clear how to select the best MRCS for the exponential model. In the next section, we propose a Gini coefficient to optimize the MRCS for the exponential model.

Gini coefficient for exponential model

To measure the disproportion of survival in each area, the Lorenz curve can be defined using the number of subjects and the sum of survival times. We define the x-axis as the cumulative proportion of the number of non-censored subjects and the y-axis as the cumulative proportion of the sum of observed times. If there is only one significant cluster the Lorenz curve is constructed in the same way as that of the Poisson model. Specifically, the x- and y-coordinates of point P for the cluster are defined as: and Considering the maximum likelihood estimates for the parameter of the exponential distribution under the null and alternative hypotheses, that is, and , the cumulative proportion of the sum of the observed times would be proportional to the cumulative proportion of non-censored subjects in each region under the null hypothesis of no clusters. If there is a significant cluster of short survival, the proportion of the sum of observed times in the cluster to that in the whole study region would be less than the proportion of the number of subjects. As the sum of the observed times in the cluster decreases, the y-coordinate decreases and the Lorenz curve moves farther away from the reference line. Then, the value of the Gini coefficient, which is twice the area between the Lorenz curve and the reference line, increases. When there are K clusters (ordered by their statistical significance), the coordinates of each cluster are defined as and . The Lorenz curve connects K points of , and the Gini coefficient is calculated in the same way as with and Different values for the MRCS produces different sets of clusters with different values of the Gini coefficient. We can select the optimal collection of clusters with the highest dissimilarity in survival based on the Gini coefficient.

Simulation study

We conducted a simulation study to evaluate the performance of the Gini coefficient in the exponential model. We used six cluster models in Seoul and Gyeonggi Province in South Korea as the whole study region. True clusters of different shapes and sizes are assumed in the study region, consisting of 67 districts, as shown in Fig. 1. Since circular and elliptical windows are available in SaTScan™, we mainly considered these two shapes. We also included an irregularly shaped cluster to examine whether the proposed method could possibly work better in identifying irregular clusters than the default setting. Cluster models A and B assumed a circular true cluster of 10% (6 districts) and 30% (20 districts) of the entire study region, respectively. Cluster model C included two adjacent circular clusters, each of which accounts for 10% (6 districts). Models D and E consisted of elliptical clusters of 10% (6 districts) and 30% (20 districts). Model F included an irregularly shaped cluster of 20% (13 districts). For each model, we considered 12 scenarios for the combination of mean survival time and censoring rate. We varied the mean survival time for the true clusters as 2, 5, and 7, compared to 10 for areas outside the clusters. We adopted the parameter setting for the mean survival time from the study by Huang et al. [4]. The censoring rates were set to 10%, 30%, 50%, and 70% to examine how the performance of the proposed method can be affected by the censoring rate.

Fig. 1

Cluster models used in the simulation. A one circular cluster of 10%, B one circular cluster of 30%, C two circular clusters of 10% each, D one elliptical cluster of 10%, E one elliptical cluster of 30%, F one irregular cluster of 20% We generated 1,000 subjects and randomly assigned them to one of the 67 districts in the study region under each scenario. If a subject was in the districts of the true cluster, the survival time was generated from an exponential distribution with a mean of 2, 5, and 7. Otherwise, the survival time was generated from an exponential distribution with a mean of 10. We censored the survival time for randomly selected subjects out of the 1,000 subjects at a chosen censoring rate. We then searched for clusters with short survival using circular and elliptical scanning windows, with 15 MRCS values of 3%, 4%, 5%, 6%, 8%, 10%, 12%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, and 50% in the SaTScan™ software. Using these numbers can be thought of as a grid search. These candidate MRCS values are used for the Poisson and Bernoulli models in SaTScan™ and were used for consistency with the exponential model. We selected these numbers to be consistent for the exponential model as used in the Poisson and Bernoulli models in SaTScan™. The MSWS was fixed at 50%. The Gini coefficient was calculated for each MRCS value. We selected the optimal MRCS with the highest Gini coefficient. The reported clusters were then compared with the true clusters. We repeated the simulation 1,000 times for each scenario. We counted the number of times the Gini coefficient selected each of the 15 MRCS values as the optimal. The performance of the proposed method was summarized using the sensitivity and positive predicted value (PPV). In the context of spatial cluster detection, sensitivity is the proportion of districts correctly detected among the districts in the true cluster, and PPV is the proportion of districts correctly detected among the districts in the detected cluster. Higher values of these measures indicate more accurate detection. Specifically, the sensitivity and PPV were estimated from 1,000 datasets as where is the number of rejected datasets. We also calculated the accuracy measures under the default MRCS setting of 50% in SaTScan™.

Korea community health survey data

To illustrate the proposed method, we used data from the Korea Community Health Survey (KCHS) conducted by the Korea Centers for Disease Control and Prevention [21]. This community-based cross-sectional survey has been conducted at 253 community health centres annually since 2008. The survey questionnaire includes topics related to health behaviour and prevention. We used the age of first drinking for males as the survival time in the 2017 survey data. If a person had never had a drink, his survival time was censored at the age of the survey. The location information of each individual was available at the district level because each district in Korea has approximately one community health centre. In Seoul and Gyeonggi province, we searched for areas with low mean age of first drinking (i.e. spatial clusters of short survival time) using the exponential-based spatial scan statistic with both circular and elliptical scanning windows. The reported clusters selected optimally by the proposed method were compared with those at the default setting in SaTScan™.

Results

Simulation study results

Here, we have presented only a subset of all the simulation results. The other results are included in Additional file 1. Tables 1 and 2 show that the Gini coefficient most often selected the optimal MRCS as the same as the size of the true cluster using circular or elliptical windows when the true cluster was circular with a mean survival time of 5, regardless of the censoring rate. The detection accuracy was very high for the most frequently chosen MRCS. Both the sensitivity and PPV were above 0.95, which is higher than those at the default setting in most cases. The difference in the detection accuracy between the most often chosen MRCS and the default setting was larger when the true cluster was smaller (10%). The difference in PPV was even more pronounced. When the true cluster was medium sized (30%), the PPV was higher in every case at the most often chosen MRCS, while the sensitivity was slightly higher or similar. These results indicate that the spatial scan statistic without optimizing the MRCS tends to report a larger cluster than the true cluster, especially when the true cluster is small. A lower PPV implies that the detected cluster is larger because the number of detected clusters is in the denominator when calculating the PPV. We also summarized the overall detection accuracy when using the Gini coefficient over all the chosen MRCSs. Still, the sensitivity and PPV were higher than or similar to those at the default setting.

Table 1

Simulation results for cluster model A (one circular cluster, 10% of total area) with a mean survival time of 5

	% of cens		Maximum reported cluster size (MRCS)																Default Setting
	% of cens		3%	4%	5%	6%	8%	10%	12%	15%	20%	25%	30%	35%	40%	45%	50%	Overall	Default Setting
Circular window	10%	Frequency	2	5	15	22	48	657	110	76	23	15	8	4	2	1	1		989
		Sensitivity	0.013	0.333	0.600	0.576	0.635	0.972	0.944	0.983	0.978	0.978	0.979	1.000	1.000	1.000	1.000	0.934	0.928
		PPV	0.500	1.000	0.931	0.957	0.885	0.989	0.814	0.670	0.503	0.428	0.322	0.286	0.235	0.222	0.188	0.909	0.903
	30%	Frequency	0	5	6	32	27	540	162	75	77	34	6	3	3	1	1		972
		Sensitivity	–	0.367	0.389	0.563	0.599	0.957	0.937	0.960	0.961	0.995	1.000	1.000	1.000	1.000	1	0.923	0.906
		PPV	–	0.950	0.911	0.974	0.783	0.968	0.812	0.700	0.524	0.430	0.339	0.271	0.234	0.214	0.188	0.852	0.845
	50%	Frequency	0	9	0	120	16	367	209	159	56	32	8	7	1	2	0		986
		Sensitivity	–	0.333	-	0.504	0.490	0.972	0.896	0.984	0.946	0.969	1.000	1.000	1.000	0.917	-	0.886	0.875
		PPV	–	1.000	-	0.991	0.716	0.977	0.774	0.711	0.479	0.419	0.327	0.301	0.231	0.165	-	0.830	0.824
	70%	Frequency	0	20	5	18	0	531	234	5	8	31	29	6	3	16	12		918
		Sensitivity	–	0.333	0.333	0.500	-	0.970	0.828	0.667	0.979	0.844	0.994	1.000	0.889	0.969	0.986	0.902	0.831
		PPV	–	1.000	1	0.986	-	0.997	0.708	0.500	0.522	0.349	0.319	0.269	0.219	0.208	0.191	0.841	0.837
Elliptical window	10%	Frequency	1	4	9	17	74	430	241	124	46	22	5	5	4	1	2		985
		Sensitivity	0.010	0.417	0.482	0.510	0.701	0.957	0.974	0.989	0.989	1.000	1.000	0.967	1.000	1.000	1.000	0.934	0.920
		PPV	1.000	1	0.963	1.000	0.962	0.969	0.823	0.689	0.522	0.399	0.326	0.279	0.238	0.207	0.185	0.853	0.847
	30%	Frequency	0	3	1	28	118	248	247	179	85	43	12	3	2	0	2		971
		Sensitivity	–	0.333	0.333	0.500	0.675	0.940	0.942	0.990	0.973	0.996	0.986	1.000	0.917	-	1.000	0.908	0.886
		PPV	–	1	1.000	0.988	0.968	0.949	0.799	0.690	0.538	0.411	0.331	0.287	0.220	-	0.164	0.794	0.788
	50%	Frequency	2	6	3	60	122	159	258	211	77	54	18	3	0	2	2		977
		Sensitivity	0.027	0.389	0.556	0.500	0.669	0.957	0.913	0.982	0.957	0.988	0.982	1.000	-	1.000	1.000	0.883	0.865
		PPV	1.000	1.000	0.933	0.992	0.986	0.955	0.791	0.696	0.522	0.411	0.321	0.288	-	0.182	0.185	0.781	0.779
	70%	Frequency	0	4	0	53	184	212	169	85	91	30	25	20	20	7	13		913
		Sensitivity	–	0.333	-	0.500	0.688	0.970	0.871	0.794	0.839	0.861	0.853	1.000	0.942	0.976	0.987	0.829	0.758
		PPV	–	1.000	-	1.000	0.987	0.986	0.770	0.535	0.460	0.340	0.275	0.280	0.227	0.207	0.182	0.762	0.761